|
|
-
ItemSimilarityJob creates no output
Something Something 2012-06-05, 02:36
My job setup is really simple. It looks like this:
public int run(String[] args) throws Exception { String datasetDate = args[0]; String inputPath = args[1]; String configFile = args[2]; String ouputLocation = args[3];
Configuration config = getConf(); config.addResource(new Path(configFile)); logger.error("config: " + config.toString());
File inputFile = new File(inputPath); File outputDir = new File(ouputLocation); outputDir.delete(); File tmpDir = new File("/tmp");
ItemSimilarityJob similarityJob = new ItemSimilarityJob();
Configuration conf = new Configuration(); conf.set("mapred.input.dir", inputFile.getAbsolutePath()); conf.set("mapred.output.dir", outputDir.getAbsolutePath()); conf.setBoolean("mapred.output.compress", false);
similarityJob.setConf(conf);
similarityJob.run(new String[]{"--tempDir", tmpDir.getAbsolutePath(), "--similarityClassname", PearsonCorrelationSimilarity.class.getName(),});
return 0; } The input file is sorted by UserId, ItemId & Preference. Preference is always '1'. A few lines from the file look like this:
-1000000334008648908 1 1 -1000000334008648908 70 1 -1000000334008648908 2090 1 -1000000334008648908 12872 1 -1000000334008648908 32790 1 -1000000334008648908 32799 1 -1000000334008648908 32969 1 -1000000397028994738 1 1 -1000000397028994738 12872 1 -1000000397028994738 32790 1 -1000000397028994738 32796 1 -1000000397028994738 32939 1 -100000083781885705 1 1 -100000083781885705 12872 1 -100000083781885705 32790 1 -100000083781885705 32837 1 -100000083781885705 33723 1 -1000001014586220418 1 1 -1000001014586220418 12872 1 -1000001014586220418 32790 1 & so on...
(UserId is created using MemoryIDMigrator) The job internally runs following 7 Hadoop jobs which all run successfully:
PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer RowSimilarityJob-VectorNormMapper-Reducer RowSimilarityJob-CooccurrencesMapper-Reducer RowSimilarityJob-UnsymmetrifyMapper-Reducer ItemSimilarityJob-MostSimilarItemPairsMapper-Reducer Problem is that the output file is empty! What am I missing? Please help. Thanks.
-
Re: ItemSimilarityJob creates no output
Something Something 2012-06-05, 18:13
One thing I noticed is that in step 4 of this process (RowSimilarityJob-VectorNormMapper-Reducer)
Mapper input: 6,925 Mapper output: 3
Reducer input: 3 Reducer output: 0
Most of the values going into the RowSimilarityJob are defaults. Here's what I see in the code:
if (shouldRunNextPhase(parsedArgs, currentPhase)) { int numberOfUsers = HadoopUtil.readInt(new Path(prepPath, PreparePreferenceMatrixJob.NUM_USERS), getConf());
ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] { "--input", new Path(prepPath, PreparePreferenceMatrixJob.RATING_MATRIX).toString(), "--output", similarityMatrixPath.toString(), "--numberOfColumns", String.valueOf(numberOfUsers), "--similarityClassname", similarityClassName, "--maxSimilaritiesPerRow", String.valueOf(maxSimilarItemsPerItem), "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), "--threshold", String.valueOf(threshold), "--tempDir", getTempPath().toString() }); } Any ideas? On Mon, Jun 4, 2012 at 7:36 PM, Something Something < [EMAIL PROTECTED]> wrote:
> My job setup is really simple. It looks like this: > > public int run(String[] args) throws Exception { > String datasetDate = args[0]; > String inputPath = args[1]; > String configFile = args[2]; > String ouputLocation = args[3]; > > Configuration config = getConf(); > config.addResource(new Path(configFile)); > logger.error("config: " + config.toString()); > > File inputFile = new File(inputPath); > File outputDir = new File(ouputLocation); > outputDir.delete(); > File tmpDir = new File("/tmp"); > > ItemSimilarityJob similarityJob = new ItemSimilarityJob(); > > Configuration conf = new Configuration(); > conf.set("mapred.input.dir", inputFile.getAbsolutePath()); > conf.set("mapred.output.dir", outputDir.getAbsolutePath()); > conf.setBoolean("mapred.output.compress", false); > > similarityJob.setConf(conf); > > similarityJob.run(new String[]{"--tempDir", > tmpDir.getAbsolutePath(), "--similarityClassname", > PearsonCorrelationSimilarity.class.getName(),}); > > return 0; > } > > > The input file is sorted by UserId, ItemId & Preference. Preference is > always '1'. A few lines from the file look like this: > > -1000000334008648908 1 1 > -1000000334008648908 70 1 > -1000000334008648908 2090 1 > -1000000334008648908 12872 1 > -1000000334008648908 32790 1 > -1000000334008648908 32799 1 > -1000000334008648908 32969 1 > -1000000397028994738 1 1 > -1000000397028994738 12872 1 > -1000000397028994738 32790 1 > -1000000397028994738 32796 1 > -1000000397028994738 32939 1 > -100000083781885705 1 1 > -100000083781885705 12872 1 > -100000083781885705 32790 1 > -100000083781885705 32837 1 > -100000083781885705 33723 1 > -1000001014586220418 1 1 > -1000001014586220418 12872 1 > -1000001014586220418 32790 1 > & so on... > > (UserId is created using MemoryIDMigrator) > > > The job internally runs following 7 Hadoop jobs which all run successfully: > > PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer > PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer > PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer > RowSimilarityJob-VectorNormMapper-Reducer > RowSimilarityJob-CooccurrencesMapper-Reducer > RowSimilarityJob-UnsymmetrifyMapper-Reducer > ItemSimilarityJob-MostSimilarItemPairsMapper-Reducer > > > Problem is that the output file is empty! What am I missing? Please > help. Thanks. > >
-
Re: ItemSimilarityJob creates no output
Lance Norskog 2012-06-06, 03:49
You can single-step these jobs inside Eclipse or IntelliJ.
On Tue, Jun 5, 2012 at 11:13 AM, Something Something <[EMAIL PROTECTED]> wrote: > One thing I noticed is that in step 4 of this process > (RowSimilarityJob-VectorNormMapper-Reducer) > > Mapper input: 6,925 > Mapper output: 3 > > Reducer input: 3 > Reducer output: 0 > > Most of the values going into the RowSimilarityJob are defaults. Here's > what I see in the code: > > if (shouldRunNextPhase(parsedArgs, currentPhase)) { > int numberOfUsers = HadoopUtil.readInt(new Path(prepPath, > PreparePreferenceMatrixJob.NUM_USERS), > getConf()); > > ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] { > "--input", new Path(prepPath, > PreparePreferenceMatrixJob.RATING_MATRIX).toString(), > "--output", similarityMatrixPath.toString(), > "--numberOfColumns", String.valueOf(numberOfUsers), > "--similarityClassname", similarityClassName, > "--maxSimilaritiesPerRow", String.valueOf(maxSimilarItemsPerItem), > "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), > "--threshold", String.valueOf(threshold), > "--tempDir", getTempPath().toString() }); > } > > > Any ideas? > > > On Mon, Jun 4, 2012 at 7:36 PM, Something Something < > [EMAIL PROTECTED]> wrote: > >> My job setup is really simple. It looks like this: >> >> public int run(String[] args) throws Exception { >> String datasetDate = args[0]; >> String inputPath = args[1]; >> String configFile = args[2]; >> String ouputLocation = args[3]; >> >> Configuration config = getConf(); >> config.addResource(new Path(configFile)); >> logger.error("config: " + config.toString()); >> >> File inputFile = new File(inputPath); >> File outputDir = new File(ouputLocation); >> outputDir.delete(); >> File tmpDir = new File("/tmp"); >> >> ItemSimilarityJob similarityJob = new ItemSimilarityJob(); >> >> Configuration conf = new Configuration(); >> conf.set("mapred.input.dir", inputFile.getAbsolutePath()); >> conf.set("mapred.output.dir", outputDir.getAbsolutePath()); >> conf.setBoolean("mapred.output.compress", false); >> >> similarityJob.setConf(conf); >> >> similarityJob.run(new String[]{"--tempDir", >> tmpDir.getAbsolutePath(), "--similarityClassname", >> PearsonCorrelationSimilarity.class.getName(),}); >> >> return 0; >> } >> >> >> The input file is sorted by UserId, ItemId & Preference. Preference is >> always '1'. A few lines from the file look like this: >> >> -1000000334008648908 1 1 >> -1000000334008648908 70 1 >> -1000000334008648908 2090 1 >> -1000000334008648908 12872 1 >> -1000000334008648908 32790 1 >> -1000000334008648908 32799 1 >> -1000000334008648908 32969 1 >> -1000000397028994738 1 1 >> -1000000397028994738 12872 1 >> -1000000397028994738 32790 1 >> -1000000397028994738 32796 1 >> -1000000397028994738 32939 1 >> -100000083781885705 1 1 >> -100000083781885705 12872 1 >> -100000083781885705 32790 1 >> -100000083781885705 32837 1 >> -100000083781885705 33723 1 >> -1000001014586220418 1 1 >> -1000001014586220418 12872 1 >> -1000001014586220418 32790 1 >> & so on... >> >> (UserId is created using MemoryIDMigrator) >> >> >> The job internally runs following 7 Hadoop jobs which all run successfully: >> >> PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer >> PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer >> PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer >> RowSimilarityJob-VectorNormMapper-Reducer >> RowSimilarityJob-CooccurrencesMapper-Reducer >> RowSimilarityJob-UnsymmetrifyMapper-Reducer >> ItemSimilarityJob-MostSimilarItemPairsMapper-Reducer >> >> >> Problem is that the output file is empty! What am I missing? Please
Lance Norskog [EMAIL PROTECTED]
-
Re: ItemSimilarityJob creates no output
Sean Owen 2012-06-06, 05:59
Is your input very small? It is probably getting mostly pruned as a result, as most of it looks like low-count data. And then there is almost no info on which to compute similarity.
On Tue, Jun 5, 2012 at 7:13 PM, Something Something <[EMAIL PROTECTED]> wrote: > One thing I noticed is that in step 4 of this process > (RowSimilarityJob-VectorNormMapper-Reducer) > > Mapper input: 6,925 > Mapper output: 3 > > Reducer input: 3 > Reducer output: 0 > > Most of the values going into the RowSimilarityJob are defaults. Here's > what I see in the code: > > if (shouldRunNextPhase(parsedArgs, currentPhase)) { > int numberOfUsers = HadoopUtil.readInt(new Path(prepPath, > PreparePreferenceMatrixJob.NUM_USERS), > getConf()); > > ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] { > "--input", new Path(prepPath, > PreparePreferenceMatrixJob.RATING_MATRIX).toString(), > "--output", similarityMatrixPath.toString(), > "--numberOfColumns", String.valueOf(numberOfUsers), > "--similarityClassname", similarityClassName, > "--maxSimilaritiesPerRow", String.valueOf(maxSimilarItemsPerItem), > "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), > "--threshold", String.valueOf(threshold), > "--tempDir", getTempPath().toString() }); > } > > > Any ideas? > > > On Mon, Jun 4, 2012 at 7:36 PM, Something Something < > [EMAIL PROTECTED]> wrote: > >> My job setup is really simple. It looks like this: >> >> public int run(String[] args) throws Exception { >> String datasetDate = args[0]; >> String inputPath = args[1]; >> String configFile = args[2]; >> String ouputLocation = args[3]; >> >> Configuration config = getConf(); >> config.addResource(new Path(configFile)); >> logger.error("config: " + config.toString()); >> >> File inputFile = new File(inputPath); >> File outputDir = new File(ouputLocation); >> outputDir.delete(); >> File tmpDir = new File("/tmp"); >> >> ItemSimilarityJob similarityJob = new ItemSimilarityJob(); >> >> Configuration conf = new Configuration(); >> conf.set("mapred.input.dir", inputFile.getAbsolutePath()); >> conf.set("mapred.output.dir", outputDir.getAbsolutePath()); >> conf.setBoolean("mapred.output.compress", false); >> >> similarityJob.setConf(conf); >> >> similarityJob.run(new String[]{"--tempDir", >> tmpDir.getAbsolutePath(), "--similarityClassname", >> PearsonCorrelationSimilarity.class.getName(),}); >> >> return 0; >> } >> >> >> The input file is sorted by UserId, ItemId & Preference. Preference is >> always '1'. A few lines from the file look like this: >> >> -1000000334008648908 1 1 >> -1000000334008648908 70 1 >> -1000000334008648908 2090 1 >> -1000000334008648908 12872 1 >> -1000000334008648908 32790 1 >> -1000000334008648908 32799 1 >> -1000000334008648908 32969 1 >> -1000000397028994738 1 1 >> -1000000397028994738 12872 1 >> -1000000397028994738 32790 1 >> -1000000397028994738 32796 1 >> -1000000397028994738 32939 1 >> -100000083781885705 1 1 >> -100000083781885705 12872 1 >> -100000083781885705 32790 1 >> -100000083781885705 32837 1 >> -100000083781885705 33723 1 >> -1000001014586220418 1 1 >> -1000001014586220418 12872 1 >> -1000001014586220418 32790 1 >> & so on... >> >> (UserId is created using MemoryIDMigrator) >> >> >> The job internally runs following 7 Hadoop jobs which all run successfully: >> >> PreparePreferenceMatrixJob-ItemIDIndexMapper-Reducer >> PreparePreferenceMatrixJob-ToItemPrefsMapper-Reducer >> PreparePreferenceMatrixJob-ToItemVectorsMapper-Reducer >> RowSimilarityJob-VectorNormMapper-Reducer >> RowSimilarityJob-CooccurrencesMapper-Reducer >> RowSimilarityJob-UnsymmetrifyMapper-Reducer
-
Re: ItemSimilarityJob creates no output
Something Something 2012-06-06, 15:57
The input size was about 6 Million so I was expecting to find some similarities. Anyway, I have started a test with the real dataset that contains 700 million lines. We shall see how that goes. One quick question, though:
I am using MemoryIDMigrator to convert UserIds from String to Long as follows:
static UpdatableIDMigrator migrator = new MemoryIDMigrator(); <some code omitted here...> migrator.toLongID(strUserID);
Question: If I pass the same userId multiple times to this method, I am guaranteed to get the same 'Long' number back, correct? On Tue, Jun 5, 2012 at 10:59 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> Is your input very small? It is probably getting mostly pruned as a > result, as most of it looks like low-count data. And then there is > almost no info on which to compute similarity. > > On Tue, Jun 5, 2012 at 7:13 PM, Something Something > <[EMAIL PROTECTED]> wrote: > > One thing I noticed is that in step 4 of this process > > (RowSimilarityJob-VectorNormMapper-Reducer) > > > > Mapper input: 6,925 > > Mapper output: 3 > > > > Reducer input: 3 > > Reducer output: 0 > > > > Most of the values going into the RowSimilarityJob are defaults. Here's > > what I see in the code: > > > > if (shouldRunNextPhase(parsedArgs, currentPhase)) { > > int numberOfUsers = HadoopUtil.readInt(new Path(prepPath, > > PreparePreferenceMatrixJob.NUM_USERS), > > getConf()); > > > > ToolRunner.run(getConf(), new RowSimilarityJob(), new String[] { > > "--input", new Path(prepPath, > > PreparePreferenceMatrixJob.RATING_MATRIX).toString(), > > "--output", similarityMatrixPath.toString(), > > "--numberOfColumns", String.valueOf(numberOfUsers), > > "--similarityClassname", similarityClassName, > > "--maxSimilaritiesPerRow", > String.valueOf(maxSimilarItemsPerItem), > > "--excludeSelfSimilarity", String.valueOf(Boolean.TRUE), > > "--threshold", String.valueOf(threshold), > > "--tempDir", getTempPath().toString() }); > > } > > > > > > Any ideas? > > > > > > On Mon, Jun 4, 2012 at 7:36 PM, Something Something < > > [EMAIL PROTECTED]> wrote: > > > >> My job setup is really simple. It looks like this: > >> > >> public int run(String[] args) throws Exception { > >> String datasetDate = args[0]; > >> String inputPath = args[1]; > >> String configFile = args[2]; > >> String ouputLocation = args[3]; > >> > >> Configuration config = getConf(); > >> config.addResource(new Path(configFile)); > >> logger.error("config: " + config.toString()); > >> > >> File inputFile = new File(inputPath); > >> File outputDir = new File(ouputLocation); > >> outputDir.delete(); > >> File tmpDir = new File("/tmp"); > >> > >> ItemSimilarityJob similarityJob = new ItemSimilarityJob(); > >> > >> Configuration conf = new Configuration(); > >> conf.set("mapred.input.dir", inputFile.getAbsolutePath()); > >> conf.set("mapred.output.dir", outputDir.getAbsolutePath()); > >> conf.setBoolean("mapred.output.compress", false); > >> > >> similarityJob.setConf(conf); > >> > >> similarityJob.run(new String[]{"--tempDir", > >> tmpDir.getAbsolutePath(), "--similarityClassname", > >> PearsonCorrelationSimilarity.class.getName(),}); > >> > >> return 0; > >> } > >> > >> > >> The input file is sorted by UserId, ItemId & Preference. Preference is > >> always '1'. A few lines from the file look like this: > >> > >> -1000000334008648908 1 1 > >> -1000000334008648908 70 1 > >> -1000000334008648908 2090 1 > >> -1000000334008648908 12872 1 > >> -1000000334008648908 32790 1 > >> -1000000334008648908 32799 1 > >> -1000000334008648908 32969 1 > >> -1000000397028994738 1 1 > >> -1000000397028994738 12872 1 > >> -1000000397028994738 32790 1 > >> -1000000397028994738 32796 1
-
Re: ItemSimilarityJob creates no output
Sean Owen 2012-06-06, 16:01
That sounds like plenty of data -- doubting that's any issue. Is it very sparse? Meaning many items exist just for one user? It's really sparseness that might produce few or no similarities.
I think something else is at work here but don't know off the top of my head based on the info so far.
Yes it is always the same hash function -- top 8 bytes of the MD5 hash. Same input means same output.
Sean
On Wed, Jun 6, 2012 at 4:57 PM, Something Something <[EMAIL PROTECTED]> wrote: > The input size was about 6 Million so I was expecting to find some > similarities. Anyway, I have started a test with the real dataset that > contains 700 million lines. We shall see how that goes. One quick > question, though: > > I am using MemoryIDMigrator to convert UserIds from String to Long as > follows: > > static UpdatableIDMigrator migrator = new MemoryIDMigrator(); > <some code omitted here...> > migrator.toLongID(strUserID); > > Question: If I pass the same userId multiple times to this method, I am > guaranteed to get the same 'Long' number back, correct?
-
Re: ItemSimilarityJob creates no output
Something Something 2012-06-06, 17:10
Hmm... that's what I am thinking.. something is a miss! A few lines from the files are pasted above. The pattern is fairly similar. Is there a place where I can upload part of my file for someone else to try?
OR BETTER YET - Can someone provide a small file that always returns a few similarities? Does a file such as this included in the source?
Thanks for the help.
On Wed, Jun 6, 2012 at 9:01 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> That sounds like plenty of data -- doubting that's any issue. Is it > very sparse? Meaning many items exist just for one user? It's really > sparseness that might produce few or no similarities. > > I think something else is at work here but don't know off the top of > my head based on the info so far. > > Yes it is always the same hash function -- top 8 bytes of the MD5 > hash. Same input means same output. > > Sean > > On Wed, Jun 6, 2012 at 4:57 PM, Something Something > <[EMAIL PROTECTED]> wrote: > > The input size was about 6 Million so I was expecting to find some > > similarities. Anyway, I have started a test with the real dataset that > > contains 700 million lines. We shall see how that goes. One quick > > question, though: > > > > I am using MemoryIDMigrator to convert UserIds from String to Long as > > follows: > > > > static UpdatableIDMigrator migrator = new MemoryIDMigrator(); > > <some code omitted here...> > > migrator.toLongID(strUserID); > > > > Question: If I pass the same userId multiple times to this method, I am > > guaranteed to get the same 'Long' number back, correct? >
-
Re: ItemSimilarityJob creates no output
Sean Owen 2012-06-06, 17:20
Just make, say, a completely dense fake data set over 1000 users and items. Something will come out. On Jun 6, 2012 6:11 PM, "Something Something" <[EMAIL PROTECTED]> wrote:
> Hmm... that's what I am thinking.. something is a miss! A few lines from > the files are pasted above. The pattern is fairly similar. Is there a > place where I can upload part of my file for someone else to try? > > OR BETTER YET - Can someone provide a small file that always returns a few > similarities? Does a file such as this included in the source? > > Thanks for the help. > > On Wed, Jun 6, 2012 at 9:01 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > That sounds like plenty of data -- doubting that's any issue. Is it > > very sparse? Meaning many items exist just for one user? It's really > > sparseness that might produce few or no similarities. > > > > I think something else is at work here but don't know off the top of > > my head based on the info so far. > > > > Yes it is always the same hash function -- top 8 bytes of the MD5 > > hash. Same input means same output. > > > > Sean > > > > On Wed, Jun 6, 2012 at 4:57 PM, Something Something > > <[EMAIL PROTECTED]> wrote: > > > The input size was about 6 Million so I was expecting to find some > > > similarities. Anyway, I have started a test with the real dataset that > > > contains 700 million lines. We shall see how that goes. One quick > > > question, though: > > > > > > I am using MemoryIDMigrator to convert UserIds from String to Long as > > > follows: > > > > > > static UpdatableIDMigrator migrator = new MemoryIDMigrator(); > > > <some code omitted here...> > > > migrator.toLongID(strUserID); > > > > > > Question: If I pass the same userId multiple times to this method, I > am > > > guaranteed to get the same 'Long' number back, correct? > > >
-
Re: ItemSimilarityJob creates no output
Something Something 2012-06-07, 06:05
I tried with a bigger/denser dataset, but still no output. Here's what I noticed:
In the MergeVectorsReducer, I see the following:
@Override protected void reduce(IntWritable row, Iterable<VectorWritable> partialVectors, Context ctx) throws IOException, InterruptedException { Vector partialVector = Vectors.merge(partialVectors);
if (row.get() == NORM_VECTOR_MARKER) { Vectors.write(partialVector, normsPath, ctx.getConfiguration()); } else if (row.get() == MAXVALUE_VECTOR_MARKER) { Vectors.write(partialVector, maxValuesPath, ctx.getConfiguration()); } else if (row.get() == NUM_NON_ZERO_ENTRIES_VECTOR_MARKER) { Vectors.write(partialVector, numNonZeroEntriesPath, ctx.getConfiguration(), true); } else { ctx.write(row, new VectorWritable(partialVector)); } } There's nothing coming out of this method. Where is the output supposed to go? In other words, what Path is this:
normsPath = new Path(ctx.getConfiguration().get(NORMS_PATH)); There are 150 rows going into this reducer & nothing is coming out. Where is it supposed to go under /tmp? I see the following under HDFS:
-rw-r--r-- 3 root supergroup 7 2012-06-06 21:57 /user/XXX/tmp/maxValues.bin -rw-r--r-- 3 root supergroup 7 2012-06-06 21:57 /user/XXX/tmp/norms.bin -rw-r--r-- 3 root supergroup 7 2012-06-06 21:57 /user/XXX/tmp/numNonZeroEntries.bin drwxrwxrwx - root supergroup 0 2012-06-06 21:57 /user/XXX/tmp/pairwiseSimilarity drwxrwxrwx - root supergroup 0 2012-06-06 21:55 /user/XXX/tmp/prepareRatingMatrix drwxrwxrwx - root supergroup 0 2012-06-06 21:58 /user/XXX/tmp/similarityMatrix drwxrwxrwx - root supergroup 0 2012-06-06 21:57 /user/XXX/tmp/weights
On Wed, Jun 6, 2012 at 10:20 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> Just make, say, a completely dense fake data set over 1000 users and items. > Something will come out. > On Jun 6, 2012 6:11 PM, "Something Something" <[EMAIL PROTECTED]> > wrote: > > > Hmm... that's what I am thinking.. something is a miss! A few lines from > > the files are pasted above. The pattern is fairly similar. Is there a > > place where I can upload part of my file for someone else to try? > > > > OR BETTER YET - Can someone provide a small file that always returns a > few > > similarities? Does a file such as this included in the source? > > > > Thanks for the help. > > > > On Wed, Jun 6, 2012 at 9:01 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > > > That sounds like plenty of data -- doubting that's any issue. Is it > > > very sparse? Meaning many items exist just for one user? It's really > > > sparseness that might produce few or no similarities. > > > > > > I think something else is at work here but don't know off the top of > > > my head based on the info so far. > > > > > > Yes it is always the same hash function -- top 8 bytes of the MD5 > > > hash. Same input means same output. > > > > > > Sean > > > > > > On Wed, Jun 6, 2012 at 4:57 PM, Something Something > > > <[EMAIL PROTECTED]> wrote: > > > > The input size was about 6 Million so I was expecting to find some > > > > similarities. Anyway, I have started a test with the real dataset > that > > > > contains 700 million lines. We shall see how that goes. One quick > > > > question, though: > > > > > > > > I am using MemoryIDMigrator to convert UserIds from String to Long as > > > > follows: > > > > > > > > static UpdatableIDMigrator migrator = new MemoryIDMigrator(); > > > > <some code omitted here...> > > > > migrator.toLongID(strUserID); > > > > > > > > Question: If I pass the same userId multiple times to this method, I > > am > > > > guaranteed to get the same 'Long' number back, correct? > > > > > >
|
|