|
Suneel Marthi
2012-02-20, 05:00
Pat Ferrel
2012-02-20, 19:10
Suneel Marthi
2012-02-20, 20:28
Lance Norskog
2012-02-21, 10:37
Pat Ferrel
2012-03-05, 19:29
Sebastian Schelter
2012-03-05, 19:32
Suneel Marthi
2012-03-05, 19:48
Fernando Fernández
2012-03-06, 09:00
Pat Ferrel
2012-03-07, 01:14
Suneel Marthi
2012-03-07, 02:25
Sebastian Schelter
2012-03-07, 07:09
Pat Ferrel
2012-03-07, 16:38
Sebastian Schelter
2012-03-07, 16:50
Pat Ferrel
2012-03-09, 00:14
Suneel Marthi
2012-03-09, 12:26
Pat Ferrel
2012-03-09, 17:50
Lance Norskog
2012-03-10, 01:57
Alex Merritt
2012-02-19, 15:25
Pat Ferrel
2012-02-18, 19:39
Suneel Marthi
2012-02-18, 21:27
Pat Ferrel
2012-02-19, 21:11
Sebastian Schelter
2012-02-19, 21:33
|
-
Re: How to find the k most similar docsSuneel Marthi 2012-02-20, 05:00
Hi Pat,
1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does. The RowSimilarityJob implementation is based on the research paper - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf I'll add the details on the mahout wiki page sometime this week. 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified. 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr, which would limit the results to only those documents that have a similarity value greater than the threshold. Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be. In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1). 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector. This could be an enhancement to add to the RowSimilarityJob. Code snippet below gets the number of columns in a matrix if not specified by the user. Path inputMatrixPath = new Path(getInputPath()); SequenceFile.Reader sequenceFileReader = new SequenceFile.Reader (fs, inputMatrixPath, conf); int NumberOfColumns = getDimensions(sequenceFileReader); sequenceFileReader.close(); private int getDimensions(Reader reader) throws IOException, InstantiationException, IllegalAccessException { Class keyClass = reader.getKeyClass(); Writable row = (Writable) keyClass.newInstance(); if (! reader.getValueClass().equals(VectorWritable.class)) { throw new IllegalArgumentException("Value type of sequencefile must be a VectorWritable"); } VectorWritable vw = new VectorWritable(); if (!reader.next(row, vw)) { log.error("matrix must have at least one row"); throw new IllegalStateException(); } Vector v = vw.get(); return v.size(); } 5. RowSimilarityJob also has an option to excludeSelfSimilarity (which is false by default) but you need to specify this so that you don't end up comparing a document with itself and ending up with a similarity measure of 1.0 (if using Cosine measure). Let me know if you have any more questions. ________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Sunday, February 19, 2012 4:33 PM Subject: Re: How to find the k most similar docs Hi Pat, 'numberOfColumns' is not optional but is only used by a few similarityMeasures (such as loglikelihood ratio). 'maxSimilaritiesPerRow' retains the top similarities. --sebastian On 19.02.2012 22:11, Pat Ferrel wrote: > This looks perfect, thanks. > > I had planned to do the RowSimilarityJob after clustering to reduce the > rows from the entire corpus to only those in a cluster. You mention > using the distance between similar rows to get an idea of the distances > for canopy clustering. This seems a very good idea since I have no other > good way to generate T1 and T2. The downside is that I have to do > RowSimilarityJob on all docs in the corpus. I assume that since you have > done this on 10 Million docs that the benefit in getting good canopies > outweighs doing similarity on all docs as far as processing resources > needed? > > I am new to reading mapreduce code so may I ask some noob questions: > > * is the best documentation here? > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.html#run(java.lang.String[]) > > * the command line arguments include: numberOfColumns, shouldn't that > be easily extracted from the input matrix? is this optional? How do for reading mapreduce job so you should be able to run this on a really large find the k most similar docs how to +
Suneel Marthi 2012-02-20, 05:00
-
Re: How to find the k most similar docsPat Ferrel 2012-02-20, 19:10
Suneel, this is extremely helpful. I hope it gets to the Mahout wiki.
Some thoughts: * a threshold for self-similarity seems useful. I'm thinking of mirrored news groups, bulletin boards, and social network posts where the docs may be very very close but have some surrounding text that doesn't quite match so similarity 1.0 might not work. This is not an academic question since these are some of the docs we plan to examine. It should be pretty easy to do this in a post processing step for now. * I see how you use RowSimilarityJob to guess at good T1 and T2. In my case I am also concerned with the cohesion of the resulting clusters. The outliers will likely never bee seen by humans. The intuition here is that well-formed clusters even if diffuse will give better results for us than a greater number of poorly-formed clusters. One way we have considered getting this result is to form lots of clusters, perhaps as you describe using T1 and T2 derived from RowSimilarityJob then throw out ones that do not match some measurement (Dunning mentions entropy). This would allow overfitting but toss the overfit cases. http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output#9d3f6a55f4a91cb6 I don't see that anyone has implemented something like this yet. Thanks again. On 2/19/12 9:00 PM, Suneel Marthi wrote: > Hi Pat, > > > 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does. The RowSimilarityJob implementation is based on the research paper - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf > > I'll add the details on the mahout wiki page sometime this week. > > 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified. > > 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr, which would limit the results to only those documents that have a similarity value greater than the threshold. > > Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be. In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1). > > 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector. This could be an enhancement to add to the RowSimilarityJob. > > Code snippet below gets the number of columns in a matrix if not specified by the user. > > Path inputMatrixPath = new Path(getInputPath()); > > SequenceFile.Reader sequenceFileReader = new SequenceFile.Reader (fs, inputMatrixPath, conf); > > int NumberOfColumns = getDimensions(sequenceFileReader); > > sequenceFileReader.close(); > private int getDimensions(Reader reader) throws IOException, InstantiationException, IllegalAccessException { > Class keyClass = reader.getKeyClass(); > Writable row = (Writable) keyClass.newInstance(); > if (! reader.getValueClass().equals(VectorWritable.class)) { > throw new IllegalArgumentException("Value type of sequencefile must be a VectorWritable"); > } > VectorWritable vw = new VectorWritable(); > if (!reader.next(row, vw)) { > log.error("matrix must have at least one row"); > throw new IllegalStateException(); > } > Vector v = vw.get(); > return v.size(); > } > 5. RowSimilarityJob also has an option to excludeSelfSimilarity (which is false by default) but you need to specify this so that you don't end up comparing a document with itself and ending up with a similarity measure of 1.0 (if using Cosine measure). +
Pat Ferrel 2012-02-20, 19:10
-
Re: How to find the k most similar docsSuneel Marthi 2012-02-20, 20:28
Pat,
You are welcome. FYI... Another option you could consider for determining document similarity would be 'MinHash clustering'. Mahout comes with a minHash clustering implementation but I never had good results from it and I never got it to run successfully on a really large corpus (like a million documents). Look at the thread at http://www.searchworkings.org/forum/-/message_boards/view_message/359922. Here is a reference to Andrei Broder's paper for detecting duplicates in documents - http://dl.acm.org/citation.cfm?id=736184 Given a choice between RowSimilarityJob and MinHash clustering, I would prefer the later but chose the former due to not having any success with Mahout's MinHash implementation. Suneel ________________________________ From: Pat Ferrel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, February 20, 2012 2:10 PM Subject: Re: How to find the k most similar docs Suneel, this is extremely helpful. I hope it gets to the Mahout wiki. Some thoughts: * a threshold for self-similarity seems useful. I'm thinking of mirrored news groups, bulletin boards, and social network posts where the docs may be very very close but have some surrounding text that doesn't quite match so similarity 1.0 might not work. This is not an academic question since these are some of the docs we plan to examine. It should be pretty easy to do this in a post processing step for now. * I see how you use RowSimilarityJob to guess at good T1 and T2. In my case I am also concerned with the cohesion of the resulting clusters. The outliers will likely never bee seen by humans. The intuition here is that well-formed clusters even if diffuse will give better results for us than a greater number of poorly-formed clusters. One way we have considered getting this result is to form lots of clusters, perhaps as you describe using T1 and T2 derived from RowSimilarityJob then throw out ones that do not match some measurement (Dunning mentions entropy). This would allow overfitting but toss the overfit cases. http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output#9d3f6a55f4a91cb6 I don't see that anyone has implemented something like this yet. Thanks again. On 2/19/12 9:00 PM, Suneel Marthi wrote: > Hi Pat, > > > 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does. The RowSimilarityJob implementation is based on the research paper - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf > > I'll add the details on the mahout wiki page sometime this week. > > 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified. > > 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr, which would limit the results to only those documents that have a similarity value greater than the threshold. > > Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be. In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1). > > 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector. This could be an enhancement to add to the RowSimilarityJob. > > Code snippet below gets the number of columns in a matrix if not specified by the user. > > Path inputMatrixPath = new Path(getInputPath()); > > SequenceFile.Reader sequenceFileReader = new SequenceFile.Reader (fs, inputMatrixPath, conf); > > int NumberOfColumns = getDimensions(sequenceFileReader); +
Suneel Marthi 2012-02-20, 20:28
-
Re: How to find the k most similar docsLance Norskog 2012-02-21, 10:37
The RowSimilarityJob discussion is here:
http://www.lucidimagination.com/search/document/d8923398aa5af753 On Mon, Feb 20, 2012 at 12:28 PM, Suneel Marthi <[EMAIL PROTECTED]> wrote: > Pat, > > You are welcome. > > FYI... > > Another option you could consider for determining document similarity would be 'MinHash clustering'. > > > Mahout comes with a minHash clustering implementation but I never had good results from it and I never got it to run successfully on a really large corpus (like a million documents). > > > Look at the thread at http://www.searchworkings.org/forum/-/message_boards/view_message/359922. > > Here is a reference to Andrei Broder's paper for detecting duplicates in documents - http://dl.acm.org/citation.cfm?id=736184 > > Given a choice between RowSimilarityJob and MinHash clustering, I would prefer the later but chose the former due to not having any success with Mahout's MinHash implementation. > > > Suneel > > > > ________________________________ > From: Pat Ferrel <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, February 20, 2012 2:10 PM > Subject: Re: How to find the k most similar docs > > Suneel, this is extremely helpful. I hope it gets to the Mahout wiki. > > Some thoughts: > > * a threshold for self-similarity seems useful. I'm thinking of > mirrored news groups, bulletin boards, and social network posts > where the docs may be very very close but have some surrounding text > that doesn't quite match so similarity 1.0 might not work. This is > not an academic question since these are some of the docs we plan to > examine. It should be pretty easy to do this in a post processing > step for now. > * I see how you use RowSimilarityJob to guess at good T1 and T2. In my > case I am also concerned with the cohesion of the resulting > clusters. The outliers will likely never bee seen by humans. The > intuition here is that well-formed clusters even if diffuse will > give better results for us than a greater number of poorly-formed > clusters. One way we have considered getting this result is to form > lots of clusters, perhaps as you describe using T1 and T2 derived > from RowSimilarityJob then throw out ones that do not match some > measurement (Dunning mentions entropy). This would allow overfitting > but toss the overfit cases. > http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output#9d3f6a55f4a91cb6 > I don't see that anyone has implemented something like this yet. > > Thanks again. > > > On 2/19/12 9:00 PM, Suneel Marthi wrote: >> Hi Pat, >> >> >> 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does. The RowSimilarityJob implementation is based on the research paper - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf >> >> I'll add the details on the mahout wiki page sometime this week. >> >> 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified. >> >> 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr, which would limit the results to only those documents that have a similarity value greater than the threshold. >> >> Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be. In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1). >> >> 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector. This could be an enhancement to add to the RowSimilarityJob. >> >> Code snippet below gets the number of columns in a matrix if not specified by the user. Lance Norskog [EMAIL PROTECTED] +
Lance Norskog 2012-02-21, 10:37
-
Re: How to find the k most similar docsPat Ferrel 2012-03-05, 19:29
I'm using Mahout 0.6 compiled from source via 'mvn install' I used
Suneel's code below to get NumberOfColumns. When I try to run the rowsimilarity job via: bin/mahout rowsimilarity -i wikipedia-clusters/tfidf-vectors/ -o /wikipedia-similarity -r 87325 -s SIMILARITY_COSINE -m 10 -ess true I get the following error 12/03/04 19:14:32 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --excludeSelfSimilarity=true, --input=wikipedia-clusters/tfidf-vectors/, --maxSimilaritiesPerRow=10, --numberOfColumns=87325, --output=/wikipedia-similarity, --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp} 2012-03-04 19:14:32.376 java[1090:1903] Unable to load realm info from SCDynamicStore 12/03/04 19:14:33 INFO input.FileInputFormat: Total input paths to process : 1 12/03/04 19:14:33 INFO mapred.JobClient: Running job: job_local_0001 12/03/04 19:14:33 INFO mapred.MapTask: io.sort.mb = 100 12/03/04 19:14:33 INFO mapred.MapTask: data buffer = 79691776/99614720 12/03/04 19:14:33 INFO mapred.MapTask: record buffer = 262144/327680 12/03/04 19:14:34 WARN mapred.LocalJobRunner: job_local_0001 java.lang.ClassCastException: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable at org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$VectorNormMapper.map(RowSimilarityJob.java:154) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) The cast error (as I understand it) usually happens when you pass in a classname incorrectly. This seems likely since coocurence similarity is being used? I've probably missed something obvious about how to pass in similarity measure to use? On 2/19/12 9:00 PM, Suneel Marthi wrote: > Hi Pat, > > > 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does. The RowSimilarityJob implementation is based on the research paper - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf > > I'll add the details on the mahout wiki page sometime this week. > > 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified. > > 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr, which would limit the results to only those documents that have a similarity value greater than the threshold. > > Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be. In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1). > > 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector. This could be an enhancement to add to the RowSimilarityJob. > > Code snippet below gets the number of columns in a matrix if not specified by the user. > > Path inputMatrixPath = new Path(getInputPath()); > > SequenceFile.Reader sequenceFileReader = new SequenceFile.Reader (fs, inputMatrixPath, conf); > > int NumberOfColumns = getDimensions(sequenceFileReader); > > sequenceFileReader.close(); > private int getDimensions(Reader reader) throws IOException, InstantiationException, IllegalAccessException { > Class keyClass = reader.getKeyClass(); > Writable row = (Writable) keyClass.newInstance(); > if (! reader.getValueClass().equals(VectorWritable.class)) { +
Pat Ferrel 2012-03-05, 19:29
-
Re: How to find the k most similar docsSebastian Schelter 2012-03-05, 19:32
That's the problem:
org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems you supply <Text,VectorWritable>. --sebastian On 05.03.2012 20:29, Pat Ferrel wrote: > org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io.IntWritable +
Sebastian Schelter 2012-03-05, 19:32
-
Re: How to find the k most similar docsSuneel Marthi 2012-03-05, 19:48
Pat,
Your input to RowSimilarity seems to be the tfidf-vectors directory which is <Text, vectorWritable>. Before executing the RowSimilarity job u need to run the RowIdJob which creates a matrix of <IntWritable, VectorWritable>. This matrix should be the input to RowSimilarity. Also from your command, you seem to be missing --tempDir argument, you would need that too. Suneel ________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Monday, March 5, 2012 2:32 PM Subject: Re: How to find the k most similar docs That's the problem: org.apache.hadoop.io.Text cannot be cast to org.apache.hadoop.io.IntWritable RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems you supply <Text,VectorWritable>. --sebastian On 05.03.2012 20:29, Pat Ferrel wrote: > org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io.IntWritable +
Suneel Marthi 2012-03-05, 19:48
-
Re: How to find the k most similar docsFernando Fernández 2012-03-06, 09:00
I'm surprised no one has mentioned SVD yet. You are supposed to obtain
better resutls using SVD factors instead of original TF-IDF vectors when computing similarities (This is the theory). Many text mining applications follow these steps: - Stopword removal. - Tf-Idf computation. - Svd factorization. - Clustering or supervised classification using SVD factors. You have SVD distributed routines in Mahout you can use (DistributedLanczosSolver), you may wnat to check them out. Best, Fernando. 2012/3/5 Suneel Marthi <[EMAIL PROTECTED]> > Pat, > > Your input to RowSimilarity seems to be the tfidf-vectors directory which > is <Text, vectorWritable>. > > Before executing the RowSimilarity job u need to run the RowIdJob which > creates a matrix of <IntWritable, VectorWritable>. This matrix should be > the input to RowSimilarity. > > Also from your command, you seem to be missing --tempDir argument, you > would need that too. > > Suneel > > > ________________________________ > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Monday, March 5, 2012 2:32 PM > Subject: Re: How to find the k most similar docs > > That's the problem: > > org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io.IntWritable > > RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems > you supply <Text,VectorWritable>. > > --sebastian > > On 05.03.2012 20:29, Pat Ferrel wrote: > > org.apache.hadoop.io.Text cannot be > > cast to org.apache.hadoop.io.IntWritable > +
Fernando Fernández 2012-03-06, 09:00
-
Re: How to find the k most similar docsPat Ferrel 2012-03-07, 01:14
Ok, making progress. I created a matrix using rowid and got the
following output: Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp ... 12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/, --output=wikipedia-matrix, --startPhase=0, --tempDir=temp} 2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info from SCDynamicStore 12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor 12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838 rows and 87325 columns to wikipedia-matrix/matrix 12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms (Minutes: 0.0293) So a doc matrix with 4838 docs and 87325 dimensions. Next I ran RowSimilarityJob Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325 --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp This gives me output in wikipedia-similarity/part-m-00000 but the size is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I set no threshold so I'd expect it to pick the 10 nearest even if they are far away. BTW what is the output format? On 3/5/12 11:48 AM, Suneel Marthi wrote: > Pat, > > Your input to RowSimilarity seems to be the tfidf-vectors directory > which is <Text, vectorWritable>. > > Before executing the RowSimilarity job u need to run the RowIdJob > which creates a matrix of <IntWritable, VectorWritable>. This matrix > should be the input to RowSimilarity. > > Also from your command, you seem to be missing --tempDir argument, you > would need that too. > > Suneel > > ------------------------------------------------------------------------ > *From:* Sebastian Schelter <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Monday, March 5, 2012 2:32 PM > *Subject:* Re: How to find the k most similar docs > > That's the problem: > > org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io > <http://org.apache.hadoop.io.Int>.IntWritable > > RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems > you supply <Text,VectorWritable>. > > --sebastian > > On 05.03.2012 20:29, Pat Ferrel wrote: > > org.apache.hadoop.io.Text cannot be > > cast to org.apache.hadoop.io.IntWritable > > > +
Pat Ferrel 2012-03-07, 01:14
-
Re: How to find the k most similar docsSuneel Marthi 2012-03-07, 02:25
Did the RowSimilarityJob execute successfully? Your output should have been one or more part-r-* files (depending on the number of reducers you have configured in ur environment).
You should be able to get a sequence dump of the wikipedia-similarity/part-m-00000 file to see what they are. The output format of RowSimilarityJob is <IntWritable, VectorWritable>. ________________________________ From: Pat Ferrel <[EMAIL PROTECTED]> To: Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, March 6, 2012 8:14 PM Subject: Re: How to find the k most similar docs Ok, making progress. I created a matrix using rowid and got the following output: Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp ... 12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/, --output=wikipedia-matrix, --startPhase=0, --tempDir=temp} 2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info from SCDynamicStore 12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor 12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838 rows and 87325 columns to wikipedia-matrix/matrix 12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms (Minutes: 0.0293) So a doc matrix with 4838 docs and 87325 dimensions. Next I ran RowSimilarityJob Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325 --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp This gives me output in wikipedia-similarity/part-m-00000 but the size is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I set no threshold so I'd expect it to pick the 10 nearest even if they are far away. BTW what is the output format? On 3/5/12 11:48 AM, Suneel Marthi wrote: > Pat, > > Your input to RowSimilarity seems to be the tfidf-vectors directory which is <Text, vectorWritable>. > > Before executing the RowSimilarity job u need to run the RowIdJob which creates a matrix of <IntWritable, VectorWritable>. This matrix should be the input to RowSimilarity. > > Also from your command, you seem to be missing --tempDir argument, you would need that too. > > Suneel > > ------------------------------------------------------------------------ > *From:* Sebastian Schelter <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED] > *Sent:* Monday, March 5, 2012 2:32 PM > *Subject:* Re: How to find the k most similar docs > > That's the problem: > > org.apache.hadoop.io.Text cannot be > cast to org.apache.hadoop.io <http://org.apache.hadoop.io.Int>.IntWritable > > RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems > you supply <Text,VectorWritable>. > > --sebastian > > On 05.03.2012 20:29, Pat Ferrel wrote: > > org.apache.hadoop.io.Text cannot be > > cast to org.apache.hadoop.io.IntWritable > > > +
Suneel Marthi 2012-03-07, 02:25
-
Re: How to find the k most similar docsSebastian Schelter 2012-03-07, 07:09
Hi Pat,
You are right, these results look strange. RowSimilarityJob has 3 custom counters (ROWS, COOCCURRENCES, PRUNED_COOCCURRENCES), can you give use the numbers for these? --sebastian On 07.03.2012 02:14, Pat Ferrel wrote: > Ok, making progress. I created a matrix using rowid and got the > following output: > > Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowid -i > wikipedia-clusters/tfidf-vectors/ -o wikipedia-matrix --tempDir temp > ... > 12/03/05 16:52:45 INFO common.AbstractJob: Command line arguments: > {--endPhase=2147483647, --input=wikipedia-clusters/tfidf-vectors/, > --output=wikipedia-matrix, --startPhase=0, --tempDir=temp} > 2012-03-05 16:52:45.870 java[4940:1903] Unable to load realm info > from SCDynamicStore > 12/03/05 16:52:46 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java > classes where applicable > 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor > 12/03/05 16:52:46 INFO compress.CodecPool: Got brand-new compressor > 12/03/05 16:52:47 INFO vectors.RowIdJob: Wrote out matrix with 4838 > rows and 87325 columns to wikipedia-matrix/matrix > 12/03/05 16:52:47 INFO driver.MahoutDriver: Program took 1758 ms > (Minutes: 0.0293) > > So a doc matrix with 4838 docs and 87325 dimensions. Next I ran > RowSimilarityJob > > Maclaurin:mahout-distribution-0.6 pferrel$ bin/mahout rowsimilarity > -i wikipedia-matrix/matrix -o wikipedia-similarity -r 87325 > --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp > > This gives me output in wikipedia-similarity/part-m-00000 but the size > is 97 bytes? Shouldn't it have created 4838 * 10 results? Ten per row? I > set no threshold so I'd expect it to pick the 10 nearest even if they > are far away. > > BTW what is the output format? > > On 3/5/12 11:48 AM, Suneel Marthi wrote: >> Pat, >> >> Your input to RowSimilarity seems to be the tfidf-vectors directory >> which is <Text, vectorWritable>. >> >> Before executing the RowSimilarity job u need to run the RowIdJob >> which creates a matrix of <IntWritable, VectorWritable>. This matrix >> should be the input to RowSimilarity. >> >> Also from your command, you seem to be missing --tempDir argument, you >> would need that too. >> >> Suneel >> >> ------------------------------------------------------------------------ >> *From:* Sebastian Schelter <[EMAIL PROTECTED]> >> *To:* [EMAIL PROTECTED] >> *Sent:* Monday, March 5, 2012 2:32 PM >> *Subject:* Re: How to find the k most similar docs >> >> That's the problem: >> >> org.apache.hadoop.io.Text cannot be >> cast to org.apache.hadoop.io >> <http://org.apache.hadoop.io.Int>.IntWritable >> >> RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems >> you supply <Text,VectorWritable>. >> >> --sebastian >> >> On 05.03.2012 20:29, Pat Ferrel wrote: >> > org.apache.hadoop.io.Text cannot be >> > cast to org.apache.hadoop.io.IntWritable >> >> >> > +
Sebastian Schelter 2012-03-07, 07:09
-
Re: How to find the k most similar docsPat Ferrel 2012-03-07, 16:38
I have been experimenting with different analyzers and n-grams to clean
up the vectors. Here is a run on a high dimensionality set of vectors with a loose analyzer (I think it was the default) The output of the rowid job was: pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowid -i wikipedia-tfidf-custom-analyzer/tfidf-vectors/ -o wikipedia-matrix --tempDir temp MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/local/hadoop HADOOP_CONF_DIR=/usr/local/hadoop/conf MAHOUT-JOB: /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/03/06 16:53:29 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --input=wikipedia-tfidf-custom-analyzer/tfidf-vectors/, --output=wikipedia-matrix, --startPhase=0, --tempDir=temp} 12/03/06 16:53:30 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/03/06 16:53:30 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library 12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor 12/03/06 16:53:30 INFO compress.CodecPool: Got brand-new compressor 12/03/06 16:53:30 INFO vectors.RowIdJob: Wrote out matrix with 4838 rows and 286907 columns to wikipedia-matrix/matrix 12/03/06 16:53:30 INFO driver.MahoutDriver: Program took 1248 ms (Minutes: 0.0208) Then I removed temp (shouldn't the jobs do that?) and ran the rowsililarity job: pat@occam2:~/mahout-distribution-0.6$ bin/mahout rowsimilarity -i wikipedia-matrix/matrix -o wikipedia-similarity -r 286907 --similarityClassname SIMILARITY_COSINE -m 10 -ess true --tempDir temp MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/local/hadoop HADOOP_CONF_DIR=/usr/local/hadoop/conf MAHOUT-JOB: /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/03/06 17:00:55 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --excludeSelfSimilarity=true, --input=wikipedia-matrix/matrix, --maxSimilaritiesPerRow=10, --numberOfColumns=286907, --output=wikipedia-similarity, --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp} 12/03/06 17:00:56 INFO input.FileInputFormat: Total input paths to process : 1 12/03/06 17:00:56 INFO mapred.JobClient: Running job: job_201203061645_0006 12/03/06 17:00:57 INFO mapred.JobClient: map 0% reduce 0% 12/03/06 17:01:13 INFO mapred.JobClient: map 100% reduce 0% 12/03/06 17:01:25 INFO mapred.JobClient: map 100% reduce 100% 12/03/06 17:01:30 INFO mapred.JobClient: Job complete: job_201203061645_0006 12/03/06 17:01:30 INFO mapred.JobClient: Counters: 26 12/03/06 17:01:30 INFO mapred.JobClient: Job Counters 12/03/06 17:01:30 INFO mapred.JobClient: Launched reduce tasks=1 12/03/06 17:01:30 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=13502 12/03/06 17:01:30 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/06 17:01:30 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/06 17:01:30 INFO mapred.JobClient: Rack-local map tasks=1 12/03/06 17:01:30 INFO mapred.JobClient: Launched map tasks=1 12/03/06 17:01:30 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=10496 12/03/06 17:01:30 INFO mapred.JobClient: File Output Format Counters 12/03/06 17:01:30 INFO mapred.JobClient: Bytes Written=97 12/03/06 17:01:30 INFO mapred.JobClient: FileSystemCounters 12/03/06 17:01:30 INFO mapred.JobClient: FILE_BYTES_READ=40 12/03/06 17:01:30 INFO mapred.JobClient: HDFS_BYTES_READ=122407 12/03/06 17:01:30 INFO mapred.JobClient: FILE_BYTES_WRITTEN=45437 12/03/06 17:01:30 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=118 12/03/06 17:01:30 INFO mapred.JobClient: File Input Format Counters 12/03/06 17:01:30 INFO mapred.JobClient: Bytes Read=122290 12/03/06 17:01:30 INFO mapred.JobClient: org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters 12/03/06 17:01:30 INFO mapred.JobClient: ROWS=4838 12/03/06 17:01:30 INFO mapred.JobClient: Map-Reduce Framework 12/03/06 17:01:30 INFO mapred.JobClient: Reduce input groups=3 12/03/06 17:01:30 INFO mapred.JobClient: Map output materialized bytes=32 12/03/06 17:01:30 INFO mapred.JobClient: Combine output records=3 12/03/06 17:01:30 INFO mapred.JobClient: Map input records=4838 12/03/06 17:01:30 INFO mapred.JobClient: Reduce shuffle bytes=32 12/03/06 17:01:30 INFO mapred.JobClient: Reduce output records=0 12/03/06 17:01:30 INFO mapred.JobClient: Spilled Records=6 12/03/06 17:01:30 INFO mapred.JobClient: Map output bytes=33 12/03/06 17:01:30 INFO mapred.JobClient: Combine input records=3 12/03/06 17:01:30 INFO mapred.JobClient: Map output records=3 12/03/06 17:01:30 INFO mapred.JobClient: SPLIT_RAW_BYTES=117 12/03/06 17:01:30 INFO mapred.JobClient: Reduce input records=3 12/03/06 17:01:30 INFO input.FileInputFormat: Total input paths to process : 1 12/03/06 17:01:31 INFO mapred.JobClient: Running job: job_201203061645_0007 12/03/06 17:01:32 INFO mapred.JobClient: map 0% reduce 0% 12/03/06 17:01:49 INFO mapred.JobClient: map 100% reduce 0% 12/03/06 17:02:01 INFO mapred.JobClient: map 100% reduce 100% 12/03/06 17:02:06 INFO mapred.JobClient: Job complete: job_201203061645_0007 12/03/06 17:02:06 INFO mapred.JobClient: Counters: 25 12/03/06 17:02:06 INFO mapred.JobClient: Job Counters 12/03/06 17:02:06 INFO mapred.JobClient: Launched reduce tasks=1 12/03/06 17:02:06 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=12989 12/03/06 17:02:06 INFO mapred.JobClient: Total time spent by all +
Pat Ferrel 2012-03-07, 16:38
-
Re: How to find the k most similar docsSebastian Schelter 2012-03-07, 16:50
Hi Pat,
Something is going completely wrong. The first pass over the data of RowSimilarityJob transposes the input matrix. From the output of the first jobs, it seems as if your input is a 4838 x 3 matrix only: Map input records=4838 Map output records=3 Combine input records=3 Combine output records=3 Reduce input records=3 Could you have a detailed look at the input to RowSimilarityJob? --sebastian On 07.03.2012 17:38, Pat Ferrel wrote: > 12/03/06 17:02:42 INFO mapred.JobClient: Map input records=0 +
Sebastian Schelter 2012-03-07, 16:50
-
Re: How to find the k most similar docsPat Ferrel 2012-03-09, 00:14
OK, back to the beginning. I went through the entire sequence again with
the notable exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters. bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity job and now get a far bigger file created that looks OK bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o wp-similarity -ess -s SIMILARITY_COSINE -m 10 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/local/hadoop HADOOP_CONF_DIR=/usr/local/hadoop/conf MAHOUT-JOB: /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --excludeSelfSimilarity=false, --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10, --numberOfColumns=311433, --output=wp-similarity, --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp} 12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to process : 1 12/03/08 15:48:36 INFO mapred.JobClient: Running job: job_201203071745_0040 12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0% 12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0% 12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0% 12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0% 12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0% 12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0% 12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0% 12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0% 12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0% 12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0% 12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33% 12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100% 12/03/08 15:49:57 INFO mapred.JobClient: Job complete: job_201203071745_0040 12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26 12/03/08 15:49:57 INFO mapred.JobClient: Job Counters 12/03/08 15:49:57 INFO mapred.JobClient: Launched reduce tasks=1 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55564 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/08 15:49:57 INFO mapred.JobClient: Rack-local map tasks=1 12/03/08 15:49:57 INFO mapred.JobClient: Launched map tasks=1 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13565 12/03/08 15:49:57 INFO mapred.JobClient: File Output Format Counters 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Written=45587186 12/03/08 15:49:57 INFO mapred.JobClient: FileSystemCounters 12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_READ=99732287 12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_READ=17156393 12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=138104586 12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=45587207 12/03/08 15:49:57 INFO mapred.JobClient: File Input Format Counters 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Read=17156283 12/03/08 15:49:57 INFO mapred.JobClient: org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters 12/03/08 15:49:57 INFO mapred.JobClient: ROWS=4838 12/03/08 15:49:57 INFO mapred.JobClient: Map-Reduce Framework 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input groups=294936 12/03/08 15:49:57 INFO mapred.JobClient: Map output materialized bytes=38326948 12/03/08 15:49:57 INFO mapred.JobClient: Combine output records=2242965 12/03/08 15:49:57 INFO mapred.JobClient: Map input records=4838 12/03/08 15:49:57 INFO mapred.JobClient: Reduce shuffle bytes=38326948 12/03/08 15:49:57 INFO mapred.JobClient: Reduce output records=294933 12/03/08 15:49:57 INFO mapred.JobClient: Spilled Records=3432447 12/03/08 15:49:57 INFO mapred.JobClient: Map output bytes=83168813 12/03/08 15:49:57 INFO mapred.JobClient: Combine input records=5912090 12/03/08 15:49:57 INFO mapred.JobClient: Map output records=3964061 12/03/08 15:49:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=110 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input records=294936 12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to process : 1 12/03/08 15:49:58 INFO mapred.JobClient: Running job: job_201203071745_0041 12/03/08 15:49:59 INFO mapred.JobClient: map 0% reduce 0% 12/03/08 15:50:19 INFO mapred.JobClient: map 8% reduce 0% 12/03/08 15:50:22 INFO mapred.JobClient: map 12% reduce 0% 12/03/08 15:50:25 INFO mapred.JobClient: map 15% reduce 0% 12/03/08 15:50:28 INFO mapred.JobClient: map 21% reduce 0% 12/03/08 15:50:31 INFO mapred.JobClient: map 23% reduce 0% 12/03/08 15:50:34 INFO mapred.JobClient: map 28% reduce 0% 12/03/08 15:50:37 INFO mapred.JobClient: map 32% reduce 0% 12/03/08 15:50:40 INFO mapred.JobClient: map 33% reduce 0% 12/03/08 15:50:43 INFO mapred.JobClient: map 35% reduce 0% 12/03/08 15:50:46 INFO mapred.JobClient: map 40% reduce 0% 12/03/08 15:50:49 INFO mapred.JobClient: map 42% reduce 0% 12/03/08 15:50:52 INFO mapred.JobClient: map 47% reduce 0% 12/03/08 15:50:55 INFO mapred.JobClient: map 48% reduce 0% 12/03/08 15:50:58 INF +
Pat Ferrel 2012-03-09, 00:14
-
Re: How to find the k most similar docsSuneel Marthi 2012-03-09, 12:26
Pat,
MatrixDump expects an input file of <Text, MatrixWritable> . The matrix that gets created from RowIdJob is <IntWritable, VectorWritable> and you cannot run MatrixDump to see the contents of the matrix. You need to use seqdumper as you had done. ________________________________ From: Pat Ferrel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Thursday, March 8, 2012 7:14 PM Subject: Re: How to find the k most similar docs OK, back to the beginning. I went through the entire sequence again with the notable exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters. bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity job and now get a far bigger file created that looks OK bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o wp-similarity -ess -s SIMILARITY_COSINE -m 10 MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. Running on hadoop, using HADOOP_HOME=/usr/local/hadoop HADOOP_CONF_DIR=/usr/local/hadoop/conf MAHOUT-JOB: /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar 12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments: {--endPhase=2147483647, --excludeSelfSimilarity=false, --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10, --numberOfColumns=311433, --output=wp-similarity, --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp} 12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to process : 1 12/03/08 15:48:36 INFO mapred.JobClient: Running job: job_201203071745_0040 12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0% 12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0% 12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0% 12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0% 12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0% 12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0% 12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0% 12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0% 12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0% 12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0% 12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33% 12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100% 12/03/08 15:49:57 INFO mapred.JobClient: Job complete: job_201203071745_0040 12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26 12/03/08 15:49:57 INFO mapred.JobClient: Job Counters 12/03/08 15:49:57 INFO mapred.JobClient: Launched reduce tasks=1 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55564 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/03/08 15:49:57 INFO mapred.JobClient: Rack-local map tasks=1 12/03/08 15:49:57 INFO mapred.JobClient: Launched map tasks=1 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=13565 12/03/08 15:49:57 INFO mapred.JobClient: File Output Format Counters 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Written=45587186 12/03/08 15:49:57 INFO mapred.JobClient: FileSystemCounters 12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_READ=99732287 12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_READ=17156393 12/03/08 15:49:57 INFO mapred.JobClient: FILE_BYTES_WRITTEN=138104586 12/03/08 15:49:57 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=45587207 12/03/08 15:49:57 INFO mapred.JobClient 12/03/08 15:49:57 INFO mapred.JobClient: Bytes Read=17156283 12/03/08 15:49:57 INFO mapred.JobClient: org.apache.mahout.math.hadoop.similarity.cooccurrence.RowSimilarityJob$Counters 12/03/08 15:49:57 INFO mapred.JobClient: ROWS=4838 12/03/08 15:49:57 INFO mapred.JobClient: Map-Reduce Framework 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input groups=294936 12/03/08 15:49:57 INFO mapred.JobClient: Map output materialized bytes=38326948 12/03/08 15:49:57 INFO mapred.JobClient: Combine output records=2242965 12/03/08 15:49:57 INFO mapred.JobClient: Map input records=4838 12/03/08 15:49:57 INFO mapred.JobClient: Reduce shuffle bytes=38326948 12/03/08 15:49:57 INFO mapred.JobClient: Reduce output records=294933 12/03/08 15:49:57 INFO mapred.JobClient: Spilled Records=3432447 12/03/08 15:49:57 INFO mapred.JobClient: Map output bytes=83168813 12/03/08 15:49:57 INFO mapred.JobClient: Combine input records=5912090 12/03/08 15:49:57 INFO mapred.JobClient: Map output records=3964061 12/03/08 15:49:57 INFO mapred.JobClient: SPLIT_RAW_BYTES=110 12/03/08 15:49:57 INFO mapred.JobClient: Reduce input records=294936 12/03/08 15:49:58 INFO input.FileInputFormat: Total input paths to process : 1 12/03/08 15:49:58 INFO mapred.JobClient: Running job: job_201203071745_0041 12/03/08 15:49:59 INFO mapred.JobClient: map 0% reduce 0% 12/03/08 15:50:19 INFO mapred.JobClient: map 8% reduce 0% 12/03/08 15:50:22 INFO mapred.JobClient: map 12% reduce 0% 12/03/08 15:50:25 INFO mapred.JobClient: map 15% reduce 0% 12/03/08 15:50:28 INFO mapred.JobClient: map 21% reduce 0% 12/03/08 15:50:31 INFO mapred.JobClient: map 23% reduce 0% 12/03/08 15:50:34 INFO mapred.JobClient: map 28% reduce 0% 12/03/08 15:50:37 INFO mapred.JobClient: map 32% reduce 0% 12/03/08 15:50:40 INFO mapred.JobClient: map 33% reduce 0% 12/03/08 15:50:43 I +
Suneel Marthi 2012-03-09, 12:26
-
Re: How to find the k most similar docsPat Ferrel 2012-03-09, 17:50
I assume that the other matrix operations will consume and produce
<Text, MatrixWritable>? If so how do you create <Text, MatrixWritable> from the output of rowid <IntWritable, VectorWritable>? Also while we are at it how do you use vectordump? If you do "bin/mahout vectordump --help" you get some crazy output that is unreadable. I would have guessed that vectordump would work on either <IntWritable, VectorWritable> so the output of rowid OR <Text, VectorWritable> the contents of tfidf-vectors/part-r-00000 but it doesn't seem to work on either using "bin/mahout vectordump -s path-to-file" Thanks Pat On 3/9/12 4:26 AM, Suneel Marthi wrote: > Pat, > > MatrixDump expects an input file of<Text, MatrixWritable> . The matrix that gets created from RowIdJob is<IntWritable, VectorWritable> and you cannot run MatrixDump to see the contents of the matrix. You need to use seqdumper as you had done. > > > > ________________________________ > From: Pat Ferrel<[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Thursday, March 8, 2012 7:14 PM > Subject: Re: How to find the k most similar docs > > OK, back to the beginning. I went through the entire sequence again with the notable exception that I did not create named vectors. I also tweaked some of the seq2sparse parameters. > > bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a > org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf > -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 > > after doing a rowid on the tfidf vectors I still get an error doing matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a matrix was created I perform the rowsimilarity job and now get a far bigger file created that looks OK > > bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o > wp-similarity -ess -s SIMILARITY_COSINE -m 10 > MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. > Running on hadoop, using HADOOP_HOME=/usr/local/hadoop > HADOOP_CONF_DIR=/usr/local/hadoop/conf > MAHOUT-JOB: > /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar > 12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments: > {--endPhase=2147483647, --excludeSelfSimilarity=false, > --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10, > --numberOfColumns=311433, --output=wp-similarity, > --similarityClassname=SIMILARITY_COSINE, --startPhase=0, --tempDir=temp} > 12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to > process : 1 > 12/03/08 15:48:36 INFO mapred.JobClient: Running job: > job_201203071745_0040 > 12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0% > 12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0% > 12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0% > 12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0% > 12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0% > 12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0% > 12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0% > 12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0% > 12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0% > 12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0% > 12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33% > 12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100% > 12/03/08 15:49:57 INFO mapred.JobClient: Job complete: > job_201203071745_0040 > 12/03/08 15:49:57 INFO mapred.JobClient: Counters: 26 > 12/03/08 15:49:57 INFO mapred.JobClient: Job Counters > 12/03/08 15:49:57 INFO mapred.JobClient: Launched reduce tasks=1 > 12/03/08 15:49:57 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=55564 > 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all > reduces waiting after reserving slots (ms)=0 > 12/03/08 15:49:57 INFO mapred.JobClient: Total time spent by all > maps waiting after reserving slots (ms)=0 +
Pat Ferrel 2012-03-09, 17:50
-
Re: How to find the k most similar docsLance Norskog 2012-03-10, 01:57
No, the matrix multiplication operations all (probably) take
<int,vector> where int is the row number. There has to be a universally unique row number. If there is no row number associated with a row in a distributed matrix op, how can the reducers know which rows they have? Rows do not necessarily have to be in order; some sequential programs might depend on this (but they should not). On Fri, Mar 9, 2012 at 9:50 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > I assume that the other matrix operations will consume and produce <Text, > MatrixWritable>? If so how do you create <Text, MatrixWritable> from the > output of rowid <IntWritable, VectorWritable>? > > Also while we are at it how do you use vectordump? If you do "bin/mahout > vectordump --help" you get some crazy output that is unreadable. I would > have guessed that vectordump would work on either <IntWritable, > VectorWritable> so the output of rowid OR <Text, VectorWritable> the > contents of tfidf-vectors/part-r-00000 but it doesn't seem to work on either > using "bin/mahout vectordump -s path-to-file" > > Thanks > Pat > > > On 3/9/12 4:26 AM, Suneel Marthi wrote: >> >> Pat, >> >> MatrixDump expects an input file of<Text, MatrixWritable> . The matrix >> that gets created from RowIdJob is<IntWritable, VectorWritable> and you >> cannot run MatrixDump to see the contents of the matrix. You need to use >> seqdumper as you had done. >> >> >> >> ________________________________ >> From: Pat Ferrel<[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Thursday, March 8, 2012 7:14 PM >> Subject: Re: How to find the k most similar docs >> >> OK, back to the beginning. I went through the entire sequence again with >> the notable exception that I did not create named vectors. I also tweaked >> some of the seq2sparse parameters. >> >> bin/mahout seq2sparse -i wp-seqfiles -o wp-vectors -ow -a >> org.apache.lucene.analysis.WhitespaceAnalyzer -chunk 100 -wt tfidf >> -s 5 -md 3 -x 90 -ng 2 -ml 50 -seq -n 2 >> >> after doing a rowid on the tfidf vectors I still get an error doing >> matrixdump on wp-matrix/matrix. Am I using it wrong? Taking on faith that a >> matrix was created I perform the rowsimilarity job and now get a far bigger >> file created that looks OK >> >> bin/mahout rowsimilarity -r 311433 -i wp-matrix/matrix -o >> wp-similarity -ess -s SIMILARITY_COSINE -m 10 >> MAHOUT_LOCAL is not set; adding HADOOP_CONF_DIR to classpath. >> Running on hadoop, using HADOOP_HOME=/usr/local/hadoop >> HADOOP_CONF_DIR=/usr/local/hadoop/conf >> MAHOUT-JOB: >> /home/pat/mahout-distribution-0.6/mahout-examples-0.6-job.jar >> 12/03/08 15:48:35 INFO common.AbstractJob: Command line arguments: >> {--endPhase=2147483647, --excludeSelfSimilarity=false, >> --input=wp-matrix/matrix, --maxSimilaritiesPerRow=10, >> --numberOfColumns=311433, --output=wp-similarity, >> --similarityClassname=SIMILARITY_COSINE, --startPhase=0, >> --tempDir=temp} >> 12/03/08 15:48:36 INFO input.FileInputFormat: Total input paths to >> process : 1 >> 12/03/08 15:48:36 INFO mapred.JobClient: Running job: >> job_201203071745_0040 >> 12/03/08 15:48:37 INFO mapred.JobClient: map 0% reduce 0% >> 12/03/08 15:48:58 INFO mapred.JobClient: map 17% reduce 0% >> 12/03/08 15:49:01 INFO mapred.JobClient: map 27% reduce 0% >> 12/03/08 15:49:04 INFO mapred.JobClient: map 40% reduce 0% >> 12/03/08 15:49:07 INFO mapred.JobClient: map 47% reduce 0% >> 12/03/08 15:49:10 INFO mapred.JobClient: map 60% reduce 0% >> 12/03/08 15:49:13 INFO mapred.JobClient: map 70% reduce 0% >> 12/03/08 15:49:16 INFO mapred.JobClient: map 80% reduce 0% >> 12/03/08 15:49:19 INFO mapred.JobClient: map 92% reduce 0% >> 12/03/08 15:49:22 INFO mapred.JobClient: map 100% reduce 0% >> 12/03/08 15:49:46 INFO mapred.JobClient: map 100% reduce 33% >> 12/03/08 15:49:52 INFO mapred.JobClient: map 100% reduce 100% >> 12/03/08 15:49:57 INFO mapred.JobClient: Job complete: Lance Norskog [EMAIL PROTECTED] +
Lance Norskog 2012-03-10, 01:57
-
Re: How to find the k most similar Zoo. sune <Alex Merritt 2012-02-19, 15:25
!vmmmakemoney sune
On Feb 18, 2012 5:28 PM, "Suneel Marthi" <[EMAIL PROTECTED]> wrote: > You might want to look at the RowSimilarityJob in Mahout to determine > document similarity. > > > Here's what you would do:- > > Assuming that your documents have already been vectorized, first convert > the vectors into an M*N matrix by calling the RowIdJob in Mahout where M > No. of rows (or documents in your case) and N= No. of columns (or the > unique terms). > > > Then run the RowSimilarity job on the matrix generated in the previous > step by specifying a cosine similarity measure, this should generate an > output that gives the most similar documents for each of the documents and > the similarity distance between them. RowSimilarityJob is a mapreduce job > so you should be able to run this on a really large corpus (I had run this > on 10 million web pages). > > The output of the RowSimilarity along with the similarity distances that > are generated between document pairs should give an idea as to what the > values of T1 and T2 should be when running canopy clustering. And the > number of clusters generated by running canopy would eventually be fed into > k-means as you had mentioned. > > > > > > ________________________________ > From: Pat Ferrel <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Saturday, February 18, 2012 2:39 PM > Subject: How to find the k most similar docs > > Given documents that are vectorized into Mahout vectors, have stop words > removed, and a TFIDF dictionary, what is the best distributed way to get k > nearest documents using a measure like cosine similarity (or the others > provided in Mahout)? I will be doing this for every document in the corpus > so the question is partly how best to do this given the existing > mahout+hadoop framework. What is the intuition about processing resources > needed? > > Expansion: At some point I'd like to extend this idea to find similar > clusters but expect that the same method should work only with centroids > instead of doc vectors. Also I expect to do canopy clustering to feed into > kmeans clustering. I'll perform the similarity measure only on docs in the > same cluster. I think I understand how to do this preprocessing so the > question is primarily the k most similar docs and/or centroids. This sounds > like k nearest neighbors, if so is this the best way to do it in > mahout+hadoop? +
Alex Merritt 2012-02-19, 15:25
-
How to find the k most similar docsPat Ferrel 2012-02-18, 19:39
Given documents that are vectorized into Mahout vectors, have stop words
removed, and a TFIDF dictionary, what is the best distributed way to get k nearest documents using a measure like cosine similarity (or the others provided in Mahout)? I will be doing this for every document in the corpus so the question is partly how best to do this given the existing mahout+hadoop framework. What is the intuition about processing resources needed? Expansion: At some point I'd like to extend this idea to find similar clusters but expect that the same method should work only with centroids instead of doc vectors. Also I expect to do canopy clustering to feed into kmeans clustering. I'll perform the similarity measure only on docs in the same cluster. I think I understand how to do this preprocessing so the question is primarily the k most similar docs and/or centroids. This sounds like k nearest neighbors, if so is this the best way to do it in mahout+hadoop? +
Pat Ferrel 2012-02-18, 19:39
-
Re: How to find the k most similar docsSuneel Marthi 2012-02-18, 21:27
You might want to look at the RowSimilarityJob in Mahout to determine document similarity.
Here's what you would do:- Assuming that your documents have already been vectorized, first convert the vectors into an M*N matrix by calling the RowIdJob in Mahout where M = No. of rows (or documents in your case) and N= No. of columns (or the unique terms). Then run the RowSimilarity job on the matrix generated in the previous step by specifying a cosine similarity measure, this should generate an output that gives the most similar documents for each of the documents and the similarity distance between them. RowSimilarityJob is a mapreduce job so you should be able to run this on a really large corpus (I had run this on 10 million web pages). The output of the RowSimilarity along with the similarity distances that are generated between document pairs should give an idea as to what the values of T1 and T2 should be when running canopy clustering. And the number of clusters generated by running canopy would eventually be fed into k-means as you had mentioned. ________________________________ From: Pat Ferrel <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, February 18, 2012 2:39 PM Subject: How to find the k most similar docs Given documents that are vectorized into Mahout vectors, have stop words removed, and a TFIDF dictionary, what is the best distributed way to get k nearest documents using a measure like cosine similarity (or the others provided in Mahout)? I will be doing this for every document in the corpus so the question is partly how best to do this given the existing mahout+hadoop framework. What is the intuition about processing resources needed? Expansion: At some point I'd like to extend this idea to find similar clusters but expect that the same method should work only with centroids instead of doc vectors. Also I expect to do canopy clustering to feed into kmeans clustering. I'll perform the similarity measure only on docs in the same cluster. I think I understand how to do this preprocessing so the question is primarily the k most similar docs and/or centroids. This sounds like k nearest neighbors, if so is this the best way to do it in mahout+hadoop? +
Suneel Marthi 2012-02-18, 21:27
-
Re: How to find the k most similar docsPat Ferrel 2012-02-19, 21:11
This looks perfect, thanks.
I had planned to do the RowSimilarityJob after clustering to reduce the rows from the entire corpus to only those in a cluster. You mention using the distance between similar rows to get an idea of the distances for canopy clustering. This seems a very good idea since I have no other good way to generate T1 and T2. The downside is that I have to do RowSimilarityJob on all docs in the corpus. I assume that since you have done this on 10 Million docs that the benefit in getting good canopies outweighs doing similarity on all docs as far as processing resources needed? I am new to reading mapreduce code so may I ask some noob questions: * is the best documentation here? https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.html#run(java.lang.String[]) * the command line arguments include: numberOfColumns, shouldn't that be easily extracted from the input matrix? is this optional? How do I tell which argument is optional from the docs? * the argument maxSimilaritiesPerRow could return first or best, it is difficult to see which. I have the source but perhaps due to the string based binding I am finding it hard to track down what code is run so any tips for reading the code or docs are greatly appreciated. On 2/18/12 1:27 PM, Suneel Marthi wrote: > You might want to look at the RowSimilarityJob in Mahout to determine document similarity. > > > Here's what you would do:- > > Assuming that your documents have already been vectorized, first convert the vectors into an M*N matrix by calling the RowIdJob in Mahout where M = No. of rows (or documents in your case) and N= No. of columns (or the unique terms). > > > Then run the RowSimilarity job on the matrix generated in the previous step by specifying a cosine similarity measure, this should generate an output that gives the most similar documents for each of the documents and the similarity distance between them. RowSimilarityJob is a mapreduce job so you should be able to run this on a really large corpus (I had run this on 10 million web pages). > > The output of the RowSimilarity along with the similarity distances that are generated between document pairs should give an idea as to what the values of T1 and T2 should be when running canopy clustering. And the number of clusters generated by running canopy would eventually be fed into k-means as you had mentioned. > > > > > > ________________________________ > From: Pat Ferrel<[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Saturday, February 18, 2012 2:39 PM > Subject: How to find the k most similar docs > > Given documents that are vectorized into Mahout vectors, have stop words removed, and a TFIDF dictionary, what is the best distributed way to get k nearest documents using a measure like cosine similarity (or the others provided in Mahout)? I will be doing this for every document in the corpus so the question is partly how best to do this given the existing mahout+hadoop framework. What is the intuition about processing resources needed? > > Expansion: At some point I'd like to extend this idea to find similar clusters but expect that the same method should work only with centroids instead of doc vectors. Also I expect to do canopy clustering to feed into kmeans clustering. I'll perform the similarity measure only on docs in the same cluster. I think I understand how to do this preprocessing so the question is primarily the k most similar docs and/or centroids. This sounds like k nearest neighbors, if so is this the best way to do it in > mahout+hadoop? +
Pat Ferrel 2012-02-19, 21:11
-
Re: How to find the k most similar docsSebastian Schelter 2012-02-19, 21:33
Hi Pat,
'numberOfColumns' is not optional but is only used by a few similarityMeasures (such as loglikelihood ratio). 'maxSimilaritiesPerRow' retains the top similarities. --sebastian On 19.02.2012 22:11, Pat Ferrel wrote: > This looks perfect, thanks. > > I had planned to do the RowSimilarityJob after clustering to reduce the > rows from the entire corpus to only those in a cluster. You mention > using the distance between similar rows to get an idea of the distances > for canopy clustering. This seems a very good idea since I have no other > good way to generate T1 and T2. The downside is that I have to do > RowSimilarityJob on all docs in the corpus. I assume that since you have > done this on 10 Million docs that the benefit in getting good canopies > outweighs doing similarity on all docs as far as processing resources > needed? > > I am new to reading mapreduce code so may I ask some noob questions: > > * is the best documentation here? > > https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.html#run(java.lang.String[]) > > * the command line arguments include: numberOfColumns, shouldn't that > be easily extracted from the input matrix? is this optional? How do > I tell which argument is optional from the docs? > * the argument maxSimilaritiesPerRow could return first or best, it is > difficult to see which. > > I have the source but perhaps due to the string based binding I am > finding it hard to track down what code is run so any tips for reading > the code or docs are greatly appreciated. > > > On 2/18/12 1:27 PM, Suneel Marthi wrote: >> You might want to look at the RowSimilarityJob in Mahout to determine >> document similarity. >> >> >> Here's what you would do:- >> >> Assuming that your documents have already been vectorized, first >> convert the vectors into an M*N matrix by calling the RowIdJob in >> Mahout where M = No. of rows (or documents in your case) and N= No. of >> columns (or the unique terms). >> >> >> Then run the RowSimilarity job on the matrix generated in the previous >> step by specifying a cosine similarity measure, this should generate >> an output that gives the most similar documents for each of the >> documents and the similarity distance between them. RowSimilarityJob >> is a mapreduce job so you should be able to run this on a really large >> corpus (I had run this on 10 million web pages). >> The output of the RowSimilarity along with the similarity distances >> that are generated between document pairs should give an idea as to >> what the values of T1 and T2 should be when running canopy clustering. >> And the number of clusters generated by running canopy would >> eventually be fed into k-means as you had mentioned. >> >> >> >> >> >> ________________________________ >> From: Pat Ferrel<[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Saturday, February 18, 2012 2:39 PM >> Subject: How to find the k most similar docs >> >> Given documents that are vectorized into Mahout vectors, have stop >> words removed, and a TFIDF dictionary, what is the best distributed >> way to get k nearest documents using a measure like cosine similarity >> (or the others provided in Mahout)? I will be doing this for every >> document in the corpus so the question is partly how best to do this >> given the existing mahout+hadoop framework. What is the intuition >> about processing resources needed? >> >> Expansion: At some point I'd like to extend this idea to find similar >> clusters but expect that the same method should work only with >> centroids instead of doc vectors. Also I expect to do canopy >> clustering to feed into kmeans clustering. I'll perform the similarity >> measure only on docs in the same cluster. I think I understand how to >> do this preprocessing so the question is primarily the k most similar >> docs and/or centroids. This sounds like k nearest neighbors, if so is >> this the best way to do it in +
Sebastian Schelter 2012-02-19, 21:33
|