|
|
-
Re: generate similar documentsSebastian Schelter 2010-10-28, 10:10
You have to supply that number, however if you don't use it number in
the similarity computation (only SIMILARITY_LOGLIKELIHOOD uses it) you can safely ignore it and pass in any number. --sebastian On 28.10.2010 12:02, Divya wrote: > Hi Sebastian, > From where can I get the numberOfColumns. > How can I calculate I have these many columns my matrix has as > SparseVectorsFromSequenceFiles generates vectors in binary format. > > Regards, > Divya > > -----Original Message----- > From: Sebastian Schelter [mailto:[EMAIL PROTECTED]] > Sent: Thursday, October 28, 2010 4:28 PM > To: [EMAIL PROTECTED] > Subject: Re: generate similar documents > > Hi Divya, > > --similarityClassname should point to an implementation of > org.apache.mahout.math.hadoop.similarity.vector.DistributedVectorSimilarity, > > you can use any value from > org.apache.mahout.math.hadoop.similarity.SimilarityType to use a > predefined similarity measure or you can point to an implementation of > your own > > --numberOfColumns is the number of columns of the input matrix, which > would be the number of unique terms as I suppose your matrix is > documents x terms > > --sebastian > > On 28.10.2010 10:11, Divya wrote: > >> Hi, >> >> I have directory of documents from which I have generated Sequence file >> using SequenceFilesFromDirectory and then converted it into vectors >> SparseVectorsFromSequenceFiles >> >> Now referring below link to generate a list of most similar documents >> >> >> >> >> > http://mail-archives.apache.org/mod_mbox/mahout-user/201007.mbox/%3C4C2E3EED > >> [EMAIL PROTECTED]%3E >> >> >> >> How can I use RowSimilarityJob to generate list of similar documents . >> >> >> >> <ol> >> >> *<li>-Dmapred.input.dir=(path): Directory containing a {@link >> DistributedRowMatrix} as a >> >> * SequenceFile<IntWritable,VectorWritable></li> >> >> *<li>-Dmapred.output.dir=(path): output path where the computations >> > output > >> should go (a {@link DistributedRowMatrix} >> >> * stored as a SequenceFile<IntWritable,VectorWritable>)</li> >> >> *<li>--numberOfColumns: the number of columns in the input matrix</li> >> >> *<li>--similarityClassname (classname): an implementation of {@link >> DistributedVectorSimilarity} used to compute the >> >> * similarity</li> >> >> *<li>--maxSimilaritiesPerRow (integer): cap the number of similar rows >> > per > >> row to this number (100)</li> >> >> *</ol> >> >> * >> >> >> >> Which argument should I pass numberOfColumns and similarityClassname ? >> >> >> >> >> >> Regards, >> >> Divya >> >> >> >> > > |