Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - How to find the k most similar docs


Copy link to this message
-
Re: How to find the k most similar docs
Suneel Marthi 2012-02-20, 05:00
Hi Pat,
1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does.  The RowSimilarityJob implementation is based on the research paper  - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf

I'll add the details on the mahout wiki page sometime this week.

2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified.

3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr,  which would limit the results to only those documents that have a similarity value greater than the threshold.

   Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be.  In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1).

4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector.  This could be an enhancement to add to the RowSimilarityJob.

   Code snippet below gets the number of columns in a matrix if not specified by the user.

   Path inputMatrixPath = new Path(getInputPath());

   SequenceFile.Reader  sequenceFileReader =  new SequenceFile.Reader (fs, inputMatrixPath, conf);

   int NumberOfColumns = getDimensions(sequenceFileReader);

sequenceFileReader.close();
private int getDimensions(Reader reader) throws IOException, InstantiationException, IllegalAccessException {
    Class keyClass = reader.getKeyClass();
    Writable row = (Writable) keyClass.newInstance();
    if (! reader.getValueClass().equals(VectorWritable.class)) {
      throw new IllegalArgumentException("Value type of sequencefile must be a VectorWritable");
    }
    VectorWritable vw = new VectorWritable();
if (!reader.next(row, vw)) {
      log.error("matrix must have at least one row");
      throw new IllegalStateException();
    }
    Vector v = vw.get();
    return v.size();
 }
5. RowSimilarityJob also has an option to excludeSelfSimilarity (which is false by default) but you need to specify this so that you don't end up comparing a document with itself and ending up with a similarity measure of 1.0 (if using Cosine measure).

Let me know if you have any more questions.
    
________________________________
 From: Sebastian Schelter <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Sunday, February 19, 2012 4:33 PM
Subject: Re: How to find the k most similar docs
 
Hi Pat,

'numberOfColumns' is not optional but is only used by a few
similarityMeasures (such as loglikelihood ratio).
'maxSimilaritiesPerRow' retains the top similarities.

--sebastian
On 19.02.2012 22:11, Pat Ferrel wrote:
> This looks perfect, thanks.
>
> I had planned to do the RowSimilarityJob after clustering to reduce the
> rows from the entire corpus to only those in a cluster. You mention
> using the distance between similar rows to get an idea of the distances
> for canopy clustering. This seems a very good idea since I have no other
> good way to generate T1 and T2. The downside is that I have to do
> RowSimilarityJob on all docs in the corpus. I assume that since you have
> done this on 10 Million docs that the benefit in getting good canopies
> outweighs doing similarity on all docs as far as processing resources
> needed?
>
> I am
 new to reading mapreduce code so may I ask some noob questions:
>
>  * is the best documentation here?

> https://builds.apache.org/job/Mahout-Quality/javadoc/org/apache/mahout/math/hadoop/similarity/RowSimilarityJob.html#run(java.lang.String[])
>
>  * the command line arguments include: numberOfColumns, shouldn't that
>    be easily extracted from the input matrix? is this optional? How do
 for reading
 mapreduce job so you should be able to run this on a really large
 find the k most similar docs
 how to