Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - How to find the k most similar docs


Copy link to this message
-
Re: How to find the k most similar docs
Suneel Marthi 2012-02-20, 20:28
Pat,

You are welcome.

FYI...

Another option you could consider for determining document similarity would be 'MinHash clustering'.
Mahout comes with a minHash clustering implementation but I never had good results from it and I never got it to run successfully on a really large corpus (like a million documents). 
Look at the thread at http://www.searchworkings.org/forum/-/message_boards/view_message/359922.

Here is a reference to Andrei Broder's paper for detecting duplicates in documents - http://dl.acm.org/citation.cfm?id=736184

Given a choice between RowSimilarityJob and MinHash clustering, I would prefer the later but chose the former due to not having any success with Mahout's MinHash implementation.
Suneel

________________________________
 From: Pat Ferrel <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, February 20, 2012 2:10 PM
Subject: Re: How to find the k most similar docs
 
Suneel, this is extremely helpful. I hope it gets to the Mahout wiki.

Some thoughts:

  * a threshold for self-similarity seems useful. I'm thinking of
    mirrored news groups, bulletin boards, and social network posts
    where the docs may be very very close but have some surrounding text
    that doesn't quite match so similarity 1.0 might not work. This is
    not an academic question since these are some of the docs we plan to
    examine. It should be pretty easy to do this in a post processing
    step for now.
  * I see how you use RowSimilarityJob to guess at good T1 and T2. In my
    case I am also concerned with the cohesion of the resulting
    clusters. The outliers will likely never bee seen by humans. The
    intuition here is that well-formed clusters even if diffuse will
    give better results for us than a greater number of poorly-formed
    clusters. One way we have considered getting this result is to form
    lots of clusters, perhaps as you describe using T1 and T2 derived
    from RowSimilarityJob then throw out ones that do not match some
    measurement (Dunning mentions entropy). This would allow overfitting
    but toss the overfit cases.
    http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output#9d3f6a55f4a91cb6
    I don't see that anyone has implemented something like this yet.

Thanks again.
On 2/19/12 9:00 PM, Suneel Marthi wrote:
> Hi Pat,
>
>
> 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does.  The RowSimilarityJob implementation is based on the research paper  - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf
>
> I'll add the details on the mahout wiki page sometime this week.
>
> 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified.
>
> 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr,  which would limit the results to only those documents that have a similarity value greater than the threshold.
>
>     Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be.  In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1).
>
> 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector.  This could be an enhancement to add to the RowSimilarityJob.
>
>     Code snippet below gets the number of columns in a matrix if not specified by the user.
>
>     Path inputMatrixPath = new Path(getInputPath());
>
>     SequenceFile.Reader  sequenceFileReader =  new SequenceFile.Reader (fs, inputMatrixPath, conf);
>
>     int NumberOfColumns = getDimensions(sequenceFileReader);