Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # user - Re: How to find the k most similar docs


+
Suneel Marthi 2012-02-20, 05:00
Copy link to this message
-
Re: How to find the k most similar docs
Pat Ferrel 2012-02-20, 19:10
Suneel, this is extremely helpful. I hope it gets to the Mahout wiki.

Some thoughts:

  * a threshold for self-similarity seems useful. I'm thinking of
    mirrored news groups, bulletin boards, and social network posts
    where the docs may be very very close but have some surrounding text
    that doesn't quite match so similarity 1.0 might not work. This is
    not an academic question since these are some of the docs we plan to
    examine. It should be pretty easy to do this in a post processing
    step for now.
  * I see how you use RowSimilarityJob to guess at good T1 and T2. In my
    case I am also concerned with the cohesion of the resulting
    clusters. The outliers will likely never bee seen by humans. The
    intuition here is that well-formed clusters even if diffuse will
    give better results for us than a greater number of poorly-formed
    clusters. One way we have considered getting this result is to form
    lots of clusters, perhaps as you describe using T1 and T2 derived
    from RowSimilarityJob then throw out ones that do not match some
    measurement (Dunning mentions entropy). This would allow overfitting
    but toss the overfit cases.
    http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output#9d3f6a55f4a91cb6
    I don't see that anyone has implemented something like this yet.

Thanks again.
On 2/19/12 9:00 PM, Suneel Marthi wrote:
> Hi Pat,
>
>
> 1. Please look at the discussion thread at http://mail-archives.apache.org/mod_mbox/mahout-user/201202.mbox/browser for a description of what the RowSimilarityJob does.  The RowSimilarityJob implementation is based on the research paper  - http://www.csee.ogi.edu/~zak/cs506-pslc/docsim.pdf
>
> I'll add the details on the mahout wiki page sometime this week.
>
> 2. 'maxSimilaritiesPerRow' returns the best similarities (not the first) - by default this returns top 100 if not specified.
>
> 3. If you would like to discard the similarities per row below a certain value you can specify a threshold -tr,  which would limit the results to only those documents that have a similarity value greater than the threshold.
>
>     Depending on the similarity measures that you get as the final output, it should give you an idea of what T1 and T2 should be.  In my particular use case I was only interested in documents that had a similarity measure of 0.7 or greater,hence 0.7 would be my T2; and the top most similar documents has a similarity value of 0.99999 (which was what I used as my T1).
>
> 4. 'numberOfColumns' is not optional; but I tend to agree with you that this should be inferred automatically if not specified by the size of the input vector.  This could be an enhancement to add to the RowSimilarityJob.
>
>     Code snippet below gets the number of columns in a matrix if not specified by the user.
>
>     Path inputMatrixPath = new Path(getInputPath());
>
>     SequenceFile.Reader  sequenceFileReader =  new SequenceFile.Reader (fs, inputMatrixPath, conf);
>
>     int NumberOfColumns = getDimensions(sequenceFileReader);
>
> sequenceFileReader.close();
> private int getDimensions(Reader reader) throws IOException, InstantiationException, IllegalAccessException {
>      Class keyClass = reader.getKeyClass();
>      Writable row = (Writable) keyClass.newInstance();
>      if (! reader.getValueClass().equals(VectorWritable.class)) {
>        throw new IllegalArgumentException("Value type of sequencefile must be a VectorWritable");
>      }
>      VectorWritable vw = new VectorWritable();
> if (!reader.next(row, vw)) {
>        log.error("matrix must have at least one row");
>        throw new IllegalStateException();
>      }
>      Vector v = vw.get();
>      return v.size();
>   }
> 5. RowSimilarityJob also has an option to excludeSelfSimilarity (which is false by default) but you need to specify this so that you don't end up comparing a document with itself and ending up with a similarity measure of 1.0 (if using Cosine measure).
+
Suneel Marthi 2012-02-20, 20:28
+
Lance Norskog 2012-02-21, 10:37
+
Pat Ferrel 2012-03-05, 19:29
+
Sebastian Schelter 2012-03-05, 19:32
+
Suneel Marthi 2012-03-05, 19:48
+
Fernando Fernández 2012-03-06, 09:00
+
Pat Ferrel 2012-03-07, 01:14
+
Suneel Marthi 2012-03-07, 02:25
+
Sebastian Schelter 2012-03-07, 07:09
+
Pat Ferrel 2012-03-07, 16:38
+
Sebastian Schelter 2012-03-07, 16:50
+
Pat Ferrel 2012-03-09, 00:14
+
Suneel Marthi 2012-03-09, 12:26
+
Pat Ferrel 2012-03-09, 17:50
+
Lance Norskog 2012-03-10, 01:57
+
Alex Merritt 2012-02-19, 15:25
+
Pat Ferrel 2012-02-18, 19:39
+
Suneel Marthi 2012-02-18, 21:27
+
Pat Ferrel 2012-02-19, 21:11
+
Sebastian Schelter 2012-02-19, 21:33