Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # user - Re: Question on RowSimilarityJob


+
Lance Norskog 2012-02-01, 04:29
+
Suneel Marthi 2012-01-20, 16:38
+
Sebastian Schelter 2012-01-20, 17:58
Copy link to this message
-
Re: Question on RowSimilarityJob
Suneel Marthi 2012-01-31, 21:40
Sebastian,

Question on the RowSimilarity job.

a) I created sequence files from my document corpus.
b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?

d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N.
    I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf  and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help.

I am willing to put some docs on the wiki for RowSimilarityJob once I am done.

Thanks for your help.

________________________________
 From: Sebastian Schelter <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Friday, January 20, 2012 12:58 PM
Subject: Re: Question on RowSimilarityJob
 
Hi,

'maxSimilaritiesPerRow' denotes the maximum number of similar rows
(documents in your use case) to keep per document.
'excludeSelfSimilarity' means that rows (documents) should not be
compared to themselves.

Sry for the lack of documentation, RowSimilarityJob was originally only
an internal job for the recommendation code. I'll try to add something
on the wiki in the next days.

--sebastian
On 20.01.2012 17:38, Suneel Marthi wrote:
> I am working on determining document similarity of a corpus I am working with using RowSimilarity.
>
> Questions:-
>
> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean?
> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ .
>
> c) Also do we have any docs on RowIdJob ?
>
> Thanks and Regards,
> Suneel
>
+
Sebastian Schelter 2012-02-01, 11:06
+
Vicky 2012-02-02, 11:47
+
Sebastian Schelter 2012-02-02, 13:08
+
Lance Norskog 2012-02-03, 07:44
+
Sebastian Schelter 2012-02-04, 17:23
+
Suneel Marthi 2012-02-02, 15:51
+
Dan Brickley 2012-02-02, 12:59
+
Sebastian Schelter 2012-02-02, 13:08
+
Suneel Marthi 2012-01-20, 22:02
+
Lance Norskog 2012-01-20, 22:48