Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Mahout, mail # user - Question on RowSimilarityJob


+
Suneel Marthi 2012-01-20, 16:38
+
Sebastian Schelter 2012-01-20, 17:58
+
Suneel Marthi 2012-01-31, 21:40
+
Sebastian Schelter 2012-02-01, 11:06
+
Vicky 2012-02-02, 11:47
+
Sebastian Schelter 2012-02-02, 13:08
Copy link to this message
-
Re: Question on RowSimilarityJob
Suneel Marthi 2012-02-02, 15:51
Thanks for the details, Sebastian. This is indeed very helpful and I completely understand now what you had on your whiteboard.

I had been normalizing my vectors during seq2sparse (with L2 norm since I would be using Cosine/Tanimoto Similarity in the RowSimilarityJob). But since the first step in the RowSimilarity job is normalizing the vectors I don't have to do that in seq2sparse; doing that would not impact the final outcome anyways. 
This has been a very useful thread and thanks for all your help.

________________________________
 From: Sebastian Schelter <[EMAIL PROTECTED]>
To: Vicky <[EMAIL PROTECTED]>
Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Thursday, February 2, 2012 8:08 AM
Subject: Re: Question on RowSimilarityJob
 
The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with
word counts e.g.)

row1: column1, column2
row2: column1, column3
row3: column1, column2

In the first MapReduce pass, we created an inverted index over the
columns (which means we transpose the matrix)

column1: row1, row2, row3
column2: row1, row3
column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row
pairs per column:

for column1:

(row1,row2)
(row1,row3)
(row2,row1)
(row2,row3)
(row3,row1)
(row3,row2)

for column2:

(row1,row3)
(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply
needs to sum them up:

for row1:

(row1,row2)
(row1,row3)
(row1,row3)

row1,row2 -> 1 cooccurrence
row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)
(row2,row3)

row2,row1 -> 1 cooccurrence
row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)
(row3,row2)
(row3,row1)

row3,row1 -> 2 cooccurrences
row3,row2 -> 1 cooccurrence
Hope that helps with understanding the approach.

--sebastian
On 02.02.2012 12:47, Vicky wrote:
> Sebastian,
> when you said,
>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared
>
>
> if there are three columns in the document (columnA,columnB,columnC,columnD). 
>
> Doc1: ColumnA, columnB, columnC,columnD
> Doc2:ColumnA,columnB,columnC,columnD
>
> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?
>
>
>
>
> ----- Original Message -----
> From: Sebastian Schelter <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Wednesday, February 1, 2012 6:06 AM
> Subject: Re: Question on RowSimilarityJob
>
> Here's a brief description of RowSimilarityJob:
>
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
>
> The computation should be executed in a way that only rows that have at
> least one non-zero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
>
> The computation is executed using three MapReduce passes:
>
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unit-length), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
>
> The third step ensures that only the top k similar rows per row are kept.
+
Lance Norskog 2012-02-03, 07:44
+
Sebastian Schelter 2012-02-04, 17:23
+
Dan Brickley 2012-02-02, 12:59
+
Sebastian Schelter 2012-02-02, 13:08
+
Suneel Marthi 2012-01-20, 22:02
+
Lance Norskog 2012-01-20, 22:48
+
Lance Norskog 2012-02-01, 04:29