Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - Question on RowSimilarityJob


Copy link to this message
-
Re: Question on RowSimilarityJob
Sebastian Schelter 2012-02-02, 13:08
The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example:

We have a 3x3 matrix (could represent items rated by users, docs with
word counts e.g.)

row1: column1, column2
row2: column1, column3
row3: column1, column2

In the first MapReduce pass, we created an inverted index over the
columns (which means we transpose the matrix)

column1: row1, row2, row3
column2: row1, row3
column3: row2

In the second MapReduce pass, we the mapper emits all coocurring row
pairs per column:

for column1:

(row1,row2)
(row1,row3)
(row2,row1)
(row2,row3)
(row3,row1)
(row3,row2)

for column2:

(row1,row3)
(row3,row1)

for column3 there's nothing to emit.

The reducer receives all cooccurrences for a particular row and simply
needs to sum them up:

for row1:

(row1,row2)
(row1,row3)
(row1,row3)

row1,row2 -> 1 cooccurrence
row1,row3 -> 2 cooccurrences

for row2:

(row2,row1)
(row2,row3)

row2,row1 -> 1 cooccurrence
row2,row3 -> 1 cooccurrence

for row3:

(row3,row1)
(row3,row2)
(row3,row1)

row3,row1 -> 2 cooccurrences
row3,row2 -> 1 cooccurrence
Hope that helps with understanding the approach.

--sebastian
On 02.02.2012 12:47, Vicky wrote:
> Sebastian,
> when you said,
>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared
>
>
> if there are three columns in the document (columnA,columnB,columnC,columnD).  
>
> Doc1: ColumnA, columnB, columnC,columnD
> Doc2:ColumnA,columnB,columnC,columnD
>
> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise?
>
>
>
>
> ----- Original Message -----
> From: Sebastian Schelter <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]
> Cc:
> Sent: Wednesday, February 1, 2012 6:06 AM
> Subject: Re: Question on RowSimilarityJob
>
> Here's a brief description of RowSimilarityJob:
>
> The goal is to compute all pairwise similarities between the rows of a
> sparse matrix A.
>
> The computation should be executed in a way that only rows that have at
> least one non-zero value in the same dimension (column) are compared. We
> need this to avoid a quadratic number of pairwise comparisons.
> Furthermore we should be able to 'embed' arbitrary similarity measures
> and we should always be able to use a combiner in all MapReduce steps.
>
> The computation is executed using three MapReduce passes:
>
> In the first step, the rows of A are preprocessed via
> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
> scaled to unit-length), a single number for each row of A is computed
> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>
> The second steps operates on the rows of A' (the columns of A). The
> mapper sums up all pairwise cooccurrences using
> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
> for one row and uses the similarity measure and the precomputed numbers
> from step one to compute all similarities via
> VectorSimilarityMeasure.similarity().
>
> The third step ensures that only the top k similar rows per row are kept.
>
> It's hard to see from the code but actually the job performs the matrix
> multiplication AA' via outer products with a modified (similarity
> measure specific) dot product.
>
> /s
>
>
>
> On 31.01.2012 22:40, Suneel Marthi wrote:
>> Sebastian,
>>
>> Question on the RowSimilarity job.
>>
>> a) I created sequence files from my document corpus.
>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1
>> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of  documents in my collection, N = cardinality of the vector - Correct?
>>
>> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where  R < N.