Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - Question on RowSimilarityJob


Copy link to this message
-
Re: Question on RowSimilarityJob
Sebastian Schelter 2012-02-02, 13:08
Great! Thank you!

On 02.02.2012 13:59, Dan Brickley wrote:
> On 1 February 2012 12:06, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
>> Here's a brief description of RowSimilarityJob:
>>
>> The goal is to compute all pairwise similarities between the rows of a
>> sparse matrix A.
>>
>> The computation should be executed in a way that only rows that have at
>> least one non-zero value in the same dimension (column) are compared. We
>> need this to avoid a quadratic number of pairwise comparisons.
>> Furthermore we should be able to 'embed' arbitrary similarity measures
>> and we should always be able to use a combiner in all MapReduce steps.
>>
>> The computation is executed using three MapReduce passes:
>>
>> In the first step, the rows of A are preprocessed via
>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or
>> scaled to unit-length), a single number for each row of A is computed
>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed.
>>
>> The second steps operates on the rows of A' (the columns of A). The
>> mapper sums up all pairwise cooccurrences using
>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so
>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors
>> for one row and uses the similarity measure and the precomputed numbers
>> from step one to compute all similarities via
>> VectorSimilarityMeasure.similarity().
>>
>> The third step ensures that only the top k similar rows per row are kept.
>>
>> It's hard to see from the code but actually the job performs the matrix
>> multiplication AA' via outer products with a modified (similarity
>> measure specific) dot product.
>
> Thanks for this. I've taken liberty of dropping it in the Wiki at
> https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob
> (and linking this thread). Hope that's ok...
>
> cheers,
>
> Dan