|
Lance Norskog
2012-02-01, 04:29
Suneel Marthi
2012-01-20, 16:38
Sebastian Schelter
2012-01-20, 17:58
Suneel Marthi
2012-01-31, 21:40
Sebastian Schelter
2012-02-01, 11:06
Vicky
2012-02-02, 11:47
Sebastian Schelter
2012-02-02, 13:08
Lance Norskog
2012-02-03, 07:44
Sebastian Schelter
2012-02-04, 17:23
Suneel Marthi
2012-02-02, 15:51
Dan Brickley
2012-02-02, 12:59
Sebastian Schelter
2012-02-02, 13:08
Suneel Marthi
2012-01-20, 22:02
Lance Norskog
2012-01-20, 22:48
|
-
Re: Question on RowSimilarityJobLance Norskog 2012-02-01, 04:29
In fact I will back off there. It is possible to use Hadoop to create
a data model, which you can then use to do recommendations for your interactive search. It is similarly possible to do your clustering job and use a disk-backed store to fetch a cluster. I have not teased apart the various ways to compose a Hadoop-based recommender job. There seem to be more options than I understood. Lance On Mon, Jan 23, 2012 at 6:06 AM, Suneel Marthi <[EMAIL PROTECTED]> wrote: > Thanks Lance for your reply. > > I cannot go with any in-memory implementations as I am working with a corpus > of ~600K documents. > > As regards clustering, I did try Mahout's various clustering implementations > - kmeans, canopy and minhash. Minhash would have been the right algorithm > for > what I am trying to do but I never had good results from Mahout's minhash > implementation. > > I had never tried any recommenders in Mahout, which one would you suggest; I > would definitely need a Hadoop based solution as I am dealing with large > data. > > Thanks, > Suneel > > ________________________________ > From: Lance Norskog <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Suneel Marthi <[EMAIL PROTECTED]> > Sent: Friday, January 20, 2012 5:48 PM > > Subject: Re: Question on RowSimilarityJob > > This can be done as a recommender or with clustering. As a > recommender, you would say, "find similar documents". In clustering, > you can take one document and find its neighbors in the cluster graph. > > How much data do you have? An important point here is that most of > Mahout's recommender algorithms are in-memory programs. There is only > one Hadoop-based recommender. There are several Hadoop-based > clustering algorithms. > > On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <[EMAIL PROTECTED]> > wrote: >> Thanks for the reply Sebastian. >> >> >> For the task I am working on - 'determine similar documents in a >> collection' - would RowSimilarity be the right approach? >> >> >> >> >> >> >> ________________________________ >> From: Sebastian Schelter <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Friday, January 20, 2012 12:58 PM >> Subject: Re: Question on RowSimilarityJob >> >> Hi, >> >> 'maxSimilaritiesPerRow' denotes the maximum number of similar rows >> (documents in your use case) to keep per document. >> 'excludeSelfSimilarity' means that rows (documents) should not be >> compared to themselves. >> >> Sry for the lack of documentation, RowSimilarityJob was originally only >> an internal job for the recommendation code. I'll try to add something >> on the wiki in the next days. >> >> --sebastian >> >> >> On 20.01.2012 17:38, Suneel Marthi wrote: >>> I am working on determining document similarity of a corpus I am working >>> with using RowSimilarity. >>> >>> Questions:- >>> >>> a) What do the parameters - 'maxSimilaritiesPerRow' and >>> 'excludeSelfSimilarity' mean? >>> b) Are there any docs available on RowSimilarityJob available, this is >>> the best I could find on Sebastian's blog - >>> http://ssc.io/rowsimilarityjob-on-steroids/ . >>> >>> c) Also do we have any docs on RowIdJob ? >>> >>> Thanks and Regards, >>> Suneel >>> > > > > -- > Lance Norskog > [EMAIL PROTECTED] > > -- Lance Norskog [EMAIL PROTECTED] +
Lance Norskog 2012-02-01, 04:29
-
Question on RowSimilarityJobSuneel Marthi 2012-01-20, 16:38
I am working on determining document similarity of a corpus I am working with using RowSimilarity.
Questions:- a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean? b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ . c) Also do we have any docs on RowIdJob ? Thanks and Regards, Suneel +
Suneel Marthi 2012-01-20, 16:38
-
Re: Question on RowSimilarityJobSebastian Schelter 2012-01-20, 17:58
Hi,
'maxSimilaritiesPerRow' denotes the maximum number of similar rows (documents in your use case) to keep per document. 'excludeSelfSimilarity' means that rows (documents) should not be compared to themselves. Sry for the lack of documentation, RowSimilarityJob was originally only an internal job for the recommendation code. I'll try to add something on the wiki in the next days. --sebastian On 20.01.2012 17:38, Suneel Marthi wrote: > I am working on determining document similarity of a corpus I am working with using RowSimilarity. > > Questions:- > > a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean? > b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ . > > c) Also do we have any docs on RowIdJob ? > > Thanks and Regards, > Suneel > +
Sebastian Schelter 2012-01-20, 17:58
-
Re: Question on RowSimilarityJobSuneel Marthi 2012-01-31, 21:40
Sebastian,
Question on the RowSimilarity job. a) I created sequence files from my document corpus. b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1 c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct? d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N. I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help. I am willing to put some docs on the wiki for RowSimilarityJob once I am done. Thanks for your help. ________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, January 20, 2012 12:58 PM Subject: Re: Question on RowSimilarityJob Hi, 'maxSimilaritiesPerRow' denotes the maximum number of similar rows (documents in your use case) to keep per document. 'excludeSelfSimilarity' means that rows (documents) should not be compared to themselves. Sry for the lack of documentation, RowSimilarityJob was originally only an internal job for the recommendation code. I'll try to add something on the wiki in the next days. --sebastian On 20.01.2012 17:38, Suneel Marthi wrote: > I am working on determining document similarity of a corpus I am working with using RowSimilarity. > > Questions:- > > a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean? > b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ . > > c) Also do we have any docs on RowIdJob ? > > Thanks and Regards, > Suneel > +
Suneel Marthi 2012-01-31, 21:40
-
Re: Question on RowSimilarityJobSebastian Schelter 2012-02-01, 11:06
Here's a brief description of RowSimilarityJob:
The goal is to compute all pairwise similarities between the rows of a sparse matrix A. The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared. We need this to avoid a quadratic number of pairwise comparisons. Furthermore we should be able to 'embed' arbitrary similarity measures and we should always be able to use a combiner in all MapReduce steps. The computation is executed using three MapReduce passes: In the first step, the rows of A are preprocessed via VectorSimilarityMeasure.normalize() (they could e.g. be binarized or scaled to unit-length), a single number for each row of A is computed via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. The second steps operates on the rows of A' (the columns of A). The mapper sums up all pairwise cooccurrences using VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so called 'stripes' pattern). The reducers sums up all cooccurrence vectors for one row and uses the similarity measure and the precomputed numbers from step one to compute all similarities via VectorSimilarityMeasure.similarity(). The third step ensures that only the top k similar rows per row are kept. It's hard to see from the code but actually the job performs the matrix multiplication AA' via outer products with a modified (similarity measure specific) dot product. /s On 31.01.2012 22:40, Suneel Marthi wrote: > Sebastian, > > Question on the RowSimilarity job. > > a) I created sequence files from my document corpus. > b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1 > c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct? > > d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N. > > > I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help. > > I am willing to put some docs on the wiki for RowSimilarityJob once I am done. > > Thanks for your help. > > > > ________________________________ > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Friday, January 20, 2012 12:58 PM > Subject: Re: Question on RowSimilarityJob > > Hi, > > 'maxSimilaritiesPerRow' denotes the maximum number of similar rows > (documents in your use case) to keep per document. > 'excludeSelfSimilarity' means that rows (documents) should not be > compared to themselves. > > Sry for the lack of documentation, RowSimilarityJob was originally only > an internal job for the recommendation code. I'll try to add something > on the wiki in the next days. > > --sebastian > > > On 20.01.2012 17:38, Suneel Marthi wrote: >> I am working on determining document similarity of a corpus I am working with using RowSimilarity. >> >> Questions:- >> >> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean? >> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ . >> >> c) Also do we have any docs on RowIdJob ? >> >> Thanks and Regards, >> Suneel >> +
Sebastian Schelter 2012-02-01, 11:06
-
Re: Question on RowSimilarityJobVicky 2012-02-02, 11:47
Sebastian,
when you said, >>The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared if there are three columns in the document (columnA,columnB,columnC,columnD). Doc1: ColumnA, columnB, columnC,columnD Doc2:ColumnA,columnB,columnC,columnD Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise? ----- Original Message ----- From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Cc: Sent: Wednesday, February 1, 2012 6:06 AM Subject: Re: Question on RowSimilarityJob Here's a brief description of RowSimilarityJob: The goal is to compute all pairwise similarities between the rows of a sparse matrix A. The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared. We need this to avoid a quadratic number of pairwise comparisons. Furthermore we should be able to 'embed' arbitrary similarity measures and we should always be able to use a combiner in all MapReduce steps. The computation is executed using three MapReduce passes: In the first step, the rows of A are preprocessed via VectorSimilarityMeasure.normalize() (they could e.g. be binarized or scaled to unit-length), a single number for each row of A is computed via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. The second steps operates on the rows of A' (the columns of A). The mapper sums up all pairwise cooccurrences using VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so called 'stripes' pattern). The reducers sums up all cooccurrence vectors for one row and uses the similarity measure and the precomputed numbers from step one to compute all similarities via VectorSimilarityMeasure.similarity(). The third step ensures that only the top k similar rows per row are kept. It's hard to see from the code but actually the job performs the matrix multiplication AA' via outer products with a modified (similarity measure specific) dot product. /s On 31.01.2012 22:40, Suneel Marthi wrote: > Sebastian, > > Question on the RowSimilarity job. > > a) I created sequence files from my document corpus. > b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1 > c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct? > > d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N. > > > I am not sure I completely understand as to what's happening in the RowSimilarity Job, I did read the paper at http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf and have been staring at your whiteboard (http://ssc.io/wp-content/uploads/2011/08/v2.jpg) for a while to understand what's happening, but guess I need some help. > > I am willing to put some docs on the wiki for RowSimilarityJob once I am done. > > Thanks for your help. > > > > ________________________________ > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Friday, January 20, 2012 12:58 PM > Subject: Re: Question on RowSimilarityJob > > Hi, > > 'maxSimilaritiesPerRow' denotes the maximum number of similar rows > (documents in your use case) to keep per document. > 'excludeSelfSimilarity' means that rows (documents) should not be > compared to themselves. > > Sry for the lack of documentation, RowSimilarityJob was originally only > an internal job for the recommendation code. I'll try to add something > on the wiki in the next days. > > --sebastian > > > On 20.01.2012 17:38, Suneel Marthi wrote: >> I am working on determining document similarity of a corpus I am working with using RowSimilarity. +
Vicky 2012-02-02, 11:47
-
Re: Question on RowSimilarityJobSebastian Schelter 2012-02-02, 13:08
The heart of RowSimilarityJob is simple cooccurrence counting (with some
tweaks and optimizations). I'll give a simplified example: We have a 3x3 matrix (could represent items rated by users, docs with word counts e.g.) row1: column1, column2 row2: column1, column3 row3: column1, column2 In the first MapReduce pass, we created an inverted index over the columns (which means we transpose the matrix) column1: row1, row2, row3 column2: row1, row3 column3: row2 In the second MapReduce pass, we the mapper emits all coocurring row pairs per column: for column1: (row1,row2) (row1,row3) (row2,row1) (row2,row3) (row3,row1) (row3,row2) for column2: (row1,row3) (row3,row1) for column3 there's nothing to emit. The reducer receives all cooccurrences for a particular row and simply needs to sum them up: for row1: (row1,row2) (row1,row3) (row1,row3) row1,row2 -> 1 cooccurrence row1,row3 -> 2 cooccurrences for row2: (row2,row1) (row2,row3) row2,row1 -> 1 cooccurrence row2,row3 -> 1 cooccurrence for row3: (row3,row1) (row3,row2) (row3,row1) row3,row1 -> 2 cooccurrences row3,row2 -> 1 cooccurrence Hope that helps with understanding the approach. --sebastian On 02.02.2012 12:47, Vicky wrote: > Sebastian, > when you said, >>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared > > > if there are three columns in the document (columnA,columnB,columnC,columnD). > > Doc1: ColumnA, columnB, columnC,columnD > Doc2:ColumnA,columnB,columnC,columnD > > Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise? > > > > > ----- Original Message ----- > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Wednesday, February 1, 2012 6:06 AM > Subject: Re: Question on RowSimilarityJob > > Here's a brief description of RowSimilarityJob: > > The goal is to compute all pairwise similarities between the rows of a > sparse matrix A. > > The computation should be executed in a way that only rows that have at > least one non-zero value in the same dimension (column) are compared. We > need this to avoid a quadratic number of pairwise comparisons. > Furthermore we should be able to 'embed' arbitrary similarity measures > and we should always be able to use a combiner in all MapReduce steps. > > The computation is executed using three MapReduce passes: > > In the first step, the rows of A are preprocessed via > VectorSimilarityMeasure.normalize() (they could e.g. be binarized or > scaled to unit-length), a single number for each row of A is computed > via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. > > The second steps operates on the rows of A' (the columns of A). The > mapper sums up all pairwise cooccurrences using > VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so > called 'stripes' pattern). The reducers sums up all cooccurrence vectors > for one row and uses the similarity measure and the precomputed numbers > from step one to compute all similarities via > VectorSimilarityMeasure.similarity(). > > The third step ensures that only the top k similar rows per row are kept. > > It's hard to see from the code but actually the job performs the matrix > multiplication AA' via outer products with a modified (similarity > measure specific) dot product. > > /s > > > > On 31.01.2012 22:40, Suneel Marthi wrote: >> Sebastian, >> >> Question on the RowSimilarity job. >> >> a) I created sequence files from my document corpus. >> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1 >> c) Ran the RowId job on the vectors generated in (2) - this gives me an M * N matrix where M = number of documents in my collection, N = cardinality of the vector - Correct? >> >> d) Ran the Rowsimilarity job on the matrix generated in (3) with Cosine Similarity measure and 'N' as the number of columns - this gives me an M * R matrix where R < N. +
Sebastian Schelter 2012-02-02, 13:08
-
Re: Question on RowSimilarityJobLance Norskog 2012-02-03, 07:44
Please add these details to the wiki page. I had no idea what this
thing does: I though it was doing matrix multiplies. On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > The heart of RowSimilarityJob is simple cooccurrence counting (with some > tweaks and optimizations). I'll give a simplified example: > > We have a 3x3 matrix (could represent items rated by users, docs with > word counts e.g.) > > row1: column1, column2 > row2: column1, column3 > row3: column1, column2 > > In the first MapReduce pass, we created an inverted index over the > columns (which means we transpose the matrix) > > column1: row1, row2, row3 > column2: row1, row3 > column3: row2 > > In the second MapReduce pass, we the mapper emits all coocurring row > pairs per column: > > for column1: > > (row1,row2) > (row1,row3) > (row2,row1) > (row2,row3) > (row3,row1) > (row3,row2) > > for column2: > > (row1,row3) > (row3,row1) > > for column3 there's nothing to emit. > > The reducer receives all cooccurrences for a particular row and simply > needs to sum them up: > > for row1: > > (row1,row2) > (row1,row3) > (row1,row3) > > row1,row2 -> 1 cooccurrence > row1,row3 -> 2 cooccurrences > > for row2: > > (row2,row1) > (row2,row3) > > row2,row1 -> 1 cooccurrence > row2,row3 -> 1 cooccurrence > > for row3: > > (row3,row1) > (row3,row2) > (row3,row1) > > row3,row1 -> 2 cooccurrences > row3,row2 -> 1 cooccurrence > > > Hope that helps with understanding the approach. > > --sebastian > > > > > On 02.02.2012 12:47, Vicky wrote: >> Sebastian, >> when you said, >>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared >> >> >> if there are three columns in the document (columnA,columnB,columnC,columnD). >> >> Doc1: ColumnA, columnB, columnC,columnD >> Doc2:ColumnA,columnB,columnC,columnD >> >> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise? >> >> >> >> >> ----- Original Message ----- >> From: Sebastian Schelter <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Cc: >> Sent: Wednesday, February 1, 2012 6:06 AM >> Subject: Re: Question on RowSimilarityJob >> >> Here's a brief description of RowSimilarityJob: >> >> The goal is to compute all pairwise similarities between the rows of a >> sparse matrix A. >> >> The computation should be executed in a way that only rows that have at >> least one non-zero value in the same dimension (column) are compared. We >> need this to avoid a quadratic number of pairwise comparisons. >> Furthermore we should be able to 'embed' arbitrary similarity measures >> and we should always be able to use a combiner in all MapReduce steps. >> >> The computation is executed using three MapReduce passes: >> >> In the first step, the rows of A are preprocessed via >> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or >> scaled to unit-length), a single number for each row of A is computed >> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. >> >> The second steps operates on the rows of A' (the columns of A). The >> mapper sums up all pairwise cooccurrences using >> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so >> called 'stripes' pattern). The reducers sums up all cooccurrence vectors >> for one row and uses the similarity measure and the precomputed numbers >> from step one to compute all similarities via >> VectorSimilarityMeasure.similarity(). >> >> The third step ensures that only the top k similar rows per row are kept. >> >> It's hard to see from the code but actually the job performs the matrix >> multiplication AA' via outer products with a modified (similarity >> measure specific) dot product. >> >> /s >> >> >> >> On 31.01.2012 22:40, Suneel Marthi wrote: >>> Sebastian, >>> >>> Question on the RowSimilarity job. >>> >>> a) I created sequence files from my document corpus. >>> b) Created vectors from sequence files with ngrams = 3, normalization = 2, min document frequency = 1 and minimum support = 1 Lance Norskog [EMAIL PROTECTED] +
Lance Norskog 2012-02-03, 07:44
-
Re: Question on RowSimilarityJobSebastian Schelter 2012-02-04, 17:23
Well these details are only a very, very simplified view of the job's
actual implementation, so it might confuse the readers. And actually RowSimilarityJob is executing a kind of matrix multiplication. If A is 0-1 matrix and we create AA' by summing outer products, we end up with a computation that is very similar to the example. --sebastian On 03.02.2012 08:44, Lance Norskog wrote: > Please add these details to the wiki page. I had no idea what this > thing does: I though it was doing matrix multiplies. > > On Thu, Feb 2, 2012 at 5:08 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >> The heart of RowSimilarityJob is simple cooccurrence counting (with some >> tweaks and optimizations). I'll give a simplified example: >> >> We have a 3x3 matrix (could represent items rated by users, docs with >> word counts e.g.) >> >> row1: column1, column2 >> row2: column1, column3 >> row3: column1, column2 >> >> In the first MapReduce pass, we created an inverted index over the >> columns (which means we transpose the matrix) >> >> column1: row1, row2, row3 >> column2: row1, row3 >> column3: row2 >> >> In the second MapReduce pass, we the mapper emits all coocurring row >> pairs per column: >> >> for column1: >> >> (row1,row2) >> (row1,row3) >> (row2,row1) >> (row2,row3) >> (row3,row1) >> (row3,row2) >> >> for column2: >> >> (row1,row3) >> (row3,row1) >> >> for column3 there's nothing to emit. >> >> The reducer receives all cooccurrences for a particular row and simply >> needs to sum them up: >> >> for row1: >> >> (row1,row2) >> (row1,row3) >> (row1,row3) >> >> row1,row2 -> 1 cooccurrence >> row1,row3 -> 2 cooccurrences >> >> for row2: >> >> (row2,row1) >> (row2,row3) >> >> row2,row1 -> 1 cooccurrence >> row2,row3 -> 1 cooccurrence >> >> for row3: >> >> (row3,row1) >> (row3,row2) >> (row3,row1) >> >> row3,row1 -> 2 cooccurrences >> row3,row2 -> 1 cooccurrence >> >> >> Hope that helps with understanding the approach. >> >> --sebastian >> >> >> >> >> On 02.02.2012 12:47, Vicky wrote: >>> Sebastian, >>> when you said, >>>>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared >>> >>> >>> if there are three columns in the document (columnA,columnB,columnC,columnD). >>> >>> Doc1: ColumnA, columnB, columnC,columnD >>> Doc2:ColumnA,columnB,columnC,columnD >>> >>> Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise? >>> >>> >>> >>> >>> ----- Original Message ----- >>> From: Sebastian Schelter <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED] >>> Cc: >>> Sent: Wednesday, February 1, 2012 6:06 AM >>> Subject: Re: Question on RowSimilarityJob >>> >>> Here's a brief description of RowSimilarityJob: >>> >>> The goal is to compute all pairwise similarities between the rows of a >>> sparse matrix A. >>> >>> The computation should be executed in a way that only rows that have at >>> least one non-zero value in the same dimension (column) are compared. We >>> need this to avoid a quadratic number of pairwise comparisons. >>> Furthermore we should be able to 'embed' arbitrary similarity measures >>> and we should always be able to use a combiner in all MapReduce steps. >>> >>> The computation is executed using three MapReduce passes: >>> >>> In the first step, the rows of A are preprocessed via >>> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or >>> scaled to unit-length), a single number for each row of A is computed >>> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. >>> >>> The second steps operates on the rows of A' (the columns of A). The >>> mapper sums up all pairwise cooccurrences using >>> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so >>> called 'stripes' pattern). The reducers sums up all cooccurrence vectors >>> for one row and uses the similarity measure and the precomputed numbers >>> from step one to compute all similarities via +
Sebastian Schelter 2012-02-04, 17:23
-
Re: Question on RowSimilarityJobSuneel Marthi 2012-02-02, 15:51
Thanks for the details, Sebastian. This is indeed very helpful and I completely understand now what you had on your whiteboard.
I had been normalizing my vectors during seq2sparse (with L2 norm since I would be using Cosine/Tanimoto Similarity in the RowSimilarityJob). But since the first step in the RowSimilarity job is normalizing the vectors I don't have to do that in seq2sparse; doing that would not impact the final outcome anyways. This has been a very useful thread and thanks for all your help. ________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: Vicky <[EMAIL PROTECTED]> Cc: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Thursday, February 2, 2012 8:08 AM Subject: Re: Question on RowSimilarityJob The heart of RowSimilarityJob is simple cooccurrence counting (with some tweaks and optimizations). I'll give a simplified example: We have a 3x3 matrix (could represent items rated by users, docs with word counts e.g.) row1: column1, column2 row2: column1, column3 row3: column1, column2 In the first MapReduce pass, we created an inverted index over the columns (which means we transpose the matrix) column1: row1, row2, row3 column2: row1, row3 column3: row2 In the second MapReduce pass, we the mapper emits all coocurring row pairs per column: for column1: (row1,row2) (row1,row3) (row2,row1) (row2,row3) (row3,row1) (row3,row2) for column2: (row1,row3) (row3,row1) for column3 there's nothing to emit. The reducer receives all cooccurrences for a particular row and simply needs to sum them up: for row1: (row1,row2) (row1,row3) (row1,row3) row1,row2 -> 1 cooccurrence row1,row3 -> 2 cooccurrences for row2: (row2,row1) (row2,row3) row2,row1 -> 1 cooccurrence row2,row3 -> 1 cooccurrence for row3: (row3,row1) (row3,row2) (row3,row1) row3,row1 -> 2 cooccurrences row3,row2 -> 1 cooccurrence Hope that helps with understanding the approach. --sebastian On 02.02.2012 12:47, Vicky wrote: > Sebastian, > when you said, >>> The computation should be executed in a way that only rows that have at least one non-zero value in the same dimension (column) are compared > > > if there are three columns in the document (columnA,columnB,columnC,columnD). > > Doc1: ColumnA, columnB, columnC,columnD > Doc2:ColumnA,columnB,columnC,columnD > > Will comparison be like Doc1:columnA against Doc2:columnA ; Doc1:columnB against Doc2:columnB, Doc1:columnC against Doc2:columnC and likewise? > > > > > ----- Original Message ----- > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Cc: > Sent: Wednesday, February 1, 2012 6:06 AM > Subject: Re: Question on RowSimilarityJob > > Here's a brief description of RowSimilarityJob: > > The goal is to compute all pairwise similarities between the rows of a > sparse matrix A. > > The computation should be executed in a way that only rows that have at > least one non-zero value in the same dimension (column) are compared. We > need this to avoid a quadratic number of pairwise comparisons. > Furthermore we should be able to 'embed' arbitrary similarity measures > and we should always be able to use a combiner in all MapReduce steps. > > The computation is executed using three MapReduce passes: > > In the first step, the rows of A are preprocessed via > VectorSimilarityMeasure.normalize() (they could e.g. be binarized or > scaled to unit-length), a single number for each row of A is computed > via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. > > The second steps operates on the rows of A' (the columns of A). The > mapper sums up all pairwise cooccurrences using > VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so > called 'stripes' pattern). The reducers sums up all cooccurrence vectors > for one row and uses the similarity measure and the precomputed numbers > from step one to compute all similarities via > VectorSimilarityMeasure.similarity(). > > The third step ensures that only the top k similar rows per row are kept. +
Suneel Marthi 2012-02-02, 15:51
-
Re: Question on RowSimilarityJobDan Brickley 2012-02-02, 12:59
On 1 February 2012 12:06, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> Here's a brief description of RowSimilarityJob: > > The goal is to compute all pairwise similarities between the rows of a > sparse matrix A. > > The computation should be executed in a way that only rows that have at > least one non-zero value in the same dimension (column) are compared. We > need this to avoid a quadratic number of pairwise comparisons. > Furthermore we should be able to 'embed' arbitrary similarity measures > and we should always be able to use a combiner in all MapReduce steps. > > The computation is executed using three MapReduce passes: > > In the first step, the rows of A are preprocessed via > VectorSimilarityMeasure.normalize() (they could e.g. be binarized or > scaled to unit-length), a single number for each row of A is computed > via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. > > The second steps operates on the rows of A' (the columns of A). The > mapper sums up all pairwise cooccurrences using > VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so > called 'stripes' pattern). The reducers sums up all cooccurrence vectors > for one row and uses the similarity measure and the precomputed numbers > from step one to compute all similarities via > VectorSimilarityMeasure.similarity(). > > The third step ensures that only the top k similar rows per row are kept. > > It's hard to see from the code but actually the job performs the matrix > multiplication AA' via outer products with a modified (similarity > measure specific) dot product. Thanks for this. I've taken liberty of dropping it in the Wiki at https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob (and linking this thread). Hope that's ok... cheers, Dan +
Dan Brickley 2012-02-02, 12:59
-
Re: Question on RowSimilarityJobSebastian Schelter 2012-02-02, 13:08
Great! Thank you!
On 02.02.2012 13:59, Dan Brickley wrote: > On 1 February 2012 12:06, Sebastian Schelter <[EMAIL PROTECTED]> wrote: >> Here's a brief description of RowSimilarityJob: >> >> The goal is to compute all pairwise similarities between the rows of a >> sparse matrix A. >> >> The computation should be executed in a way that only rows that have at >> least one non-zero value in the same dimension (column) are compared. We >> need this to avoid a quadratic number of pairwise comparisons. >> Furthermore we should be able to 'embed' arbitrary similarity measures >> and we should always be able to use a combiner in all MapReduce steps. >> >> The computation is executed using three MapReduce passes: >> >> In the first step, the rows of A are preprocessed via >> VectorSimilarityMeasure.normalize() (they could e.g. be binarized or >> scaled to unit-length), a single number for each row of A is computed >> via VectorSimilarityMeasure.norm() (e.g. L1 norm) and A' is formed. >> >> The second steps operates on the rows of A' (the columns of A). The >> mapper sums up all pairwise cooccurrences using >> VectorSimilarityMeasure.aggregate() (as vectors, thereby using the so >> called 'stripes' pattern). The reducers sums up all cooccurrence vectors >> for one row and uses the similarity measure and the precomputed numbers >> from step one to compute all similarities via >> VectorSimilarityMeasure.similarity(). >> >> The third step ensures that only the top k similar rows per row are kept. >> >> It's hard to see from the code but actually the job performs the matrix >> multiplication AA' via outer products with a modified (similarity >> measure specific) dot product. > > Thanks for this. I've taken liberty of dropping it in the Wiki at > https://cwiki.apache.org/confluence/display/MAHOUT/RowSimilarityJob > (and linking this thread). Hope that's ok... > > cheers, > > Dan +
Sebastian Schelter 2012-02-02, 13:08
-
Re: Question on RowSimilarityJobSuneel Marthi 2012-01-20, 22:02
Thanks for the reply Sebastian.
For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach? ________________________________ From: Sebastian Schelter <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, January 20, 2012 12:58 PM Subject: Re: Question on RowSimilarityJob Hi, 'maxSimilaritiesPerRow' denotes the maximum number of similar rows (documents in your use case) to keep per document. 'excludeSelfSimilarity' means that rows (documents) should not be compared to themselves. Sry for the lack of documentation, RowSimilarityJob was originally only an internal job for the recommendation code. I'll try to add something on the wiki in the next days. --sebastian On 20.01.2012 17:38, Suneel Marthi wrote: > I am working on determining document similarity of a corpus I am working with using RowSimilarity. > > Questions:- > > a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean? > b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ . > > c) Also do we have any docs on RowIdJob ? > > Thanks and Regards, > Suneel > +
Suneel Marthi 2012-01-20, 22:02
-
Re: Question on RowSimilarityJobLance Norskog 2012-01-20, 22:48
This can be done as a recommender or with clustering. As a
recommender, you would say, "find similar documents". In clustering, you can take one document and find its neighbors in the cluster graph. How much data do you have? An important point here is that most of Mahout's recommender algorithms are in-memory programs. There is only one Hadoop-based recommender. There are several Hadoop-based clustering algorithms. On Fri, Jan 20, 2012 at 2:02 PM, Suneel Marthi <[EMAIL PROTECTED]> wrote: > Thanks for the reply Sebastian. > > > For the task I am working on - 'determine similar documents in a collection' - would RowSimilarity be the right approach? > > > > > > > ________________________________ > From: Sebastian Schelter <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Friday, January 20, 2012 12:58 PM > Subject: Re: Question on RowSimilarityJob > > Hi, > > 'maxSimilaritiesPerRow' denotes the maximum number of similar rows > (documents in your use case) to keep per document. > 'excludeSelfSimilarity' means that rows (documents) should not be > compared to themselves. > > Sry for the lack of documentation, RowSimilarityJob was originally only > an internal job for the recommendation code. I'll try to add something > on the wiki in the next days. > > --sebastian > > > On 20.01.2012 17:38, Suneel Marthi wrote: >> I am working on determining document similarity of a corpus I am working with using RowSimilarity. >> >> Questions:- >> >> a) What do the parameters - 'maxSimilaritiesPerRow' and 'excludeSelfSimilarity' mean? >> b) Are there any docs available on RowSimilarityJob available, this is the best I could find on Sebastian's blog - http://ssc.io/rowsimilarityjob-on-steroids/ . >> >> c) Also do we have any docs on RowIdJob ? >> >> Thanks and Regards, >> Suneel >> -- Lance Norskog [EMAIL PROTECTED] +
Lance Norskog 2012-01-20, 22:48
|