|
Kris Jack
2010-06-08, 13:38
Olivier Grisel
2010-06-08, 14:31
Kris Jack
2010-06-08, 15:11
Jake Mannix
2010-06-08, 16:26
Kris Jack
2010-06-08, 16:44
Kris Jack
2010-06-08, 16:59
Jake Mannix
2010-06-08, 17:06
Kris Jack
2010-06-09, 12:11
Jake Mannix
2010-06-09, 17:31
Kris Jack
2010-06-09, 17:44
Kris Jack
2010-06-15, 16:32
Ted Dunning
2010-06-15, 17:00
Kris Jack
2010-06-18, 16:46
Sebastian Schelter
2010-06-18, 16:51
Kris Jack
2010-06-18, 16:54
Kris Jack
2010-06-28, 15:18
Sebastian Schelter
2010-06-28, 20:15
Kris Jack
2010-06-29, 09:25
Sebastian Schelter
2010-06-29, 09:28
Kris Jack
2010-07-02, 16:22
Sebastian Schelter
2010-07-02, 19:33
Olivier Grisel
2010-06-08, 22:56
Jake Mannix
2010-06-08, 23:16
Sebastian Schelter
2010-06-08, 22:39
Jake Mannix
2010-06-08, 22:52
Sean Owen
2010-06-08, 23:08
Sebastian Schelter
2010-06-08, 23:21
Jake Mannix
2010-06-08, 23:33
Sebastian Schelter
2010-06-08, 23:45
Jake Mannix
2010-06-08, 23:53
Sebastian Schelter
2010-06-09, 10:23
Kris Jack
2010-06-09, 17:15
Sebastian Schelter
2010-06-09, 17:56
Sean Owen
2010-06-09, 17:58
Jake Mannix
2010-06-09, 18:14
Sean Owen
2010-06-09, 18:25
Jake Mannix
2010-06-09, 18:33
Sean Owen
2010-06-09, 18:36
Sebastian Schelter
2010-06-09, 18:50
|
-
Generating a Document Similarity MatrixKris Jack 2010-06-08, 13:38
Hi everyone,
I currently use lucene's moreLikeThis function through solr to find documents that are related to one another. A single call, however, takes around 4 seconds to complete and I would like to reduce this. I got to thinking that I might be able to use Mahout to generate a document similarity matrix offline that could then be looked-up in real time for serving. Is this a reasonable use of Mahout? If so, what functions will generate a document similarity matrix? Also, I would like to be able to keep the text processing advantages provided through lucene so it would help if I could still use my lucene index. If not, then could you recommend any alternative solutions please? Many thanks, Kris +
Kris Jack 2010-06-08, 13:38
-
Re: Generating a Document Similarity MatrixOlivier Grisel 2010-06-08, 14:31
2010/6/8 Kris Jack <[EMAIL PROTECTED]>:
> Hi everyone, > > I currently use lucene's moreLikeThis function through solr to find > documents that are related to one another. A single call, however, takes > around 4 seconds to complete and I would like to reduce this. I got to > thinking that I might be able to use Mahout to generate a document > similarity matrix offline that could then be looked-up in real time for > serving. Is this a reasonable use of Mahout? If so, what functions will > generate a document similarity matrix? Also, I would like to be able to > keep the text processing advantages provided through lucene so it would help > if I could still use my lucene index. If not, then could you recommend any > alternative solutions please? How many documents do you have in your index? Have you tried to tweak the MoreLikeThis parameters ? (I don't know if it's possible using the solr interface, I use it directly using the lucene java API) For instance you can trade off recall for speed by decreasing the number of terms to use in the query and trade recall for precision and speed by increasing the percentage of terms that should match. You could also use Mahout implementation of SVD to build low dimensional semantic vectors representing your documents (a.k.a. Latent Semantic Indexing) and then index those transformed frequency vectors in a dedicated lucene index (or document field provided you name the resulting terms with something that does not match real life terms present in other). However using standard SVD will probably result in dense (as opposed to sparse) low dimensional semantic vectors. I don't think lucene's lookup performance is good with dense frequency vectors even though the number of terms is greatly reduced by SVD. Hence it would probably be better to either threshold the top 100 absolute values of each semantic vectors before indexing (probably the simpler solution) or using a sparsifying penalty contrained variant of SVD / LSI. You should have a look at the literature on sparse coding or sparse dictionary learning, Sparse-PCA and more generally L1 penalty regression methods such as the Lasso and LARS. I don't know about any library for sparse semantic coding of document that works automatically with lucene. Probably some non trivial coding is needed there. Another alternative is finding low dimensional (64 or 32 components) dense codes and then binary thresholding then and store integer code in the DB or the lucene index and then build smart exact match queries to find all document lying in the hamming ball of size 1 or 2 of the reference document's binary code. But I think this approach while promising for web scale document collections is even more experimental and requires very good code low dim encoders (I don't think linear models such as SVD are good enough for reducing sparse 10e6 components vectors to dense 64 components vectors, non linear encoders such as Stacked Restricted Boltzmann Machines are probably a better choice). In any case let us know about your results, I am really interested on practical yet scalable solutions to this problem. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel +
Olivier Grisel 2010-06-08, 14:31
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-08, 15:11
Hi Olivier,
Thanks for your suggestions. I have over 10 million documents and they have quite a lot of meta-data associated with them including rather large text fields. It is possible to tweak the moreLikeThis function from solr. I have tried changing the parameters (http://wiki.apache.org/solr/MoreLikeThis) but am not managing to get results in under 300ms without sacrificing the quality of the results too much. I suspect that there would be gains to be made from reducing the dimensionality of the feature vectors before indexing with lucene so I may give that a try. I'll keep you posted if I come up with other solutions. Thanks, Kris 2010/6/8 Olivier Grisel <[EMAIL PROTECTED]> > 2010/6/8 Kris Jack <[EMAIL PROTECTED]>: > > Hi everyone, > > > > I currently use lucene's moreLikeThis function through solr to find > > documents that are related to one another. A single call, however, takes > > around 4 seconds to complete and I would like to reduce this. I got to > > thinking that I might be able to use Mahout to generate a document > > similarity matrix offline that could then be looked-up in real time for > > serving. Is this a reasonable use of Mahout? If so, what functions will > > generate a document similarity matrix? Also, I would like to be able to > > keep the text processing advantages provided through lucene so it would > help > > if I could still use my lucene index. If not, then could you recommend > any > > alternative solutions please? > > How many documents do you have in your index? Have you tried to tweak > the MoreLikeThis parameters ? (I don't know if it's possible using the > solr interface, I use it directly using the lucene java API) > > For instance you can trade off recall for speed by decreasing the > number of terms to use in the query and trade recall for precision and > speed by increasing the percentage of terms that should match. > > You could also use Mahout implementation of SVD to build low > dimensional semantic vectors representing your documents (a.k.a. > Latent Semantic Indexing) and then index those transformed frequency > vectors in a dedicated lucene index (or document field provided you > name the resulting terms with something that does not match real life > terms present in other). However using standard SVD will probably > result in dense (as opposed to sparse) low dimensional semantic > vectors. I don't think lucene's lookup performance is good with dense > frequency vectors even though the number of terms is greatly reduced > by SVD. Hence it would probably be better to either threshold the top > 100 absolute values of each semantic vectors before indexing (probably > the simpler solution) or using a sparsifying penalty contrained > variant of SVD / LSI. You should have a look at the literature on > sparse coding or sparse dictionary learning, Sparse-PCA and more > generally L1 penalty regression methods such as the Lasso and LARS. I > don't know about any library for sparse semantic coding of document > that works automatically with lucene. Probably some non trivial coding > is needed there. > > Another alternative is finding low dimensional (64 or 32 components) > dense codes and then binary thresholding then and store integer code > in the DB or the lucene index and then build smart exact match queries > to find all document lying in the hamming ball of size 1 or 2 of the > reference document's binary code. But I think this approach while > promising for web scale document collections is even more experimental > and requires very good code low dim encoders (I don't think linear > models such as SVD are good enough for reducing sparse 10e6 components > vectors to dense 64 components vectors, non linear encoders such as > Stacked Restricted Boltzmann Machines are probably a better choice). > > In any case let us know about your results, I am really interested on > practical yet scalable solutions to this problem. > > -- > Olivier > http://twitter.com/ogrisel - http://github.com/ogrisel Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-08, 15:11
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-08, 16:26
Hi Kris,
If you generate a full document-document similarity matrix offline, and then make sure to sparsify the rows (trim off all similarities below a threshold, or only take the top N for each row, etc...). Then encoding these values directly in the index would indeed allow for *superfast* MoreLikeThis functionality, because you've already computed all of the similar results offline. The only downside is that it won't apply to newly indexed documents. If your indexing setup is such that you don't fold in new documents live, but do so in batch, then this should be fine. An alternative is to use something like a Locality Sensitive Hash (something one of my co-workers is writing up a nice implementation of now, and I'm going to get him to contribute it once it's fully tested), to reduce the search space (as a lucene Filter) and speed up the query. -jake On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > Hi Olivier, > > Thanks for your suggestions. I have over 10 million documents and they > have > quite a lot of meta-data associated with them including rather large text > fields. It is possible to tweak the moreLikeThis function from solr. I > have tried changing the parameters ( > http://wiki.apache.org/solr/MoreLikeThis) > but am not managing to get results in under 300ms without sacrificing the > quality of the results too much. > > I suspect that there would be gains to be made from reducing the > dimensionality of the feature vectors before indexing with lucene so I may > give that a try. I'll keep you posted if I come up with other solutions. > > Thanks, > Kris > > > > 2010/6/8 Olivier Grisel <[EMAIL PROTECTED]> > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]>: > > > Hi everyone, > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > documents that are related to one another. A single call, however, > takes > > > around 4 seconds to complete and I would like to reduce this. I got to > > > thinking that I might be able to use Mahout to generate a document > > > similarity matrix offline that could then be looked-up in real time for > > > serving. Is this a reasonable use of Mahout? If so, what functions > will > > > generate a document similarity matrix? Also, I would like to be able > to > > > keep the text processing advantages provided through lucene so it would > > help > > > if I could still use my lucene index. If not, then could you recommend > > any > > > alternative solutions please? > > > > How many documents do you have in your index? Have you tried to tweak > > the MoreLikeThis parameters ? (I don't know if it's possible using the > > solr interface, I use it directly using the lucene java API) > > > > For instance you can trade off recall for speed by decreasing the > > number of terms to use in the query and trade recall for precision and > > speed by increasing the percentage of terms that should match. > > > > You could also use Mahout implementation of SVD to build low > > dimensional semantic vectors representing your documents (a.k.a. > > Latent Semantic Indexing) and then index those transformed frequency > > vectors in a dedicated lucene index (or document field provided you > > name the resulting terms with something that does not match real life > > terms present in other). However using standard SVD will probably > > result in dense (as opposed to sparse) low dimensional semantic > > vectors. I don't think lucene's lookup performance is good with dense > > frequency vectors even though the number of terms is greatly reduced > > by SVD. Hence it would probably be better to either threshold the top > > 100 absolute values of each semantic vectors before indexing (probably > > the simpler solution) or using a sparsifying penalty contrained > > variant of SVD / LSI. You should have a look at the literature on > > sparse coding or sparse dictionary learning, Sparse-PCA and more > > generally L1 penalty regression methods such as the Lasso and LARS. I +
Jake Mannix 2010-06-08, 16:26
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-08, 16:44
Hi Jake,
Thanks for that. The first solution that you suggest is more like what I was imagining. Please excuse me, I'm new to Mahout and don't know how to use it to generate the full document-document similarity matrix. I would rather not have to re-implement the moreLikeThis algorithm that, although rather straight forward, may take time for a newbie to MapReduce like me. Could you guide me a little in finding the relevant Mahout code for generating the matrix or is it not really designed for that? For the moment, I would be happy to have an off-line batch version working. Also, it is desirable to take advantage of the text processing features that I have already configured using solr, so I would prefer to read in the feature vectors for the documents from a lucene index, as I am doing at present (e.g. http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ ). Thanks, Kris 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > Hi Kris, > > If you generate a full document-document similarity matrix offline, and > then make sure to sparsify the rows (trim off all similarities below a > threshold, or only take the top N for each row, etc...). Then encoding > these values directly in the index would indeed allow for *superfast* > MoreLikeThis functionality, because you've already computed all > of the similar results offline. > > The only downside is that it won't apply to newly indexed documents. > If your indexing setup is such that you don't fold in new documents live, > but do so in batch, then this should be fine. > > An alternative is to use something like a Locality Sensitive Hash > (something one of my co-workers is writing up a nice implementation > of now, and I'm going to get him to contribute it once it's fully tested), > to reduce the search space (as a lucene Filter) and speed up the > query. > > -jake > > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > > > Hi Olivier, > > > > Thanks for your suggestions. I have over 10 million documents and they > > have > > quite a lot of meta-data associated with them including rather large text > > fields. It is possible to tweak the moreLikeThis function from solr. I > > have tried changing the parameters ( > > http://wiki.apache.org/solr/MoreLikeThis) > > but am not managing to get results in under 300ms without sacrificing the > > quality of the results too much. > > > > I suspect that there would be gains to be made from reducing the > > dimensionality of the feature vectors before indexing with lucene so I > may > > give that a try. I'll keep you posted if I come up with other solutions. > > > > Thanks, > > Kris > > > > > > > > 2010/6/8 Olivier Grisel <[EMAIL PROTECTED]> > > > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]>: > > > > Hi everyone, > > > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > > documents that are related to one another. A single call, however, > > takes > > > > around 4 seconds to complete and I would like to reduce this. I got > to > > > > thinking that I might be able to use Mahout to generate a document > > > > similarity matrix offline that could then be looked-up in real time > for > > > > serving. Is this a reasonable use of Mahout? If so, what functions > > will > > > > generate a document similarity matrix? Also, I would like to be able > > to > > > > keep the text processing advantages provided through lucene so it > would > > > help > > > > if I could still use my lucene index. If not, then could you > recommend > > > any > > > > alternative solutions please? > > > > > > How many documents do you have in your index? Have you tried to tweak > > > the MoreLikeThis parameters ? (I don't know if it's possible using the > > > solr interface, I use it directly using the lucene java API) > > > > > > For instance you can trade off recall for speed by decreasing the > > > number of terms to use in the query and trade recall for precision and Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-08, 16:44
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-08, 16:59
Sorry, copied the previous link from the wrong tab :/ Meant to be
https://cwiki.apache.org/MAHOUT/creating-vectors-from-text.html for reading in lucene vectors. 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > Hi Jake, > > Thanks for that. The first solution that you suggest is more like what I > was imagining. > > Please excuse me, I'm new to Mahout and don't know how to use it to > generate the full document-document similarity matrix. I would rather not > have to re-implement the moreLikeThis algorithm that, although rather > straight forward, may take time for a newbie to MapReduce like me. Could > you guide me a little in finding the relevant Mahout code for generating the > matrix or is it not really designed for that? > > For the moment, I would be happy to have an off-line batch version > working. Also, it is desirable to take advantage of the text processing > features that I have already configured using solr, so I would prefer to > read in the feature vectors for the documents from a lucene index, as I am > doing at present (e.g. > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > ). > > Thanks, > Kris > > > > 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > > Hi Kris, >> >> If you generate a full document-document similarity matrix offline, and >> then make sure to sparsify the rows (trim off all similarities below a >> threshold, or only take the top N for each row, etc...). Then encoding >> these values directly in the index would indeed allow for *superfast* >> MoreLikeThis functionality, because you've already computed all >> of the similar results offline. >> >> The only downside is that it won't apply to newly indexed documents. >> If your indexing setup is such that you don't fold in new documents live, >> but do so in batch, then this should be fine. >> >> An alternative is to use something like a Locality Sensitive Hash >> (something one of my co-workers is writing up a nice implementation >> of now, and I'm going to get him to contribute it once it's fully tested), >> to reduce the search space (as a lucene Filter) and speed up the >> query. >> >> -jake >> >> On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[EMAIL PROTECTED]> wrote: >> >> > Hi Olivier, >> > >> > Thanks for your suggestions. I have over 10 million documents and they >> > have >> > quite a lot of meta-data associated with them including rather large >> text >> > fields. It is possible to tweak the moreLikeThis function from solr. I >> > have tried changing the parameters ( >> > http://wiki.apache.org/solr/MoreLikeThis) >> > but am not managing to get results in under 300ms without sacrificing >> the >> > quality of the results too much. >> > >> > I suspect that there would be gains to be made from reducing the >> > dimensionality of the feature vectors before indexing with lucene so I >> may >> > give that a try. I'll keep you posted if I come up with other >> solutions. >> > >> > Thanks, >> > Kris >> > >> > >> > >> > 2010/6/8 Olivier Grisel <[EMAIL PROTECTED]> >> > >> > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]>: >> > > > Hi everyone, >> > > > >> > > > I currently use lucene's moreLikeThis function through solr to find >> > > > documents that are related to one another. A single call, however, >> > takes >> > > > around 4 seconds to complete and I would like to reduce this. I got >> to >> > > > thinking that I might be able to use Mahout to generate a document >> > > > similarity matrix offline that could then be looked-up in real time >> for >> > > > serving. Is this a reasonable use of Mahout? If so, what functions >> > will >> > > > generate a document similarity matrix? Also, I would like to be >> able >> > to >> > > > keep the text processing advantages provided through lucene so it >> would >> > > help >> > > > if I could still use my lucene index. If not, then could you >> recommend >> > > any >> > > > alternative solutions please? >> > > >> > > How many documents do you have in your index? Have you tried to tweak Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-08, 16:59
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-08, 17:06
Hi Kris,
So you already know how to make a sparse feature matrix out of your Solr index, based on Grant's instructions? Once you have that matrix loaded into HDFS, then the following Java code should create a document-document similarity matrix: // String p = "/path/to/matrix/on/hdfs"; String tmpPath = "/tmp/matrixmultiplyspace"; int numDocuments = // whatever your numDocuments is int numTerms = // total number of terms in the matrix DistributedRowMatrix text = new DistributedRowMatrix(inputPath, tmpPath, numDocuments, numTerms); JobConf conf = new JobConf("similarity job"); text.configure(conf); DistributedRowMatrix transpose = text.transpose(); DistributedRowMatrix similarity = transpose.times(transpose); System.out.println("Similarity matrix lives: " + similarity.getRowPath()); // Now, the rows of this similarity are going to be way too dense, so you'll want to write a small map-reduce job (well, no reduce is necessary) to run over this matrix and trim out all the unuseful entries of each row, but that shouldn't be too hard to do. Of course, to do it really efficiently, that functionality could be folded into the reducer of the matrix multiply job, and done in the same pass over the data as that one. -jake On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > Hi Jake, > > Thanks for that. The first solution that you suggest is more like what I > was imagining. > > Please excuse me, I'm new to Mahout and don't know how to use it to > generate > the full document-document similarity matrix. I would rather not have to > re-implement the moreLikeThis algorithm that, although rather straight > forward, may take time for a newbie to MapReduce like me. Could you guide > me a little in finding the relevant Mahout code for generating the matrix > or > is it not really designed for that? > > For the moment, I would be happy to have an off-line batch version working. > Also, it is desirable to take advantage of the text processing features > that > I have already configured using solr, so I would prefer to read in the > feature vectors for the documents from a lucene index, as I am doing at > present (e.g. > > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > ). > > Thanks, > Kris > > > > 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > > > Hi Kris, > > > > If you generate a full document-document similarity matrix offline, and > > then make sure to sparsify the rows (trim off all similarities below a > > threshold, or only take the top N for each row, etc...). Then encoding > > these values directly in the index would indeed allow for *superfast* > > MoreLikeThis functionality, because you've already computed all > > of the similar results offline. > > > > The only downside is that it won't apply to newly indexed documents. > > If your indexing setup is such that you don't fold in new documents live, > > but do so in batch, then this should be fine. > > > > An alternative is to use something like a Locality Sensitive Hash > > (something one of my co-workers is writing up a nice implementation > > of now, and I'm going to get him to contribute it once it's fully > tested), > > to reduce the search space (as a lucene Filter) and speed up the > > query. > > > > -jake > > > > On Tue, Jun 8, 2010 at 8:11 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > > > > > Hi Olivier, > > > > > > Thanks for your suggestions. I have over 10 million documents and they > > > have > > > quite a lot of meta-data associated with them including rather large > text > > > fields. It is possible to tweak the moreLikeThis function from solr. > I > > > have tried changing the parameters ( > > > http://wiki.apache.org/solr/MoreLikeThis) > > > but am not managing to get results in under 300ms without sacrificing > the > > > quality of the results too much. > > > > > > I suspect that there would be gains to be made from reducing the > > > dimensionality of the feature vectors before indexing with lucene so I +
Jake Mannix 2010-06-08, 17:06
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-09, 12:11
Hi,
Thanks very for the code. In implementing it, I got a little stuck on specifying the JobConf's similarity job - JobConf conf = new JobConf("similarity job"); I assume that I should define here how I would like two vectors to be compared with one another? Please do correct me if that's wrong. If so, however, could you point me to any examples of what this code should look like (e.g. cosine similarity)? I'm sure that these kinds of similarity jobs must already exist in Mahout but being new to both Mahout and MapReduce, I'm not sure where to find them. Thanks, Kris 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > Hi Kris, > > So you already know how to make a sparse feature matrix out of > your Solr index, based on Grant's instructions? Once you have that > matrix loaded into HDFS, then the following Java code should > create a document-document similarity matrix: > > // > String p = "/path/to/matrix/on/hdfs"; > String tmpPath = "/tmp/matrixmultiplyspace"; > int numDocuments = // whatever your numDocuments is > int numTerms = // total number of terms in the matrix > > DistributedRowMatrix text = new DistributedRowMatrix(inputPath, > tmpPath, numDocuments, numTerms); > JobConf conf = new JobConf("similarity job"); > text.configure(conf); > > DistributedRowMatrix transpose = text.transpose(); > > DistributedRowMatrix similarity = transpose.times(transpose); > System.out.println("Similarity matrix lives: " + similarity.getRowPath()); > // > > Now, the rows of this similarity are going to be way too dense, so > you'll want to write a small map-reduce job (well, no reduce is necessary) > to run over this matrix and trim out all the unuseful entries of each > row, but that shouldn't be too hard to do. > > Of course, to do it really efficiently, that functionality could be folded > into the reducer of the matrix multiply job, and done in the same pass over > the data as that one. > > -jake > > > > On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > > > Hi Jake, > > > > Thanks for that. The first solution that you suggest is more like what I > > was imagining. > > > > Please excuse me, I'm new to Mahout and don't know how to use it to > > generate > > the full document-document similarity matrix. I would rather not have to > > re-implement the moreLikeThis algorithm that, although rather straight > > forward, may take time for a newbie to MapReduce like me. Could you > guide > > me a little in finding the relevant Mahout code for generating the matrix > > or > > is it not really designed for that? > > > > For the moment, I would be happy to have an off-line batch version > working. > > Also, it is desirable to take advantage of the text processing features > > that > > I have already configured using solr, so I would prefer to read in the > > feature vectors for the documents from a lucene index, as I am doing at > > present (e.g. > > > > > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > > ). > > > > Thanks, > > Kris > > > > > > > > 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > > > > > Hi Kris, > > > > > > If you generate a full document-document similarity matrix offline, > and > > > then make sure to sparsify the rows (trim off all similarities below a > > > threshold, or only take the top N for each row, etc...). Then encoding > > > these values directly in the index would indeed allow for *superfast* > > > MoreLikeThis functionality, because you've already computed all > > > of the similar results offline. > > > > > > The only downside is that it won't apply to newly indexed documents. > > > If your indexing setup is such that you don't fold in new documents > live, > > > but do so in batch, then this should be fine. > > > > > > An alternative is to use something like a Locality Sensitive Hash > > > (something one of my co-workers is writing up a nice implementation > > > of now, and I'm going to get him to contribute it once it's fully Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-09, 12:11
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-09, 17:31
On Wed, Jun 9, 2010 at 5:11 AM, Kris Jack <[EMAIL PROTECTED]> wrote:
> Hi, > > Thanks very for the code. In implementing it, I got a little stuck on > specifying the JobConf's similarity job - > > JobConf conf = new JobConf("similarity job"); > > I assume that I should define here how I would like two vectors to be > compared with one another? Please do correct me if that's wrong. If so, > however, could you point me to any examples of what this code should look > like (e.g. cosine similarity)? I'm sure that these kinds of similarity > jobs > must already exist in Mahout but being new to both Mahout and MapReduce, > I'm > not sure where to find them. > In the sample I mentioned (using sparse matrix multiplication), you don't get to chose the similarity - if the input vectors are unit-length normalized, then the computation is cosine similarity. You would have to write your own map-reduce job to do a different one. -jake > > Thanks, > Kris > > > > 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > > > Hi Kris, > > > > So you already know how to make a sparse feature matrix out of > > your Solr index, based on Grant's instructions? Once you have that > > matrix loaded into HDFS, then the following Java code should > > create a document-document similarity matrix: > > > > // > > String p = "/path/to/matrix/on/hdfs"; > > String tmpPath = "/tmp/matrixmultiplyspace"; > > int numDocuments = // whatever your numDocuments is > > int numTerms = // total number of terms in the matrix > > > > DistributedRowMatrix text = new DistributedRowMatrix(inputPath, > > tmpPath, numDocuments, numTerms); > > JobConf conf = new JobConf("similarity job"); > > text.configure(conf); > > > > DistributedRowMatrix transpose = text.transpose(); > > > > DistributedRowMatrix similarity = transpose.times(transpose); > > System.out.println("Similarity matrix lives: " + > similarity.getRowPath()); > > // > > > > Now, the rows of this similarity are going to be way too dense, so > > you'll want to write a small map-reduce job (well, no reduce is > necessary) > > to run over this matrix and trim out all the unuseful entries of each > > row, but that shouldn't be too hard to do. > > > > Of course, to do it really efficiently, that functionality could be > folded > > into the reducer of the matrix multiply job, and done in the same pass > over > > the data as that one. > > > > -jake > > > > > > > > On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > > > > > Hi Jake, > > > > > > Thanks for that. The first solution that you suggest is more like what > I > > > was imagining. > > > > > > Please excuse me, I'm new to Mahout and don't know how to use it to > > > generate > > > the full document-document similarity matrix. I would rather not have > to > > > re-implement the moreLikeThis algorithm that, although rather straight > > > forward, may take time for a newbie to MapReduce like me. Could you > > guide > > > me a little in finding the relevant Mahout code for generating the > matrix > > > or > > > is it not really designed for that? > > > > > > For the moment, I would be happy to have an off-line batch version > > working. > > > Also, it is desirable to take advantage of the text processing features > > > that > > > I have already configured using solr, so I would prefer to read in the > > > feature vectors for the documents from a lucene index, as I am doing at > > > present (e.g. > > > > > > > > > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > > > ). > > > > > > Thanks, > > > Kris > > > > > > > > > > > > 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > > > > > > > Hi Kris, > > > > > > > > If you generate a full document-document similarity matrix offline, > > and > > > > then make sure to sparsify the rows (trim off all similarities below > a > > > > threshold, or only take the top N for each row, etc...). Then > encoding > > > > these values directly in the index would indeed allow for *superfast* +
Jake Mannix 2010-06-09, 17:31
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-09, 17:44
Hi Jake,
Thanks for that. I'll give it a try with cosine similarity first off and as I get more experience I'll try and implement some other similarity methods. Kris 2010/6/9 Jake Mannix <[EMAIL PROTECTED]> > On Wed, Jun 9, 2010 at 5:11 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > Thanks very for the code. In implementing it, I got a little stuck on > > specifying the JobConf's similarity job - > > > > JobConf conf = new JobConf("similarity job"); > > > > I assume that I should define here how I would like two vectors to be > > compared with one another? Please do correct me if that's wrong. If so, > > however, could you point me to any examples of what this code should look > > like (e.g. cosine similarity)? I'm sure that these kinds of similarity > > jobs > > must already exist in Mahout but being new to both Mahout and MapReduce, > > I'm > > not sure where to find them. > > > > In the sample I mentioned (using sparse matrix multiplication), you don't > get to chose the similarity - if the input vectors are unit-length > normalized, > then the computation is cosine similarity. You would have to write your > own > map-reduce job to do a different one. > > -jake > > > > > > > > Thanks, > > Kris > > > > > > > > 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > > > > > Hi Kris, > > > > > > So you already know how to make a sparse feature matrix out of > > > your Solr index, based on Grant's instructions? Once you have that > > > matrix loaded into HDFS, then the following Java code should > > > create a document-document similarity matrix: > > > > > > // > > > String p = "/path/to/matrix/on/hdfs"; > > > String tmpPath = "/tmp/matrixmultiplyspace"; > > > int numDocuments = // whatever your numDocuments is > > > int numTerms = // total number of terms in the matrix > > > > > > DistributedRowMatrix text = new DistributedRowMatrix(inputPath, > > > tmpPath, numDocuments, numTerms); > > > JobConf conf = new JobConf("similarity job"); > > > text.configure(conf); > > > > > > DistributedRowMatrix transpose = text.transpose(); > > > > > > DistributedRowMatrix similarity = transpose.times(transpose); > > > System.out.println("Similarity matrix lives: " + > > similarity.getRowPath()); > > > // > > > > > > Now, the rows of this similarity are going to be way too dense, so > > > you'll want to write a small map-reduce job (well, no reduce is > > necessary) > > > to run over this matrix and trim out all the unuseful entries of each > > > row, but that shouldn't be too hard to do. > > > > > > Of course, to do it really efficiently, that functionality could be > > folded > > > into the reducer of the matrix multiply job, and done in the same pass > > over > > > the data as that one. > > > > > > -jake > > > > > > > > > > > > On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <[EMAIL PROTECTED]> > wrote: > > > > > > > Hi Jake, > > > > > > > > Thanks for that. The first solution that you suggest is more like > what > > I > > > > was imagining. > > > > > > > > Please excuse me, I'm new to Mahout and don't know how to use it to > > > > generate > > > > the full document-document similarity matrix. I would rather not > have > > to > > > > re-implement the moreLikeThis algorithm that, although rather > straight > > > > forward, may take time for a newbie to MapReduce like me. Could you > > > guide > > > > me a little in finding the relevant Mahout code for generating the > > matrix > > > > or > > > > is it not really designed for that? > > > > > > > > For the moment, I would be happy to have an off-line batch version > > > working. > > > > Also, it is desirable to take advantage of the text processing > features > > > > that > > > > I have already configured using solr, so I would prefer to read in > the > > > > feature vectors for the documents from a lucene index, as I am doing > at > > > > present (e.g. > > > > > > > > > > > > > > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-09, 17:44
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-15, 16:32
Hi,
I have gone through the matrix code quite thoroughly and have managed to get an implementation of document similarity working. Like you said, however, the rows of the similarity matrix are very dense and take a while to calculate. I have made the original document-term matrix more sparse by removing terms that have a document frequency of > 1% and now I would like to trim out row entries that do not have high similarity values. As you suggested, I would like to keep only the top x entries by modifying the reducer of the matrix multiply job. I was wondering if there was an interesting way to do this with the current mahout code such as requesting that the Vector accumulator returns only elements that have values greater than a given threshold, sorting the vector by value rather than key, or something else? Thanks, Kris 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > Hi Kris, > > So you already know how to make a sparse feature matrix out of > your Solr index, based on Grant's instructions? Once you have that > matrix loaded into HDFS, then the following Java code should > create a document-document similarity matrix: > > // > String p = "/path/to/matrix/on/hdfs"; > String tmpPath = "/tmp/matrixmultiplyspace"; > int numDocuments = // whatever your numDocuments is > int numTerms = // total number of terms in the matrix > > DistributedRowMatrix text = new DistributedRowMatrix(inputPath, > tmpPath, numDocuments, numTerms); > JobConf conf = new JobConf("similarity job"); > text.configure(conf); > > DistributedRowMatrix transpose = text.transpose(); > > DistributedRowMatrix similarity = transpose.times(transpose); > System.out.println("Similarity matrix lives: " + similarity.getRowPath()); > // > > Now, the rows of this similarity are going to be way too dense, so > you'll want to write a small map-reduce job (well, no reduce is necessary) > to run over this matrix and trim out all the unuseful entries of each > row, but that shouldn't be too hard to do. > > Of course, to do it really efficiently, that functionality could be folded > into the reducer of the matrix multiply job, and done in the same pass over > the data as that one. > > -jake > > > > On Tue, Jun 8, 2010 at 9:44 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > > > Hi Jake, > > > > Thanks for that. The first solution that you suggest is more like what I > > was imagining. > > > > Please excuse me, I'm new to Mahout and don't know how to use it to > > generate > > the full document-document similarity matrix. I would rather not have to > > re-implement the moreLikeThis algorithm that, although rather straight > > forward, may take time for a newbie to MapReduce like me. Could you > guide > > me a little in finding the relevant Mahout code for generating the matrix > > or > > is it not really designed for that? > > > > For the moment, I would be happy to have an off-line batch version > working. > > Also, it is desirable to take advantage of the text processing features > > that > > I have already configured using solr, so I would prefer to read in the > > feature vectors for the documents from a lucene index, as I am doing at > > present (e.g. > > > > > http://lucene.grantingersoll.com/2010/02/16/trijug-intro-to-mahout-slides-and-demo-examples/ > > ). > > > > Thanks, > > Kris > > > > > > > > 2010/6/8 Jake Mannix <[EMAIL PROTECTED]> > > > > > Hi Kris, > > > > > > If you generate a full document-document similarity matrix offline, > and > > > then make sure to sparsify the rows (trim off all similarities below a > > > threshold, or only take the top N for each row, etc...). Then encoding > > > these values directly in the index would indeed allow for *superfast* > > > MoreLikeThis functionality, because you've already computed all > > > of the similar results offline. > > > > > > The only downside is that it won't apply to newly indexed documents. > > > If your indexing setup is such that you don't fold in new documents > live, > > > but do so in batch, then this should be fine. Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-15, 16:32
-
Re: Generating a Document Similarity MatrixTed Dunning 2010-06-15, 17:00
Threshold are generally dangerous. It is usually preferable to specify the
sparseness you want (1%, 0.2%, whatever), sort the results in descending score order using Hadoop's builtin capabilities and just drop the rest. On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > I was wondering if there was an > interesting way to do this with the current mahout code such as requesting > that the Vector accumulator returns only elements that have values greater > than a given threshold, sorting the vector by value rather than key, or > something else? > +
Ted Dunning 2010-06-15, 17:00
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-18, 16:46
Thanks Ted,
I got that working. Unfortunately, the matrix multiplication job is taking far longer than I hoped. With just over 10 million documents, 10 mappers and 10 reducers, I can't get it to complete the job in under 48 hours. Perhaps you have an idea for speeding it up? I have already been quite ruthless with making the vectors sparse. I did not include terms that appeared in over 1% of the corpus and only kept terms that appeared at least 50 times. Is it normal that the matrix multiplication map reduce task should take so long to process with this quantity of data and resources available or do you think that my system is not configured properly? Thanks, Kris 2010/6/15 Ted Dunning <[EMAIL PROTECTED]> > Threshold are generally dangerous. It is usually preferable to specify the > sparseness you want (1%, 0.2%, whatever), sort the results in descending > score order using Hadoop's builtin capabilities and just drop the rest. > > On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[EMAIL PROTECTED]> wrote: > > > I was wondering if there was an > > interesting way to do this with the current mahout code such as > requesting > > that the Vector accumulator returns only elements that have values > greater > > than a given threshold, sorting the vector by value rather than key, or > > something else? > > > +
Kris Jack 2010-06-18, 16:46
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-18, 16:51
Hi Kris,
maybe you want to give the patch from https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet tested it with larger data yet, but I would be happy to get some feedback for it and maybe it helps you with your usecase. -sebastian Am 18.06.2010 18:46, schrieb Kris Jack: > Thanks Ted, > > I got that working. Unfortunately, the matrix multiplication job is taking > far longer than I hoped. With just over 10 million documents, 10 mappers > and 10 reducers, I can't get it to complete the job in under 48 hours. > > Perhaps you have an idea for speeding it up? I have already been quite > ruthless with making the vectors sparse. I did not include terms that > appeared in over 1% of the corpus and only kept terms that appeared at least > 50 times. Is it normal that the matrix multiplication map reduce task > should take so long to process with this quantity of data and resources > available or do you think that my system is not configured properly? > > Thanks, > Kris > > > > 2010/6/15 Ted Dunning <[EMAIL PROTECTED]> > > >> Threshold are generally dangerous. It is usually preferable to specify the >> sparseness you want (1%, 0.2%, whatever), sort the results in descending >> score order using Hadoop's builtin capabilities and just drop the rest. >> >> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[EMAIL PROTECTED]> wrote: >> >> >>> I was wondering if there was an >>> interesting way to do this with the current mahout code such as >>> >> requesting >> >>> that the Vector accumulator returns only elements that have values >>> >> greater >> >>> than a given threshold, sorting the vector by value rather than key, or >>> something else? >>> >>> >> > +
Sebastian Schelter 2010-06-18, 16:51
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-18, 16:54
Thanks Sebastian, I'll give it a try!
2010/6/18 Sebastian Schelter <[EMAIL PROTECTED]> > Hi Kris, > > maybe you want to give the patch from > https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet > tested it with larger data yet, but I would be happy to get some > feedback for it and maybe it helps you with your usecase. > > -sebastian > > Am 18.06.2010 18:46, schrieb Kris Jack: > > Thanks Ted, > > > > I got that working. Unfortunately, the matrix multiplication job is > taking > > far longer than I hoped. With just over 10 million documents, 10 mappers > > and 10 reducers, I can't get it to complete the job in under 48 hours. > > > > Perhaps you have an idea for speeding it up? I have already been quite > > ruthless with making the vectors sparse. I did not include terms that > > appeared in over 1% of the corpus and only kept terms that appeared at > least > > 50 times. Is it normal that the matrix multiplication map reduce task > > should take so long to process with this quantity of data and resources > > available or do you think that my system is not configured properly? > > > > Thanks, > > Kris > > > > > > > > 2010/6/15 Ted Dunning <[EMAIL PROTECTED]> > > > > > >> Threshold are generally dangerous. It is usually preferable to specify > the > >> sparseness you want (1%, 0.2%, whatever), sort the results in descending > >> score order using Hadoop's builtin capabilities and just drop the rest. > >> > >> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[EMAIL PROTECTED]> > wrote: > >> > >> > >>> I was wondering if there was an > >>> interesting way to do this with the current mahout code such as > >>> > >> requesting > >> > >>> that the Vector accumulator returns only elements that have values > >>> > >> greater > >> > >>> than a given threshold, sorting the vector by value rather than key, or > >>> something else? > >>> > >>> > >> > > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-18, 16:54
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-28, 15:18
Hi,
I am now using the version of org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian has written and has been added to the trunk. Thanks again for that! I can generate an output file that should contain a list of documents with their top 100* *most similar documents. I am having problems, however, in converting the output file into a readable format using mahout's vectordump: $ ./mahout vectordump --seqFile similarRows --output results.out --printKey no HADOOP_CONF_DIR or HADOOP_HOME set, running locally Input Path: /home/kris/similarRows Exception in thread "main" java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) at org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77) at org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174) What is this doing that takes up so much memory? A file is produced with 37,952 readable rows but I'm expecting more like 500,000 results, since I have this number of documents. Should I be using something else to read the output file of the RowSimilarityJob? Thanks, Kris 2010/6/18 Sebastian Schelter <[EMAIL PROTECTED]> > Hi Kris, > > maybe you want to give the patch from > https://issues.apache.org/jira/browse/MAHOUT-418 a try? I have not yet > tested it with larger data yet, but I would be happy to get some > feedback for it and maybe it helps you with your usecase. > > -sebastian > > Am 18.06.2010 18:46, schrieb Kris Jack: > > Thanks Ted, > > > > I got that working. Unfortunately, the matrix multiplication job is > taking > > far longer than I hoped. With just over 10 million documents, 10 mappers > > and 10 reducers, I can't get it to complete the job in under 48 hours. > > > > Perhaps you have an idea for speeding it up? I have already been quite > > ruthless with making the vectors sparse. I did not include terms that > > appeared in over 1% of the corpus and only kept terms that appeared at > least > > 50 times. Is it normal that the matrix multiplication map reduce task > > should take so long to process with this quantity of data and resources > > available or do you think that my system is not configured properly? > > > > Thanks, > > Kris > > > > > > > > 2010/6/15 Ted Dunning <[EMAIL PROTECTED]> > > > > > >> Threshold are generally dangerous. It is usually preferable to specify > the > >> sparseness you want (1%, 0.2%, whatever), sort the results in descending > >> score order using Hadoop's builtin capabilities and just drop the rest. > >> > >> On Tue, Jun 15, 2010 at 9:32 AM, Kris Jack <[EMAIL PROTECTED]> > wrote: > >> > >> > >>> I was wondering if there was an > >>> interesting way to do this with the current mahout code such as > >>> > >> requesting > >> > >>> that the Vector accumulator returns only elements that have values > >>> > >> greater > >> > >>> than a given threshold, sorting the vector by value rather than key, or > >>> something else? > >>> > >>> > >> > > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-28, 15:18
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-28, 20:15
Hi Kris,
Unfortunately I'm not familiar with the VectorDumper code (and a quick look didn't help either), so I can't help you with the OutOfMemoryError. It could be possible that only 37,952 results are found for an input of 500,000 vectors, it really depends on the actual data. If you're sure that there should be more results, you could provide me with a sample input file and I'll try to find out why there aren't more results. I wrote a small class for you that dumps the output file of the job to the console, (I tested it with the output of my unit-tests), maybe that can help us find the source of the problem. -sebastian public class MatrixReader extends AbstractJob { public static void main(String[] args) throws Exception { ToolRunner.run(new MatrixReader(), args); } @Override public int run(String[] args) throws Exception { addInputOption(); Map<String,String> parsedArgs = parseArguments(args); if (parsedArgs == null) { return -1; } Configuration conf = getConf(); FileSystem fs = FileSystem.get(conf); Path vectorFile = fs.listStatus(getInputPath(), TasteHadoopUtils.PARTS_FILTER)[0].getPath(); SequenceFile.Reader reader = null; try { reader = new SequenceFile.Reader(fs, vectorFile, conf); IntWritable key = new IntWritable(); VectorWritable value = new VectorWritable(); while (reader.next(key, value)) { int row = key.get(); System.out.print(String.valueOf(key.get()) + ": "); Iterator<Element> elementsIterator = value.get().iterateNonZero(); String separator = ""; while (elementsIterator.hasNext()) { Element element = elementsIterator.next(); System.out.print(separator + String.valueOf(element.index()) + "," + String.valueOf(element.get())); separator = ";"; } System.out.print("\n"); } } finally { reader.close(); } return 0; } } Am 28.06.2010 17:18, schrieb Kris Jack: > Hi, > > I am now using the version of > org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian has > written and has been added to the trunk. Thanks again for that! I can > generate an output file that should contain a list of documents with their > top 100* *most similar documents. I am having problems, however, in > converting the output file into a readable format using mahout's vectordump: > > $ ./mahout vectordump --seqFile similarRows --output results.out --printKey > no HADOOP_CONF_DIR or HADOOP_HOME set, running locally > Input Path: /home/kris/similarRows > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) > at > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) > at org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > at > org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77) > at > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at > org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) > at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) > at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:174) > > What is this doing that takes up so much memory? A file is produced with > 37,952 readable rows but I'm expecting more like 500,000 results, since I +
Sebastian Schelter 2010-06-28, 20:15
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-29, 09:25
Hi Sebastian,
You really are very kind! I have taken your code and run it to print out the contents of the output file. There are indeed only 37,952 results so that gives me more confidence in the vector dumper. I'm not sure why there was a memory problem though, seeing as it seems to have output the results correctly. Now I just have to match them up with my original lucene ids and see how it is performing. I'll keep you posted with the results. Thanks, Kris 2010/6/28 Sebastian Schelter <[EMAIL PROTECTED]> > Hi Kris, > > Unfortunately I'm not familiar with the VectorDumper code (and a quick > look didn't help either), so I can't help you with the OutOfMemoryError. > > It could be possible that only 37,952 results are found for an input of > 500,000 vectors, it really depends on the actual data. If you're sure > that there should be more results, you could provide me with a sample > input file and I'll try to find out why there aren't more results. > > I wrote a small class for you that dumps the output file of the job to > the console, (I tested it with the output of my unit-tests), maybe that > can help us find the source of the problem. > > -sebastian > > public class MatrixReader extends AbstractJob { > > public static void main(String[] args) throws Exception { > ToolRunner.run(new MatrixReader(), args); > } > > @Override > public int run(String[] args) throws Exception { > > addInputOption(); > > Map<String,String> parsedArgs = parseArguments(args); > if (parsedArgs == null) { > return -1; > } > > Configuration conf = getConf(); > FileSystem fs = FileSystem.get(conf); > > Path vectorFile = fs.listStatus(getInputPath(), > TasteHadoopUtils.PARTS_FILTER)[0].getPath(); > > SequenceFile.Reader reader = null; > try { > reader = new SequenceFile.Reader(fs, vectorFile, conf); > IntWritable key = new IntWritable(); > VectorWritable value = new VectorWritable(); > > while (reader.next(key, value)) { > int row = key.get(); > System.out.print(String.valueOf(key.get()) + ": "); > Iterator<Element> elementsIterator = value.get().iterateNonZero(); > String separator = ""; > while (elementsIterator.hasNext()) { > Element element = elementsIterator.next(); > System.out.print(separator + String.valueOf(element.index()) + > "," + String.valueOf(element.get())); > separator = ";"; > } > System.out.print("\n"); > } > } finally { > reader.close(); > } > return 0; > } > } > > Am 28.06.2010 17:18, schrieb Kris Jack: > > Hi, > > > > I am now using the version of > > org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian > has > > written and has been added to the trunk. Thanks again for that! I can > > generate an output file that should contain a list of documents with > their > > top 100* *most similar documents. I am having problems, however, in > > converting the output file into a readable format using mahout's > vectordump: > > > > $ ./mahout vectordump --seqFile similarRows --output results.out > --printKey > > no HADOOP_CONF_DIR or HADOOP_HOME set, running locally > > Input Path: /home/kris/similarRows > > Exception in thread "main" java.lang.OutOfMemoryError: Java heap space > > at > > > org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) > > at > > org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) > > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) > > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1830) > > at > org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1876) > > at > > > org.apache.mahout.utils.vectors.SequenceFileVectorIterable$SeqFileIterator.hasNext(SequenceFileVectorIterable.java:77) > > at > > org.apache.mahout.utils.vectors.VectorDumper.main(VectorDumper.java:138) > > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-29, 09:25
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-29, 09:28
Hi Kris,
I'm glad I could help you and it's really cool that you are testing my patches on real data. I'm looking forward to hearing more! -sebastian Am 29.06.2010 11:25, schrieb Kris Jack: > Hi Sebastian, > > You really are very kind! I have taken your code and run it to print out > the contents of the output file. There are indeed only 37,952 results so > that gives me more confidence in the vector dumper. I'm not sure why there > was a memory problem though, seeing as it seems to have output the results > correctly. Now I just have to match them up with my original lucene ids and > see how it is performing. I'll keep you posted with the results. > > Thanks, > Kris > > > > 2010/6/28 Sebastian Schelter <[EMAIL PROTECTED]> > > >> Hi Kris, >> >> Unfortunately I'm not familiar with the VectorDumper code (and a quick >> look didn't help either), so I can't help you with the OutOfMemoryError. >> >> It could be possible that only 37,952 results are found for an input of >> 500,000 vectors, it really depends on the actual data. If you're sure >> that there should be more results, you could provide me with a sample >> input file and I'll try to find out why there aren't more results. >> >> I wrote a small class for you that dumps the output file of the job to >> the console, (I tested it with the output of my unit-tests), maybe that >> can help us find the source of the problem. >> >> -sebastian >> >> public class MatrixReader extends AbstractJob { >> >> public static void main(String[] args) throws Exception { >> ToolRunner.run(new MatrixReader(), args); >> } >> >> @Override >> public int run(String[] args) throws Exception { >> >> addInputOption(); >> >> Map<String,String> parsedArgs = parseArguments(args); >> if (parsedArgs == null) { >> return -1; >> } >> >> Configuration conf = getConf(); >> FileSystem fs = FileSystem.get(conf); >> >> Path vectorFile = fs.listStatus(getInputPath(), >> TasteHadoopUtils.PARTS_FILTER)[0].getPath(); >> >> SequenceFile.Reader reader = null; >> try { >> reader = new SequenceFile.Reader(fs, vectorFile, conf); >> IntWritable key = new IntWritable(); >> VectorWritable value = new VectorWritable(); >> >> while (reader.next(key, value)) { >> int row = key.get(); >> System.out.print(String.valueOf(key.get()) + ": "); >> Iterator<Element> elementsIterator = value.get().iterateNonZero(); >> String separator = ""; >> while (elementsIterator.hasNext()) { >> Element element = elementsIterator.next(); >> System.out.print(separator + String.valueOf(element.index()) + >> "," + String.valueOf(element.get())); >> separator = ";"; >> } >> System.out.print("\n"); >> } >> } finally { >> reader.close(); >> } >> return 0; >> } >> } >> >> Am 28.06.2010 17:18, schrieb Kris Jack: >> >>> Hi, >>> >>> I am now using the version of >>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that Sebastian >>> >> has >> >>> written and has been added to the trunk. Thanks again for that! I can >>> generate an output file that should contain a list of documents with >>> >> their >> >>> top 100* *most similar documents. I am having problems, however, in >>> converting the output file into a readable format using mahout's >>> >> vectordump: >> >>> $ ./mahout vectordump --seqFile similarRows --output results.out >>> >> --printKey >> >>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally >>> Input Path: /home/kris/similarRows >>> Exception in thread "main" java.lang.OutOfMemoryError: Java heap space >>> at >>> >>> >> org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:59) >> >>> at >>> org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:101) >>> at >>> >> org.apache.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1930) >> >>> at +
Sebastian Schelter 2010-06-29, 09:28
-
Re: Generating a Document Similarity MatrixKris Jack 2010-07-02, 16:22
Hi Sebastian,
I am currently using your code with NamedVectors in my input. In the output, however, the names seem to be missing. Would there be a way to include them? Thanks, Kris 2010/6/29 Sebastian Schelter <[EMAIL PROTECTED]> > Hi Kris, > > I'm glad I could help you and it's really cool that you are testing my > patches on real data. I'm looking forward to hearing more! > > -sebastian > > Am 29.06.2010 11:25, schrieb Kris Jack: > > Hi Sebastian, > > > > You really are very kind! I have taken your code and run it to print out > > the contents of the output file. There are indeed only 37,952 results so > > that gives me more confidence in the vector dumper. I'm not sure why > there > > was a memory problem though, seeing as it seems to have output the > results > > correctly. Now I just have to match them up with my original lucene ids > and > > see how it is performing. I'll keep you posted with the results. > > > > Thanks, > > Kris > > > > > > > > 2010/6/28 Sebastian Schelter <[EMAIL PROTECTED]> > > > > > >> Hi Kris, > >> > >> Unfortunately I'm not familiar with the VectorDumper code (and a quick > >> look didn't help either), so I can't help you with the OutOfMemoryError. > >> > >> It could be possible that only 37,952 results are found for an input of > >> 500,000 vectors, it really depends on the actual data. If you're sure > >> that there should be more results, you could provide me with a sample > >> input file and I'll try to find out why there aren't more results. > >> > >> I wrote a small class for you that dumps the output file of the job to > >> the console, (I tested it with the output of my unit-tests), maybe that > >> can help us find the source of the problem. > >> > >> -sebastian > >> > >> public class MatrixReader extends AbstractJob { > >> > >> public static void main(String[] args) throws Exception { > >> ToolRunner.run(new MatrixReader(), args); > >> } > >> > >> @Override > >> public int run(String[] args) throws Exception { > >> > >> addInputOption(); > >> > >> Map<String,String> parsedArgs = parseArguments(args); > >> if (parsedArgs == null) { > >> return -1; > >> } > >> > >> Configuration conf = getConf(); > >> FileSystem fs = FileSystem.get(conf); > >> > >> Path vectorFile = fs.listStatus(getInputPath(), > >> TasteHadoopUtils.PARTS_FILTER)[0].getPath(); > >> > >> SequenceFile.Reader reader = null; > >> try { > >> reader = new SequenceFile.Reader(fs, vectorFile, conf); > >> IntWritable key = new IntWritable(); > >> VectorWritable value = new VectorWritable(); > >> > >> while (reader.next(key, value)) { > >> int row = key.get(); > >> System.out.print(String.valueOf(key.get()) + ": "); > >> Iterator<Element> elementsIterator > value.get().iterateNonZero(); > >> String separator = ""; > >> while (elementsIterator.hasNext()) { > >> Element element = elementsIterator.next(); > >> System.out.print(separator + String.valueOf(element.index()) + > >> "," + String.valueOf(element.get())); > >> separator = ";"; > >> } > >> System.out.print("\n"); > >> } > >> } finally { > >> reader.close(); > >> } > >> return 0; > >> } > >> } > >> > >> Am 28.06.2010 17:18, schrieb Kris Jack: > >> > >>> Hi, > >>> > >>> I am now using the version of > >>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that > Sebastian > >>> > >> has > >> > >>> written and has been added to the trunk. Thanks again for that! I can > >>> generate an output file that should contain a list of documents with > >>> > >> their > >> > >>> top 100* *most similar documents. I am having problems, however, in > >>> converting the output file into a readable format using mahout's > >>> > >> vectordump: > >> > >>> $ ./mahout vectordump --seqFile similarRows --output results.out > >>> > >> --printKey > >> > >>> no HADOOP_CONF_DIR or HADOOP_HOME set, running locally > >>> Input Path: /home/kris/similarRows Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-07-02, 16:22
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-07-02, 19:33
Hi Kris,
I think the best way would be to manually join the names to the result after executing the job. --sebastian Am 02.07.2010 18:22, schrieb Kris Jack: > Hi Sebastian, > > I am currently using your code with NamedVectors in my input. In the > output, however, the names seem to be missing. Would there be a way to > include them? > > Thanks, > Kris > > > > 2010/6/29 Sebastian Schelter <[EMAIL PROTECTED]> > > >> Hi Kris, >> >> I'm glad I could help you and it's really cool that you are testing my >> patches on real data. I'm looking forward to hearing more! >> >> -sebastian >> >> Am 29.06.2010 11:25, schrieb Kris Jack: >> >>> Hi Sebastian, >>> >>> You really are very kind! I have taken your code and run it to print out >>> the contents of the output file. There are indeed only 37,952 results so >>> that gives me more confidence in the vector dumper. I'm not sure why >>> >> there >> >>> was a memory problem though, seeing as it seems to have output the >>> >> results >> >>> correctly. Now I just have to match them up with my original lucene ids >>> >> and >> >>> see how it is performing. I'll keep you posted with the results. >>> >>> Thanks, >>> Kris >>> >>> >>> >>> 2010/6/28 Sebastian Schelter <[EMAIL PROTECTED]> >>> >>> >>> >>>> Hi Kris, >>>> >>>> Unfortunately I'm not familiar with the VectorDumper code (and a quick >>>> look didn't help either), so I can't help you with the OutOfMemoryError. >>>> >>>> It could be possible that only 37,952 results are found for an input of >>>> 500,000 vectors, it really depends on the actual data. If you're sure >>>> that there should be more results, you could provide me with a sample >>>> input file and I'll try to find out why there aren't more results. >>>> >>>> I wrote a small class for you that dumps the output file of the job to >>>> the console, (I tested it with the output of my unit-tests), maybe that >>>> can help us find the source of the problem. >>>> >>>> -sebastian >>>> >>>> public class MatrixReader extends AbstractJob { >>>> >>>> public static void main(String[] args) throws Exception { >>>> ToolRunner.run(new MatrixReader(), args); >>>> } >>>> >>>> @Override >>>> public int run(String[] args) throws Exception { >>>> >>>> addInputOption(); >>>> >>>> Map<String,String> parsedArgs = parseArguments(args); >>>> if (parsedArgs == null) { >>>> return -1; >>>> } >>>> >>>> Configuration conf = getConf(); >>>> FileSystem fs = FileSystem.get(conf); >>>> >>>> Path vectorFile = fs.listStatus(getInputPath(), >>>> TasteHadoopUtils.PARTS_FILTER)[0].getPath(); >>>> >>>> SequenceFile.Reader reader = null; >>>> try { >>>> reader = new SequenceFile.Reader(fs, vectorFile, conf); >>>> IntWritable key = new IntWritable(); >>>> VectorWritable value = new VectorWritable(); >>>> >>>> while (reader.next(key, value)) { >>>> int row = key.get(); >>>> System.out.print(String.valueOf(key.get()) + ": "); >>>> Iterator<Element> elementsIterator >>>> >> value.get().iterateNonZero(); >> >>>> String separator = ""; >>>> while (elementsIterator.hasNext()) { >>>> Element element = elementsIterator.next(); >>>> System.out.print(separator + String.valueOf(element.index()) + >>>> "," + String.valueOf(element.get())); >>>> separator = ";"; >>>> } >>>> System.out.print("\n"); >>>> } >>>> } finally { >>>> reader.close(); >>>> } >>>> return 0; >>>> } >>>> } >>>> >>>> Am 28.06.2010 17:18, schrieb Kris Jack: >>>> >>>> >>>>> Hi, >>>>> >>>>> I am now using the version of >>>>> org.apache.mahout.math.hadoop.similarity.RowSimilarityJob that >>>>> >> Sebastian >> >>>>> >>>> has >>>> >>>> >>>>> written and has been added to the trunk. Thanks again for that! I can >>>>> generate an output file that should contain a list of documents with > +
Sebastian Schelter 2010-07-02, 19:33
-
Re: Generating a Document Similarity MatrixOlivier Grisel 2010-06-08, 22:56
2010/6/8 Jake Mannix <[EMAIL PROTECTED]>:
> Hi Kris, > > If you generate a full document-document similarity matrix offline, and > then make sure to sparsify the rows (trim off all similarities below a > threshold, or only take the top N for each row, etc...). Then encoding > these values directly in the index would indeed allow for *superfast* > MoreLikeThis functionality, because you've already computed all > of the similar results offline. For 10e6 documents if might not be reasonable to generate the complete document-document similarity matrix: 1e12 components => a couple of tera bytes of similarity values just to find the find the top N afterwards: sorting a tera byte of data can be fast when you have a datacenter like yahoos or googles but might not be reasonable when you just have a CMS running on a couple of servers :) Trimming off low similarities should happen before starting to writer the rows on the hard drive. -- Olivier http://twitter.com/ogrisel - http://github.com/ogrisel +
Olivier Grisel 2010-06-08, 22:56
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-08, 23:16
On Tue, Jun 8, 2010 at 3:56 PM, Olivier Grisel <[EMAIL PROTECTED]>wrote:
> 2010/6/8 Jake Mannix <[EMAIL PROTECTED]>: > > Hi Kris, > > > > If you generate a full document-document similarity matrix offline, and > > then make sure to sparsify the rows (trim off all similarities below a > > threshold, or only take the top N for each row, etc...). Then encoding > > these values directly in the index would indeed allow for *superfast* > > MoreLikeThis functionality, because you've already computed all > > of the similar results offline. > > For 10e6 documents if might not be reasonable to generate the complete > document-document similarity matrix: 1e12 components => a couple of > tera bytes of similarity values just to find the find the top N > afterwards: Nope, this isn't what happens in what I described: when you take a sparseDocumentMatrix.transpose().times(itself), the scaling does not go N^2*M, with N^2 outputs - the calculation is sparse, only computing the entries which are nonzero. If you pre-sparsify the documents a little (remove all terms which occur in more than 1% of all documents, or something like that), this sparse calculation is even faster - it scales as sum_{i=1...N}(k_i)^2, where k_i is the number of nonzero elements in document i. If all documents were the same length (k), then this scales as N*k^2, and the total number of nonzero entries in the output is far less than N^2 if k << N,M. Getting rid of the common terms (even *lots* of them) beforehand is still a very good idea. -jake +
Jake Mannix 2010-06-08, 23:16
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-08, 22:39
Hi Kris,
actually the code to compute the item-to-item similarities in the collaborative filtering part of mahout (which at the first look seems to be a totally different problem than yours) is based on a paper that deals with computing the pairwise similarity of text documents in a very simple way. Maybe that could be helpful to you: Elsayed et al: Pairwise Document Similarity in Large Collections with MapReduce http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf<http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf> -sebastian 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > Hi everyone, > > I currently use lucene's moreLikeThis function through solr to find > documents that are related to one another. A single call, however, takes > around 4 seconds to complete and I would like to reduce this. I got to > thinking that I might be able to use Mahout to generate a document > similarity matrix offline that could then be looked-up in real time for > serving. Is this a reasonable use of Mahout? If so, what functions will > generate a document similarity matrix? Also, I would like to be able to > keep the text processing advantages provided through lucene so it would > help > if I could still use my lucene index. If not, then could you recommend any > alternative solutions please? > > Many thanks, > Kris > +
Sebastian Schelter 2010-06-08, 22:39
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-08, 22:52
The code in mahout CF is doing that? I don't think that's right, we don't
do anything that fancy right now, do we Sean? -jake On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter <[EMAIL PROTECTED]>wrote: > Hi Kris, > > actually the code to compute the item-to-item similarities in the > collaborative filtering part of mahout (which at the first look seems to be > a totally different problem than yours) is based on a paper that deals with > computing the pairwise similarity of text documents in a very simple way. > Maybe that could be helpful to you: > > Elsayed et al: Pairwise Document Similarity in Large Collections with > MapReduce > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > < > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > -sebastian > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > > > Hi everyone, > > > > I currently use lucene's moreLikeThis function through solr to find > > documents that are related to one another. A single call, however, takes > > around 4 seconds to complete and I would like to reduce this. I got to > > thinking that I might be able to use Mahout to generate a document > > similarity matrix offline that could then be looked-up in real time for > > serving. Is this a reasonable use of Mahout? If so, what functions will > > generate a document similarity matrix? Also, I would like to be able to > > keep the text processing advantages provided through lucene so it would > > help > > if I could still use my lucene index. If not, then could you recommend > any > > alternative solutions please? > > > > Many thanks, > > Kris > > > +
Jake Mannix 2010-06-08, 22:52
-
Re: Generating a Document Similarity MatrixSean Owen 2010-06-08, 23:08
Sort of, there is a separate job to compute all item-item similarities
under a variety of metrics. This is what Sebastian wrote. It's not used in the co-occurrence recommender (but could be -- vaguely a to-do here.) But sure if you're willing to think of a doc as an "item vector" of "preferences" from "words" then this works fine to compute doc similarity under these metrics. On Wed, Jun 9, 2010 at 12:52 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > The code in mahout CF is doing that? I don't think that's right, we don't > do anything that fancy right now, do we Sean? > > -jake > > On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter > <[EMAIL PROTECTED]>wrote: > >> Hi Kris, >> >> actually the code to compute the item-to-item similarities in the >> collaborative filtering part of mahout (which at the first look seems to be >> a totally different problem than yours) is based on a paper that deals with >> computing the pairwise similarity of text documents in a very simple way. >> Maybe that could be helpful to you: >> >> Elsayed et al: Pairwise Document Similarity in Large Collections with >> MapReduce >> >> http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf >> < >> http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf >> > >> >> -sebastian >> >> >> 2010/6/8 Kris Jack <[EMAIL PROTECTED]> >> >> > Hi everyone, >> > >> > I currently use lucene's moreLikeThis function through solr to find >> > documents that are related to one another. A single call, however, takes >> > around 4 seconds to complete and I would like to reduce this. I got to >> > thinking that I might be able to use Mahout to generate a document >> > similarity matrix offline that could then be looked-up in real time for >> > serving. Is this a reasonable use of Mahout? If so, what functions will >> > generate a document similarity matrix? Also, I would like to be able to >> > keep the text processing advantages provided through lucene so it would >> > help >> > if I could still use my lucene index. If not, then could you recommend >> any >> > alternative solutions please? >> > >> > Many thanks, >> > Kris >> > >> > +
Sean Owen 2010-06-08, 23:08
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-08, 23:21
I did not wanna say you can use the item-item-similarity code from CF for
computing the document similarities, I just wanted to point out that these problems are closely related and that the paper which the CF code is based on is dealing with the computation of pairwise document similarities and could therefore be helpful. -sebastian 2010/6/9 Jake Mannix <[EMAIL PROTECTED]> > The code in mahout CF is doing that? I don't think that's right, we don't > do anything that fancy right now, do we Sean? > > -jake > > On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter > <[EMAIL PROTECTED]>wrote: > > > Hi Kris, > > > > actually the code to compute the item-to-item similarities in the > > collaborative filtering part of mahout (which at the first look seems to > be > > a totally different problem than yours) is based on a paper that deals > with > > computing the pairwise similarity of text documents in a very simple way. > > Maybe that could be helpful to you: > > > > Elsayed et al: Pairwise Document Similarity in Large Collections with > > MapReduce > > > > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > < > > > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > > > > -sebastian > > > > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > > > > > Hi everyone, > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > documents that are related to one another. A single call, however, > takes > > > around 4 seconds to complete and I would like to reduce this. I got to > > > thinking that I might be able to use Mahout to generate a document > > > similarity matrix offline that could then be looked-up in real time for > > > serving. Is this a reasonable use of Mahout? If so, what functions > will > > > generate a document similarity matrix? Also, I would like to be able > to > > > keep the text processing advantages provided through lucene so it would > > > help > > > if I could still use my lucene index. If not, then could you recommend > > any > > > alternative solutions please? > > > > > > Many thanks, > > > Kris > > > > > > +
Sebastian Schelter 2010-06-08, 23:21
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-08, 23:33
Ah yes. I would love for us to have an implementation of that pairwise
similarity code. It would be useful for lots of things in Mahout, yes! -jake On Tue, Jun 8, 2010 at 4:21 PM, Sebastian Schelter <[EMAIL PROTECTED]>wrote: > I did not wanna say you can use the item-item-similarity code from CF for > computing the document similarities, I just wanted to point out that these > problems are closely related and that the paper which the CF code is based > on is dealing with the computation of pairwise document similarities and > could therefore be helpful. > > -sebastian > > 2010/6/9 Jake Mannix <[EMAIL PROTECTED]> > > > The code in mahout CF is doing that? I don't think that's right, we > don't > > do anything that fancy right now, do we Sean? > > > > -jake > > > > On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter > > <[EMAIL PROTECTED]>wrote: > > > > > Hi Kris, > > > > > > actually the code to compute the item-to-item similarities in the > > > collaborative filtering part of mahout (which at the first look seems > to > > be > > > a totally different problem than yours) is based on a paper that deals > > with > > > computing the pairwise similarity of text documents in a very simple > way. > > > Maybe that could be helpful to you: > > > > > > Elsayed et al: Pairwise Document Similarity in Large Collections with > > > MapReduce > > > > > > > > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > < > > > > > > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > > > > > > > -sebastian > > > > > > > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > > > > > > > Hi everyone, > > > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > > documents that are related to one another. A single call, however, > > takes > > > > around 4 seconds to complete and I would like to reduce this. I got > to > > > > thinking that I might be able to use Mahout to generate a document > > > > similarity matrix offline that could then be looked-up in real time > for > > > > serving. Is this a reasonable use of Mahout? If so, what functions > > will > > > > generate a document similarity matrix? Also, I would like to be able > > to > > > > keep the text processing advantages provided through lucene so it > would > > > > help > > > > if I could still use my lucene index. If not, then could you > recommend > > > any > > > > alternative solutions please? > > > > > > > > Many thanks, > > > > Kris > > > > > > > > > > +
Jake Mannix 2010-06-08, 23:33
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-08, 23:45
The relation between these two problems (document similarity and item
similarity in CF) is exactly like Sean pointed out: In the paper a document is a vector of term frequencies and the paper shows how to compute the pairwise similarities between those. To use this for collaborative filtering you actually just have to replace the document with an item which is a vector of user preferences. It shouldn't be too hard to make this work on a DistributedRowMatrix too, I think. You already mentioned you wanna have it that way some time in MAHOUT-362 :) -sebastian 2010/6/9 Jake Mannix <[EMAIL PROTECTED]> > Ah yes. I would love for us to have an implementation of that pairwise > similarity > code. It would be useful for lots of things in Mahout, yes! > > -jake > > On Tue, Jun 8, 2010 at 4:21 PM, Sebastian Schelter > <[EMAIL PROTECTED]>wrote: > > > I did not wanna say you can use the item-item-similarity code from CF for > > computing the document similarities, I just wanted to point out that > these > > problems are closely related and that the paper which the CF code is > based > > on is dealing with the computation of pairwise document similarities and > > could therefore be helpful. > > > > -sebastian > > > > 2010/6/9 Jake Mannix <[EMAIL PROTECTED]> > > > > > The code in mahout CF is doing that? I don't think that's right, we > > don't > > > do anything that fancy right now, do we Sean? > > > > > > -jake > > > > > > On Tue, Jun 8, 2010 at 3:39 PM, Sebastian Schelter > > > <[EMAIL PROTECTED]>wrote: > > > > > > > Hi Kris, > > > > > > > > actually the code to compute the item-to-item similarities in the > > > > collaborative filtering part of mahout (which at the first look seems > > to > > > be > > > > a totally different problem than yours) is based on a paper that > deals > > > with > > > > computing the pairwise similarity of text documents in a very simple > > way. > > > > Maybe that could be helpful to you: > > > > > > > > Elsayed et al: Pairwise Document Similarity in Large Collections with > > > > MapReduce > > > > > > > > > > > > > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > < > > > > > > > > > > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > > > > > > > > > > -sebastian > > > > > > > > > > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > > > > > > > > > Hi everyone, > > > > > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > > > documents that are related to one another. A single call, however, > > > takes > > > > > around 4 seconds to complete and I would like to reduce this. I > got > > to > > > > > thinking that I might be able to use Mahout to generate a document > > > > > similarity matrix offline that could then be looked-up in real time > > for > > > > > serving. Is this a reasonable use of Mahout? If so, what > functions > > > will > > > > > generate a document similarity matrix? Also, I would like to be > able > > > to > > > > > keep the text processing advantages provided through lucene so it > > would > > > > > help > > > > > if I could still use my lucene index. If not, then could you > > recommend > > > > any > > > > > alternative solutions please? > > > > > > > > > > Many thanks, > > > > > Kris > > > > > > > > > > > > > > > +
Sebastian Schelter 2010-06-08, 23:45
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-08, 23:53
On Tue, Jun 8, 2010 at 4:45 PM, Sebastian Schelter
<[EMAIL PROTECTED]>wrote: > The relation between these two problems (document similarity and item > similarity in CF) is exactly like Sean pointed out: In the paper a document > is a vector of term frequencies and the paper shows how to compute the > pairwise similarities between those. To use this for collaborative > filtering > you actually just have to replace the document with an item which is a vector of user preferences. > Yep, a vector is a vector is a vector. (And when you're me, even if you are *not* a vector, you might be a vector. ;) ) > It shouldn't be too hard to make this work on a DistributedRowMatrix too, I > think. You already mentioned you wanna have it that way some time > in MAHOUT-362 :) > Well indeed I did! -jake +
Jake Mannix 2010-06-08, 23:53
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-09, 10:23
I could try to make the similarity computation work on the rows of a
DistributedRowMatrix with several metrices (similar to o.a.m.cf.taste.hadoop.similarity.item.ItemSimilarityJob) and would concentrate on the implementation of it as a mathematical operation not specific to any domain. So it would be left up to the users to convert their documents to vectors and maybe do things like stemming or stopword removal to reduce the computation overhead, when they use this job for text documents. I could start working on that in 2 weeks from now though. Tell me if that's welcomed and I'll go and create a jira issue :) -sebastian 2010/6/9 Jake Mannix <[EMAIL PROTECTED]> > On Tue, Jun 8, 2010 at 4:45 PM, Sebastian Schelter > <[EMAIL PROTECTED]>wrote: > > > The relation between these two problems (document similarity and item > > similarity in CF) is exactly like Sean pointed out: In the paper a > document > > is a vector of term frequencies and the paper shows how to compute the > > pairwise similarities between those. To use this for collaborative > > filtering > > you actually just have to replace the document with an item which is a > > vector of user preferences. > > > > Yep, a vector is a vector is a vector. (And when you're me, even if you > are *not* a vector, you might be a vector. ;) ) > > > > It shouldn't be too hard to make this work on a DistributedRowMatrix too, > I > > think. You already mentioned you wanna have it that way some time > > in MAHOUT-362 :) > > > > Well indeed I did! > > -jake > +
Sebastian Schelter 2010-06-09, 10:23
-
Re: Generating a Document Similarity MatrixKris Jack 2010-06-09, 17:15
Hi Sebastion,
Thanks for the reference. I had a look through the paper and it's certainly very relevant to the problem that I'm trying to solve. Do you think the CF functionality could be co-opted to output such document similarities as it stands or will it require modification? If it can be used straight off, say to give the top 25 most related documents for each document, then how would you suggest that I go about this? Thanks, Kris 2010/6/8 Sebastian Schelter <[EMAIL PROTECTED]> > Hi Kris, > > actually the code to compute the item-to-item similarities in the > collaborative filtering part of mahout (which at the first look seems to be > a totally different problem than yours) is based on a paper that deals with > computing the pairwise similarity of text documents in a very simple way. > Maybe that could be helpful to you: > > Elsayed et al: Pairwise Document Similarity in Large Collections with > MapReduce > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf<http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf> > < > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > -sebastian > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > > > Hi everyone, > > > > I currently use lucene's moreLikeThis function through solr to find > > documents that are related to one another. A single call, however, takes > > around 4 seconds to complete and I would like to reduce this. I got to > > thinking that I might be able to use Mahout to generate a document > > similarity matrix offline that could then be looked-up in real time for > > serving. Is this a reasonable use of Mahout? If so, what functions will > > generate a document similarity matrix? Also, I would like to be able to > > keep the text processing advantages provided through lucene so it would > > help > > if I could still use my lucene index. If not, then could you recommend > any > > alternative solutions please? > > > > Many thanks, > > Kris > > > -- Dr Kris Jack, http://www.mendeley.com/profiles/kris-jack/ +
Kris Jack 2010-06-09, 17:15
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-09, 17:56
The ItemSimilarityJob cannot be directly used as its not working on a
DistributedRowMatrix but on data structures unique to collaborative filtering, so if you ask me I'd say that a separate job would be required. If you wanna give it a try, a good starting point to get an idea how the computation of the pairwise cosine similarities works could be to take a look at the example in the comment of org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityJob (starting at line 59). Just think of items as documents and users as terms. -sebastian 2010/6/9 Kris Jack <[EMAIL PROTECTED]> > Hi Sebastion, > > Thanks for the reference. I had a look through the paper and it's > certainly > very relevant to the problem that I'm trying to solve. Do you think the CF > functionality could be co-opted to output such document similarities as it > stands or will it require modification? If it can be used straight off, > say > to give the top 25 most related documents for each document, then how would > you suggest that I go about this? > > Thanks, > Kris > > > > 2010/6/8 Sebastian Schelter <[EMAIL PROTECTED]> > > > Hi Kris, > > > > actually the code to compute the item-to-item similarities in the > > collaborative filtering part of mahout (which at the first look seems to > be > > a totally different problem than yours) is based on a paper that deals > with > > computing the pairwise similarity of text documents in a very simple way. > > Maybe that could be helpful to you: > > > > Elsayed et al: Pairwise Document Similarity in Large Collections with > > MapReduce > > > > > http://www.umiacs.umd.edu/~jimmylin/publications/Elsayed_etal_ACL2008_short.pdf > < > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > < > > > http://www.umiacs.umd.edu/%7Ejimmylin/publications/Elsayed_etal_ACL2008_short.pdf > > > > > > > -sebastian > > > > > > 2010/6/8 Kris Jack <[EMAIL PROTECTED]> > > > > > Hi everyone, > > > > > > I currently use lucene's moreLikeThis function through solr to find > > > documents that are related to one another. A single call, however, > takes > > > around 4 seconds to complete and I would like to reduce this. I got to > > > thinking that I might be able to use Mahout to generate a document > > > similarity matrix offline that could then be looked-up in real time for > > > serving. Is this a reasonable use of Mahout? If so, what functions > will > > > generate a document similarity matrix? Also, I would like to be able > to > > > keep the text processing advantages provided through lucene so it would > > > help > > > if I could still use my lucene index. If not, then could you recommend > > any > > > alternative solutions please? > > > > > > Many thanks, > > > Kris > > > > > > > > > -- > Dr Kris Jack, > http://www.mendeley.com/profiles/kris-jack/ > +
Sebastian Schelter 2010-06-09, 17:56
-
Re: Generating a Document Similarity MatrixSean Owen 2010-06-09, 17:58
Well I'm not sure they're unique, they're just vectors. Would that not
be the best neutral representation for things like this? What was the comment about keying by ints vs longs earlier? If unifying that helps bring things closer together I can look at it, if I can understand the issue. On Wed, Jun 9, 2010 at 6:56 PM, Sebastian Schelter <[EMAIL PROTECTED]> wrote: > The ItemSimilarityJob cannot be directly used as its not working on a > DistributedRowMatrix but on data structures unique to collaborative > filtering, so if you ask me I'd say that a separate job would be required. > +
Sean Owen 2010-06-09, 17:58
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-09, 18:14
The ItemSimilarityJob actually uses implementations of the Vector
class hierarchy? I think that's the issue - if the on-disk and in-mapper representations are never Vectors, then they won't interoperate with any of the matrix operations... And yeah, keying on ints is necessary for now, unless we want to make a new matrix type (at least for distributed matrices) which keys on longs (which actually might be a good idea: now that we're using VInt and VLong, the disk space and network usage should be not be adversely affected - just the in-memory representation). In fact, the more I play with this, the more I see that the distributed matrices really are different beasts than their in-memory baby cousins (some operations just don't make sense, and others are way inefficient, and yet others have sneaky tricks which need to be represented differently). If DistributedRowMatrix (and relatives) is really going to be generalizable and useful, we're going to need to allow the types to be configurable - key on ints or longs, have values be vectors keyed on ints or longs, and even have entries be either float / double / boolean. -jake -jake On Wed, Jun 9, 2010 at 10:58 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Well I'm not sure they're unique, they're just vectors. Would that not > be the best neutral representation for things like this? > > What was the comment about keying by ints vs longs earlier? If > unifying that helps bring things closer together I can look at it, if > I can understand the issue. > > On Wed, Jun 9, 2010 at 6:56 PM, Sebastian Schelter > <[EMAIL PROTECTED]> wrote: > > The ItemSimilarityJob cannot be directly used as its not working on a > > DistributedRowMatrix but on data structures unique to collaborative > > filtering, so if you ask me I'd say that a separate job would be > required. > > > +
Jake Mannix 2010-06-09, 18:14
-
Re: Generating a Document Similarity MatrixSean Owen 2010-06-09, 18:25
On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:
> The ItemSimilarityJob actually uses implementations of the Vector > class hierarchy? I think that's the issue - if the on-disk and in-mapper > representations are never Vectors, then they won't interoperate with > any of the matrix operations... Yes they are Vectors. > And yeah, keying on ints is necessary for now, unless we want to > make a new matrix type (at least for distributed matrices) which > keys on longs (which actually might be a good idea: now that > we're using VInt and VLong, the disk space and network usage > should be not be adversely affected - just the in-memory > representation). Oh I see. Well that's not a problem. Already, IDs have to be mapped to ints to be used as dimensions in a Vector. So in most cases things are keyed by these int pseudo-IDs. That's OK too. A matrix is a bunch of vectors -- at least, that's a nice structure for a SequenceFile. Row (or col) ID mapped to row (column) vector. is that not what other jobs are using? what's the better alternative we could think about converging on. +
Sean Owen 2010-06-09, 18:25
-
Re: Generating a Document Similarity MatrixJake Mannix 2010-06-09, 18:33
On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > > The ItemSimilarityJob actually uses implementations of the Vector > > class hierarchy? I think that's the issue - if the on-disk and in-mapper > > representations are never Vectors, then they won't interoperate with > > any of the matrix operations... > > Yes they are Vectors. > Oh, I guess I missed that, which step/phase of the ItemSimilarity job uses these, on trunk currently? I don't see any mappers which take in int, vector pairs... > Oh I see. Well that's not a problem. Already, IDs have to be mapped to > ints to be used as dimensions in a Vector. So in most cases things are > keyed by these int pseudo-IDs. That's OK too. > > A matrix is a bunch of vectors -- at least, that's a nice structure > for a SequenceFile. Row (or col) ID mapped to row (column) vector. > > is that not what other jobs are using? > what's the better alternative we could think about converging on. > Yes, as long as the *on HDFS* representation is a SequenceFile<IntWritable,VectorWritable>, we can interoperate. Or now that you've moved on to VIntWritable, I should migrate the distributed matrix stuff to do the same. And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses are reusable and would reduce replicated work as well... -jake +
Jake Mannix 2010-06-09, 18:33
-
Re: Generating a Document Similarity MatrixSean Owen 2010-06-09, 18:36
Nope I'm dreaming. These jobs do use custom output formats. I hadn't
really looked closely either. (Everything else uses vectors.) Now I imagine there is some reason but yeah it would be much better to operate in terms of vectors if possible. Sebastian is there a reason Vectors couldn't be used? On Wed, Jun 9, 2010 at 7:33 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: >> > The ItemSimilarityJob actually uses implementations of the Vector >> > class hierarchy? I think that's the issue - if the on-disk and in-mapper >> > representations are never Vectors, then they won't interoperate with >> > any of the matrix operations... >> >> Yes they are Vectors. >> > > Oh, I guess I missed that, which step/phase of the ItemSimilarity job uses > these, on trunk currently? I don't see any mappers which take in > int, vector pairs... > > >> Oh I see. Well that's not a problem. Already, IDs have to be mapped to >> ints to be used as dimensions in a Vector. So in most cases things are >> keyed by these int pseudo-IDs. That's OK too. >> >> A matrix is a bunch of vectors -- at least, that's a nice structure >> for a SequenceFile. Row (or col) ID mapped to row (column) vector. >> >> is that not what other jobs are using? >> what's the better alternative we could think about converging on. >> > > Yes, as long as the *on HDFS* representation is a > SequenceFile<IntWritable,VectorWritable>, we can interoperate. Or > now that you've moved on to VIntWritable, I should migrate the distributed > matrix stuff to do the same. > > And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses > are reusable and would reduce replicated work as well... > > -jake > +
Sean Owen 2010-06-09, 18:36
-
Re: Generating a Document Similarity MatrixSebastian Schelter 2010-06-09, 18:50
Actually there's no real reason, why vectors couldn't be used except that
the CF data structures use longs as keys and floats as values in opposite to ints and doubles on the vector side. But on a first look I think we could certainly migrate that to use vectors. -sebastian 2010/6/9 Sean Owen <[EMAIL PROTECTED]> > Nope I'm dreaming. These jobs do use custom output formats. I hadn't > really looked closely either. (Everything else uses vectors.) Now I > imagine there is some reason but yeah it would be much better to > operate in terms of vectors if possible. > > Sebastian is there a reason Vectors couldn't be used? > > On Wed, Jun 9, 2010 at 7:33 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > > On Wed, Jun 9, 2010 at 11:25 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > > > >> On Wed, Jun 9, 2010 at 7:14 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > >> > The ItemSimilarityJob actually uses implementations of the Vector > >> > class hierarchy? I think that's the issue - if the on-disk and > in-mapper > >> > representations are never Vectors, then they won't interoperate with > >> > any of the matrix operations... > >> > >> Yes they are Vectors. > >> > > > > Oh, I guess I missed that, which step/phase of the ItemSimilarity job > uses > > these, on trunk currently? I don't see any mappers which take in > > int, vector pairs... > > > > > >> Oh I see. Well that's not a problem. Already, IDs have to be mapped to > >> ints to be used as dimensions in a Vector. So in most cases things are > >> keyed by these int pseudo-IDs. That's OK too. > >> > >> A matrix is a bunch of vectors -- at least, that's a nice structure > >> for a SequenceFile. Row (or col) ID mapped to row (column) vector. > >> > >> is that not what other jobs are using? > >> what's the better alternative we could think about converging on. > >> > > > > Yes, as long as the *on HDFS* representation is a > > SequenceFile<IntWritable,VectorWritable>, we can interoperate. Or > > now that you've moved on to VIntWritable, I should migrate the > distributed > > matrix stuff to do the same. > > > > And any Mapper<IntWritable,VectorWritable, KOUT, VOUT> subclasses > > are reusable and would reduce replicated work as well... > > > > -jake > > > +
Sebastian Schelter 2010-06-09, 18:50
|