|
|
Pat Ferrel 2012-03-20, 16:28
How do you map the output of RowSimilarity to documents? What I really need is to create an association of
doc1 --> docn, docm, doci, etc.
The output of rowsimilarity looks like
rowid --> vector of rowids : distances
for example:
Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095, 12793:0.22009858979452146,3275:0.1871791030103281, 14613:0.3534278632679437,4411:0.2516380602790199, 17520:0.3139731583634198,13611:0.18968888212315968, 14354:0.17673965754661425,0:1.0000000000000004}
It would be nice to use the same keys as they are output by seq2aparse, in my case named vectors so file names would appear in the output as rowids. Creating my association would be trivial.
Have I missed a dictionary containing rowid to docid(name) mapping?
Suneel Marthi 2012-03-20, 17:41
Docindex is ur answer
Sent from my iPhone
On Mar 20, 2012, at 12:28 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote:
> How do you map the output of RowSimilarity to documents? What I really need is to create an association of > > doc1 --> docn, docm, doci, etc. > > The output of rowsimilarity looks like > > rowid --> vector of rowids : distances > > for example: > > Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095, > 12793:0.22009858979452146,3275:0.1871791030103281, > 14613:0.3534278632679437,4411:0.2516380602790199, > 17520:0.3139731583634198,13611:0.18968888212315968, > 14354:0.17673965754661425,0:1.0000000000000004} > > It would be nice to use the same keys as they are output by seq2aparse, in my case named vectors so file names would appear in the output as rowids. Creating my association would be trivial. > > Have I missed a dictionary containing rowid to docid(name) mapping? >
Suneel Marthi 2012-03-20, 18:52
I should have been more elaborate in my previous reply. RowId job creates a matrix which is of type <IntWritable, VectorWritable> and a docIndex <IntWritable, Text>
docIndex is a map of the rowId to the keys generated from seq2sparse.
What you would need to do is to join the output of RowSimilarity to docIndex to get the format u r looking for. Hope that helps. Suneel ________________________________ From: Suneel Marthi <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Tuesday, March 20, 2012 1:41 PM Subject: Re: RowSimilarityJob Docindex is ur answer
Sent from my iPhone
On Mar 20, 2012, at 12:28 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote:
> How do you map the output of RowSimilarity to documents? What I really need is to create an association of > > doc1 --> docn, docm, doci, etc. > > The output of rowsimilarity looks like > > rowid --> vector of rowids : distances > > for example: > > Key: 0: Value: {14458:0.2966480826934176,11399:0.30290014772966095, > 12793:0.22009858979452146,3275:0.1871791030103281, > 14613:0.3534278632679437,4411:0.2516380602790199, > 17520:0.3139731583634198,13611:0.18968888212315968, > 14354:0.17673965754661425,0:1.0000000000000004} > > It would be nice to use the same keys as they are output by seq2aparse, in my case named vectors so file names would appear in the output as rowids. Creating my association would be trivial. > > Have I missed a dictionary containing rowid to docid(name) mapping? >
Pat Ferrel 2012-05-31, 02:22
What is the value created to describe simlarity by RowSimilarityJob? The paper which describes how the algorithm is implemented doesn't describe the various similarity values returned by mahout. It seems to focus on cooccurrences.
For SIMILARITY_COSINE is the value = cosine or 1 - cosine?
Is the value calculated after cooccurrences determines similar docs independently?
The code is very difficult to read so a little help would be appreciated.
|
|