Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - How to find the k most similar docs


Copy link to this message
-
Re: How to find the k most similar docs
Suneel Marthi 2012-03-05, 19:48
Pat,

Your input to RowSimilarity seems to be the tfidf-vectors directory which is <Text, vectorWritable>.

Before executing the RowSimilarity job u need to run the RowIdJob which creates a matrix of <IntWritable, VectorWritable>.  This matrix should be the input to RowSimilarity.

Also from your command, you seem to be missing --tempDir argument, you would need that too.

Suneel
________________________________
 From: Sebastian Schelter <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Sent: Monday, March 5, 2012 2:32 PM
Subject: Re: How to find the k most similar docs
 
That's the problem:

org.apache.hadoop.io.Text cannot be
   cast to org.apache.hadoop.io.IntWritable

RowSimilarityJob expects <IntWritable,VectorWritable> as input, it seems
you supply <Text,VectorWritable>.

--sebastian

On 05.03.2012 20:29, Pat Ferrel wrote:
> org.apache.hadoop.io.Text cannot be
>    cast to org.apache.hadoop.io.IntWritable