Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - can't get <point-id, cluster-id> thru "-p"


Copy link to this message
-
Re: can't get <point-id, cluster-id> thru "-p"
Baoqiang Cao 2012-03-14, 22:18
Thanks a lot. But I don't know if I miss anything in front of my teary
eyes because of Wednesday afternoon or ? I have equivalent inputs as
yours:

mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
/mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points

the cluster files after 15 iterations are
/mahout/kmeans/clusters-15-final. /mahout/points is a directory I
created in prior. On screen, the output are something like
"VL-1721020{n=186 c=[...". It just is no any output files under that
directory.

Any help , please

On Wed, Mar 14, 2012 at 2:13 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote:
> The -p parameter is an input. You should pass in the clusterPoints/
> directory that was generated by the cluster driver you used.
>
> My use of fkmeans might be an example:
>
>   mahout fkmeans -i wikipedia-vectors/tfidf-vectors/ -c
>   wikipedia-fkmeans-centroids -o wikipedia-fkmeans-clusters -k 100 -m
>   2 -ow -x 10 -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>
> This will create wikipedia-clusters/clusters/clusteredPoints/part-m-00000
> which is the file with the clustered points. I then did a clusterdump
>
>   mahout clusterdump -s
>   wikipedia-fkmeans-clusters/clusters/clusters-1/part-r-00000 -p
>   wikipedia-fkmeans-clusters/clusteredPoints/ -d
>  wikipedia-fkmeans-clusters/dictionary.file-0 -dt sequencefile -dm
>   org.apache.mahout.common.distance.CosineDistanceMeasure
>
> This will output to the screen. Use -o to specify an output file.
>
> Good advice for any user of mahout is read the output of the help very
> carefully. IMHO it is very easy to misunderstand the parameters, inputs, and
> outputs. I think I only understand about 10%. Try:
>
>   mahout fkmeans --help
>
>
>
> On 3/14/12 10:52 AM, Baoqiang Cao wrote:
>>
>> Hi,
>>
>> Very sorry for such a trivial question but ran out of luck. I'm trying
>> to see which points (thru point-ids) belong to which cluster center.
>> Here is what I did:
>>
>> mahout clusterdump -s /mahout/kmeans/clusters-15-final -d
>> /mahout/sparse/dictionary.file-0 -dt sequencefile   -p /mahout/points
>>>
>>> out
>>
>> The onscreen output is:
>>
>> 12/03/14 12:39:52 INFO common.AbstractJob: Command line arguments:
>> {--dictionary=/mahout/sparse/dictionary.file-0,
>> --dictionaryType=sequencefile,
>>
>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>> --endPhase=2147483647, --outputFormat=TEXT,
>> --pointsDir=/mahout/points,
>> --seqFileDir=/mahout/kmeans/clusters-15-final, --startPhase=0,
>> --tempDir=temp}
>> 12/03/14 12:39:55 WARN snappy.LoadSnappy: Snappy native library is
>> available
>> 12/03/14 12:39:55 INFO util.NativeCodeLoader: Loaded the native-hadoop
>> library
>> 12/03/14 12:39:55 INFO snappy.LoadSnappy: Snappy native library loaded
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:39:55 INFO compress.CodecPool: Got brand-new decompressor
>> 12/03/14 12:42:07 INFO clustering.ClusterDumper: Wrote 5188 clusters
>> 12/03/14 12:42:07 INFO driver.MahoutDriver: Program took 135276 ms
>> (Minutes: 2.2546)
>>
>>
>> There is nothing under "/mahout/points". Any help on why and how?
>>
>> Thanks in advance.
>> Baoqiang
>>
>