Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - Problem using SNAPSHOT kmeans


Copy link to this message
-
Re: Problem using SNAPSHOT kmeans
Jeff Eastman 2012-06-04, 21:22
This is the new ClusterIterator k-means implementation and you may have
indeed found a corner case. Take a look at my logic in the preceding and
let's see if there is a fix we can try.

On 6/4/12 4:40 PM, Pat Ferrel wrote:
> Hmm, switched back to mahout 0.6 and the same command line produced
> the expected results with the same data. No error. Can't find anything
> on JIRA.
>
> Is anyone else using kmeans from the trunk on real data?
>
> On 6/4/12 9:05 AM, Pat Ferrel wrote:
>> Using the CLI to kmeans from several trunk versions I get an error I
>> don't understand.  When the job died the
>> b3/canopy-centroids/clusters-0-final contained the random-seeds file
>> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0
>> had several part files but b3/kmeans-clusters/clusters-1 was empty.
>> When I look through the code from the trace it doesn't make much sense.
>>
>> Command line:
>> mahout kmeans
>>   -i b3/vectors/tfidf-vectors/
>>   -k 20
>>   -c b3/canopy-centroids/clusters-0-final
>>   -cl
>>   -o b3/kmeans-clusters
>>   -ow
>>   -cd 0.01
>>   -x 30
>>   -dm org.apache.mahout.common.distance.CosineDistanceMeasure
>>
>> Error:
>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments:
>> {--clustering=null,
>> --clusters=[b3/canopy-centroids/clusters-0-final],
>> --convergenceDelta=[0.01],
>> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure],
>> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/],
>> --maxIter=[30], --method=[mapreduce], --numClusters=[20],
>> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0],
>> --tempDir=[temp]}
>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info
>> from SCDynamicStore
>> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting
>> b3/canopy-centroids/clusters-0-final
>> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load
>> native-hadoop library for your platform... using builtin-java classes
>> where applicable
>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor
>> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors
>> to b3/canopy-centroids/clusters-0-final/part-randomSeed
>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input:
>> b3/vectors/tfidf-vectors Clusters In:
>> b3/canopy-centroids/clusters-0-final/part-randomSeed Out:
>> b3/kmeans-clusters Distance:
>> org.apache.mahout.common.distance.CosineDistanceMeasure
>> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max
>> Iterations: 30 num Reduce Tasks:
>> org.apache.mahout.math.VectorWritable Input Vectors: {}
>> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor
>> Cluster Iterator running iteration 1 over priorPath:
>> b3/kmeans-clusters/clusters-0
>> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to
>> process : 1
>> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001
>> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100
>> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720
>> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680
>> 12/06/04 07:55:08 INFO mapred.JobClient:  map 0% reduce 0%
>> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001
>> org.apache.mahout.math.IndexException: Index -1 is outside allowable
>> range of [0,20)
>>     at
>> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
>>     at
>> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44)
>>     at
>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52)
>>     at
>> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18)
>>     at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144)
>>     at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764)
>>     at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370)
>>     at
>> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)