Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # dev - Re: [jira] [Issue Comment Edited] (MAHOUT-504) Kmeans clustering error


Copy link to this message
-
Re: [jira] [Issue Comment Edited] (MAHOUT-504) Kmeans clustering error
Lance Norskog 2012-02-16, 03:46
Nobody reads the docs. If the program itself can do this, instead of
just barfing, it should. This is a case of Passive-Agressive Error
Reporting.

On Wed, Feb 15, 2012 at 7:20 AM, Jeff Eastman
<[EMAIL PROTECTED]> wrote:
> The error message describes what the algorithm can see: that there are no
> initial clusters. The wiki documentation seems reasonably clear on the use
> of -k
> (https://cwiki.apache.org/confluence/display/MAHOUT/K-Means+Clustering) to
> obtain them by sampling the input dataset, otherwise -c needs to contain
> clusters produced by the user.
>
>
> On 2/14/12 8:04 PM, Lance Norskog wrote:
>>
>> Could the error message describe the user's mistake?
>>
>> On Tue, Feb 14, 2012 at 9:16 AM, Jeff Eastman
>> <[EMAIL PROTECTED]>  wrote:
>>>
>>> +1 bingo. K-Means is expecting you to provide the prior cluster centers
>>> in
>>> -c. If you want it to sample from your input data you need to add the -k
>>> option to tell it how many you want. This has been a constant part of the
>>> api for some time, hence 0.4, 0.5 and 0.6 will all give the same error if
>>> you overlook this argument.
>>>
>>>
>>>
>>> On 2/14/12 8:56 AM, Suneel Marthi wrote:
>>>>
>>>> You are not specifying the number of clusters that need to be generated,
>>>> try running again by specifying a -k<number of clusters>    option. You
>>>> also
>>>> need to specify that you need clustering to be done with -cl.
>>>>
>>>> For example:-
>>>>
>>>> ./bin/mahout kmeans -i
>>>> ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c
>>>> ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x
>>>> 10  -ow -k 20 -cl
>>>>
>>>>
>>>>
>>>> ________________________________
>>>>  From: qiang xu (Issue Comment Edited) (JIRA)<[EMAIL PROTECTED]>
>>>> To: [EMAIL PROTECTED]
>>>> Sent: Tuesday, February 14, 2012 10:48 AM
>>>> Subject: [jira] [Issue Comment Edited] (MAHOUT-504) Kmeans clustering
>>>> error
>>>>
>>>>
>>>>     [
>>>>
>>>> https://issues.apache.org/jira/browse/MAHOUT-504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13207675#comment-13207675
>>>> ]
>>>>
>>>> qiang xu edited comment on MAHOUT-504 at 2/14/12 3:46 PM:
>>>> ----------------------------------------------------------
>>>>
>>>> This problem still exist in mahout 0.5 and 0.6
>>>> ./bin/mahout kmeans -i
>>>> ./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/ -c
>>>> ./examples/bin/work/clusters -o ./examples/bin/work/reuters-kmeans -x 10
>>>>  -ow
>>>> Running on hadoop, using HADOOP_HOME=/data/hadoop_cluster/hadoop-0.20.2/
>>>> HADOOP_CONF_DIR=/data/hadoop_cluster/hadoop-0.20.2/conf/
>>>> 12/02/14 20:56:03 INFO common.AbstractJob: Command line arguments:
>>>> {--clusters=./examples/bin/work/clusters, --convergenceDelta=0.5,
>>>>
>>>> --distanceMeasure=org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure,
>>>> --endPhase=2147483647,
>>>> --input=./examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors/,
>>>> --maxIter=10, --method=mapreduce,
>>>> --output=./examples/bin/work/reuters-kmeans, --overwrite=null,
>>>> --startPhase=0, --tempDir=temp}
>>>> 12/02/14 20:56:03 INFO kmeans.KMeansDriver: Input:
>>>> examples/bin/work/reuters-out-seqdir-sparse/tfidf-vectors Clusters In:
>>>> examples/bin/work/clusters Out: examples/bin/work/reuters-kmeans
>>>> Distance:
>>>> org.apache.mahout.common.distance.SquaredEuclideanDistanceMeasure
>>>> 12/02/14 20:56:03 INFO kmeans.KMeansDriver: convergence: 0.5 max
>>>> Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable
>>>> Input
>>>> Vectors: {}
>>>> 12/02/14 20:56:03 INFO kmeans.KMeansDriver: K-Means Iteration 1
>>>> 12/02/14 20:56:05 INFO input.FileInputFormat: Total input paths to
>>>> process
>>>> : 1
>>>> 12/02/14 20:56:06 INFO mapred.JobClient: Running job:
>>>> job_201202131515_0122
>>>> 12/02/14 20:56:07 INFO mapred.JobClient:  map 0% reduce 0%
>>>> 12/02/14 20:56:16 INFO mapred.JobClient: Task Id :
>>>> attempt_201202131515_0122_m_000000_0, Status : FAILED

Lance Norskog
[EMAIL PROTECTED]