Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - Mahout-279/kmeans++


Copy link to this message
-
Re: Mahout-279/kmeans++
Ted Dunning 2012-08-22, 17:18
Just an off thought, do you have duplicate input points?

On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote:

> ... I have also verified by running canopy multiple times with 0.5 and 0.7
> that there is a continual discrepancy between the two clustering versions.
>  The max/min vectors in a cluster using 0.5 is: 19192158/215  and 0.7 is:
> 921998/5.  They should not necessarily be the same, since I am using canopy
> clustering to find initial centroids, however I would think they would have
> the same sum, which they do not (45901885 vs 1599154).
>
> Here is the method I am running:
>
> public static void KmeansClusteringCanopy(String outputDir, String T,
> String itMax)
>                         throws IOException, InterruptedException,
> ClassNotFoundException,
>                         InstantiationException, IllegalAccessException {
>
>                 Configuration conf = new Configuration();
>
>                 DistanceMeasure measure = new EuclideanDistanceMeasure();
>
>                 Path vectorsFolder = new Path(outputDir, "vectors");
>                 Path clusterCenters = new Path(outputDir +
> "-canopy/centriods");
>                 Path clusterOutput = new Path(outputDir +
> "-canopy/clusters");
>
>                 // create canopies instead of initial vectors
>                 CanopyDriver.run(conf, vectorsFolder, clusterCenters,
> measure,
>                                 Double.parseDouble(T),
> Double.parseDouble(T), false, 0, false);
>
>
>                 // kmeans cluster operation
>                 KMeansDriver.run(conf, vectorsFolder, new
> Path(clusterCenters,
>                                 "clusters-0-final/part-r-00000"),
> clusterOutput, measure, 0.01,
>                                 Integer.parseInt(itMax), true, 0.0, false);
>
>
>                 //post process by putting completed clusters into their
> own files.
>                 ClusterOutputPostProcessorDriver.run(clusterOutput,
>                                 new
> Path(clusterOutput+"/CanopyClusterVectorFolders"), false);
>
>         }
>
> What do you think?
>
> On another but related note: Is there a plan to have a method -- say
> ClusterOutputPostProcessorDriver -- which when run outputs the vectors
> within clusters as well as a separate folder containing pruned outliers?
>
> Thanks!
>
> Mattie
>
> -----Original Message-----
> From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]]
> Sent: Friday, August 17, 2012 12:16 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Mahout-279/kmeans++
>
> The clustering algorithm has also changed internally. So, expect the
> results to be different ( and better ).
>
> I can think of one reason for this behavior. Maybe lots of clusters are
> having only one vector inside it, and, AFAIK, clusterdumper will not
> output any cluster with single vector.
> So, I think, its clusterdumper which is doing the invisible "pruning" (
> by not ouputting clusters with single vectors ).
>
> Can you cross check the output once with ClusterOutputPostProcessorDriver?
>
> No, no tool can output the pruned vectors. The only way to see all
> vectors assigned to any cluster is to set clusterClassificationThreshold
> to 0.
>
> If you still face the problem, then please provide the parameters with
> which you are calling kmeans.
>
> Regarding "I should also mention I have vectors which are exactly the
> same (even their names), perhaps they are the ones being pruned, is that
> possible? "
>
> The name of the vector has nothing to do with clustering, I am not sure
> whether it will have any effect when clusterdumper is in action. So,
> crosschecking with ClusterOutputPostProcessorDriver will answer this.
>
> Good luck.
> Paritosh
>
> On 17-08-2012 21:07, Whitmore, Mattie wrote:
> > Sure, I have a dataset which I wish to cluster using Kmeans.  Previously
> (v0.5) when I did a clusterdump the total amount of vectors within the
> resultant clusters was the same as the total amount fed to the algorithm.
>  I wish this to be the case when clustering with v0.7.  The only change in