|
|
-
Re: Judging the quality of clusteringJeff Eastman 2012-05-17, 21:33
Hi Pat,
I don't have a good answer here. Evidently, something in CDbw has become broken and you are the first to notice. When I run TestCDbwEvaluator, the values for k-means and fuzzy-k are clearly incorrect. The values for Canopy, MeanShift and Dirichlet are not so obviously incorrect but I remain suspicious. Something must have become broken in the recent clustering refactoring. From the method CDbwEvaluator.invalidCluster comment (used to enable pruning): * Return if the cluster is valid. Valid clusters must have more than 2 representative points, * and at least one of them must be different than the cluster center. This is because the * representative points extraction will duplicate the cluster center if it is empty. Oddly enough, inspection of the test log indicates that only k-means and fuzzy-k are not pruning clusters. Clearly some more investigation is needed. I will take a look at it tomorrow. In the mean time if you develop any additional insight please do share it with us. Thanks, Jeff On 5/17/12 3:53 PM, Pat Ferrel wrote: > I built a tool that iterates through a list of values for k on the > same data and spits out the CDbw and ClusterEvaluator results each time. > > When the evaluator or CDbw prunes a cluster, how do I interpret that? > They seem to throw out the same clusters on a given run. Also CDbw > always returns an inter-cluster density of 0? > > On 5/17/12 5:58 AM, Jeff Eastman wrote: >> Yes, that is the paper I used to implement CDbw. I've tried it a few >> times along with the simpler ClusterEvaluator metrics I took from >> Mahout In Action and they look to be reasonable - see the tests - >> though I have no way to judge their absolute values. Anything you can >> contribute in this area would be most welcome. Perhaps a wiki page? >> >> >> On 5/16/12 1:14 PM, Pat Ferrel wrote: >>> The reference was in the code for >>> http://www.db-net.aueb.gr/index.php/corporate/content/download/227/833/file/HV_poster2002.pdf >>> >>> On 5/16/12 9:56 AM, Pat Ferrel wrote: >>>> Thanks, I've been looking at that. Is there a description of how to >>>> interpret those values? An academic paper maybe? The intra-cluster >>>> distance intuitively seems to correspond to something like >>>> cohesion. I don't get the intuition behind inter-cluster distances >>>> but Ted thinks they are the most important. >>>> >>>> On 5/16/12 7:32 AM, Jeff Eastman wrote: >>>>> Mahout has a ClusterEvaluator and a CDbwEvaluator that compute >>>>> some quality metrics (inter-cluster distance, >>>>> intra-cluster-distance, ...) that you may find useful. Both >>>>> calculate a set of representative points from the clustering >>>>> output and compute the (n^2) metrics over these points rather than >>>>> all of the points in each cluster. >>>>> >>>>> On 5/15/12 4:46 PM, Pat Ferrel wrote: >>>>>> So many questions about best k, how to choose t1 and t2, how much >>>>>> help is dimensional reduction would have clear answers if we had >>>>>> a way to judge the quality of clusters. >>>>>> >>>>>> Various methods were discussed here for a time: >>>>>> http://www.lucidimagination.com/search/document/dab8c1f3c3addcfe/validating_clustering_output >>>>>> >>>>>> Has there been any work on building a measure of quality? >>>>>> >>>>>> >>>>> >>> >>> >> > > |