|
Pat Ferrel
2012-07-09, 00:44
Ted Dunning
2012-07-09, 01:07
Lance Norskog
2012-07-09, 03:40
Ted Dunning
2012-07-09, 06:39
Lance Norskog
2012-07-09, 07:34
Jeff Eastman
2012-07-09, 14:41
Pat Ferrel
2012-07-09, 16:26
Ted Dunning
2012-07-09, 16:32
Pat Ferrel
2012-07-09, 16:59
Ted Dunning
2012-07-09, 17:05
Pat Ferrel
2012-07-09, 17:06
Ted Dunning
2012-07-09, 17:16
Pat Ferrel
2012-07-11, 17:21
Ted Dunning
2012-07-11, 17:36
Pat Ferrel
2012-07-11, 18:06
Jeff Eastman
2012-07-11, 18:46
Pat Ferrel
2012-07-11, 19:40
Pat Ferrel
2012-07-13, 23:15
|
-
Cluster Evaluation 0.8 stylePat Ferrel 2012-07-09, 00:44
To use something like kmeans on any large and changing data set it seems
a requirement that there be some means of evaluating the quality of clusters at different scales. The usual eyeballing breaks down quickly. Trying to use the cluster evaluators in Mahout with kmeans as the clustering method and cosine and the distance measure has proven problematic. The method is to iterate through the data using different ks and performing the evaluation at each point. What I find is that certain values are almost always in error. The Intra-cluster density from ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is almost always 0. I have also seen several cases where CDbw fails to return any results but have not tracked down why yet. Given that the data for either evaluator is usually incomplete these methods are not very useful. Is mahout dropping the evaluators? Is the general wisdom that they are not particularly useful? Should a newer method be pursued? This seems a fairly important question to me, am I missing something? Raw data for a sample crawl is given below:
-
Re: Cluster Evaluation 0.8 styleTed Dunning 2012-07-09, 01:07
I can't comment on the existing evaluators, but for me the only real
measure that I care about is average distance to nearest cluster for new or held-out data. I will be building something of this sort for the clustering part of the knn code I have been working on. On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > To use something like kmeans on any large and changing data set it seems > a requirement that there be some means of evaluating the quality of > clusters at different scales. The usual eyeballing breaks down quickly. > > Trying to use the cluster evaluators in Mahout with kmeans as the > clustering method and cosine and the distance measure has proven > problematic. The method is to iterate through the data using different ks > and performing the evaluation at each point. What I find is that certain > values are almost always in error. The Intra-cluster density from > ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is > almost always 0. I have also seen several cases where CDbw fails to return > any results but have not tracked down why yet. > > Given that the data for either evaluator is usually incomplete these > methods are not very useful. Is mahout dropping the evaluators? Is the > general wisdom that they are not particularly useful? Should a newer method > be pursued? This seems a fairly important question to me, am I missing > something? > > Raw data for a sample crawl is given below: > > > >
-
Re: Cluster Evaluation 0.8 styleLance Norskog 2012-07-09, 03:40
Are there any measures of self-similarity?
On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > I can't comment on the existing evaluators, but for me the only real > measure that I care about is average distance to nearest cluster for new or > held-out data. I will be building something of this sort for the > clustering part of the knn code I have been working on. > > > On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > >> To use something like kmeans on any large and changing data set it seems >> a requirement that there be some means of evaluating the quality of >> clusters at different scales. The usual eyeballing breaks down quickly. >> >> Trying to use the cluster evaluators in Mahout with kmeans as the >> clustering method and cosine and the distance measure has proven >> problematic. The method is to iterate through the data using different ks >> and performing the evaluation at each point. What I find is that certain >> values are almost always in error. The Intra-cluster density from >> ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is >> almost always 0. I have also seen several cases where CDbw fails to return >> any results but have not tracked down why yet. >> >> Given that the data for either evaluator is usually incomplete these >> methods are not very useful. Is mahout dropping the evaluators? Is the >> general wisdom that they are not particularly useful? Should a newer method >> be pursued? This seems a fairly important question to me, am I missing >> something? >> >> Raw data for a sample crawl is given below: >> >> >> >> > -- Lance Norskog [EMAIL PROTECTED]
-
Re: Cluster Evaluation 0.8 styleTed Dunning 2012-07-09, 06:39
What do you mean by self similarity? Power law size scaling? Or that two successive clusterings get nearly the same answer?
Sent from my iPhone On Jul 8, 2012, at 8:40 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Are there any measures of self-similarity? > > On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> I can't comment on the existing evaluators, but for me the only real >> measure that I care about is average distance to nearest cluster for new or >> held-out data. I will be building something of this sort for the >> clustering part of the knn code I have been working on. >> >> >> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote: >> >>> To use something like kmeans on any large and changing data set it seems >>> a requirement that there be some means of evaluating the quality of >>> clusters at different scales. The usual eyeballing breaks down quickly. >>> >>> Trying to use the cluster evaluators in Mahout with kmeans as the >>> clustering method and cosine and the distance measure has proven >>> problematic. The method is to iterate through the data using different ks >>> and performing the evaluation at each point. What I find is that certain >>> values are almost always in error. The Intra-cluster density from >>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is >>> almost always 0. I have also seen several cases where CDbw fails to return >>> any results but have not tracked down why yet. >>> >>> Given that the data for either evaluator is usually incomplete these >>> methods are not very useful. Is mahout dropping the evaluators? Is the >>> general wisdom that they are not particularly useful? Should a newer method >>> be pursued? This seems a fairly important question to me, am I missing >>> something? >>> >>> Raw data for a sample crawl is given below: >>> >>> >>> >>> >> > > > -- > Lance Norskog > [EMAIL PROTECTED]
-
Re: Cluster Evaluation 0.8 styleLance Norskog 2012-07-09, 07:34
Power law size scaling.
On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > What do you mean by self similarity? Power law size scaling? Or that two successive clusterings get nearly the same answer? > > Sent from my iPhone > > On Jul 8, 2012, at 8:40 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > >> Are there any measures of self-similarity? >> >> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> >>> I can't comment on the existing evaluators, but for me the only real >>> measure that I care about is average distance to nearest cluster for new or >>> held-out data. I will be building something of this sort for the >>> clustering part of the knn code I have been working on. >>> >>> >>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote: >>> >>>> To use something like kmeans on any large and changing data set it seems >>>> a requirement that there be some means of evaluating the quality of >>>> clusters at different scales. The usual eyeballing breaks down quickly. >>>> >>>> Trying to use the cluster evaluators in Mahout with kmeans as the >>>> clustering method and cosine and the distance measure has proven >>>> problematic. The method is to iterate through the data using different ks >>>> and performing the evaluation at each point. What I find is that certain >>>> values are almost always in error. The Intra-cluster density from >>>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is >>>> almost always 0. I have also seen several cases where CDbw fails to return >>>> any results but have not tracked down why yet. >>>> >>>> Given that the data for either evaluator is usually incomplete these >>>> methods are not very useful. Is mahout dropping the evaluators? Is the >>>> general wisdom that they are not particularly useful? Should a newer method >>>> be pursued? This seems a fairly important question to me, am I missing >>>> something? >>>> >>>> Raw data for a sample crawl is given below: >>>> >>>> >>>> >>>> >>> >> >> >> -- >> Lance Norskog >> [EMAIL PROTECTED] -- Lance Norskog [EMAIL PROTECTED]
-
Re: Cluster Evaluation 0.8 styleJeff Eastman 2012-07-09, 14:41
Hi Pat,
The ClusterEvaluator implements the clustering metrics from Mahout In Action and the CDbw from the published paper. Both work with somewhat contrived examples in tests but it would be really desirable to get them working with real data. To my knowledge you are the first person to try to use them. Please open a JIRA and add whatever test data you can that helps illustrate the problems you are seeing. Jeff On 7/8/12 8:44 PM, Pat Ferrel wrote: > To use something like kmeans on any large and changing data set it > seems a requirement that there be some means of evaluating the quality > of clusters at different scales. The usual eyeballing breaks down > quickly. > > Trying to use the cluster evaluators in Mahout with kmeans as the > clustering method and cosine and the distance measure has proven > problematic. The method is to iterate through the data using different > ks and performing the evaluation at each point. What I find is that > certain values are almost always in error. The Intra-cluster density > from ClusterEvaluator is almost always NaN. The CDbw inter-cluster > density is almost always 0. I have also seen several cases where CDbw > fails to return any results but have not tracked down why yet. > > Given that the data for either evaluator is usually incomplete these > methods are not very useful. Is mahout dropping the evaluators? Is the > general wisdom that they are not particularly useful? Should a newer > method be pursued? This seems a fairly important question to me, am I > missing something? > > Raw data for a sample crawl is given below: > > >
-
Re: Cluster Evaluation 0.8 stylePat Ferrel 2012-07-09, 16:26
Can you rephrase that question? I do a rowsimilarity measure for the
docs excluding self-similarity but I doubt that is what you are asking. Are you asking if I do a similarity calc on clusters? I'm planning to find clusters that are similar using their centroids. This is to create a sort of graph clustering model mixing different clustering scales (different ks) but I'd like to have a way to discard poor quality clusters from the calc. On 7/8/12 8:40 PM, Lance Norskog wrote: > Are there any measures of self-similarity? > > On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> I can't comment on the existing evaluators, but for me the only real >> measure that I care about is average distance to nearest cluster for new or >> held-out data. I will be building something of this sort for the >> clustering part of the knn code I have been working on. >> >> >> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote: >> >>> To use something like kmeans on any large and changing data set it seems >>> a requirement that there be some means of evaluating the quality of >>> clusters at different scales. The usual eyeballing breaks down quickly. >>> >>> Trying to use the cluster evaluators in Mahout with kmeans as the >>> clustering method and cosine and the distance measure has proven >>> problematic. The method is to iterate through the data using different ks >>> and performing the evaluation at each point. What I find is that certain >>> values are almost always in error. The Intra-cluster density from >>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is >>> almost always 0. I have also seen several cases where CDbw fails to return >>> any results but have not tracked down why yet. >>> >>> Given that the data for either evaluator is usually incomplete these >>> methods are not very useful. Is mahout dropping the evaluators? Is the >>> general wisdom that they are not particularly useful? Should a newer method >>> be pursued? This seems a fairly important question to me, am I missing >>> something? >>> >>> Raw data for a sample crawl is given below: >>> >>> >>> >>> >
-
Re: Cluster Evaluation 0.8 styleTed Dunning 2012-07-09, 16:32
Power law scaling is very rare to observe directly in k-means clusters
because the algorithm tends to force them to be the same physical size. Bayesian non-parametric clustering algorithms can show some scaling effects, but it is very difficult to see very many clusters so it is very difficult to demonstrate self-similar scaling over a very large size range. If you want to try, just produce a Zipf-plot (plot size rank versus size on log-log). Look for linearity. On Mon, Jul 9, 2012 at 12:34 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Power law size scaling. > > On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > What do you mean by self similarity? Power law size scaling? Or that > two successive clusterings get nearly the same answer? > > > > Sent from my iPhone > > > > On Jul 8, 2012, at 8:40 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > > >> Are there any measures of self-similarity? > >> > >> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > >> > >>> I can't comment on the existing evaluators, but for me the only real > >>> measure that I care about is average distance to nearest cluster for > new or > >>> held-out data. I will be building something of this sort for the > >>> clustering part of the knn code I have been working on. > >>> > >>> > >>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> > wrote: > >>> > >>>> To use something like kmeans on any large and changing data set it > seems > >>>> a requirement that there be some means of evaluating the quality of > >>>> clusters at different scales. The usual eyeballing breaks down > quickly. > >>>> > >>>> Trying to use the cluster evaluators in Mahout with kmeans as the > >>>> clustering method and cosine and the distance measure has proven > >>>> problematic. The method is to iterate through the data using > different ks > >>>> and performing the evaluation at each point. What I find is that > certain > >>>> values are almost always in error. The Intra-cluster density from > >>>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster > density is > >>>> almost always 0. I have also seen several cases where CDbw fails to > return > >>>> any results but have not tracked down why yet. > >>>> > >>>> Given that the data for either evaluator is usually incomplete these > >>>> methods are not very useful. Is mahout dropping the evaluators? Is the > >>>> general wisdom that they are not particularly useful? Should a newer > method > >>>> be pursued? This seems a fairly important question to me, am I missing > >>>> something? > >>>> > >>>> Raw data for a sample crawl is given below: > >>>> > >>>> > >>>> > >>>> > >>> > >> > >> > >> -- > >> Lance Norskog > >> [EMAIL PROTECTED] > > > > -- > Lance Norskog > [EMAIL PROTECTED] >
-
Re: Cluster Evaluation 0.8 stylePat Ferrel 2012-07-09, 16:59
https://issues.apache.org/jira/browse/MAHOUT-1020 was closed for a
reason unrelated to the primary issue. I'll clean it up, reopen and attach some data. I seem to recall the same issues with reuters. Not sure if you can include it as a test resource in the build but it is real data. As to why people haven't gotten to using this in Mahout. I can only say that they may have and stopped when they got results they didn't understand. Scale is one of Mahout's primary benefits. And at scale you require evaluators or you ignore quality. I don't know of another choice--hence my original question. I can only assume that people who cluster at large scale are currently not concerned with quality? Or could it be that people are not clustering large data sets because they have no way to judge quality? Seems like this issue has a self-fulfilling nature. On 7/9/12 7:41 AM, Jeff Eastman wrote: > Hi Pat, > > The ClusterEvaluator implements the clustering metrics from Mahout In > Action and the CDbw from the published paper. Both work with somewhat > contrived examples in tests but it would be really desirable to get > them working with real data. To my knowledge you are the first person > to try to use them. Please open a JIRA and add whatever test data you > can that helps illustrate the problems you are seeing. > > Jeff > > On 7/8/12 8:44 PM, Pat Ferrel wrote: >> To use something like kmeans on any large and changing data set it >> seems a requirement that there be some means of evaluating the >> quality of clusters at different scales. The usual eyeballing breaks >> down quickly. >> >> Trying to use the cluster evaluators in Mahout with kmeans as the >> clustering method and cosine and the distance measure has proven >> problematic. The method is to iterate through the data using >> different ks and performing the evaluation at each point. What I find >> is that certain values are almost always in error. The Intra-cluster >> density from ClusterEvaluator is almost always NaN. The CDbw >> inter-cluster density is almost always 0. I have also seen several >> cases where CDbw fails to return any results but have not tracked >> down why yet. >> >> Given that the data for either evaluator is usually incomplete these >> methods are not very useful. Is mahout dropping the evaluators? Is >> the general wisdom that they are not particularly useful? Should a >> newer method be pursued? This seems a fairly important question to >> me, am I missing something? >> >> Raw data for a sample crawl is given below: >> >> >> >
-
Re: Cluster Evaluation 0.8 styleTed Dunning 2012-07-09, 17:05
There hasn't been much use-case for clustering up to now. Also, our
clustering is dead slow which discourages use. On Mon, Jul 9, 2012 at 9:59 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > Scale is one of Mahout's primary benefits. And at scale you require > evaluators or you ignore quality. I don't know of another choice--hence my > original question. I can only assume that people who cluster at large scale > are currently not concerned with quality? Or could it be that people are > not clustering large data sets because they have no way to judge quality? > Seems like this issue has a self-fulfilling nature. >
-
Re: Cluster Evaluation 0.8 stylePat Ferrel 2012-07-09, 17:06
Sorry, I'm not following this shorthand. Are you asking if the term
weights of each centroid follow a power law, like they are supposed to? On 7/9/12 12:34 AM, Lance Norskog wrote: > Power law size scaling. > > On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> What do you mean by self similarity? Power law size scaling? Or that two successive clusterings get nearly the same answer? >> >> Sent from my iPhone >> >> On Jul 8, 2012, at 8:40 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: >> >>> Are there any measures of self-similarity? >>> >>> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>> >>>> I can't comment on the existing evaluators, but for me the only real >>>> measure that I care about is average distance to nearest cluster for new or >>>> held-out data. I will be building something of this sort for the >>>> clustering part of the knn code I have been working on. >>>> >>>> >>>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> wrote: >>>> >>>>> To use something like kmeans on any large and changing data set it seems >>>>> a requirement that there be some means of evaluating the quality of >>>>> clusters at different scales. The usual eyeballing breaks down quickly. >>>>> >>>>> Trying to use the cluster evaluators in Mahout with kmeans as the >>>>> clustering method and cosine and the distance measure has proven >>>>> problematic. The method is to iterate through the data using different ks >>>>> and performing the evaluation at each point. What I find is that certain >>>>> values are almost always in error. The Intra-cluster density from >>>>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster density is >>>>> almost always 0. I have also seen several cases where CDbw fails to return >>>>> any results but have not tracked down why yet. >>>>> >>>>> Given that the data for either evaluator is usually incomplete these >>>>> methods are not very useful. Is mahout dropping the evaluators? Is the >>>>> general wisdom that they are not particularly useful? Should a newer method >>>>> be pursued? This seems a fairly important question to me, am I missing >>>>> something? >>>>> >>>>> Raw data for a sample crawl is given below: >>>>> >>>>> >>>>> >>>>> >>> >>> -- >>> Lance Norskog >>> [EMAIL PROTECTED] > >
-
Re: Cluster Evaluation 0.8 styleTed Dunning 2012-07-09, 17:16
I think that he means cluster sizes rather than term weights.
For text, term frequencies follow an approximate power law. On Mon, Jul 9, 2012 at 10:06 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > Sorry, I'm not following this shorthand. Are you asking if the term > weights of each centroid follow a power law, like they are supposed to? > > On 7/9/12 12:34 AM, Lance Norskog wrote: > >> Power law size scaling. >> >> On Sun, Jul 8, 2012 at 11:39 PM, Ted Dunning <[EMAIL PROTECTED]> >> wrote: >> >>> What do you mean by self similarity? Power law size scaling? Or that >>> two successive clusterings get nearly the same answer? >>> >>> Sent from my iPhone >>> >>> On Jul 8, 2012, at 8:40 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: >>> >>> Are there any measures of self-similarity? >>>> >>>> On Sun, Jul 8, 2012 at 6:07 PM, Ted Dunning <[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> I can't comment on the existing evaluators, but for me the only real >>>>> measure that I care about is average distance to nearest cluster for >>>>> new or >>>>> held-out data. I will be building something of this sort for the >>>>> clustering part of the knn code I have been working on. >>>>> >>>>> >>>>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>> To use something like kmeans on any large and changing data set it >>>>>> seems >>>>>> a requirement that there be some means of evaluating the quality of >>>>>> clusters at different scales. The usual eyeballing breaks down >>>>>> quickly. >>>>>> >>>>>> Trying to use the cluster evaluators in Mahout with kmeans as the >>>>>> clustering method and cosine and the distance measure has proven >>>>>> problematic. The method is to iterate through the data using >>>>>> different ks >>>>>> and performing the evaluation at each point. What I find is that >>>>>> certain >>>>>> values are almost always in error. The Intra-cluster density from >>>>>> ClusterEvaluator is almost always NaN. The CDbw inter-cluster >>>>>> density is >>>>>> almost always 0. I have also seen several cases where CDbw fails to >>>>>> return >>>>>> any results but have not tracked down why yet. >>>>>> >>>>>> Given that the data for either evaluator is usually incomplete these >>>>>> methods are not very useful. Is mahout dropping the evaluators? Is the >>>>>> general wisdom that they are not particularly useful? Should a newer >>>>>> method >>>>>> be pursued? This seems a fairly important question to me, am I missing >>>>>> something? >>>>>> >>>>>> Raw data for a sample crawl is given below: >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>> -- >>>> Lance Norskog >>>> [EMAIL PROTECTED] >>>> >>> >> >> > >
-
Re: Cluster Evaluation 0.8 stylePat Ferrel 2012-07-11, 17:21
The average distance to the nearest cluster measures overall clumpiness
found at a particular scale but does not address the cohesiveness of any particular clump. In any real world data set some clusters will be cohesive and some not. This happens for at least two reasons; some data does not clump, and there are multiple scales for clumpiness. This is an important distinction I believe and implies the need for a cohesiveness per cluster evaluation. It was my understanding that the ClusterEvaluator included an attempt to provide this measure with intra-cluster density per cluster though it looks like that output has been removed? On 7/8/12 6:07 PM, Ted Dunning wrote: > I can't comment on the existing evaluators, but for me the only real > measure that I care about is average distance to nearest cluster for > new or held-out data. I will be building something of this sort for > the clustering part of the knn code I have been working on. > > On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > To use something like kmeans on any large and changing data set it > seems a requirement that there be some means of evaluating the > quality of clusters at different scales. The usual eyeballing > breaks down quickly. > > Trying to use the cluster evaluators in Mahout with kmeans as the > clustering method and cosine and the distance measure has proven > problematic. The method is to iterate through the data using > different ks and performing the evaluation at each point. What I > find is that certain values are almost always in error. The > Intra-cluster density from ClusterEvaluator is almost always NaN. > The CDbw inter-cluster density is almost always 0. I have also > seen several cases where CDbw fails to return any results but have > not tracked down why yet. > > Given that the data for either evaluator is usually incomplete > these methods are not very useful. Is mahout dropping the > evaluators? Is the general wisdom that they are not particularly > useful? Should a newer method be pursued? This seems a fairly > important question to me, am I missing something? > > Raw data for a sample crawl is given below: > > > >
-
Re: Cluster Evaluation 0.8 styleTed Dunning 2012-07-11, 17:36
With k-means algorithms, you don't find out much about clumpiness because
large clusters in the data will get multiple k-means clusters. On Wed, Jul 11, 2012 at 10:21 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > The average distance to the nearest cluster measures overall clumpiness > found at a particular scale but does not address the cohesiveness of any > particular clump. In any real world data set some clusters will be cohesive > and some not. This happens for at least two reasons; some data does not > clump, and there are multiple scales for clumpiness. This is an important > distinction I believe and implies the need for a cohesiveness per cluster > evaluation. > > It was my understanding that the ClusterEvaluator included an attempt to > provide this measure with intra-cluster density per cluster though it looks > like that output has been removed? > > On 7/8/12 6:07 PM, Ted Dunning wrote: > >> I can't comment on the existing evaluators, but for me the only real >> measure that I care about is average distance to nearest cluster for new or >> held-out data. I will be building something of this sort for the >> clustering part of the knn code I have been working on. >> >> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]<mailto: >> [EMAIL PROTECTED]>**> wrote: >> >> To use something like kmeans on any large and changing data set it >> seems a requirement that there be some means of evaluating the >> quality of clusters at different scales. The usual eyeballing >> breaks down quickly. >> >> Trying to use the cluster evaluators in Mahout with kmeans as the >> clustering method and cosine and the distance measure has proven >> problematic. The method is to iterate through the data using >> different ks and performing the evaluation at each point. What I >> find is that certain values are almost always in error. The >> Intra-cluster density from ClusterEvaluator is almost always NaN. >> The CDbw inter-cluster density is almost always 0. I have also >> seen several cases where CDbw fails to return any results but have >> not tracked down why yet. >> >> Given that the data for either evaluator is usually incomplete >> these methods are not very useful. Is mahout dropping the >> evaluators? Is the general wisdom that they are not particularly >> useful? Should a newer method be pursued? This seems a fairly >> important question to me, am I missing something? >> >> Raw data for a sample crawl is given below: >> >> >> >> >> > >
-
Re: Cluster Evaluation 0.8 stylePat Ferrel 2012-07-11, 18:06
No argument there and that is exactly one of my points. Real data often
clusters at multiple scales. Using kmeans to find this involves calculating clusters at several scales and evaluating the results for each scale factor (k)--on average. However I think that this will always create some bad/non-cohesive clusters (at any scale) and it would be nice to have a way to throw these out or at least flag them. Wouldn't some measure of the distribution of points in each cluster give us a way to detect every cluster's cohesiveness? BTW I imagine there are more elegant ways to cluster at multiple scales, perhaps even all at once, but I haven't found one and would welcome enlightenment. Blindly running hierarchical clustering is not a fair answer since it has the same problems mentioned above. On 7/11/12 10:36 AM, Ted Dunning wrote: > With k-means algorithms, you don't find out much about clumpiness because > large clusters in the data will get multiple k-means clusters. > > On Wed, Jul 11, 2012 at 10:21 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > >> The average distance to the nearest cluster measures overall clumpiness >> found at a particular scale but does not address the cohesiveness of any >> particular clump. In any real world data set some clusters will be cohesive >> and some not. This happens for at least two reasons; some data does not >> clump, and there are multiple scales for clumpiness. This is an important >> distinction I believe and implies the need for a cohesiveness per cluster >> evaluation. >> >> It was my understanding that the ClusterEvaluator included an attempt to >> provide this measure with intra-cluster density per cluster though it looks >> like that output has been removed? >> >> On 7/8/12 6:07 PM, Ted Dunning wrote: >> >>> I can't comment on the existing evaluators, but for me the only real >>> measure that I care about is average distance to nearest cluster for new or >>> held-out data. I will be building something of this sort for the >>> clustering part of the knn code I have been working on. >>> >>> On Sun, Jul 8, 2012 at 5:44 PM, Pat Ferrel <[EMAIL PROTECTED]<mailto: >>> [EMAIL PROTECTED]>**> wrote: >>> >>> To use something like kmeans on any large and changing data set it >>> seems a requirement that there be some means of evaluating the >>> quality of clusters at different scales. The usual eyeballing >>> breaks down quickly. >>> >>> Trying to use the cluster evaluators in Mahout with kmeans as the >>> clustering method and cosine and the distance measure has proven >>> problematic. The method is to iterate through the data using >>> different ks and performing the evaluation at each point. What I >>> find is that certain values are almost always in error. The >>> Intra-cluster density from ClusterEvaluator is almost always NaN. >>> The CDbw inter-cluster density is almost always 0. I have also >>> seen several cases where CDbw fails to return any results but have >>> not tracked down why yet. >>> >>> Given that the data for either evaluator is usually incomplete >>> these methods are not very useful. Is mahout dropping the >>> evaluators? Is the general wisdom that they are not particularly >>> useful? Should a newer method be pursued? This seems a fairly >>> important question to me, am I missing something? >>> >>> Raw data for a sample crawl is given below: >>> >>> >>> >>> >>> >>
-
Re: Cluster Evaluation 0.8 styleJeff Eastman 2012-07-11, 18:46
The ClusterEvaluator has methods for both inter-cluster density and
intra-cluster density. The former computes the density using the cluster centers, while the latter uses a set of representative points extracted from the clustered points. This reduces the computational overhead of calculating a density from all of the points from each cluster. The unit test uses synthetic data and produces reasonable looking results afaict. Have you had negative experiences with that? On 7/11/12 1:21 PM, Pat Ferrel wrote: > ... > > It was my understanding that the ClusterEvaluator included an attempt > to provide this measure with intra-cluster density per cluster though > it looks like that output has been removed? >
-
Re: Cluster Evaluation 0.8 stylePat Ferrel 2012-07-11, 19:40
As I've said before this issue is still a problem.
https://issues.apache.org/jira/browse/MAHOUT-1020?focusedCommentId=13409696#comment-13409696 This should be reopened and I sent you a link to get my data (only 8G good luck!) My confusion with the per cluster density measure is because In 0.8 an output file is required for clusterdump but the per cluster density measure is not written to it. It's in the lNFO output to STDOUT. When I run a bunch of these the STDOUT is lost so I'll have to modify my scripts or update my KFinder code. I'd vote to include it in the output file in the future. The only problem I've seen with the per cluster Intra-cluster density is that I get a lot of pruned clusters sometimes and the Intra-Cluster Density is not calculated for them. I think we've discussed this in the past. 12/07/11 12:22:12 INFO evaluation.ClusterEvaluator: Intra-Cluster Density[766] = 0.6243875150474454 I really would like to get this stuff working and am willing to provide whatever help you need if you are in a position to work on it. I have 0.8-SNAPSHOT building but am inexperienced debugging in this kind of large data situation but willing to learn. If you'd like me to try something out just point me in the right direction. I'm also happy to test Ted's inter-cluster stuff too. On 7/11/12 11:46 AM, Jeff Eastman wrote: > The ClusterEvaluator has methods for both inter-cluster density and > intra-cluster density. The former computes the density using the > cluster centers, while the latter uses a set of representative points > extracted from the clustered points. This reduces the computational > overhead of calculating a density from all of the points from each > cluster. > > The unit test uses synthetic data and produces reasonable looking > results afaict. Have you had negative experiences with that? > > On 7/11/12 1:21 PM, Pat Ferrel wrote: >> ... >> >> It was my understanding that the ClusterEvaluator included an attempt >> to provide this measure with intra-cluster density per cluster though >> it looks like that output has been removed? >> >
-
Re: Cluster Evaluation 0.8 stylePat Ferrel 2012-07-13, 23:15
The user list? Seems like JIRA would be a better place to discuss what
files I need to send but OK. From the inputs to the ClusterEvaluator class I'll send: 1. conf.set(RepresentativePointsDriver.DISTANCE_MEASURE_KEY, dm.getClass().getName()); ---> org.apache.mahout.common.distance.CosineDistanceMeasure I guess you can just make a note of this 2. conf.set(RepresentativePointsDriver.STATE_IN_KEY, "tmp/representative/representativePoints-" + numIters); ---> representativePoints-5/* Here 5 is the maxiter value used internally in clusterdump 3. ClusterEvaluator ce = new ClusterEvaluator(conf, finalClusters); ---> clusters-27-final/* The final clusters dir of the k = 500 run. I can't upload more than 10M to JIRA and this is 22M so here is a webdav URL once again: http://cloud.occamsmachete.com/public.php?service=files&token=ceae2302d5ef6a55737b5e48aaafe45a3eddc389&file=/cluster-eval.tar.gz I hope I got it right this time. I don't think there is a cluster evaluator driver so I'll throw something together to double check it myself. Thanks, Pat On 7/13/12 1:40 PM, Jeff Eastman wrote: > The rep-points tar you sent doesn't look right. I was expecting a > directory of representativePoints-i where i is the number of > iterations you used to run the RepresentativePointsDriver. Each > iteration will add a single point to the evolving list of > representative points for each cluster. > > And, next time you send clusters, please don't send the > clusteredPoints. All I need is the clusters-n-final directory and the > directory with the last representativePoints parts. > > Finally, can we please do this on the list so it is searchable by > others? You can also upload the relevant files to the JIRA so we know > what we are dealing with. > > Jeff > > > On 7/13/12 3:58 PM, Pat Ferrel wrote: >> OK but I can't find it. It doesn't seem to be listed on the "mahout" >> CL help. Maybe there's some way to tell the script to execute an >> arbitrary driver? >> >> Anyway I just wrote a few lines to execute it and sent you a link to >> the output. >> >> On 7/13/12 12:40 PM, Jeff Eastman wrote: >>> Sure there is. >>> >>> On 7/13/12 12:36 PM, Pat Ferrel wrote: >>>> So there is no command line way to run RepresentativePointsDriver? >>>> I'll have to hack up something, might be more than a minute... >>>> >>>> On 7/13/12 9:06 AM, Pat Ferrel wrote: >>>>> OK, didn't know there was a RepresentativePointsDriver. Give me a >>>>> few minutes. >>>>> >>>>> On 7/13/12 9:04 AM, Jeff Eastman wrote: >>>>>> Hi Pat, >>>>>> >>>>>> You will need to run the RepresentativePointsDriver to extract a >>>>>> set of representative points for your clusters. It expects a -i >>>>>> input directory full of clusters (your final directory), a -cp >>>>>> directory full of clustered points, an -o output directory for >>>>>> the representative points, a distance measure, number of >>>>>> iterations, etc. >>>>>> >>>>>> The cluster dumper does for you this but it is not done by the >>>>>> respective clustering algorithms. >>>>>> >>>>>> With this data we can run the various evaluators on a consistent >>>>>> and much smaller set of points to debug them further. >>>>>> >>>>>> Jeff >>>>>> >>>>>> >>>>>> On 7/11/12 4:43 PM, Pat Ferrel wrote: >>>>>>> D'oh... True that. >>>>>>> >>>>>>> This has the final cluster part and the clusteredPoints dir. Are >>>>>>> "representative points" taken from clusteredPoints? Anyway let >>>>>>> me know if this is not what you need. >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/MAHOUT-1045 >>>>>>> clusters CDbw Inter-Cluster Density CDbw Intra-Cluster >>>>>>> Density CDbw Separation CDbw Validity Index Inter-cluster >>>>>>> Density Intra-cluster Density >>>>>>> 500 0 1050.07236806084 187792.321370176 1.97E+08 >>>>>>> 0.928988162001239 NaN >>>>>>> >>>>>>> http://cloud.occamsmachete.com/public.php?service=files&token=5c527cbef78c26ea8c729a3b07f45de87011cb16&file=/4000-clusters-eval.tar.gz |