Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - kmeans not returning k clusters


Copy link to this message
-
Re: kmeans not returning k clusters
Jeff Eastman 2012-05-09, 19:24
Does this cluster reduction happen when you prime k-means with canopy?
Can you first adjust T1==T2 to get about 200 canopies and feed that to
k-means? How wide are your term vectors? Have you tried other distance
measures?

If anybody else out there is experiencing similar problems, please chime in.

Jeff

On 5/9/12 1:07 PM, Pat Ferrel wrote:
> That's what I'm doing now. Random seeds is not really the best way to
> do kmeans. However my results are repeatable as far as I've gone. And
> canopy wants to generate a much larger set of clusters, with a wide
> range of T1 and T2 for this data set so the theory that it does not
> support 30 clusters seems unlikely although the may be a fair distance
> apart.
>
> Since I've tried several times with several random seed so the "seeds
> are too close" theory doesn't seem likely.
> Given canopy wants to generate more clusters, the "doesn't support k =
> 30" theory doesn't seem likely.
>
> I'm not saying that there is a real problem here but when I noticed it
> I had 16,000 documents and was asking for 200 clusters and got 38. If
> there is some good reason for this it would be nice to find it and
> report it to the user. The "good reason" might be very helpful in the
> analysis. Or it could be a bug.
>
> At least it's out there in case others are seeing lost clusters.
>
> On 5/9/12 7:49 AM, Jeff Eastman wrote:
>> Paratosh is correct in his analysis. K-means can work itself into a
>> situation where there are some empty clusters if the initial cluster
>> centers are too closely spaced or if the data really doesn't support
>> k clusters. This is because it assigns each vector to the most likely
>> (closest) cluster. If two prior clusters are very close together this
>> can cause one of them to become empty.
>>
>> Have you tried priming k-means with canopy instead of the random
>> sampler?
>>
>> On 5/9/12 10:35 AM, Pat Ferrel wrote:
>>> I suspect you are right Paritosh. I ran the random seed with kmean
>>> several times on the supplied data set and always got 28 rather than
>>> 30 clusters. I don't care so much about the number but it might mean
>>> that some clusters are thrown out and without looking you couldn't
>>> tell if they were important ones or not. Just upping k to 32 doesn't
>>> really work if you still get some thrown out.
>>>
>>> At least i think the issue is repeatable with this data.
>>>
>>> On 5/9/12 1:14 AM, Paritosh Ranjan wrote:
>>>> Printouts of Mahout vectors prints only the non-zero elements.
>>>> So, the centers are not empty, rather they are zero.
>>>>
>>>> Prima facie, I suspect that you are getting lot of empty clusters.
>>>> This might be occurring due to the combination of distance measure,
>>>> convergence threshold and distances between vectors.
>>>> Can you try to analyze and change/play around with these parameters?
>>>>
>>>> I will try to look into how the Random Cluster Initialization is
>>>> working. I will log a jira if I find some issue. However, I think
>>>> that there will be no problem in cluster initialization part.
>>>>
>>>> On 09-05-2012 03:21, Danfeng Li wrote:
>>>>> I got the same issue. What I found is that the initial centers
>>>>> have many empty ones, the final number of clusters are decided by
>>>>> the number of nonempty centers.
>>>>>
>>>>> Here are some example of my cases:
>>>>>
>>>>> ...
>>>>> CL-34358205{n=0 c=[] r=[]}
>>>>> CL-34358207{n=0 c=[] r=[]}
>>>>> CL-34358209{n=0 c=[] r=[]}
>>>>> CL-34358213{n=0 c=[0:1.000] r=[]}
>>>>> CL-34358215{n=0 c=[] r=[]}
>>>>> CL-34358216{n=0 c=[] r=[]}
>>>>> CL-34358217{n=0 c=[] r=[]}
>>>>> CL-34358220{n=0 c=[] r=[]}
>>>>> CL-34358221{n=0 c=[] r=[]}
>>>>> CL-34358222{n=0 c=[] r=[]}
>>>>> CL-34358223{n=0 c=[] r=[]}
>>>>> CL-34358224{n=0 c=[] r=[]}
>>>>> CL-34358227{n=0 c=[0:1.000] r=[]}
>>>>> CL-34358228{n=0 c=[] r=[]}
>>>>> CL-34358229{n=0 c=[] r=[]}
>>>>> ...
>>>>>
>>>>> Is it the case there is a bug in initialization?
>>>>>
>>>>> Thanks.
>>>>> Dan
>>>>>
>>>>> -----Original Message-----