Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - canopy cluster size


Copy link to this message
-
Re: canopy cluster size
Jeff Eastman 2012-03-13, 22:01
No, Canopy only uses a single reducer, so what's happening is many
mappers are munching your data in parallel and then the poor little
reducer has to combine them all. It is slow going and a problem with
Canopy that I don't know how to fix. It is complicated by the fact that
all the canopy centers become very dense vectors in this process,
consuming memory and cpu. You might play with t3 and t4 parameters which
set different T1/2 values for the reduce step. That could improve
reducer performance.

Suggest you try k-means. With it you can specify the number of clusters
you want and use that many reducers to improve scalability.
On 3/13/12 2:51 PM, Baoqiang Cao wrote:
> Thanks Jeff!
>
> After post the email, I did try CosineDistance, the problem is that
> the reducer part takes too long, it almost stop. The T2 values I tried
> on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the
> reducer quickly passed 67%, then very very slowly progress, for
> example, it takes several minutes to finish 1% more.
>
> Is that something wrong in my data?
>
> Best
> Baoqiang
>
>
> On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman
> <[EMAIL PROTECTED]>  wrote:
>> EuclideanDistance is not a great choice for document clustering, especially
>> with a lot of terms. Suggest you try CosineDistance which will give you all
>> distances between 0 and 1. If you still end up with only one canopy it is
>> because T2 is too large. T1 has no effect upon the number of canopies
>> produced. Once you make T2 small enough you should see more canopies.
>>
>> You might also try k-means, sampling maybe k=50 initial clusters from your
>> dataset. Then you can tune k to see how that affects your clusters.
>>
>>
>>
>>
>> On 3/13/12 12:44 PM, Baoqiang Cao wrote:
>>> Hi,
>>>
>>> I'm trying to use canopy clustering on about 2 million documents. What I
>>> did is:
>>>
>>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o
>>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5
>>>
>>> And canopy clustering:
>>>
>>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o
>>> /mahout/canopy-clusters/test -dm
>>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2
>>> 1.5 -ow -cl
>>>
>>> at last:
>>>
>>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final
>>> -dt sequencefile  -o foo
>>>
>>> In "foo", there is only one line staring with "C-0{n=100 c=[",
>>> regardless t1 and t2 values I used.
>>>
>>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in
>>> the final output from clusterdump. I'm expecting not a single cluster,
>>> any help find out why I got only one cluster?
>>>
>>> Thanks.
>>> Baoqiang
>>>
>>>
>