Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - canopy cluster size


Copy link to this message
-
Re: canopy cluster size
Baoqiang Cao 2012-03-14, 13:30
Appreciate!

It help a lot on clarifying canopy for me. After all these adventures,
I guess kmeans is the inevitable solution for my problem. Ironically,
I went to canopy in hope of getting better results out of kmeans.

Thanks again.

Baoqiang
On Tue, Mar 13, 2012 at 5:01 PM, Jeff Eastman
<[EMAIL PROTECTED]> wrote:
> No, Canopy only uses a single reducer, so what's happening is many mappers
> are munching your data in parallel and then the poor little reducer has to
> combine them all. It is slow going and a problem with Canopy that I don't
> know how to fix. It is complicated by the fact that all the canopy centers
> become very dense vectors in this process, consuming memory and cpu. You
> might play with t3 and t4 parameters which set different T1/2 values for the
> reduce step. That could improve reducer performance.
>
> Suggest you try k-means. With it you can specify the number of clusters you
> want and use that many reducers to improve scalability.
>
>
>
> On 3/13/12 2:51 PM, Baoqiang Cao wrote:
>>
>> Thanks Jeff!
>>
>> After post the email, I did try CosineDistance, the problem is that
>> the reducer part takes too long, it almost stop. The T2 values I tried
>> on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the
>> reducer quickly passed 67%, then very very slowly progress, for
>> example, it takes several minutes to finish 1% more.
>>
>> Is that something wrong in my data?
>>
>> Best
>> Baoqiang
>>
>>
>> On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman
>> <[EMAIL PROTECTED]>  wrote:
>>>
>>> EuclideanDistance is not a great choice for document clustering,
>>> especially
>>> with a lot of terms. Suggest you try CosineDistance which will give you
>>> all
>>> distances between 0 and 1. If you still end up with only one canopy it is
>>> because T2 is too large. T1 has no effect upon the number of canopies
>>> produced. Once you make T2 small enough you should see more canopies.
>>>
>>> You might also try k-means, sampling maybe k=50 initial clusters from
>>> your
>>> dataset. Then you can tune k to see how that affects your clusters.
>>>
>>>
>>>
>>>
>>> On 3/13/12 12:44 PM, Baoqiang Cao wrote:
>>>>
>>>> Hi,
>>>>
>>>> I'm trying to use canopy clustering on about 2 million documents. What I
>>>> did is:
>>>>
>>>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o
>>>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5
>>>>
>>>> And canopy clustering:
>>>>
>>>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o
>>>> /mahout/canopy-clusters/test -dm
>>>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2
>>>> 1.5 -ow -cl
>>>>
>>>> at last:
>>>>
>>>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final
>>>> -dt sequencefile  -o foo
>>>>
>>>> In "foo", there is only one line staring with "C-0{n=100 c=[",
>>>> regardless t1 and t2 values I used.
>>>>
>>>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in
>>>> the final output from clusterdump. I'm expecting not a single cluster,
>>>> any help find out why I got only one cluster?
>>>>
>>>> Thanks.
>>>> Baoqiang
>>>>
>>>>
>>
>