Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - canopy cluster size


Copy link to this message
-
Re: canopy cluster size
Baoqiang Cao 2012-03-14, 17:03
Very good points! I'm going to give Dirichlet a try.
Thanks as always.
Baoqiang

On Wed, Mar 14, 2012 at 8:52 AM, Jeff Eastman
<[EMAIL PROTECTED]> wrote:
> YW, you might also try Dirichlet with a DistanceMeasureClusterDistribution
> on a CosineDistanceMeasure. See DirichletClusterer or the wiki for an
> explanation of why this might also be an attractive approach. With enough
> initial models (maybe -k=50 or 100 in your case) it is essentially
> non-parametric. You can also use k, reducers with Dirichlet (also k-means,
> btw) to improve scalability. See TestL1ModelClustering for an example of
> this approach.
>
>
> On 3/14/12 7:30 AM, Baoqiang Cao wrote:
>>
>> Appreciate!
>>
>> It help a lot on clarifying canopy for me. After all these adventures,
>> I guess kmeans is the inevitable solution for my problem. Ironically,
>> I went to canopy in hope of getting better results out of kmeans.
>>
>> Thanks again.
>>
>> Baoqiang
>>
>>
>> On Tue, Mar 13, 2012 at 5:01 PM, Jeff Eastman
>> <[EMAIL PROTECTED]>  wrote:
>>>
>>> No, Canopy only uses a single reducer, so what's happening is many
>>> mappers
>>> are munching your data in parallel and then the poor little reducer has
>>> to
>>> combine them all. It is slow going and a problem with Canopy that I don't
>>> know how to fix. It is complicated by the fact that all the canopy
>>> centers
>>> become very dense vectors in this process, consuming memory and cpu. You
>>> might play with t3 and t4 parameters which set different T1/2 values for
>>> the
>>> reduce step. That could improve reducer performance.
>>>
>>> Suggest you try k-means. With it you can specify the number of clusters
>>> you
>>> want and use that many reducers to improve scalability.
>>>
>>>
>>>
>>> On 3/13/12 2:51 PM, Baoqiang Cao wrote:
>>>>
>>>> Thanks Jeff!
>>>>
>>>> After post the email, I did try CosineDistance, the problem is that
>>>> the reducer part takes too long, it almost stop. The T2 values I tried
>>>> on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the
>>>> reducer quickly passed 67%, then very very slowly progress, for
>>>> example, it takes several minutes to finish 1% more.
>>>>
>>>> Is that something wrong in my data?
>>>>
>>>> Best
>>>> Baoqiang
>>>>
>>>>
>>>> On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman
>>>> <[EMAIL PROTECTED]>    wrote:
>>>>>
>>>>> EuclideanDistance is not a great choice for document clustering,
>>>>> especially
>>>>> with a lot of terms. Suggest you try CosineDistance which will give you
>>>>> all
>>>>> distances between 0 and 1. If you still end up with only one canopy it
>>>>> is
>>>>> because T2 is too large. T1 has no effect upon the number of canopies
>>>>> produced. Once you make T2 small enough you should see more canopies.
>>>>>
>>>>> You might also try k-means, sampling maybe k=50 initial clusters from
>>>>> your
>>>>> dataset. Then you can tune k to see how that affects your clusters.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 3/13/12 12:44 PM, Baoqiang Cao wrote:
>>>>>>
>>>>>> Hi,
>>>>>>
>>>>>> I'm trying to use canopy clustering on about 2 million documents. What
>>>>>> I
>>>>>> did is:
>>>>>>
>>>>>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o
>>>>>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5
>>>>>>
>>>>>> And canopy clustering:
>>>>>>
>>>>>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o
>>>>>> /mahout/canopy-clusters/test -dm
>>>>>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2
>>>>>> 1.5 -ow -cl
>>>>>>
>>>>>> at last:
>>>>>>
>>>>>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final
>>>>>> -dt sequencefile  -o foo
>>>>>>
>>>>>> In "foo", there is only one line staring with "C-0{n=100 c=[",
>>>>>> regardless t1 and t2 values I used.
>>>>>>
>>>>>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in
>>>>>> the final output from clusterdump. I'm expecting not a single cluster,
>>>>>> any help find out why I got only one cluster?
>>>>>>
>>>>>> Thanks.