|
|
Baoqiang Cao 2012-03-13, 18:44
Hi,
I'm trying to use canopy clustering on about 2 million documents. What I did is:
mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5
And canopy clustering:
mahout canopy -i /mahout/sparse/test/tfidf-vectors -o /mahout/canopy-clusters/test -dm org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2 1.5 -ow -cl
at last:
mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final -dt sequencefile -o foo
In "foo", there is only one line staring with "C-0{n=100 c=[", regardless t1 and t2 values I used.
I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in the final output from clusterdump. I'm expecting not a single cluster, any help find out why I got only one cluster?
Thanks. Baoqiang
-
Re: canopy cluster size
Jeff Eastman 2012-03-13, 20:08
EuclideanDistance is not a great choice for document clustering, especially with a lot of terms. Suggest you try CosineDistance which will give you all distances between 0 and 1. If you still end up with only one canopy it is because T2 is too large. T1 has no effect upon the number of canopies produced. Once you make T2 small enough you should see more canopies.
You might also try k-means, sampling maybe k=50 initial clusters from your dataset. Then you can tune k to see how that affects your clusters.
On 3/13/12 12:44 PM, Baoqiang Cao wrote: > Hi, > > I'm trying to use canopy clustering on about 2 million documents. What I did is: > > mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o > /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5 > > And canopy clustering: > > mahout canopy -i /mahout/sparse/test/tfidf-vectors -o > /mahout/canopy-clusters/test -dm > org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2 > 1.5 -ow -cl > > at last: > > mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final > -dt sequencefile -o foo > > In "foo", there is only one line staring with "C-0{n=100 c=[", > regardless t1 and t2 values I used. > > I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in > the final output from clusterdump. I'm expecting not a single cluster, > any help find out why I got only one cluster? > > Thanks. > Baoqiang > >
-
Re: canopy cluster size
Baoqiang Cao 2012-03-13, 20:51
Thanks Jeff!
After post the email, I did try CosineDistance, the problem is that the reducer part takes too long, it almost stop. The T2 values I tried on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the reducer quickly passed 67%, then very very slowly progress, for example, it takes several minutes to finish 1% more.
Is that something wrong in my data?
Best Baoqiang On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > EuclideanDistance is not a great choice for document clustering, especially > with a lot of terms. Suggest you try CosineDistance which will give you all > distances between 0 and 1. If you still end up with only one canopy it is > because T2 is too large. T1 has no effect upon the number of canopies > produced. Once you make T2 small enough you should see more canopies. > > You might also try k-means, sampling maybe k=50 initial clusters from your > dataset. Then you can tune k to see how that affects your clusters. > > > > > On 3/13/12 12:44 PM, Baoqiang Cao wrote: >> >> Hi, >> >> I'm trying to use canopy clustering on about 2 million documents. What I >> did is: >> >> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o >> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5 >> >> And canopy clustering: >> >> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o >> /mahout/canopy-clusters/test -dm >> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2 >> 1.5 -ow -cl >> >> at last: >> >> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final >> -dt sequencefile -o foo >> >> In "foo", there is only one line staring with "C-0{n=100 c=[", >> regardless t1 and t2 values I used. >> >> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in >> the final output from clusterdump. I'm expecting not a single cluster, >> any help find out why I got only one cluster? >> >> Thanks. >> Baoqiang >> >> >
-
Re: canopy cluster size
Jeff Eastman 2012-03-13, 22:01
No, Canopy only uses a single reducer, so what's happening is many mappers are munching your data in parallel and then the poor little reducer has to combine them all. It is slow going and a problem with Canopy that I don't know how to fix. It is complicated by the fact that all the canopy centers become very dense vectors in this process, consuming memory and cpu. You might play with t3 and t4 parameters which set different T1/2 values for the reduce step. That could improve reducer performance.
Suggest you try k-means. With it you can specify the number of clusters you want and use that many reducers to improve scalability. On 3/13/12 2:51 PM, Baoqiang Cao wrote: > Thanks Jeff! > > After post the email, I did try CosineDistance, the problem is that > the reducer part takes too long, it almost stop. The T2 values I tried > on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the > reducer quickly passed 67%, then very very slowly progress, for > example, it takes several minutes to finish 1% more. > > Is that something wrong in my data? > > Best > Baoqiang > > > On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman > <[EMAIL PROTECTED]> wrote: >> EuclideanDistance is not a great choice for document clustering, especially >> with a lot of terms. Suggest you try CosineDistance which will give you all >> distances between 0 and 1. If you still end up with only one canopy it is >> because T2 is too large. T1 has no effect upon the number of canopies >> produced. Once you make T2 small enough you should see more canopies. >> >> You might also try k-means, sampling maybe k=50 initial clusters from your >> dataset. Then you can tune k to see how that affects your clusters. >> >> >> >> >> On 3/13/12 12:44 PM, Baoqiang Cao wrote: >>> Hi, >>> >>> I'm trying to use canopy clustering on about 2 million documents. What I >>> did is: >>> >>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o >>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5 >>> >>> And canopy clustering: >>> >>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o >>> /mahout/canopy-clusters/test -dm >>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2 >>> 1.5 -ow -cl >>> >>> at last: >>> >>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final >>> -dt sequencefile -o foo >>> >>> In "foo", there is only one line staring with "C-0{n=100 c=[", >>> regardless t1 and t2 values I used. >>> >>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in >>> the final output from clusterdump. I'm expecting not a single cluster, >>> any help find out why I got only one cluster? >>> >>> Thanks. >>> Baoqiang >>> >>> >
-
Re: canopy cluster size
Baoqiang Cao 2012-03-14, 13:30
Appreciate!
It help a lot on clarifying canopy for me. After all these adventures, I guess kmeans is the inevitable solution for my problem. Ironically, I went to canopy in hope of getting better results out of kmeans.
Thanks again.
Baoqiang On Tue, Mar 13, 2012 at 5:01 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > No, Canopy only uses a single reducer, so what's happening is many mappers > are munching your data in parallel and then the poor little reducer has to > combine them all. It is slow going and a problem with Canopy that I don't > know how to fix. It is complicated by the fact that all the canopy centers > become very dense vectors in this process, consuming memory and cpu. You > might play with t3 and t4 parameters which set different T1/2 values for the > reduce step. That could improve reducer performance. > > Suggest you try k-means. With it you can specify the number of clusters you > want and use that many reducers to improve scalability. > > > > On 3/13/12 2:51 PM, Baoqiang Cao wrote: >> >> Thanks Jeff! >> >> After post the email, I did try CosineDistance, the problem is that >> the reducer part takes too long, it almost stop. The T2 values I tried >> on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the >> reducer quickly passed 67%, then very very slowly progress, for >> example, it takes several minutes to finish 1% more. >> >> Is that something wrong in my data? >> >> Best >> Baoqiang >> >> >> On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman >> <[EMAIL PROTECTED]> wrote: >>> >>> EuclideanDistance is not a great choice for document clustering, >>> especially >>> with a lot of terms. Suggest you try CosineDistance which will give you >>> all >>> distances between 0 and 1. If you still end up with only one canopy it is >>> because T2 is too large. T1 has no effect upon the number of canopies >>> produced. Once you make T2 small enough you should see more canopies. >>> >>> You might also try k-means, sampling maybe k=50 initial clusters from >>> your >>> dataset. Then you can tune k to see how that affects your clusters. >>> >>> >>> >>> >>> On 3/13/12 12:44 PM, Baoqiang Cao wrote: >>>> >>>> Hi, >>>> >>>> I'm trying to use canopy clustering on about 2 million documents. What I >>>> did is: >>>> >>>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o >>>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5 >>>> >>>> And canopy clustering: >>>> >>>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o >>>> /mahout/canopy-clusters/test -dm >>>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2 >>>> 1.5 -ow -cl >>>> >>>> at last: >>>> >>>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final >>>> -dt sequencefile -o foo >>>> >>>> In "foo", there is only one line staring with "C-0{n=100 c=[", >>>> regardless t1 and t2 values I used. >>>> >>>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in >>>> the final output from clusterdump. I'm expecting not a single cluster, >>>> any help find out why I got only one cluster? >>>> >>>> Thanks. >>>> Baoqiang >>>> >>>> >> >
-
Re: canopy cluster size
Jeff Eastman 2012-03-14, 13:52
YW, you might also try Dirichlet with a DistanceMeasureClusterDistribution on a CosineDistanceMeasure. See DirichletClusterer or the wiki for an explanation of why this might also be an attractive approach. With enough initial models (maybe -k=50 or 100 in your case) it is essentially non-parametric. You can also use k, reducers with Dirichlet (also k-means, btw) to improve scalability. See TestL1ModelClustering for an example of this approach.
On 3/14/12 7:30 AM, Baoqiang Cao wrote: > Appreciate! > > It help a lot on clarifying canopy for me. After all these adventures, > I guess kmeans is the inevitable solution for my problem. Ironically, > I went to canopy in hope of getting better results out of kmeans. > > Thanks again. > > Baoqiang > > > On Tue, Mar 13, 2012 at 5:01 PM, Jeff Eastman > <[EMAIL PROTECTED]> wrote: >> No, Canopy only uses a single reducer, so what's happening is many mappers >> are munching your data in parallel and then the poor little reducer has to >> combine them all. It is slow going and a problem with Canopy that I don't >> know how to fix. It is complicated by the fact that all the canopy centers >> become very dense vectors in this process, consuming memory and cpu. You >> might play with t3 and t4 parameters which set different T1/2 values for the >> reduce step. That could improve reducer performance. >> >> Suggest you try k-means. With it you can specify the number of clusters you >> want and use that many reducers to improve scalability. >> >> >> >> On 3/13/12 2:51 PM, Baoqiang Cao wrote: >>> Thanks Jeff! >>> >>> After post the email, I did try CosineDistance, the problem is that >>> the reducer part takes too long, it almost stop. The T2 values I tried >>> on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the >>> reducer quickly passed 67%, then very very slowly progress, for >>> example, it takes several minutes to finish 1% more. >>> >>> Is that something wrong in my data? >>> >>> Best >>> Baoqiang >>> >>> >>> On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman >>> <[EMAIL PROTECTED]> wrote: >>>> EuclideanDistance is not a great choice for document clustering, >>>> especially >>>> with a lot of terms. Suggest you try CosineDistance which will give you >>>> all >>>> distances between 0 and 1. If you still end up with only one canopy it is >>>> because T2 is too large. T1 has no effect upon the number of canopies >>>> produced. Once you make T2 small enough you should see more canopies. >>>> >>>> You might also try k-means, sampling maybe k=50 initial clusters from >>>> your >>>> dataset. Then you can tune k to see how that affects your clusters. >>>> >>>> >>>> >>>> >>>> On 3/13/12 12:44 PM, Baoqiang Cao wrote: >>>>> Hi, >>>>> >>>>> I'm trying to use canopy clustering on about 2 million documents. What I >>>>> did is: >>>>> >>>>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o >>>>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5 >>>>> >>>>> And canopy clustering: >>>>> >>>>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o >>>>> /mahout/canopy-clusters/test -dm >>>>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2 >>>>> 1.5 -ow -cl >>>>> >>>>> at last: >>>>> >>>>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final >>>>> -dt sequencefile -o foo >>>>> >>>>> In "foo", there is only one line staring with "C-0{n=100 c=[", >>>>> regardless t1 and t2 values I used. >>>>> >>>>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in >>>>> the final output from clusterdump. I'm expecting not a single cluster, >>>>> any help find out why I got only one cluster? >>>>> >>>>> Thanks. >>>>> Baoqiang >>>>> >>>>> >
-
Re: canopy cluster size
Baoqiang Cao 2012-03-14, 17:03
Very good points! I'm going to give Dirichlet a try. Thanks as always. Baoqiang
On Wed, Mar 14, 2012 at 8:52 AM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > YW, you might also try Dirichlet with a DistanceMeasureClusterDistribution > on a CosineDistanceMeasure. See DirichletClusterer or the wiki for an > explanation of why this might also be an attractive approach. With enough > initial models (maybe -k=50 or 100 in your case) it is essentially > non-parametric. You can also use k, reducers with Dirichlet (also k-means, > btw) to improve scalability. See TestL1ModelClustering for an example of > this approach. > > > On 3/14/12 7:30 AM, Baoqiang Cao wrote: >> >> Appreciate! >> >> It help a lot on clarifying canopy for me. After all these adventures, >> I guess kmeans is the inevitable solution for my problem. Ironically, >> I went to canopy in hope of getting better results out of kmeans. >> >> Thanks again. >> >> Baoqiang >> >> >> On Tue, Mar 13, 2012 at 5:01 PM, Jeff Eastman >> <[EMAIL PROTECTED]> wrote: >>> >>> No, Canopy only uses a single reducer, so what's happening is many >>> mappers >>> are munching your data in parallel and then the poor little reducer has >>> to >>> combine them all. It is slow going and a problem with Canopy that I don't >>> know how to fix. It is complicated by the fact that all the canopy >>> centers >>> become very dense vectors in this process, consuming memory and cpu. You >>> might play with t3 and t4 parameters which set different T1/2 values for >>> the >>> reduce step. That could improve reducer performance. >>> >>> Suggest you try k-means. With it you can specify the number of clusters >>> you >>> want and use that many reducers to improve scalability. >>> >>> >>> >>> On 3/13/12 2:51 PM, Baoqiang Cao wrote: >>>> >>>> Thanks Jeff! >>>> >>>> After post the email, I did try CosineDistance, the problem is that >>>> the reducer part takes too long, it almost stop. The T2 values I tried >>>> on Cosine are, 0.8, 0.5, 0.2, 0.1, 0.08, 0.0008. Every case, the >>>> reducer quickly passed 67%, then very very slowly progress, for >>>> example, it takes several minutes to finish 1% more. >>>> >>>> Is that something wrong in my data? >>>> >>>> Best >>>> Baoqiang >>>> >>>> >>>> On Tue, Mar 13, 2012 at 3:08 PM, Jeff Eastman >>>> <[EMAIL PROTECTED]> wrote: >>>>> >>>>> EuclideanDistance is not a great choice for document clustering, >>>>> especially >>>>> with a lot of terms. Suggest you try CosineDistance which will give you >>>>> all >>>>> distances between 0 and 1. If you still end up with only one canopy it >>>>> is >>>>> because T2 is too large. T1 has no effect upon the number of canopies >>>>> produced. Once you make T2 small enough you should see more canopies. >>>>> >>>>> You might also try k-means, sampling maybe k=50 initial clusters from >>>>> your >>>>> dataset. Then you can tune k to see how that affects your clusters. >>>>> >>>>> >>>>> >>>>> >>>>> On 3/13/12 12:44 PM, Baoqiang Cao wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to use canopy clustering on about 2 million documents. What >>>>>> I >>>>>> did is: >>>>>> >>>>>> mahout seq2sparse -i /mahout/input/chris_ce_20120120 -o >>>>>> /mahout/sparse/test -wt tfidf -nr 100 -x 50 -md 34000 -n 2 -s 5 >>>>>> >>>>>> And canopy clustering: >>>>>> >>>>>> mahout canopy -i /mahout/sparse/test/tfidf-vectors -o >>>>>> /mahout/canopy-clusters/test -dm >>>>>> org.apache.mahout.common.distance.EuclideanDistanceMeasure -t1 20 -t2 >>>>>> 1.5 -ow -cl >>>>>> >>>>>> at last: >>>>>> >>>>>> mahout clusterdump -s /mahout/canopy-clusters/test/clusters-0-final >>>>>> -dt sequencefile -o foo >>>>>> >>>>>> In "foo", there is only one line staring with "C-0{n=100 c=[", >>>>>> regardless t1 and t2 values I used. >>>>>> >>>>>> I tried "-t1 2000 -t2 1500", ..., "-t1 2 -t2 1.5". Always, one line in >>>>>> the final output from clusterdump. I'm expecting not a single cluster, >>>>>> any help find out why I got only one cluster? >>>>>> >>>>>> Thanks.
|
|