|
Timothy Potter
2011-02-24, 22:18
Ted Dunning
2011-02-24, 22:25
Timothy Potter
2011-02-24, 23:01
Jeff Eastman
2011-02-24, 23:09
Ted Dunning
2011-02-24, 23:21
Ted Dunning
2011-02-24, 23:23
Ted Dunning
2011-02-24, 23:23
Timothy Potter
2011-02-24, 23:31
Ted Dunning
2011-02-24, 23:47
Jeff Eastman
2011-02-25, 00:02
Timothy Potter
2011-02-25, 17:40
Ted Dunning
2011-02-25, 18:03
Jeff Eastman
2011-02-25, 19:24
|
-
Dirichlet clustering woes ...Timothy Potter 2011-02-24, 22:18
My colleague Szymon and I have been working on Mahout-588 and hoped to
include Dirichlet in our clustering benchmarks, but unfortunately have not had much success. So we're reaching out to the community to see if anyone else has been successful with somewhat large-scale Dirichlet clustering. Specifically, we have 6,077,604 sparse TFIDF vectors generated from the Apache Mail Archives. Using vectors with 40K dimensions on a 5-node cluster it runs nicely until map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%, 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h nothing. The CPUs at the nodes run with almost 100% and full 6GB. So then we tried vectors with 20K dimensions and were able to complete 1 iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of reducers set to 1. The job args we're using are: bin/mahout dirichlet \ -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ -o /asf-mail-archives/mahout-0.4/dirichlet/ \ -a0 1.0 \ -x 10 \ --distanceMeasure org.apache.mahout.common.distance.CosineDistanceMeasure \ -k 60 We're still studying the code to diagnose ourselves, but also wanted to get some feedback. Kind regards, Timothy Potter [EMAIL PROTECTED]
-
Re: Dirichlet clustering woes ...Ted Dunning 2011-02-24, 22:25
Do you have any stats about how many clusters there are and whether a vast
number of points are being assigned to a single cluster? I am a little surprised at your results since the Dirichlet clustering doesn't have any tall poles (that I know of). Every point is compared to every cluster and contributes to every cluster. As such, stragglers shouldn't be a big deal. Did you check the usual suspects with respect to swapping and GC? On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[EMAIL PROTECTED]>wrote: > My colleague Szymon and I have been working on Mahout-588 and hoped to > include Dirichlet in our clustering benchmarks, but unfortunately have not > had much success. So we're reaching out to the community to see if anyone > else has been successful with somewhat large-scale Dirichlet clustering. > > Specifically, we have 6,077,604 sparse TFIDF vectors generated from the > Apache Mail Archives. > > Using vectors with 40K dimensions on a 5-node cluster it runs nicely until > map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%, > 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h > nothing. > The CPUs at the nodes run with almost 100% and full 6GB. > > So then we tried vectors with 20K dimensions and were able to complete 1 > iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each > percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of > reducers set to 1. > > The job args we're using are: > > bin/mahout dirichlet \ > -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ > -o /asf-mail-archives/mahout-0.4/dirichlet/ \ > -a0 1.0 \ > -x 10 \ > --distanceMeasure > org.apache.mahout.common.distance.CosineDistanceMeasure \ > -k 60 > > > We're still studying the code to diagnose ourselves, but also wanted to get > some feedback. > > Kind regards, > > Timothy Potter > [EMAIL PROTECTED] >
-
Re: Dirichlet clustering woes ...Timothy Potter 2011-02-24, 23:01
Thanks Ted, good know about not having any "tall poles". I'll need to dig
into it a bit more to answer your first question, but at least that gives me something to look for. On Thu, Feb 24, 2011 at 3:25 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Do you have any stats about how many clusters there are and whether a vast > number of points are being assigned to a single cluster? > > I am a little surprised at your results since the Dirichlet clustering > doesn't have any tall poles (that I know of). Every point is compared to > every cluster and contributes to every cluster. As such, stragglers > shouldn't be a big deal. > > Did you check the usual suspects with respect to swapping and GC? > > > On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[EMAIL PROTECTED]>wrote: > >> My colleague Szymon and I have been working on Mahout-588 and hoped to >> include Dirichlet in our clustering benchmarks, but unfortunately have not >> had much success. So we're reaching out to the community to see if anyone >> else has been successful with somewhat large-scale Dirichlet clustering. >> >> Specifically, we have 6,077,604 sparse TFIDF vectors generated from the >> Apache Mail Archives. >> >> Using vectors with 40K dimensions on a 5-node cluster it runs nicely until >> map-100% and reduce-92%. and than it virtually stops. it takes 3min to >> 93%, >> 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h >> nothing. >> The CPUs at the nodes run with almost 100% and full 6GB. >> >> So then we tried vectors with 20K dimensions and were able to complete 1 >> iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each >> percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of >> reducers set to 1. >> >> The job args we're using are: >> >> bin/mahout dirichlet \ >> -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ >> -o /asf-mail-archives/mahout-0.4/dirichlet/ \ >> -a0 1.0 \ >> -x 10 \ >> --distanceMeasure >> org.apache.mahout.common.distance.CosineDistanceMeasure \ >> -k 60 >> >> >> We're still studying the code to diagnose ourselves, but also wanted to >> get >> some feedback. >> >> Kind regards, >> >> Timothy Potter >> [EMAIL PROTECTED] >> > >
-
RE: Dirichlet clustering woes ...Jeff Eastman 2011-02-24, 23:09
I'm surprised too. It looks like you are creating 60 clusters which is completely reasonable. During map processing, each point is compared to each cluster to generate its pdf() and the point is assigned to one of the clusters using a multinomial over all the pdfs. If you have many points assigned to one of your clusters by this process then the copy-merge step could take a while to build the reducer input for that cluster. How many mappers are being created from your dataset?
The reducers then accumulate the posterior statistics for one or more clusters. You can try increasing the number of reducers (up to k) which can help with this step. Again, if most of your points are being assigned to a single cluster that reducer will be bogged down observing them all. Also, since the models accumulate Gaussian statistics to compute mean and std posterior values these values will tend to become denser as many vectors are summed and this can drive up memory consumption during the reduce step. You might try increasing the value of -k to spread the vectors over more clusters. Adjusting the value of -a0 could also cause input vectors to be more evenly distributed over the initial prior clusters (which have random center vectors). For text, you might find that the L1Model with a CosineDistanceMeasure could work better than the default NormalModelDistribution. You are breaking new ground here. I've run Dirichlet over Reuters and it seemed to work ok at that scale. Jeff -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 24, 2011 2:26 PM To: [EMAIL PROTECTED] Cc: Timothy Potter Subject: Re: Dirichlet clustering woes ... Do you have any stats about how many clusters there are and whether a vast number of points are being assigned to a single cluster? I am a little surprised at your results since the Dirichlet clustering doesn't have any tall poles (that I know of). Every point is compared to every cluster and contributes to every cluster. As such, stragglers shouldn't be a big deal. Did you check the usual suspects with respect to swapping and GC? On Thu, Feb 24, 2011 at 2:18 PM, Timothy Potter <[EMAIL PROTECTED]>wrote: > My colleague Szymon and I have been working on Mahout-588 and hoped to > include Dirichlet in our clustering benchmarks, but unfortunately have not > had much success. So we're reaching out to the community to see if anyone > else has been successful with somewhat large-scale Dirichlet clustering. > > Specifically, we have 6,077,604 sparse TFIDF vectors generated from the > Apache Mail Archives. > > Using vectors with 40K dimensions on a 5-node cluster it runs nicely until > map-100% and reduce-92%. and than it virtually stops. it takes 3min to 93%, > 7min to get 94%, 23min to get 95%, 1:12 to 96% and after another 4h > nothing. > The CPUs at the nodes run with almost 100% and full 6GB. > > So then we tried vectors with 20K dimensions and were able to complete 1 > iteration after 7 hrs 32 mins. The last 3% of reduce was running 1h each > percent, i had 4 working nodes (+1 namenode), Xmx2500 and max num of > reducers set to 1. > > The job args we're using are: > > bin/mahout dirichlet \ > -i /asf-mail-archives/mahout-0.4/tfidf-vectors/ \ > -o /asf-mail-archives/mahout-0.4/dirichlet/ \ > -a0 1.0 \ > -x 10 \ > --distanceMeasure > org.apache.mahout.common.distance.CosineDistanceMeasure \ > -k 60 > > > We're still studying the code to diagnose ourselves, but also wanted to get > some feedback. > > Kind regards, > > Timothy Potter > [EMAIL PROTECTED] >
-
Re: Dirichlet clustering woes ...Ted Dunning 2011-02-24, 23:21
Ahhh...
This is likely the tall pole that I thought wasn't there. On Thu, Feb 24, 2011 at 3:09 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > and the point is assigned to one of the clusters using a multinomial over > all the pdfs.
-
Re: Dirichlet clustering woes ...Ted Dunning 2011-02-24, 23:23
We should probably have an option to down-sample large clusters to make the
PDF computation faster. On Thu, Feb 24, 2011 at 3:09 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > Again, if most of your points are being assigned to a single cluster that > reducer will be bogged down observing them all.
-
Re: Dirichlet clustering woes ...Ted Dunning 2011-02-24, 23:23
60 x 40K = 2400K = 2GB.
How much memory does each reducer get? If it is significantly larger than 3GB, you should be fine. On Thu, Feb 24, 2011 at 3:09 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > The reducers then accumulate the posterior statistics for one or more > clusters. You can try increasing the number of reducers (up to k) which can > help with this step. Again, if most of your points are being assigned to a > single cluster that reducer will be bogged down observing them all. Also, > since the models accumulate Gaussian statistics to compute mean and std > posterior values these values will tend to become denser as many vectors are > summed and this can drive up memory consumption during the reduce step.
-
Re: Dirichlet clustering woes ...Timothy Potter 2011-02-24, 23:31
I'm re-running it right now on 4-node cluster of EC2 xlarge instances with 3
reducers / node and 4GB max heap per child ... none are swapping and all have load avg around 3 ... will post results once I have them. Intuitively, your comment about all points being assigned to one cluster makes sense because we get through the map tasks and all the reducers except one very quickly ... and then it bogs down. Thanks! On Thu, Feb 24, 2011 at 4:23 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > We should probably have an option to down-sample large clusters to make the > PDF computation faster. > > On Thu, Feb 24, 2011 at 3:09 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > > > Again, if most of your points are being assigned to a single cluster that > > reducer will be bogged down observing them all. >
-
Re: Dirichlet clustering woes ...Ted Dunning 2011-02-24, 23:47
This sounds like a classic case of a monster cluster.
On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[EMAIL PROTECTED]>wrote: > Intuitively, your comment about all points being assigned to one cluster > makes sense because we get through the map tasks and all the reducers > except > one very quickly ... and then it bogs down. >
-
RE: Dirichlet clustering woes ...Jeff Eastman 2011-02-25, 00:02
It indicates the prior cluster centers (as initialized by the ModelDistribution) and std are waaaay off target.
-----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Thursday, February 24, 2011 3:47 PM To: [EMAIL PROTECTED] Cc: Timothy Potter Subject: Re: Dirichlet clustering woes ... This sounds like a classic case of a monster cluster. On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[EMAIL PROTECTED]>wrote: > Intuitively, your comment about all points being assigned to one cluster > makes sense because we get through the map tasks and all the reducers > except > one very quickly ... and then it bogs down. >
-
Re: Dirichlet clustering woes ...Timothy Potter 2011-02-25, 17:40
Quick update -- making some progress with this by increasing -a0 to 10
instead of 1 ... The first iteration completed successfully in 1 hr 8 mins. I had 72 map tasks and 12 reducers; the reducers completely roughly at the same time. However, I'm not out of the woods yet as the map tasks seem pretty bogged down in Iteration 2. The number of vectors per cluster from Iteration 1 are included below. I also want to try the L1Model as suggested by Jeff. Any tips on where I can learn more about why raising -a0 to 10 caused the input vectors to be more evenly distributed over the initial prior clusters? Thanks for your help. Distribution of Vectors per cluster after 1 Dirichlet Iteration: ID Num Vecs :C-0: 621236 :C-1: 502712 :C-5: 397233 :C-2: 396496 :C-3: 369936 :C-4: 361496 :C-6: 290305 :C-7: 277959 :C-9: 277152 :C-8: 248298 :C-12: 194878 :C-10: 192341 :C-11: 180626 :C-13: 149143 :C-14: 136651 :C-15: 125184 :C-17: 115815 :C-16: 107250 :C-18: 106541 :C-19: 92748 :C-21: 80788 :C-20: 72520 :C-24: 68924 :C-23: 66936 :C-22: 64589 :C-25: 60714 :C-26: 59370 :C-27: 47513 :C-28: 34267 :C-29: 33357 :C-30: 32002 :C-31: 30125 :C-32: 28909 :C-33: 24937 :C-36: 23991 :C-35: 22988 :C-38: 17363 :C-34: 16684 :C-37: 15835 :C-40: 13528 :C-39: 11476 :C-42: 11118 :C-44: 10630 :C-41: 9611 :C-43: 8736 :C-46: 8707 :C-45: 8371 :C-47: 7570 :C-49: 5138 :C-48: 4979 :C-50: 4378 :C-53: 4288 :C-51: 4001 :C-52: 3727 :C-54: 3146 :C-55: 2730 :C-56: 2528 :C-58: 2401 :C-57: 2098 :C-59: 1964 On Thu, Feb 24, 2011 at 5:02 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > It indicates the prior cluster centers (as initialized by the > ModelDistribution) and std are waaaay off target. > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Thursday, February 24, 2011 3:47 PM > To: [EMAIL PROTECTED] > Cc: Timothy Potter > Subject: Re: Dirichlet clustering woes ... > > This sounds like a classic case of a monster cluster. > > On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[EMAIL PROTECTED] > >wrote: > > > Intuitively, your comment about all points being assigned to one cluster > > makes sense because we get through the map tasks and all the reducers > > except > > one very quickly ... and then it bogs down. > > >
-
Re: Dirichlet clustering woes ...Ted Dunning 2011-02-25, 18:03
If you plot these you will see an exponential distribution of cluster size
that fits exp(13.23 + -0.09483*x$cluster) It is mildly interesting that this isn't a power law, but you have the same take-away. The second pass and later passes are going to have a problem with key skew. On Fri, Feb 25, 2011 at 9:40 AM, Timothy Potter <[EMAIL PROTECTED]>wrote: > Quick update -- making some progress with this by increasing -a0 to 10 > instead of 1 ... The first iteration completed successfully in 1 hr 8 mins. > > I had 72 map tasks and 12 reducers; the reducers completely roughly at the > same time. > > However, I'm not out of the woods yet as the map tasks seem pretty bogged > down in Iteration 2. The number of vectors per cluster from Iteration 1 are > included below. > > I also want to try the L1Model as suggested by Jeff. > > Any tips on where I can learn more about why raising -a0 to 10 caused the > input vectors to be more evenly distributed over the initial prior > clusters? > > Thanks for your help. > > Distribution of Vectors per cluster after 1 Dirichlet Iteration: > > ID Num Vecs :C-0: 621236 :C-1: 502712 :C-5: 397233 :C-2: 396496 > :C-3: 369936 :C-4: 361496 :C-6: 290305 :C-7: 277959 :C-9: 277152 :C-8: > 248298 :C-12: 194878 :C-10: 192341 :C-11: 180626 :C-13: 149143 :C-14: > 136651 :C-15: 125184 :C-17: 115815 :C-16: 107250 :C-18: 106541 :C-19: > 92748 :C-21: 80788 :C-20: 72520 :C-24: 68924 :C-23: 66936 :C-22: 64589 > :C-25: 60714 :C-26: 59370 :C-27: 47513 :C-28: 34267 :C-29: 33357 > :C-30: > 32002 :C-31: 30125 :C-32: 28909 :C-33: 24937 :C-36: 23991 :C-35: 22988 > :C-38: 17363 :C-34: 16684 :C-37: 15835 :C-40: 13528 :C-39: 11476 > :C-42: > 11118 :C-44: 10630 :C-41: 9611 :C-43: 8736 :C-46: 8707 :C-45: 8371 > :C-47: 7570 :C-49: 5138 :C-48: 4979 :C-50: 4378 :C-53: 4288 :C-51: > 4001 > :C-52: 3727 :C-54: 3146 :C-55: 2730 :C-56: 2528 :C-58: 2401 :C-57: > 2098 > :C-59: 1964 > > On Thu, Feb 24, 2011 at 5:02 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > > > It indicates the prior cluster centers (as initialized by the > > ModelDistribution) and std are waaaay off target. > > > > -----Original Message----- > > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, February 24, 2011 3:47 PM > > To: [EMAIL PROTECTED] > > Cc: Timothy Potter > > Subject: Re: Dirichlet clustering woes ... > > > > This sounds like a classic case of a monster cluster. > > > > On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[EMAIL PROTECTED] > > >wrote: > > > > > Intuitively, your comment about all points being assigned to one > cluster > > > makes sense because we get through the map tasks and all the reducers > > > except > > > one very quickly ... and then it bogs down. > > > > > >
-
RE: Dirichlet clustering woes ...Jeff Eastman 2011-02-25, 19:24
Not sure of the subtleties of the Dirichlet distribution but rDirichlet in UncommonDistributions adds the alpha0 value to the total counts when it samples from the Beta distribution. In the first iteration, when the total counts are zero, it increases the probability of choosing a new cluster. During subsequent iterations, it is completely overshadowed by the total counts.
-----Original Message----- From: Timothy Potter [mailto:[EMAIL PROTECTED]] Sent: Friday, February 25, 2011 9:41 AM To: [EMAIL PROTECTED] Subject: Re: Dirichlet clustering woes ... Quick update -- making some progress with this by increasing -a0 to 10 instead of 1 ... The first iteration completed successfully in 1 hr 8 mins. I had 72 map tasks and 12 reducers; the reducers completely roughly at the same time. However, I'm not out of the woods yet as the map tasks seem pretty bogged down in Iteration 2. The number of vectors per cluster from Iteration 1 are included below. I also want to try the L1Model as suggested by Jeff. Any tips on where I can learn more about why raising -a0 to 10 caused the input vectors to be more evenly distributed over the initial prior clusters? Thanks for your help. Distribution of Vectors per cluster after 1 Dirichlet Iteration: ID Num Vecs :C-0: 621236 :C-1: 502712 :C-5: 397233 :C-2: 396496 :C-3: 369936 :C-4: 361496 :C-6: 290305 :C-7: 277959 :C-9: 277152 :C-8: 248298 :C-12: 194878 :C-10: 192341 :C-11: 180626 :C-13: 149143 :C-14: 136651 :C-15: 125184 :C-17: 115815 :C-16: 107250 :C-18: 106541 :C-19: 92748 :C-21: 80788 :C-20: 72520 :C-24: 68924 :C-23: 66936 :C-22: 64589 :C-25: 60714 :C-26: 59370 :C-27: 47513 :C-28: 34267 :C-29: 33357 :C-30: 32002 :C-31: 30125 :C-32: 28909 :C-33: 24937 :C-36: 23991 :C-35: 22988 :C-38: 17363 :C-34: 16684 :C-37: 15835 :C-40: 13528 :C-39: 11476 :C-42: 11118 :C-44: 10630 :C-41: 9611 :C-43: 8736 :C-46: 8707 :C-45: 8371 :C-47: 7570 :C-49: 5138 :C-48: 4979 :C-50: 4378 :C-53: 4288 :C-51: 4001 :C-52: 3727 :C-54: 3146 :C-55: 2730 :C-56: 2528 :C-58: 2401 :C-57: 2098 :C-59: 1964 On Thu, Feb 24, 2011 at 5:02 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > It indicates the prior cluster centers (as initialized by the > ModelDistribution) and std are waaaay off target. > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Thursday, February 24, 2011 3:47 PM > To: [EMAIL PROTECTED] > Cc: Timothy Potter > Subject: Re: Dirichlet clustering woes ... > > This sounds like a classic case of a monster cluster. > > On Thu, Feb 24, 2011 at 3:31 PM, Timothy Potter <[EMAIL PROTECTED] > >wrote: > > > Intuitively, your comment about all points being assigned to one cluster > > makes sense because we get through the map tasks and all the reducers > > except > > one very quickly ... and then it bogs down. > > > |