|
|
-
Help regarding ClusterOutputPostProcessor
gaurav redkar 2012-01-06, 11:48
Hello,
wen I ran the ClusterOutputPostProcessor on synthetic_control_data in mapreduce mode, I observed that one directory contained points belonging to 2 other clusters and the directories relating to those 2 clusters were not created as their "part- *" files were empty and the function "movePartFilesToRespectiveDirectories()" was not able to create the directories to put them into. I have converted the sequence file containing the points belonging to those 3 clusters into text file(by changing the output format to TextOutputFormat). Kindly find the attached part-file which can be viewed.
Any suggestions as to why this might be happening...?
Note: The program runs fine in sequential mode.
Thanks.
+
gaurav redkar 2012-01-06, 11:48
-
Re: Help regarding ClusterOutputPostProcessor
Lance Norskog 2012-01-06, 12:34
Apache mail throws away all attachments.
If you think that this is a bug, please file a JIRA. If you can change ClusterOutputPostProcessorTest to test for this scenario, please contribute it. With this it is possible to single-step map-reduce jobs inside your IDE. Sometimes these directory manipulation problems are hard to find.
Lance
On Fri, Jan 6, 2012 at 3:48 AM, gaurav redkar <[EMAIL PROTECTED]> wrote: > Hello, > > wen I ran the ClusterOutputPostProcessor on synthetic_control_data in > mapreduce mode, I observed that one directory contained points belonging to > 2 other clusters and the directories relating to those 2 clusters were not > created as their "part- *" files were empty and the function > "movePartFilesToRespectiveDirectories()" was not able to create the > directories to put them into. I have converted the sequence file containing > the points belonging to those 3 clusters into text file(by changing the > output format to TextOutputFormat). Kindly find the attached part-file which > can be viewed. > > Any suggestions as to why this might be happening...? > > Note: The program runs fine in sequential mode. > > Thanks. > >
-- Lance Norskog [EMAIL PROTECTED]
+
Lance Norskog 2012-01-06, 12:34
-
Re: Help regarding ClusterOutputPostProcessor
Paritosh Ranjan 2012-01-06, 12:42
ClusterOutputProcessorDriver has options to run either sequentially or in a mapreduce way. If the clustering was done sequetially, then ClusterOutputProcessor should be run sequentially, and if the clustering was done in a mapreduce way, then run the ClusterOutputPostProcessor with option mapreduce=true. If you have already tried this, and its still now working, then filing a bug (as Lance mentioned) would be appropriate. On 06-01-2012 17:18, gaurav redkar wrote: > Hello, > wen I ran the ClusterOutputPostProcessor on synthetic_control_data in > mapreduce mode, I observed that one directory contained points > belonging to 2 other clusters and the directories relating to those 2 > clusters were not created as their "part- *" files were empty and the > function "movePartFilesToRespectiveDirectories()" was not able to > create the directories to put them into. I have converted the sequence > file containing the points belonging to those 3 clusters into text > file(by changing the output format to TextOutputFormat). Kindly find > the attached part-file which can be viewed. > Any suggestions as to why this might be happening...? > Note: The program runs fine in sequential mode. > Thanks. > > > No virus found in this message. > Checked by AVG - www.avg.com < http://www.avg.com>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12 >
+
Paritosh Ranjan 2012-01-06, 12:42
-
Re: Help regarding ClusterOutputPostProcessor
gaurav redkar 2012-01-25, 09:41
Hello, I was able to rectify the afore-mentioned problem after i implemented a custom partitioner instead of using the default hash partitioner. I have another issue though. After running the post processor the number of points that each cluster contains is not matching the number of points each cluster should contain as stated by clusterdumper. MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} the n mentioned in clusters-n-final against each cluster is different from the number of points actually contained in d directory for each cluster. Any idea why is this happening ...? PS: the dataset on which i tested the algorithm has 1000 records with 200 attributes per record. I can share the dataset that i have used if needed. Thanks, Gaurav On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan <[EMAIL PROTECTED]> wrote: > ClusterOutputProcessorDriver has options to run either sequentially or in > a mapreduce way. > > If the clustering was done sequetially, then ClusterOutputProcessor should > be run sequentially, and if the clustering was done in a mapreduce way, > then run the ClusterOutputPostProcessor with option mapreduce=true. > > If you have already tried this, and its still now working, then filing a > bug (as Lance mentioned) would be appropriate. > > > On 06-01-2012 17:18, gaurav redkar wrote: > >> Hello, >> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in >> mapreduce mode, I observed that one directory contained points belonging to >> 2 other clusters and the directories relating to those 2 clusters were not >> created as their "part- *" files were empty and the function "** >> movePartFilesToRespectiveDirec**tories()" was not able to create the >> directories to put them into. I have converted the sequence file containing >> the points belonging to those 3 clusters into text file(by changing the >> output format to TextOutputFormat). Kindly find the attached part-file >> which can be viewed. >> Any suggestions as to why this might be happening...? >> Note: The program runs fine in sequential mode. >> Thanks. >> >> >> No virus found in this message. >> Checked by AVG - www.avg.com < http://www.avg.com>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12 >> >> >
+
gaurav redkar 2012-01-25, 09:41
-
Re: Help regarding ClusterOutputPostProcessor
Jeff Eastman 2012-01-25, 15:21
Mean Shift accumulates the pointIds of every point assigned to a cluster, so I would expect n= to be correct in the cluster dumper output. It is most likely the postprocessor is misbehaving. Please create a JIRA and attach your dataset and we will take a look at it. It would also be useful for you to include the exact CLI commands which you used to duplicate this problem. On 1/25/12 2:41 AM, gaurav redkar wrote: > Hello, > > I was able to rectify the afore-mentioned problem after i implemented a > custom partitioner instead of using the default hash partitioner. I have > another issue though. After running the post processor the number of points > that each cluster contains is not matching the number of points each > cluster should contain as stated by clusterdumper. > > > MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} > > MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} > the n mentioned in clusters-n-final against each cluster is different from > the number of points actually contained in d directory for each cluster. > Any idea why is this happening ...? > > PS: the dataset on which i tested the algorithm has 1000 records with 200 > attributes per record. I can share the dataset that i have used if needed. > > Thanks, > > Gaurav > > On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<[EMAIL PROTECTED]> wrote: > >> ClusterOutputProcessorDriver has options to run either sequentially or in >> a mapreduce way. >> >> If the clustering was done sequetially, then ClusterOutputProcessor should >> be run sequentially, and if the clustering was done in a mapreduce way, >> then run the ClusterOutputPostProcessor with option mapreduce=true. >> >> If you have already tried this, and its still now working, then filing a >> bug (as Lance mentioned) would be appropriate. >> >> >> On 06-01-2012 17:18, gaurav redkar wrote: >> >>> Hello, >>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in >>> mapreduce mode, I observed that one directory contained points belonging to >>> 2 other clusters and the directories relating to those 2 clusters were not >>> created as their "part- *" files were empty and the function "** >>> movePartFilesToRespectiveDirec**tories()" was not able to create the >>> directories to put them into. I have converted the sequence file containing >>> the points belonging to those 3 clusters into text file(by changing the >>> output format to TextOutputFormat). Kindly find the attached part-file >>> which can be viewed. >>> Any suggestions as to why this might be happening...? >>> Note: The program runs fine in sequential mode. >>> Thanks. >>> >>> >>> No virus found in this message. >>> Checked by AVG - www.avg.com< http://www.avg.com>>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12 >>> >>>
+
Jeff Eastman 2012-01-25, 15:21
-
Re: Help regarding ClusterOutputPostProcessor
gaurav redkar 2012-01-31, 04:25
Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out MAHOUT-966 < https://issues.apache.org/jira/browse/MAHOUT-966> and share your inputs. Thanks, Gaurav On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > Mean Shift accumulates the pointIds of every point assigned to a cluster, > so I would expect n= to be correct in the cluster dumper output. It is most > likely the postprocessor is misbehaving. Please create a JIRA and attach > your dataset and we will take a look at it. > > It would also be useful for you to include the exact CLI commands which > you used to duplicate this problem. > > > On 1/25/12 2:41 AM, gaurav redkar wrote: > >> Hello, >> >> I was able to rectify the afore-mentioned problem after i implemented a >> custom partitioner instead of using the default hash partitioner. I have >> another issue though. After running the post processor the number of >> points >> that each cluster contains is not matching the number of points each >> cluster should contain as stated by clusterdumper. >> >> >> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} >> >> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} >> the n mentioned in clusters-n-final against each cluster is different from >> the number of points actually contained in d directory for each cluster. >> Any idea why is this happening ...? >> >> PS: the dataset on which i tested the algorithm has 1000 records with 200 >> attributes per record. I can share the dataset that i have used if needed. >> >> Thanks, >> >> Gaurav >> >> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<[EMAIL PROTECTED]> >> wrote: >> >> ClusterOutputProcessorDriver has options to run either sequentially or >>> in >>> a mapreduce way. >>> >>> If the clustering was done sequetially, then ClusterOutputProcessor >>> should >>> be run sequentially, and if the clustering was done in a mapreduce way, >>> then run the ClusterOutputPostProcessor with option mapreduce=true. >>> >>> If you have already tried this, and its still now working, then filing a >>> bug (as Lance mentioned) would be appropriate. >>> >>> >>> On 06-01-2012 17:18, gaurav redkar wrote: >>> >>> Hello, >>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in >>>> mapreduce mode, I observed that one directory contained points >>>> belonging to >>>> 2 other clusters and the directories relating to those 2 clusters were >>>> not >>>> created as their "part- *" files were empty and the function "** >>>> movePartFilesToRespectiveDirec****tories()" was not able to create the >>>> >>>> directories to put them into. I have converted the sequence file >>>> containing >>>> the points belonging to those 3 clusters into text file(by changing the >>>> output format to TextOutputFormat). Kindly find the attached part-file >>>> which can be viewed. >>>> Any suggestions as to why this might be happening...? >>>> Note: The program runs fine in sequential mode. >>>> Thanks. >>>> >>>> >>>> No virus found in this message. >>>> Checked by AVG - www.avg.com< http://www.avg.com**>>>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: 01/05/12 >>>> >>>> >>>> >
+
gaurav redkar 2012-01-31, 04:25
-
Re: Help regarding ClusterOutputPostProcessor
praneet mhatre 2012-04-26, 22:10
Hi, I had a look at the JIRA and looks like the issue is still unresolved. I wanted to know if the suggestion that the postprocessor may be at fault has been verified. I am using Dirichlet clustering for a project of mine and I also noticed the mismatch between the number of points actually present in the cluster and the value of n. I was wondering if the clusteredPoints directory contains the correct point assignment and if I could just use that for the purpose of my project. Thanks! On Mon, Jan 30, 2012 at 8:25 PM, gaurav redkar <[EMAIL PROTECTED]>wrote: > Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out > MAHOUT-966 < https://issues.apache.org/jira/browse/MAHOUT-966> and share > your inputs. > > Thanks, > Gaurav > > On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman <[EMAIL PROTECTED] > >wrote: > > > Mean Shift accumulates the pointIds of every point assigned to a cluster, > > so I would expect n= to be correct in the cluster dumper output. It is > most > > likely the postprocessor is misbehaving. Please create a JIRA and attach > > your dataset and we will take a look at it. > > > > It would also be useful for you to include the exact CLI commands which > > you used to duplicate this problem. > > > > > > On 1/25/12 2:41 AM, gaurav redkar wrote: > > > >> Hello, > >> > >> I was able to rectify the afore-mentioned problem after i implemented a > >> custom partitioner instead of using the default hash partitioner. I > have > >> another issue though. After running the post processor the number of > >> points > >> that each cluster contains is not matching the number of points each > >> cluster should contain as stated by clusterdumper. > >> > >> > >> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} > >> > >> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} > >> the n mentioned in clusters-n-final against each cluster is different > from > >> the number of points actually contained in d directory for each cluster. > >> Any idea why is this happening ...? > >> > >> PS: the dataset on which i tested the algorithm has 1000 records with > 200 > >> attributes per record. I can share the dataset that i have used if > needed. > >> > >> Thanks, > >> > >> Gaurav > >> > >> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<[EMAIL PROTECTED]> > >> wrote: > >> > >> ClusterOutputProcessorDriver has options to run either sequentially or > >>> in > >>> a mapreduce way. > >>> > >>> If the clustering was done sequetially, then ClusterOutputProcessor > >>> should > >>> be run sequentially, and if the clustering was done in a mapreduce way, > >>> then run the ClusterOutputPostProcessor with option mapreduce=true. > >>> > >>> If you have already tried this, and its still now working, then filing > a > >>> bug (as Lance mentioned) would be appropriate. > >>> > >>> > >>> On 06-01-2012 17:18, gaurav redkar wrote: > >>> > >>> Hello, > >>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in > >>>> mapreduce mode, I observed that one directory contained points > >>>> belonging to > >>>> 2 other clusters and the directories relating to those 2 clusters were > >>>> not > >>>> created as their "part- *" files were empty and the function "** > >>>> movePartFilesToRespectiveDirec****tories()" was not able to create the > >>>> > >>>> directories to put them into. I have converted the sequence file > >>>> containing > >>>> the points belonging to those 3 clusters into text file(by changing > the > >>>> output format to TextOutputFormat). Kindly find the attached part-file > >>>> which can be viewed. > >>>> Any suggestions as to why this might be happening...? > >>>> Note: The program runs fine in sequential mode. > >>>> Thanks. > >>>> > >>>> > >>>> No virus found in this message. > >>>> Checked by AVG - www.avg.com< http://www.avg.com**>> >>>> Version: 10.0.1416 / Virus Database: 2109/4125 - Release Date: > 01/05/12 > >>>> > >>>> > >>>> > > > -- Praneet Mhatre Graduate Student Donald Bren School of ICS University of California, Irvine
+
praneet mhatre 2012-04-26, 22:10
-
Re: Help regarding ClusterOutputPostProcessor
Paritosh Ranjan 2012-04-27, 08:16
To answer : "I was wondering if the clusteredPoints directory contains the correct point assignment and if I could just use that for the purpose of my project. " I would say "Yes". If you will read the comments in the issue, you will find that "The number of members printed by the clusterdumper code match the number of points generated by the ClusterOutputPostProcessor for each cluster. Sadly this number does not match the value 'n' for that cluster in the clusterdumper implementation. " So, the bug is most probably in the value of "n". Even other people have faced it http://comments.gmane.org/gmane.comp.apache.mahout.user/10906. So, go ahead with the clusteredPoints. On 27-04-2012 03:40, praneet mhatre wrote: > Hi, > > I had a look at the JIRA and looks like the issue is still unresolved. I > wanted to know if the suggestion that the postprocessor may be at fault has > been verified. > > I am using Dirichlet clustering for a project of mine and I also noticed > the mismatch between the number of points actually present in the cluster > and the value of n. I was wondering if the clusteredPoints directory > contains the correct point assignment and if I could just use that for the > purpose of my project. > > Thanks! > > On Mon, Jan 30, 2012 at 8:25 PM, gaurav redkar<[EMAIL PROTECTED]>wrote: > >> Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out >> MAHOUT-966< https://issues.apache.org/jira/browse/MAHOUT-966> and share >> your inputs. >> >> Thanks, >> Gaurav >> >> On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman<[EMAIL PROTECTED] >>> wrote: >>> Mean Shift accumulates the pointIds of every point assigned to a cluster, >>> so I would expect n= to be correct in the cluster dumper output. It is >> most >>> likely the postprocessor is misbehaving. Please create a JIRA and attach >>> your dataset and we will take a look at it. >>> >>> It would also be useful for you to include the exact CLI commands which >>> you used to duplicate this problem. >>> >>> >>> On 1/25/12 2:41 AM, gaurav redkar wrote: >>> >>>> Hello, >>>> >>>> I was able to rectify the afore-mentioned problem after i implemented a >>>> custom partitioner instead of using the default hash partitioner. I >> have >>>> another issue though. After running the post processor the number of >>>> points >>>> that each cluster contains is not matching the number of points each >>>> cluster should contain as stated by clusterdumper. >>>> >>>> >>>> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} >>>> >>>> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} >>>> the n mentioned in clusters-n-final against each cluster is different >> from >>>> the number of points actually contained in d directory for each cluster. >>>> Any idea why is this happening ...? >>>> >>>> PS: the dataset on which i tested the algorithm has 1000 records with >> 200 >>>> attributes per record. I can share the dataset that i have used if >> needed. >>>> Thanks, >>>> >>>> Gaurav >>>> >>>> On Fri, Jan 6, 2012 at 6:12 PM, Paritosh Ranjan<[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> ClusterOutputProcessorDriver has options to run either sequentially or >>>>> in >>>>> a mapreduce way. >>>>> >>>>> If the clustering was done sequetially, then ClusterOutputProcessor >>>>> should >>>>> be run sequentially, and if the clustering was done in a mapreduce way, >>>>> then run the ClusterOutputPostProcessor with option mapreduce=true. >>>>> >>>>> If you have already tried this, and its still now working, then filing >> a >>>>> bug (as Lance mentioned) would be appropriate. >>>>> >>>>> >>>>> On 06-01-2012 17:18, gaurav redkar wrote: >>>>> >>>>> Hello, >>>>>> wen I ran the ClusterOutputPostProcessor on synthetic_control_data in >>>>>> mapreduce mode, I observed that one directory contained points >>>>>> belonging to >>>>>> 2 other clusters and the directories relating to those 2 clusters were >>>>>> not >>>>>> created as their "part- *" files were empty and the function "**
+
Paritosh Ranjan 2012-04-27, 08:16
-
Re: Help regarding ClusterOutputPostProcessor
Jeff Eastman 2012-04-27, 19:36
I think the answer to this question lies in how Dirichlet works: During each iteration, all points are assigned to clusters based upon a probabilistic assignment using a multinomial sampling of the cluster pdfs times a Dirichlet distribution mixture (see DirichletClusteringPolicy.select() for exact details). The value of "n" in each cluster in a clusters-i directory is the number of points that were assigned to it during the i-th iteration. If you are running the postprocessor over the last iteration, then "n" would be the number of points assigned to it during the last iteration only. OTOH, when the ClusterOutputPostprocessor computes cluster assignments for each vector, it assigns only the cluster with the maximum pdf (the most likely cluster). Since each point is likely to be assigned to several of the possible clusters during the iterations it is not likely that "n" will ever agree with the COP assignment. On 4/27/12 4:16 AM, Paritosh Ranjan wrote: > To answer : > > "I was wondering if the clusteredPoints directory contains the correct > point assignment and if I could just use that for the purpose of my > project. " > I would say "Yes". > > If you will read the comments in the issue, you will find that > > "The number of members printed by the clusterdumper code match the > number of points generated by the ClusterOutputPostProcessor for each > cluster. Sadly this number does not match the value 'n' for that > cluster in the clusterdumper implementation. " > > So, the bug is most probably in the value of "n". Even other people > have faced it > http://comments.gmane.org/gmane.comp.apache.mahout.user/10906. > > So, go ahead with the clusteredPoints. > > On 27-04-2012 03:40, praneet mhatre wrote: >> Hi, >> >> I had a look at the JIRA and looks like the issue is still unresolved. I >> wanted to know if the suggestion that the postprocessor may be at >> fault has >> been verified. >> >> I am using Dirichlet clustering for a project of mine and I also noticed >> the mismatch between the number of points actually present in the >> cluster >> and the value of n. I was wondering if the clusteredPoints directory >> contains the correct point assignment and if I could just use that >> for the >> purpose of my project. >> >> Thanks! >> >> On Mon, Jan 30, 2012 at 8:25 PM, gaurav >> redkar<[EMAIL PROTECTED]>wrote: >> >>> Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out >>> MAHOUT-966< https://issues.apache.org/jira/browse/MAHOUT-966> and >>> share >>> your inputs. >>> >>> Thanks, >>> Gaurav >>> >>> On Wed, Jan 25, 2012 at 8:51 PM, Jeff >>> Eastman<[EMAIL PROTECTED] >>>> wrote: >>>> Mean Shift accumulates the pointIds of every point assigned to a >>>> cluster, >>>> so I would expect n= to be correct in the cluster dumper output. It is >>> most >>>> likely the postprocessor is misbehaving. Please create a JIRA and >>>> attach >>>> your dataset and we will take a look at it. >>>> >>>> It would also be useful for you to include the exact CLI commands >>>> which >>>> you used to duplicate this problem. >>>> >>>> >>>> On 1/25/12 2:41 AM, gaurav redkar wrote: >>>> >>>>> Hello, >>>>> >>>>> I was able to rectify the afore-mentioned problem after i >>>>> implemented a >>>>> custom partitioner instead of using the default hash partitioner. I >>> have >>>>> another issue though. After running the post processor the number of >>>>> points >>>>> that each cluster contains is not matching the number of points each >>>>> cluster should contain as stated by clusterdumper. >>>>> >>>>> >>>>> MSV-287{ n=90 c=[0.05195, 0.05675, 0.07151, 0.05713, 0.06946,...} >>>>> >>>>> MSV-145{ n=90 c=[0.93685, 0.93071, 0.93641, 0.94629, 0.94409,..} >>>>> the n mentioned in clusters-n-final against each cluster is different >>> from >>>>> the number of points actually contained in d directory for each >>>>> cluster. >>>>> Any idea why is this happening ...? >>>>> >>>>> PS: the dataset on which i tested the algorithm has 1000 records with
+
Jeff Eastman 2012-04-27, 19:36
-
Re: Help regarding ClusterOutputPostProcessor
praneet mhatre 2012-04-29, 21:53
Great, that helps! I'll just go ahead with the output file then and see what kind of results I get. Thank you! On Fri, Apr 27, 2012 at 12:36 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > I think the answer to this question lies in how Dirichlet works: During > each iteration, all points are assigned to clusters based upon a > probabilistic assignment using a multinomial sampling of the cluster pdfs > times a Dirichlet distribution mixture (see DirichletClusteringPolicy.**select() > for exact details). The value of "n" in each cluster in a clusters-i > directory is the number of points that were assigned to it during the i-th > iteration. If you are running the postprocessor over the last iteration, > then "n" would be the number of points assigned to it during the last > iteration only. > > OTOH, when the ClusterOutputPostprocessor computes cluster assignments for > each vector, it assigns only the cluster with the maximum pdf (the most > likely cluster). Since each point is likely to be assigned to several of > the possible clusters during the iterations it is not likely that "n" will > ever agree with the COP assignment. > > > > On 4/27/12 4:16 AM, Paritosh Ranjan wrote: > >> To answer : >> >> "I was wondering if the clusteredPoints directory contains the correct >> point assignment and if I could just use that for the purpose of my >> project. " >> I would say "Yes". >> >> If you will read the comments in the issue, you will find that >> >> "The number of members printed by the clusterdumper code match the number >> of points generated by the ClusterOutputPostProcessor for each cluster. >> Sadly this number does not match the value 'n' for that cluster in the >> clusterdumper implementation. " >> >> So, the bug is most probably in the value of "n". Even other people have >> faced it http://comments.gmane.org/**gmane.comp.apache.mahout.user/**>> 10906 < http://comments.gmane.org/gmane.comp.apache.mahout.user/10906>. >> >> So, go ahead with the clusteredPoints. >> >> On 27-04-2012 03:40, praneet mhatre wrote: >> >>> Hi, >>> >>> I had a look at the JIRA and looks like the issue is still unresolved. I >>> wanted to know if the suggestion that the postprocessor may be at fault >>> has >>> been verified. >>> >>> I am using Dirichlet clustering for a project of mine and I also noticed >>> the mismatch between the number of points actually present in the cluster >>> and the value of n. I was wondering if the clusteredPoints directory >>> contains the correct point assignment and if I could just use that for >>> the >>> purpose of my project. >>> >>> Thanks! >>> >>> On Mon, Jan 30, 2012 at 8:25 PM, gaurav redkar<[EMAIL PROTECTED]>** >>> wrote: >>> >>> Hello. As Jeff mentioned, i created a JIRA issue. Kindly check out >>>> MAHOUT-966< https://issues.**apache.org/jira/browse/MAHOUT-**966<https://issues.apache.org/jira/browse/MAHOUT-966>>>>>> and share >>>> your inputs. >>>> >>>> Thanks, >>>> Gaurav >>>> >>>> On Wed, Jan 25, 2012 at 8:51 PM, Jeff Eastman<jdog@** >>>> windwardsolutions.com <[EMAIL PROTECTED]> >>>> >>>>> wrote: >>>>> Mean Shift accumulates the pointIds of every point assigned to a >>>>> cluster, >>>>> so I would expect n= to be correct in the cluster dumper output. It is >>>>> >>>> most >>>> >>>>> likely the postprocessor is misbehaving. Please create a JIRA and >>>>> attach >>>>> your dataset and we will take a look at it. >>>>> >>>>> It would also be useful for you to include the exact CLI commands which >>>>> you used to duplicate this problem. >>>>> >>>>> >>>>> On 1/25/12 2:41 AM, gaurav redkar wrote: >>>>> >>>>> Hello, >>>>>> >>>>>> I was able to rectify the afore-mentioned problem after i implemented >>>>>> a >>>>>> custom partitioner instead of using the default hash partitioner. I >>>>>> >>>>> have >>>> >>>>> another issue though. After running the post processor the number of >>>>>> points >>>>>> that each cluster contains is not matching the number of points each >>>>>> cluster should contain as stated by clusterdumper. Praneet Mhatre Graduate Student Donald Bren School of ICS University of California, Irvine
+
praneet mhatre 2012-04-29, 21:53
|
|