|
Tharindu Mathew
2012-02-17, 08:48
Paritosh Ranjan
2012-02-17, 09:06
gaurav redkar
2012-02-17, 09:41
Tharindu Mathew
2012-02-17, 12:24
Paritosh Ranjan
2012-02-17, 12:25
Tharindu Mathew
2012-02-17, 13:09
Jeff Eastman
2012-02-17, 13:15
Tharindu Mathew
2012-02-17, 17:07
|
-
How to use clusterpp?Tharindu Mathew 2012-02-17, 08:48
Hi,
I'm trying to reproduce https://issues.apache.org/jira/browse/MAHOUT-966 When executing clusterpp, I get out put such as this: $bin/hadoop fs -cat /user/mackie/output/ppclusters/part-r-00999 SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable_䪖?g???8?-?? Is this normal? I thought I would get some human readable output when this was used... I tried searching around but couldn't get any documentation regarding clusterpp -- Regards, Tharindu blog: http://mackiemathew.com/
-
Re: How to use clusterpp?Paritosh Ranjan 2012-02-17, 09:06
Check this out https://cwiki.apache.org/MAHOUT/top-down-clustering.html.
It tells how to use clusterpp. You will not get a human readable version. The output will be in SequenceFileFormat, which is not human readable. SequeneFileFormat is a key value format. You will have to iterate over it and read the key value and print into a text file or console. Look into this package org.apache.mahout.common.iterator.sequencefile. This package contains some utility classes which can help you iterate through SequenceFileFormat files. On 17-02-2012 14:18, Tharindu Mathew wrote: > Hi, > > I'm trying to reproduce https://issues.apache.org/jira/browse/MAHOUT-966 > > When executing clusterpp, I get out put such as this: > > $bin/hadoop fs -cat /user/mackie/output/ppclusters/part-r-00999 > SEQorg.apache.hadoop.io.Text%org.apache.mahout.math.VectorWritable_䪖?g???8?-?? > > Is this normal? I thought I would get some human readable output when this > was used... I tried searching around but couldn't get any documentation > regarding clusterpp >
-
Re: How to use clusterpp?gaurav redkar 2012-02-17, 09:41
If that is the only thing that is contained in the part-r-* file, then the
reducer responsible to write to that part-r-* file did not recieve any input records to write to it. This happens because the program uses the default hash partitioner which sometimes maps records belonging to different clusters to a same reducer; thus leaving some reducers without any input records. the simplest and the quickest way to view the contents of the part-r-* files will be to change the outputformat of the job from SequenceFileOutputFormat to TextOutputFormat and comment the line where the program calls the "movePartFilesToRespectiveDirectories()" function since this function expects the part-r-* files to be in sequencefile format. This way you will get all the part files in human-readable format. You can later even modify the "movePartFilesToRespectiveDirectories()" function to move the part-r* files to respective directories. Hope this helps. On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan <[EMAIL PROTECTED]> wrote: > Check this out https://cwiki.apache.org/**MAHOUT/top-down-clustering.** > html <https://cwiki.apache.org/MAHOUT/top-down-clustering.html>. > > It tells how to use clusterpp. > > You will not get a human readable version. > The output will be in SequenceFileFormat, which is not human readable. > SequeneFileFormat is a key value format. You will have to iterate over it > and read the key value and print into a text file or console. > > Look into this package org.apache.mahout.common.**iterator.sequencefile. > This package contains some utility classes which can help you iterate > through SequenceFileFormat files. > > > On 17-02-2012 14:18, Tharindu Mathew wrote: > >> Hi, >> >> I'm trying to reproduce https://issues.apache.org/** >> jira/browse/MAHOUT-966 <https://issues.apache.org/jira/browse/MAHOUT-966> >> >> When executing clusterpp, I get out put such as this: >> >> $bin/hadoop fs -cat /user/mackie/output/**ppclusters/part-r-00999 >> SEQorg.apache.hadoop.io.Text%**org.apache.mahout.math.** >> VectorWritable_䪖?g???8?-?? >> >> Is this normal? I thought I would get some human readable output when this >> was used... I tried searching around but couldn't get any documentation >> regarding clusterpp >> >> >
-
Re: How to use clusterpp?Tharindu Mathew 2012-02-17, 12:24
OffTopic: How would I contribute a documentation patch?
On Fri, Feb 17, 2012 at 3:11 PM, gaurav redkar <[EMAIL PROTECTED]>wrote: > If that is the only thing that is contained in the part-r-* file, then the > reducer responsible to write to that part-r-* file did not recieve any > input records to write to it. This happens because the program uses the > default hash partitioner which sometimes maps records belonging to > different clusters to a same reducer; thus leaving some reducers without > any input records. > > the simplest and the quickest way to view the contents of the part-r-* > files will be to change the outputformat of the job from > SequenceFileOutputFormat to TextOutputFormat and comment the line where the > program calls the "movePartFilesToRespectiveDirectories()" function since > this function expects the part-r-* files to be in sequencefile format. This > way you will get all the part files in human-readable format. > > You can later even modify the "movePartFilesToRespectiveDirectories()" > function to move the part-r* files to respective directories. > > Hope this helps. > > > > On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan <[EMAIL PROTECTED]> > wrote: > > > Check this out https://cwiki.apache.org/**MAHOUT/top-down-clustering.** > > html <https://cwiki.apache.org/MAHOUT/top-down-clustering.html>. > > > > It tells how to use clusterpp. > > > > You will not get a human readable version. > > The output will be in SequenceFileFormat, which is not human readable. > > SequeneFileFormat is a key value format. You will have to iterate over it > > and read the key value and print into a text file or console. > > > > Look into this package org.apache.mahout.common.**iterator.sequencefile. > > This package contains some utility classes which can help you iterate > > through SequenceFileFormat files. > > > > > > On 17-02-2012 14:18, Tharindu Mathew wrote: > > > >> Hi, > >> > >> I'm trying to reproduce https://issues.apache.org/** > >> jira/browse/MAHOUT-966 < > https://issues.apache.org/jira/browse/MAHOUT-966> > >> > >> When executing clusterpp, I get out put such as this: > >> > >> $bin/hadoop fs -cat /user/mackie/output/**ppclusters/part-r-00999 > >> SEQorg.apache.hadoop.io.Text%**org.apache.mahout.math.** > >> VectorWritable_䪖?g???8?-?? > >> > >> Is this normal? I thought I would get some human readable output when > this > >> was used... I tried searching around but couldn't get any documentation > >> regarding clusterpp > >> > >> > > > -- Regards, Tharindu blog: http://mackiemathew.com/
-
Re: How to use clusterpp?Paritosh Ranjan 2012-02-17, 12:25
Try logging in and updating.
On 17-02-2012 17:54, Tharindu Mathew wrote: > OffTopic: How would I contribute a documentation patch? > > On Fri, Feb 17, 2012 at 3:11 PM, gaurav redkar<[EMAIL PROTECTED]>wrote: > >> If that is the only thing that is contained in the part-r-* file, then the >> reducer responsible to write to that part-r-* file did not recieve any >> input records to write to it. This happens because the program uses the >> default hash partitioner which sometimes maps records belonging to >> different clusters to a same reducer; thus leaving some reducers without >> any input records. >> >> the simplest and the quickest way to view the contents of the part-r-* >> files will be to change the outputformat of the job from >> SequenceFileOutputFormat to TextOutputFormat and comment the line where the >> program calls the "movePartFilesToRespectiveDirectories()" function since >> this function expects the part-r-* files to be in sequencefile format. This >> way you will get all the part files in human-readable format. >> >> You can later even modify the "movePartFilesToRespectiveDirectories()" >> function to move the part-r* files to respective directories. >> >> Hope this helps. >> >> >> >> On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan<[EMAIL PROTECTED]> >> wrote: >> >>> Check this out https://cwiki.apache.org/**MAHOUT/top-down-clustering.** >>> html<https://cwiki.apache.org/MAHOUT/top-down-clustering.html>. >>> >>> It tells how to use clusterpp. >>> >>> You will not get a human readable version. >>> The output will be in SequenceFileFormat, which is not human readable. >>> SequeneFileFormat is a key value format. You will have to iterate over it >>> and read the key value and print into a text file or console. >>> >>> Look into this package org.apache.mahout.common.**iterator.sequencefile. >>> This package contains some utility classes which can help you iterate >>> through SequenceFileFormat files. >>> >>> >>> On 17-02-2012 14:18, Tharindu Mathew wrote: >>> >>>> Hi, >>>> >>>> I'm trying to reproduce https://issues.apache.org/** >>>> jira/browse/MAHOUT-966< >> https://issues.apache.org/jira/browse/MAHOUT-966> >>>> When executing clusterpp, I get out put such as this: >>>> >>>> $bin/hadoop fs -cat /user/mackie/output/**ppclusters/part-r-00999 >>>> SEQorg.apache.hadoop.io.Text%**org.apache.mahout.math.** >>>> VectorWritable_䪖?g???8?-?? >>>> >>>> Is this normal? I thought I would get some human readable output when >> this >>>> was used... I tried searching around but couldn't get any documentation >>>> regarding clusterpp >>>> >>>> > >
-
Re: How to use clusterpp?Tharindu Mathew 2012-02-17, 13:09
Or I can just use the cluster dump tool right...?
On Fri, Feb 17, 2012 at 5:55 PM, Paritosh Ranjan <[EMAIL PROTECTED]> wrote: > Try logging in and updating. > > Thanks... > > On 17-02-2012 17:54, Tharindu Mathew wrote: > >> OffTopic: How would I contribute a documentation patch? >> >> On Fri, Feb 17, 2012 at 3:11 PM, gaurav redkar<[EMAIL PROTECTED]>** >> wrote: >> >> If that is the only thing that is contained in the part-r-* file, then >>> the >>> reducer responsible to write to that part-r-* file did not recieve any >>> input records to write to it. This happens because the program uses the >>> default hash partitioner which sometimes maps records belonging to >>> different clusters to a same reducer; thus leaving some reducers without >>> any input records. >>> >>> the simplest and the quickest way to view the contents of the part-r-* >>> files will be to change the outputformat of the job from >>> SequenceFileOutputFormat to TextOutputFormat and comment the line where >>> the >>> program calls the "**movePartFilesToRespectiveDirec**tories()" function >>> since >>> this function expects the part-r-* files to be in sequencefile format. >>> This >>> way you will get all the part files in human-readable format. >>> >>> You can later even modify the "**movePartFilesToRespectiveDirec** >>> tories()" >>> function to move the part-r* files to respective directories. >>> >>> Hope this helps. >>> >>> >>> >>> On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan<[EMAIL PROTECTED]> >>> wrote: >>> >>> Check this out https://cwiki.apache.org/**** >>>> MAHOUT/top-down-clustering.**<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**> >>>> html<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**html<https://cwiki.apache.org/MAHOUT/top-down-clustering.html> >>>> >. >>>> >>>> It tells how to use clusterpp. >>>> >>>> You will not get a human readable version. >>>> The output will be in SequenceFileFormat, which is not human readable. >>>> SequeneFileFormat is a key value format. You will have to iterate over >>>> it >>>> and read the key value and print into a text file or console. >>>> >>>> Look into this package org.apache.mahout.common.**** >>>> iterator.sequencefile. >>>> This package contains some utility classes which can help you iterate >>>> through SequenceFileFormat files. >>>> >>>> >>>> On 17-02-2012 14:18, Tharindu Mathew wrote: >>>> >>>> Hi, >>>>> >>>>> I'm trying to reproduce https://issues.apache.org/** >>>>> jira/browse/MAHOUT-966< >>>>> >>>> https://issues.apache.org/**jira/browse/MAHOUT-966<https://issues.apache.org/jira/browse/MAHOUT-966> >>> > >>> >>>> When executing clusterpp, I get out put such as this: >>>>> >>>>> $bin/hadoop fs -cat /user/mackie/output/****ppclusters/part-r-00999 >>>>> SEQorg.apache.hadoop.io.Text%****org.apache.mahout.math.** >>>>> VectorWritable_䪖?g???8?-?? >>>>> >>>>> Is this normal? I thought I would get some human readable output when >>>>> >>>> this >>> >>>> was used... I tried searching around but couldn't get any documentation >>>>> regarding clusterpp >>>>> >>>>> >>>>> >> >> > -- Regards, Tharindu blog: http://mackiemathew.com/
-
Re: How to use clusterpp?Jeff Eastman 2012-02-17, 13:15
For human-readable output, yes.
On 2/17/12 6:09 AM, Tharindu Mathew wrote: > Or I can just use the cluster dump tool right...? > > On Fri, Feb 17, 2012 at 5:55 PM, Paritosh Ranjan<[EMAIL PROTECTED]> wrote: > >> Try logging in and updating. >> >> Thanks... >> On 17-02-2012 17:54, Tharindu Mathew wrote: >> >>> OffTopic: How would I contribute a documentation patch? >>> >>> On Fri, Feb 17, 2012 at 3:11 PM, gaurav redkar<[EMAIL PROTECTED]>** >>> wrote: >>> >>> If that is the only thing that is contained in the part-r-* file, then >>>> the >>>> reducer responsible to write to that part-r-* file did not recieve any >>>> input records to write to it. This happens because the program uses the >>>> default hash partitioner which sometimes maps records belonging to >>>> different clusters to a same reducer; thus leaving some reducers without >>>> any input records. >>>> >>>> the simplest and the quickest way to view the contents of the part-r-* >>>> files will be to change the outputformat of the job from >>>> SequenceFileOutputFormat to TextOutputFormat and comment the line where >>>> the >>>> program calls the "**movePartFilesToRespectiveDirec**tories()" function >>>> since >>>> this function expects the part-r-* files to be in sequencefile format. >>>> This >>>> way you will get all the part files in human-readable format. >>>> >>>> You can later even modify the "**movePartFilesToRespectiveDirec** >>>> tories()" >>>> function to move the part-r* files to respective directories. >>>> >>>> Hope this helps. >>>> >>>> >>>> >>>> On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan<[EMAIL PROTECTED]> >>>> wrote: >>>> >>>> Check this out https://cwiki.apache.org/**** >>>>> MAHOUT/top-down-clustering.**<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**> >>>>> html<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**html<https://cwiki.apache.org/MAHOUT/top-down-clustering.html> >>>>>> . >>>>> It tells how to use clusterpp. >>>>> >>>>> You will not get a human readable version. >>>>> The output will be in SequenceFileFormat, which is not human readable. >>>>> SequeneFileFormat is a key value format. You will have to iterate over >>>>> it >>>>> and read the key value and print into a text file or console. >>>>> >>>>> Look into this package org.apache.mahout.common.**** >>>>> iterator.sequencefile. >>>>> This package contains some utility classes which can help you iterate >>>>> through SequenceFileFormat files. >>>>> >>>>> >>>>> On 17-02-2012 14:18, Tharindu Mathew wrote: >>>>> >>>>> Hi, >>>>>> I'm trying to reproduce https://issues.apache.org/** >>>>>> jira/browse/MAHOUT-966< >>>>>> >>>>> https://issues.apache.org/**jira/browse/MAHOUT-966<https://issues.apache.org/jira/browse/MAHOUT-966> >>>>> >>>>> When executing clusterpp, I get out put such as this: >>>>>> $bin/hadoop fs -cat /user/mackie/output/****ppclusters/part-r-00999 >>>>>> SEQorg.apache.hadoop.io.Text%****org.apache.mahout.math.** >>>>>> VectorWritable_䪖?g???8?-?? >>>>>> >>>>>> Is this normal? I thought I would get some human readable output when >>>>>> >>>>> this >>>>> was used... I tried searching around but couldn't get any documentation >>>>>> regarding clusterpp >>>>>> >>>>>> >>>>>> >>> >
-
Re: How to use clusterpp?Tharindu Mathew 2012-02-17, 17:07
Hi,
Thanks for the replies everyone... just getting the hang of things... appreciate the tolerance for all the dumb questions... Gaurav, a small question: You run the clustering and then you run the cluster post processor. I ran the cluster dumper on the initial clusteredPoints and I get output all with n=1, r=[] and a very large centroid. So from what I understand, the cluster algo is run again. Can I know for the out put you've shown in the jira, for which part did you run the clustering again? (I have 1000 clusters shown) I'm asking this so I can verify that I've run things correctly, and I'm generating the same output. On Fri, Feb 17, 2012 at 6:45 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > For human-readable output, yes. > > > On 2/17/12 6:09 AM, Tharindu Mathew wrote: > >> Or I can just use the cluster dump tool right...? >> >> On Fri, Feb 17, 2012 at 5:55 PM, Paritosh Ranjan<[EMAIL PROTECTED]> >> wrote: >> >> Try logging in and updating. >>> >>> Thanks... >>> On 17-02-2012 17:54, Tharindu Mathew wrote: >>> >>> OffTopic: How would I contribute a documentation patch? >>>> >>>> On Fri, Feb 17, 2012 at 3:11 PM, gaurav redkar<[EMAIL PROTECTED]>* >>>> *** >>>> >>>> wrote: >>>> >>>> If that is the only thing that is contained in the part-r-* file, then >>>> >>>>> the >>>>> reducer responsible to write to that part-r-* file did not recieve any >>>>> input records to write to it. This happens because the program uses the >>>>> default hash partitioner which sometimes maps records belonging to >>>>> different clusters to a same reducer; thus leaving some reducers >>>>> without >>>>> any input records. >>>>> >>>>> the simplest and the quickest way to view the contents of the part-r-* >>>>> files will be to change the outputformat of the job from >>>>> SequenceFileOutputFormat to TextOutputFormat and comment the line where >>>>> the >>>>> program calls the "****movePartFilesToRespectiveDirec****tories()" >>>>> function >>>>> >>>>> since >>>>> this function expects the part-r-* files to be in sequencefile format. >>>>> This >>>>> way you will get all the part files in human-readable format. >>>>> >>>>> You can later even modify the "****movePartFilesToRespectiveDirec**** >>>>> >>>>> tories()" >>>>> function to move the part-r* files to respective directories. >>>>> >>>>> Hope this helps. >>>>> >>>>> >>>>> >>>>> On Fri, Feb 17, 2012 at 2:36 PM, Paritosh Ranjan<[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>> Check this out https://cwiki.apache.org/**** >>>>> >>>>>> MAHOUT/top-down-clustering.**<**https://cwiki.apache.org/**** >>>>>> MAHOUT/top-down-clustering.**<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**> >>>>>> > >>>>>> html<https://cwiki.apache.org/****MAHOUT/top-down-clustering.****html<https://cwiki.apache.org/**MAHOUT/top-down-clustering.**html> >>>>>> <https://cwiki.apache.**org/MAHOUT/top-down-**clustering.html<https://cwiki.apache.org/MAHOUT/top-down-clustering.html> >>>>>> > >>>>>> >>>>>> . >>>>>>> >>>>>> It tells how to use clusterpp. >>>>>> >>>>>> You will not get a human readable version. >>>>>> The output will be in SequenceFileFormat, which is not human readable. >>>>>> SequeneFileFormat is a key value format. You will have to iterate over >>>>>> it >>>>>> and read the key value and print into a text file or console. >>>>>> >>>>>> Look into this package org.apache.mahout.common.**** >>>>>> >>>>>> iterator.sequencefile. >>>>>> This package contains some utility classes which can help you iterate >>>>>> through SequenceFileFormat files. >>>>>> >>>>>> >>>>>> On 17-02-2012 14:18, Tharindu Mathew wrote: >>>>>> >>>>>> Hi, >>>>>> >>>>>>> I'm trying to reproduce https://issues.apache.org/** >>>>>>> jira/browse/MAHOUT-966< >>>>>>> >>>>>>> https://issues.apache.org/****jira/browse/MAHOUT-966<https://issues.apache.org/**jira/browse/MAHOUT-966> >>>>>> <https:/**/issues.apache.org/jira/**browse/MAHOUT-966<https://issues.apache.org/jira/browse/MAHOUT-966> >>>>>> > >>>>>> >>>>>> >>>>>> When executing clusterpp, I get out put such as this: Regards, Tharindu blog: http://mackiemathew.com/ |