|
|
-
Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIteratorJeff Eastman 2012-03-15, 20:06
+1 Paritosh, this is exactly what I envisioned. And I also like your
idea of first converting them all to use ClusterWritable. Go for it! On 3/15/12 10:42 AM, Paritosh Ranjan wrote: > I saw the code and my understanding of the new implementation is: > a) K-Means, Fuzzy K-Means and Dirichlet will ClusterIterator and write > IntWritable, ClusterWritbale in buildClusters phase ( Instead of > Kluster, SoftCluster and DirichletCluster ) > b) Canopy and MeanShift will NOT use ClusterIterator but will emit > IntWritable, ClusterWritable ( Instead of Canopy and MeanShiftCanopy ) > > There are tools ( ClusterDumper and ClusterEvaluator ) which expect > <Cluster> when they read from the output file after clustering ( ~ > buildCluster phase ). > > KMeans is expecting Canopy and KCluster, but will get ClusterWritable. > > So, everything needs to be in sync ( i.e. ClusterWritable ) > > I propose to wrap everything in ClusterWritable first, as everything > is a Cluster ( eg. DirichletCluster, SoftCluster, Kluster, Canopy and > MeanShiftCanopy ). This will remove the inconsistency without much > chaos. Once ClusterWritable is uniformly used, then refactor all > algorithms. > > I am also not against making ClusterDumper unavailable for a week or > so since we have ClusterOutputPostProcessor now. > > Is my understanding correct? If not, please help me understand it. > If yes, which way do you propose to refactor? > > On 15-03-2012 19:24, Jeff Eastman wrote: >> Yes, that was my point. below It may, in fact, be impossible to >> implement and commit them independently since so much of Mahout >> clustering depends upon the Cluster sequenceFile. You may be able to >> get part way by moving the Canopy mods into the kmeans issue, but >> then the cluster dumper and evaluator will not work with kmeans. >> >> Ideas? >> >> On 3/14/12 10:15 PM, Paritosh Ranjan wrote: >>> Thanks Jeff. One question, are "Use ClusterIterator" tasks dependent >>> on "Modify Canopy etc to use ClusterWritable" task ? >>> I am assuming that all subtasks in MAHOUT-933 >>> <https://issues.apache.org/jira/browse/MAHOUT-933> are independent >>> of each other and the order to pick them does not matter. Am I correct? >>> >>> On 15-03-2012 09:23, Jeff Eastman wrote: >>>> Sure Paritosh, go ahead and take a crack at it. I am moving from CO >>>> to PA for the next few weeks and won't be able to do much coding >>>> during that period. I suspect you will also need to modify Canopy >>>> to emit ClusterWritable and also the RandomSeedGenerator. >>>> >>>> Smooth sailing, >>>> Jeff >>>> >>>> On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote: >>>>> [ >>>>> https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840 >>>>> ] >>>>> >>>>> Paritosh Ranjan commented on MAHOUT-988: >>>>> ---------------------------------------- >>>>> >>>>> Jeff, I would like to work on this issue (or MAHOUT-989, or >>>>> MAHOUT-990). Can I? I might also need some help ( at least the >>>>> first patch review ). >>>>> >>>>> >>>>>> Convert K-means buildClusters to use new ClusterIterator >>>>>> -------------------------------------------------------- >>>>>> >>>>>> Key: MAHOUT-988 >>>>>> URL: >>>>>> https://issues.apache.org/jira/browse/MAHOUT-988 >>>>>> Project: Mahout >>>>>> Issue Type: Sub-task >>>>>> Components: Clustering >>>>>> Affects Versions: 0.6 >>>>>> Reporter: Jeff Eastman >>>>>> Assignee: Jeff Eastman >>>>>> Fix For: 0.7 >>>>>> >>>>>> >>>>>> Refactor the current K-means implementation to use the >>>>>> ClusterIterator/Classifier implementation. This will replace the >>>>>> mapper, combiner, reducer, clusterer and many unit tests but will >>>>>> not modify the other driver APIs, thus retaining compatibility >>>>>> with existing CLI. |