Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # dev - Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator


Copy link to this message
-
Re: [jira] [Commented] (MAHOUT-988) Convert K-means buildClusters to use new ClusterIterator
Jeff Eastman 2012-03-15, 20:06
+1 Paritosh, this is exactly what I envisioned. And I also like your
idea of first converting them all to use ClusterWritable. Go for it!

On 3/15/12 10:42 AM, Paritosh Ranjan wrote:
> I saw the code and my understanding of the new implementation is:
> a) K-Means, Fuzzy K-Means and Dirichlet will ClusterIterator and write
> IntWritable, ClusterWritbale in buildClusters phase ( Instead of
> Kluster, SoftCluster and DirichletCluster )
> b) Canopy and MeanShift will NOT use ClusterIterator but will emit
> IntWritable, ClusterWritable ( Instead of Canopy and MeanShiftCanopy )
>
> There are tools ( ClusterDumper and ClusterEvaluator ) which expect
> <Cluster> when they read from the output file after clustering ( ~
> buildCluster phase ).
>
> KMeans is expecting Canopy and KCluster, but will get ClusterWritable.
>
> So, everything needs to be in sync ( i.e. ClusterWritable )
>
> I propose to wrap everything in ClusterWritable first, as everything
> is a Cluster ( eg. DirichletCluster, SoftCluster, Kluster, Canopy and
> MeanShiftCanopy ). This will remove the inconsistency without much
> chaos. Once ClusterWritable is uniformly used, then refactor all
> algorithms.
>
> I am also not against making ClusterDumper unavailable for a week or
> so since we have ClusterOutputPostProcessor now.
>
> Is my understanding correct? If not, please help me understand it.
> If yes, which way do you propose to refactor?
>
> On 15-03-2012 19:24, Jeff Eastman wrote:
>> Yes, that was my point. below It may, in fact, be impossible to
>> implement and commit them independently since so much of Mahout
>> clustering depends upon the Cluster sequenceFile. You may be able to
>> get part way by moving the Canopy mods into the kmeans issue, but
>> then the cluster dumper and evaluator will not work with kmeans.
>>
>> Ideas?
>>
>> On 3/14/12 10:15 PM, Paritosh Ranjan wrote:
>>> Thanks Jeff. One question, are "Use ClusterIterator" tasks dependent
>>> on "Modify Canopy etc to use ClusterWritable" task ?
>>> I am assuming that all subtasks in MAHOUT-933
>>> <https://issues.apache.org/jira/browse/MAHOUT-933> are independent
>>> of each other and the order to pick them does not matter. Am I correct?
>>>
>>> On 15-03-2012 09:23, Jeff Eastman wrote:
>>>> Sure Paritosh, go ahead and take a crack at it. I am moving from CO
>>>> to PA for the next few weeks and won't be able to do much coding
>>>> during that period. I suspect you will also need to modify Canopy
>>>> to emit ClusterWritable and also the RandomSeedGenerator.
>>>>
>>>> Smooth sailing,
>>>> Jeff
>>>>
>>>> On 3/14/12 8:28 PM, Paritosh Ranjan (Commented) (JIRA) wrote:
>>>>>      [
>>>>> https://issues.apache.org/jira/browse/MAHOUT-988?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13229840#comment-13229840
>>>>> ]
>>>>>
>>>>> Paritosh Ranjan commented on MAHOUT-988:
>>>>> ----------------------------------------
>>>>>
>>>>> Jeff, I would like to work on this issue (or MAHOUT-989, or
>>>>> MAHOUT-990). Can I? I might also need some help ( at least the
>>>>> first patch review ).
>>>>>
>>>>>
>>>>>> Convert K-means buildClusters to use new ClusterIterator
>>>>>> --------------------------------------------------------
>>>>>>
>>>>>>                  Key: MAHOUT-988
>>>>>>                  URL:
>>>>>> https://issues.apache.org/jira/browse/MAHOUT-988
>>>>>>              Project: Mahout
>>>>>>           Issue Type: Sub-task
>>>>>>           Components: Clustering
>>>>>>     Affects Versions: 0.6
>>>>>>             Reporter: Jeff Eastman
>>>>>>             Assignee: Jeff Eastman
>>>>>>              Fix For: 0.7
>>>>>>
>>>>>>
>>>>>> Refactor the current K-means implementation to use the
>>>>>> ClusterIterator/Classifier implementation. This will replace the
>>>>>> mapper, combiner, reducer, clusterer and many unit tests but will
>>>>>> not modify the other driver APIs, thus retaining compatibility
>>>>>> with existing CLI.