|
Saikat Kanjilal
2012-02-22, 04:57
Paritosh Ranjan
2012-02-22, 05:27
Saikat Kanjilal
2012-02-22, 05:46
Jeff Eastman
2012-02-22, 15:56
Jake Mannix
2012-02-22, 16:55
Jeff Eastman
2012-02-22, 18:32
Jake Mannix
2012-02-22, 18:58
Jeff Eastman
2012-02-22, 19:18
Ted Dunning
2012-02-22, 20:42
Jeff Eastman
2012-02-23, 00:01
Ted Dunning
2012-02-23, 04:23
Jeff Eastman
2012-02-23, 05:01
Ted Dunning
2012-02-23, 05:07
Saikat Kanjilal
2012-02-22, 17:06
Jeff Eastman
2012-02-22, 23:25
Paritosh Ranjan
2012-02-23, 08:03
Saikat Kanjilal
2012-02-23, 08:08
Saikat Kanjilal
2012-02-24, 13:33
Paritosh Ranjan
2012-02-24, 17:23
Saikat Kanjilal
2012-02-22, 23:29
|
-
Helping out with the .7 releaseSaikat Kanjilal 2012-02-22, 04:57
Hi Guys,I was interested in helping out with the clustering component of mahout, I looked through the JIRA items below and was wondering if there is a specific one that would be good to start with: https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide I initially was thinking to work on Mahout-930 or Mahout-931 but could work on others if needed. Best Regards +
Saikat Kanjilal 2012-02-22, 04:57
-
Re: Helping out with the .7 releaseParitosh Ranjan 2012-02-22, 05:27
We are creating clustering as classification components which will help
in moving clustering out. Once the component is ready, then the clustering algorithms would need refactoring. The clustering as classification component and the outlier removal component has been created. Most of it is committed, and rest is available as a patch. See https://issues.apache.org/jira/browse/MAHOUT-929 If you will apply the latest patch available on Mahout-929 you can see all that is available now. If you want, you can help with the test case of ClusterClassificationMapper available in the patch. On 22-02-2012 10:27, Saikat Kanjilal wrote: > Hi Guys,I was interested in helping out with the clustering component of mahout, I looked through the JIRA items below and was wondering if there is a specific one that would be good to start with: > > https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide > > I initially was thinking to work on Mahout-930 or Mahout-931 but could work on others if needed. > Best Regards +
Paritosh Ranjan 2012-02-22, 05:27
-
RE: Helping out with the .7 releaseSaikat Kanjilal 2012-02-22, 05:46
Hi Paritosh,Yes creating the test case would be a great first start, however are there other tasks you guys need help with before I can do before the test creation, I will sync trunk and start reading through the code in the meantime.Regards > Date: Wed, 22 Feb 2012 10:57:51 +0530 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Re: Helping out with the .7 release > > We are creating clustering as classification components which will help > in moving clustering out. Once the component is ready, then the > clustering algorithms would need refactoring. > The clustering as classification component and the outlier removal > component has been created. > > Most of it is committed, and rest is available as a patch. See > https://issues.apache.org/jira/browse/MAHOUT-929 > If you will apply the latest patch available on Mahout-929 you can see > all that is available now. > > If you want, you can help with the test case of > ClusterClassificationMapper available in the patch. > > On 22-02-2012 10:27, Saikat Kanjilal wrote: > > Hi Guys,I was interested in helping out with the clustering component of mahout, I looked through the JIRA items below and was wondering if there is a specific one that would be good to start with: > > > > https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide > > > > I initially was thinking to work on Mahout-930 or Mahout-931 but could work on others if needed. > > Best Regards > +
Saikat Kanjilal 2012-02-22, 05:46
-
Re: Helping out with the .7 releaseJeff Eastman 2012-02-22, 15:56
Hi Saikat,
I agree with Paritosh, that a great place to begin would be to write some unit tests. This will familiarize you with the code base and help us a lot with our 0.7 housekeeping release. The new clustering classification components are going to unify many - but not all - of the existing clustering algorithms to reduce their complexity by factoring out duplication and streamlining their integration into semi-supervised classification engines. Please feel free to post any questions you may have in reading through this code. This is a major refactoring effort and we will need all the help we can get. Thanks for the offer, Jeff On 2/21/12 10:46 PM, Saikat Kanjilal wrote: > Hi Paritosh,Yes creating the test case would be a great first start, however are there other tasks you guys need help with before I can do before the test creation, I will sync trunk and start reading through the code in the meantime.Regards > >> Date: Wed, 22 Feb 2012 10:57:51 +0530 >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Subject: Re: Helping out with the .7 release >> >> We are creating clustering as classification components which will help >> in moving clustering out. Once the component is ready, then the >> clustering algorithms would need refactoring. >> The clustering as classification component and the outlier removal >> component has been created. >> >> Most of it is committed, and rest is available as a patch. See >> https://issues.apache.org/jira/browse/MAHOUT-929 >> If you will apply the latest patch available on Mahout-929 you can see >> all that is available now. >> >> If you want, you can help with the test case of >> ClusterClassificationMapper available in the patch. >> >> On 22-02-2012 10:27, Saikat Kanjilal wrote: >>> Hi Guys,I was interested in helping out with the clustering component of mahout, I looked through the JIRA items below and was wondering if there is a specific one that would be good to start with: >>> >>> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide >>> >>> I initially was thinking to work on Mahout-930 or Mahout-931 but could work on others if needed. >>> Best Regards > +
Jeff Eastman 2012-02-22, 15:56
-
Re: Helping out with the .7 releaseJake Mannix 2012-02-22, 16:55
So I haven't looked super-carefully at the clustering refactoring work, can
someone give a little overview of what the plan is? The NewLDA stuff is technically in "clustering" and generally works by taking in SeqFile<IW,VW> documents as the training corpus, and spits out two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector per topic) and a SeqFile<IW,VW> of "classifications" (keyed on docId, one vector over the topic space for projection onto each topic dimension). This is similar to how SVD clustering/decomposition works, but with L1-normed outputs instead of L2. But this seems very different from all of the structures in the rest of clustering. -jake On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > Hi Saikat, > > I agree with Paritosh, that a great place to begin would be to write some > unit tests. This will familiarize you with the code base and help us a lot > with our 0.7 housekeeping release. The new clustering classification > components are going to unify many - but not all - of the existing > clustering algorithms to reduce their complexity by factoring out > duplication and streamlining their integration into semi-supervised > classification engines. > > Please feel free to post any questions you may have in reading through > this code. This is a major refactoring effort and we will need all the help > we can get. Thanks for the offer, > > Jeff > > > On 2/21/12 10:46 PM, Saikat Kanjilal wrote: > >> Hi Paritosh,Yes creating the test case would be a great first start, >> however are there other tasks you guys need help with before I can do >> before the test creation, I will sync trunk and start reading through the >> code in the meantime.Regards >> >> Date: Wed, 22 Feb 2012 10:57:51 +0530 >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> Subject: Re: Helping out with the .7 release >>> >>> We are creating clustering as classification components which will help >>> in moving clustering out. Once the component is ready, then the >>> clustering algorithms would need refactoring. >>> The clustering as classification component and the outlier removal >>> component has been created. >>> >>> Most of it is committed, and rest is available as a patch. See >>> https://issues.apache.org/**jira/browse/MAHOUT-929<https://issues.apache.org/jira/browse/MAHOUT-929> >>> If you will apply the latest patch available on Mahout-929 you can see >>> all that is available now. >>> >>> If you want, you can help with the test case of >>> ClusterClassificationMapper available in the patch. >>> >>> On 22-02-2012 10:27, Saikat Kanjilal wrote: >>> >>>> Hi Guys,I was interested in helping out with the clustering component >>>> of mahout, I looked through the JIRA items below and was wondering if there >>>> is a specific one that would be good to start with: >>>> >>>> https://issues.apache.org/**jira/secure/IssueNavigator.** >>>> jspa?reset=true&jqlQuery=**project+%3D+MAHOUT+AND+** >>>> resolution+%3D+Unresolved+AND+**component+%3D+Clustering+** >>>> ORDER+BY+priority+DESC&mode=**hide<https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide> >>>> >>>> I initially was thinking to work on Mahout-930 or Mahout-931 but could >>>> work on others if needed. >>>> Best Regards >>>> >>> >> > > +
Jake Mannix 2012-02-22, 16:55
-
Re: Helping out with the .7 releaseJeff Eastman 2012-02-22, 18:32
This refactoring is focused on some of the iterative clustering
algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input vector against them to produce a posterior set of clusters (e.g. clusters-1) for the next iteration. This will result in k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator iterating over a ClusterClassifier using a ClusteringPolicy. You can see these classes in o.a.m.clustering. They are a work in progress but in-memory, sequential from sequenceFiles and k-means MR work in tests and can be demonstrated in the DisplayXX examples which employ them. Paritosh has also been building a ClusterClassificationDriver (o.a.m.clustering.classify) which we want to use to factor all of the redundant cluster-data implementations (-cl option) out of the respective cluster drivers. This will affect Canopy in addition to the above algorithms. An imagined benefit of this refactoring comes from the fact that ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. We think this means that a posterior set of trained Clusters can be used as a component classifier in a semi-supervised classifier implementation. I suppose we will need to demonstrate this before we go too much further in the refactoring but Ted, at least, seems to approve of this integration approach between supervised classification and clustering (unsupervised classification). I don't think it has had a lot of other eyeballs on it. I don't think LDA fits into this subset of clustering algorithms as also do not Canopy and MeanShift. As you note, it does not produce Clusters but I'd be interested in your reactions to the above. Jeff On 2/22/12 9:55 AM, Jake Mannix wrote: > So I haven't looked super-carefully at the clustering refactoring work, can > someone give a little overview of what > the plan is? > > The NewLDA stuff is technically in "clustering" and generally works by > taking in SeqFile<IW,VW> documents as the training corpus, and spits out > two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector per > topic) and a SeqFile<IW,VW> of "classifications" (keyed on docId, one > vector over the topic space for projection onto each topic dimension). > > This is similar to how SVD clustering/decomposition works, but with > L1-normed outputs instead of L2. > > But this seems very different from all of the structures in the rest of > clustering. > > -jake > > On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman<[EMAIL PROTECTED]>wrote: > >> Hi Saikat, >> >> I agree with Paritosh, that a great place to begin would be to write some >> unit tests. This will familiarize you with the code base and help us a lot >> with our 0.7 housekeeping release. The new clustering classification >> components are going to unify many - but not all - of the existing >> clustering algorithms to reduce their complexity by factoring out >> duplication and streamlining their integration into semi-supervised >> classification engines. >> >> Please feel free to post any questions you may have in reading through >> this code. This is a major refactoring effort and we will need all the help >> we can get. Thanks for the offer, >> >> Jeff >> >> >> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: >> >>> Hi Paritosh,Yes creating the test case would be a great first start, >>> however are there other tasks you guys need help with before I can do >>> before the test creation, I will sync trunk and start reading through the >>> code in the meantime.Regards >>> >>> Date: Wed, 22 Feb 2012 10:57:51 +0530 >>>> From: [EMAIL PROTECTED] >>>> To: [EMAIL PROTECTED] >>>> Subject: Re: Helping out with the .7 release >>>> >>>> We are creating clustering as classification components which will help >>>> in moving clustering out. Once the component is ready, then the >>>> clustering algorithms would need refactoring. >>>> The clustering as classification component and the outlier removal +
Jeff Eastman 2012-02-22, 18:32
-
Re: Helping out with the .7 releaseJake Mannix 2012-02-22, 18:58
On Wed, Feb 22, 2012 at 10:32 AM, Jeff Eastman
<[EMAIL PROTECTED]>wrote: > This refactoring is focused on some of the iterative clustering algorithms > which, in each iteration, load a prior set of clusters ( e.g. clusters-0) > and process each input vector against them to produce a posterior set of > clusters (e.g. clusters-1) for the next iteration. This will result in > k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator > iterating over a ClusterClassifier using a ClusteringPolicy. You can see > these classes in o.a.m.clustering. They are a work in progress but > in-memory, sequential from sequenceFiles and k-means MR work in tests and > can be demonstrated in the DisplayXX examples which employ them. > > Paritosh has also been building a ClusterClassificationDriver > (o.a.m.clustering.classify) which we want to use to factor all of the > redundant cluster-data implementations (-cl option) out of the respective > cluster drivers. This will affect Canopy in addition to the above > algorithms. > > An imagined benefit of this refactoring comes from the fact that > ClusterClassifier extends AbstractVectorClassifier and implements > OnlineLearner. We think this means that a posterior set of trained Clusters > can be used as a component classifier in a semi-supervised classifier > implementation. I suppose we will need to demonstrate this before we go too > much further in the refactoring but Ted, at least, seems to approve of this > integration approach between supervised classification and clustering > (unsupervised classification). I don't think it has had a lot of other > eyeballs on it. > > I don't think LDA fits into this subset of clustering algorithms as also > do not Canopy and MeanShift. As you note, it does not produce Clusters but > I'd be interested in your reactions to the above. > So LDA lives in o.a.m.clustering, and does actually produce what you could *call* clusters - it assigns fuzzy weighted cluster_ids (called topic_ids in LDA) to training data, in much the same way that fuzzy-kmeans does. It also produces things which *act* like a "ClusterClassifier", and while this is unsupervised, once you extend to Labeled LDA (saving that merge from my GitHub fork until 0.8 "new features"), it's also a supervised classifier. I'm not necessarily saying that LDA (and Canopy, and SVD) *must* merge to use the same API, but if we're doing work to unify these things so they talk the same language, seeing what the end goal is (maybe not reached in this round of refactoring) would help inform the process of how we do this next step. I can state my reasons for liking going with simple vectors both for the "classifier" and the "cluster": disk format is the same as our input data, so when people write utils to hook this up to Pig (and Scalding, Cascalog, Hive, etc etc), you don't need to write utils for handling new data type, and even algorithms that take this input can run over the *outputs*: e.g. you generate a set of "clusters" in LDA which are topics - each one is a vector over input features, so this collection of vectors can be fed in, with *no change* into another clustering algorithm, like KMeans, to find which topics are most like each other [maybe contrived example, but there may be better ones: make a tree / hierarchy based on topics as inputs instead of documents as inputs, to see if there is a nice tree structure to your topic model! When your outputs are a bunch of custom Cluster thingees, you can't interoperate with everything else (regression, vector-based recommenders, etc) without more work. -jake > > Jeff > > > On 2/22/12 9:55 AM, Jake Mannix wrote: > >> So I haven't looked super-carefully at the clustering refactoring work, >> can >> someone give a little overview of what >> the plan is? >> >> The NewLDA stuff is technically in "clustering" and generally works by >> taking in SeqFile<IW,VW> documents as the training corpus, and spits out >> two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector per +
Jake Mannix 2012-02-22, 18:58
-
Re: Helping out with the .7 releaseJeff Eastman 2012-02-22, 19:18
On 2/22/12 11:58 AM, Jake Mannix wrote:
> On Wed, Feb 22, 2012 at 10:32 AM, Jeff Eastman > <[EMAIL PROTECTED]>wrote: > >> This refactoring is focused on some of the iterative clustering algorithms >> which, in each iteration, load a prior set of clusters ( e.g. clusters-0) >> and process each input vector against them to produce a posterior set of >> clusters (e.g. clusters-1) for the next iteration. This will result in >> k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator >> iterating over a ClusterClassifier using a ClusteringPolicy. You can see >> these classes in o.a.m.clustering. They are a work in progress but >> in-memory, sequential from sequenceFiles and k-means MR work in tests and >> can be demonstrated in the DisplayXX examples which employ them. >> >> Paritosh has also been building a ClusterClassificationDriver >> (o.a.m.clustering.classify) which we want to use to factor all of the >> redundant cluster-data implementations (-cl option) out of the respective >> cluster drivers. This will affect Canopy in addition to the above >> algorithms. >> >> An imagined benefit of this refactoring comes from the fact that >> ClusterClassifier extends AbstractVectorClassifier and implements >> OnlineLearner. We think this means that a posterior set of trained Clusters >> can be used as a component classifier in a semi-supervised classifier >> implementation. I suppose we will need to demonstrate this before we go too >> much further in the refactoring but Ted, at least, seems to approve of this >> integration approach between supervised classification and clustering >> (unsupervised classification). I don't think it has had a lot of other >> eyeballs on it. >> >> I don't think LDA fits into this subset of clustering algorithms as also >> do not Canopy and MeanShift. As you note, it does not produce Clusters but >> I'd be interested in your reactions to the above. >> > So LDA lives in o.a.m.clustering, and does actually produce what you could > *call* clusters - it assigns fuzzy weighted cluster_ids (called topic_ids > in LDA) to training data, in much the same way that fuzzy-kmeans does. It > also produces things which *act* like a "ClusterClassifier", and while this > is unsupervised, once you extend to Labeled LDA (saving that merge from my > GitHub fork until 0.8 "new features"), it's also a supervised classifier. > > I'm not necessarily saying that LDA (and Canopy, and SVD) *must* merge to > use the same API, but if we're doing work to unify these things so they > talk the same language, seeing what the end goal is (maybe not reached in > this round of refactoring) would help inform the process of how we do this > next step. I agree this a want and not a must, but if there are things we can do early in this refactoring to improve the prospects of further merging down the road then I am +1 on that. > > I can state my reasons for liking going with simple vectors both for the > "classifier" and the "cluster": disk format is the same as our input data, > so when people write utils to hook this up to Pig (and Scalding, Cascalog, > Hive, etc etc), you don't need to write utils for handling new data type, > and even algorithms that take this input can run over the *outputs*: e.g. > you generate a set of "clusters" in LDA which are topics - each one is a > vector over input features, so this collection of vectors can be fed in, > with *no change* into another clustering algorithm, like KMeans, to find > which topics are most like each other [maybe contrived example, but there > may be better ones: make a tree / hierarchy based on topics as inputs > instead of documents as inputs, to see if there is a nice tree structure to > your topic model! When your outputs are a bunch of custom Cluster > thingees, you can't interoperate with everything else (regression, > vector-based recommenders, etc) without more work. > > -jake > Currently, AbstractClusters are represented by two vectors - a center (mean) and a radius (std) - and two doubles - numObservations and totalObservations. There is also some observation statistics used between mappers and reducers but this can be null when iterations are done. While this is overkill for the DistanceMeasureClusters used in most of the early clustering algorithms, they are critical for the GaussianClusters used by Dirichlet. The DMClusters also have a DistanceMeasure which is needed for classification via pdf(). It also allows for cool-looking ellipses to be drawn around the cluster centers in the DisplayXX examples. I agree having a single "cluster" representation is a good goal. Does it have to be a single vector? Doesn't that require some implicit way to measure distance or pdf? +
Jeff Eastman 2012-02-22, 19:18
-
Re: Helping out with the .7 releaseTed Dunning 2012-02-22, 20:42
I would also like to see if we can put an all reduce implementation into this effort. The idea is that we can use a map only job that does all iteration internally. I think that this could result in more than an order of magnitude speed up for our clustering codes. It could also provide similar benefits for the nascent parallel classifier training work.
This seems to be a cleanup of a long standing wart in our code but it is reasonable that others may feel differently. Sent from my iPhone On Feb 22, 2012, at 10:32 AM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > This refactoring is focused on some of the iterative clustering algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input vector against them to produce a posterior set of clusters (e.g. clusters-1) for the next iteration. This will result in k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator iterating over a ClusterClassifier using a ClusteringPolicy. You can see these classes in o.a.m.clustering. They are a work in progress but in-memory, sequential from sequenceFiles and k-means MR work in tests and can be demonstrated in the DisplayXX examples which employ them. > > Paritosh has also been building a ClusterClassificationDriver (o.a.m.clustering.classify) which we want to use to factor all of the redundant cluster-data implementations (-cl option) out of the respective cluster drivers. This will affect Canopy in addition to the above algorithms. > > An imagined benefit of this refactoring comes from the fact that ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. We think this means that a posterior set of trained Clusters can be used as a component classifier in a semi-supervised classifier implementation. I suppose we will need to demonstrate this before we go too much further in the refactoring but Ted, at least, seems to approve of this integration approach between supervised classification and clustering (unsupervised classification). I don't think it has had a lot of other eyeballs on it. > > I don't think LDA fits into this subset of clustering algorithms as also do not Canopy and MeanShift. As you note, it does not produce Clusters but I'd be interested in your reactions to the above. > > Jeff > > On 2/22/12 9:55 AM, Jake Mannix wrote: >> So I haven't looked super-carefully at the clustering refactoring work, can >> someone give a little overview of what >> the plan is? >> >> The NewLDA stuff is technically in "clustering" and generally works by >> taking in SeqFile<IW,VW> documents as the training corpus, and spits out >> two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector per >> topic) and a SeqFile<IW,VW> of "classifications" (keyed on docId, one >> vector over the topic space for projection onto each topic dimension). >> >> This is similar to how SVD clustering/decomposition works, but with >> L1-normed outputs instead of L2. >> >> But this seems very different from all of the structures in the rest of >> clustering. >> >> -jake >> >> On Wed, Feb 22, 2012 at 7:56 AM, Jeff Eastman<[EMAIL PROTECTED]>wrote: >> >>> Hi Saikat, >>> >>> I agree with Paritosh, that a great place to begin would be to write some >>> unit tests. This will familiarize you with the code base and help us a lot >>> with our 0.7 housekeeping release. The new clustering classification >>> components are going to unify many - but not all - of the existing >>> clustering algorithms to reduce their complexity by factoring out >>> duplication and streamlining their integration into semi-supervised >>> classification engines. >>> >>> Please feel free to post any questions you may have in reading through >>> this code. This is a major refactoring effort and we will need all the help >>> we can get. Thanks for the offer, >>> >>> Jeff >>> >>> >>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: >>> >>>> Hi Paritosh,Yes creating the test case would be a great first start, +
Ted Dunning 2012-02-22, 20:42
-
Re: Helping out with the .7 releaseJeff Eastman 2012-02-23, 00:01
Hey Ted,
Could you elaborate on this approach? I don't grok how an "all reduce implementation" can be done with a "map-only job", or how a mapper could do "all iteration[s] internally". I've just gotten the ClusterIterator working in MR mode and it does what I thought we'd been talking about earlier: In each iteration, each mapper loads all the prior clusters and then iterates through all its input points, training each of the prior clusters in the process. Then, in the cleanup() method, all the trained clusters are sent to the reducers keyed by their model indexes. This eliminates the need for a combiner and means each reducer only has to merge n-mappers worth of trained clusters into a posterior trained cluster before it is output. If numReducers == k then the current reduce-step overloads should disappear. The secret to this implementation is to allow clusters to observe other clusters in addition to observing vectors, thereby accumulating all of those clusters' observation statistics before recomputing posterior parameters. On 2/22/12 1:42 PM, Ted Dunning wrote: > I would also like to see if we can put an all reduce implementation into this effort. The idea is that we can use a map only job that does all iteration internally. I think that this could result in more than an order of magnitude speed up for our clustering codes. It could also provide similar benefits for the nascent parallel classifier training work. > > This seems to be a cleanup of a long standing wart in our code but it is reasonable that others may feel differently. > > Sent from my iPhone > > On Feb 22, 2012, at 10:32 AM, Jeff Eastman<[EMAIL PROTECTED]> wrote: > >> This refactoring is focused on some of the iterative clustering algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input vector against them to produce a posterior set of clusters (e.g. clusters-1) for the next iteration. This will result in k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator iterating over a ClusterClassifier using a ClusteringPolicy. You can see these classes in o.a.m.clustering. They are a work in progress but in-memory, sequential from sequenceFiles and k-means MR work in tests and can be demonstrated in the DisplayXX examples which employ them. >> >> Paritosh has also been building a ClusterClassificationDriver (o.a.m.clustering.classify) which we want to use to factor all of the redundant cluster-data implementations (-cl option) out of the respective cluster drivers. This will affect Canopy in addition to the above algorithms. >> >> An imagined benefit of this refactoring comes from the fact that ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. We think this means that a posterior set of trained Clusters can be used as a component classifier in a semi-supervised classifier implementation. I suppose we will need to demonstrate this before we go too much further in the refactoring but Ted, at least, seems to approve of this integration approach between supervised classification and clustering (unsupervised classification). I don't think it has had a lot of other eyeballs on it. >> >> I don't think LDA fits into this subset of clustering algorithms as also do not Canopy and MeanShift. As you note, it does not produce Clusters but I'd be interested in your reactions to the above. >> >> Jeff >> >> On 2/22/12 9:55 AM, Jake Mannix wrote: >>> So I haven't looked super-carefully at the clustering refactoring work, can >>> someone give a little overview of what >>> the plan is? >>> >>> The NewLDA stuff is technically in "clustering" and generally works by >>> taking in SeqFile<IW,VW> documents as the training corpus, and spits out >>> two things: SeqFile<IW,VW> of a "model" (keyed on topicId, one vector per >>> topic) and a SeqFile<IW,VW> of "classifications" (keyed on docId, one >>> vector over the topic space for projection onto each topic dimension). +
Jeff Eastman 2012-02-23, 00:01
-
Re: Helping out with the .7 releaseTed Dunning 2012-02-23, 04:23
All reduce is a non map reduce primitive stolen from mpi. It is used, for example, in vw to accumulate gradient information without additional Map reduce iterations.
The all reduce operation works by building a tree of all tasks. A state is sent up the tree from the leaves. Each internal node adds together the children's states and adds in its own. At the root we have the combination of all states and that result is sent back down the tree. In practice all mappers iterate through there input slice and do an all reduce. Then they reset their input and repeat. Commonly the root node will include a termination flag to signal convergence. The effect is that iterations don't require spawning a new map reduce job and thus we save considerable time at each step. Indeed, if the input can fit into memory, we can gain even more speed. With in memory operation we may get two orders of magnitude speed up. With data too large to fit in memory gains will be more modest. Sent from my iPhone On Feb 22, 2012, at 4:01 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > Hey Ted, > > Could you elaborate on this approach? I don't grok how an "all reduce implementation" can be done with a "map-only job", or how a mapper could do "all iteration[s] internally". > > I've just gotten the ClusterIterator working in MR mode and it does what I thought we'd been talking about earlier: In each iteration, each mapper loads all the prior clusters and then iterates through all its input points, training each of the prior clusters in the process. Then, in the cleanup() method, all the trained clusters are sent to the reducers keyed by their model indexes. This eliminates the need for a combiner and means each reducer only has to merge n-mappers worth of trained clusters into a posterior trained cluster before it is output. If numReducers == k then the current reduce-step overloads should disappear. > > The secret to this implementation is to allow clusters to observe other clusters in addition to observing vectors, thereby accumulating all of those clusters' observation statistics before recomputing posterior parameters. > > > > On 2/22/12 1:42 PM, Ted Dunning wrote: >> I would also like to see if we can put an all reduce implementation into this effort. The idea is that we can use a map only job that does all iteration internally. I think that this could result in more than an order of magnitude speed up for our clustering codes. It could also provide similar benefits for the nascent parallel classifier training work. >> >> This seems to be a cleanup of a long standing wart in our code but it is reasonable that others may feel differently. >> >> Sent from my iPhone >> >> On Feb 22, 2012, at 10:32 AM, Jeff Eastman<[EMAIL PROTECTED]> wrote: >> >>> This refactoring is focused on some of the iterative clustering algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input vector against them to produce a posterior set of clusters (e.g. clusters-1) for the next iteration. This will result in k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator iterating over a ClusterClassifier using a ClusteringPolicy. You can see these classes in o.a.m.clustering. They are a work in progress but in-memory, sequential from sequenceFiles and k-means MR work in tests and can be demonstrated in the DisplayXX examples which employ them. >>> >>> Paritosh has also been building a ClusterClassificationDriver (o.a.m.clustering.classify) which we want to use to factor all of the redundant cluster-data implementations (-cl option) out of the respective cluster drivers. This will affect Canopy in addition to the above algorithms. >>> >>> An imagined benefit of this refactoring comes from the fact that ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. We think this means that a posterior set of trained Clusters can be used as a component classifier in a semi-supervised classifier implementation. I suppose we will need to demonstrate this before we go too much further in the refactoring but Ted, at least, seems to approve of this integration approach between supervised classification and clustering (unsupervised classification). I don't think it has had a lot of other eyeballs on it. +
Ted Dunning 2012-02-23, 04:23
-
Re: Helping out with the .7 releaseJeff Eastman 2012-02-23, 05:01
Got any code that does this I could look at?
On 2/22/12 9:23 PM, Ted Dunning wrote: > All reduce is a non map reduce primitive stolen from mpi. It is used, for example, in vw to accumulate gradient information without additional Map reduce iterations. > > The all reduce operation works by building a tree of all tasks. A state is sent up the tree from the leaves. Each internal node adds together the children's states and adds in its own. At the root we have the combination of all states and that result is sent back down the tree. > > In practice all mappers iterate through there input slice and do an all reduce. Then they reset their input and repeat. Commonly the root node will include a termination flag to signal convergence. > > The effect is that iterations don't require spawning a new map reduce job and thus we save considerable time at each step. Indeed, if the input can fit into memory, we can gain even more speed. With in memory operation we may get two orders of magnitude speed up. With data too large to fit in memory gains will be more modest. > > Sent from my iPhone > > On Feb 22, 2012, at 4:01 PM, Jeff Eastman<[EMAIL PROTECTED]> wrote: > >> Hey Ted, >> >> Could you elaborate on this approach? I don't grok how an "all reduce implementation" can be done with a "map-only job", or how a mapper could do "all iteration[s] internally". >> >> I've just gotten the ClusterIterator working in MR mode and it does what I thought we'd been talking about earlier: In each iteration, each mapper loads all the prior clusters and then iterates through all its input points, training each of the prior clusters in the process. Then, in the cleanup() method, all the trained clusters are sent to the reducers keyed by their model indexes. This eliminates the need for a combiner and means each reducer only has to merge n-mappers worth of trained clusters into a posterior trained cluster before it is output. If numReducers == k then the current reduce-step overloads should disappear. >> >> The secret to this implementation is to allow clusters to observe other clusters in addition to observing vectors, thereby accumulating all of those clusters' observation statistics before recomputing posterior parameters. >> >> >> >> On 2/22/12 1:42 PM, Ted Dunning wrote: >>> I would also like to see if we can put an all reduce implementation into this effort. The idea is that we can use a map only job that does all iteration internally. I think that this could result in more than an order of magnitude speed up for our clustering codes. It could also provide similar benefits for the nascent parallel classifier training work. >>> >>> This seems to be a cleanup of a long standing wart in our code but it is reasonable that others may feel differently. >>> >>> Sent from my iPhone >>> >>> On Feb 22, 2012, at 10:32 AM, Jeff Eastman<[EMAIL PROTECTED]> wrote: >>> >>>> This refactoring is focused on some of the iterative clustering algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input vector against them to produce a posterior set of clusters (e.g. clusters-1) for the next iteration. This will result in k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator iterating over a ClusterClassifier using a ClusteringPolicy. You can see these classes in o.a.m.clustering. They are a work in progress but in-memory, sequential from sequenceFiles and k-means MR work in tests and can be demonstrated in the DisplayXX examples which employ them. >>>> >>>> Paritosh has also been building a ClusterClassificationDriver (o.a.m.clustering.classify) which we want to use to factor all of the redundant cluster-data implementations (-cl option) out of the respective cluster drivers. This will affect Canopy in addition to the above algorithms. >>>> >>>> An imagined benefit of this refactoring comes from the fact that ClusterClassifier extends AbstractVectorClassifier and implements OnlineLearner. We think this means that a posterior set of trained Clusters can be used as a component classifier in a semi-supervised classifier implementation. I suppose we will need to demonstrate this before we go too much further in the refactoring but Ted, at least, seems to approve of this integration approach between supervised classification and clustering (unsupervised classification). I don't think it has had a lot of other eyeballs on it. +
Jeff Eastman 2012-02-23, 05:01
-
Re: Helping out with the .7 releaseTed Dunning 2012-02-23, 05:07
Only vw itself.
Sent from my iPhone On Feb 22, 2012, at 9:01 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > Got any code that does this I could look at? > > On 2/22/12 9:23 PM, Ted Dunning wrote: >> All reduce is a non map reduce primitive stolen from mpi. It is used, for example, in vw to accumulate gradient information without additional Map reduce iterations. >> >> The all reduce operation works by building a tree of all tasks. A state is sent up the tree from the leaves. Each internal node adds together the children's states and adds in its own. At the root we have the combination of all states and that result is sent back down the tree. >> >> In practice all mappers iterate through there input slice and do an all reduce. Then they reset their input and repeat. Commonly the root node will include a termination flag to signal convergence. >> >> The effect is that iterations don't require spawning a new map reduce job and thus we save considerable time at each step. Indeed, if the input can fit into memory, we can gain even more speed. With in memory operation we may get two orders of magnitude speed up. With data too large to fit in memory gains will be more modest. >> >> Sent from my iPhone >> >> On Feb 22, 2012, at 4:01 PM, Jeff Eastman<[EMAIL PROTECTED]> wrote: >> >>> Hey Ted, >>> >>> Could you elaborate on this approach? I don't grok how an "all reduce implementation" can be done with a "map-only job", or how a mapper could do "all iteration[s] internally". >>> >>> I've just gotten the ClusterIterator working in MR mode and it does what I thought we'd been talking about earlier: In each iteration, each mapper loads all the prior clusters and then iterates through all its input points, training each of the prior clusters in the process. Then, in the cleanup() method, all the trained clusters are sent to the reducers keyed by their model indexes. This eliminates the need for a combiner and means each reducer only has to merge n-mappers worth of trained clusters into a posterior trained cluster before it is output. If numReducers == k then the current reduce-step overloads should disappear. >>> >>> The secret to this implementation is to allow clusters to observe other clusters in addition to observing vectors, thereby accumulating all of those clusters' observation statistics before recomputing posterior parameters. >>> >>> >>> >>> On 2/22/12 1:42 PM, Ted Dunning wrote: >>>> I would also like to see if we can put an all reduce implementation into this effort. The idea is that we can use a map only job that does all iteration internally. I think that this could result in more than an order of magnitude speed up for our clustering codes. It could also provide similar benefits for the nascent parallel classifier training work. >>>> >>>> This seems to be a cleanup of a long standing wart in our code but it is reasonable that others may feel differently. >>>> >>>> Sent from my iPhone >>>> >>>> On Feb 22, 2012, at 10:32 AM, Jeff Eastman<[EMAIL PROTECTED]> wrote: >>>> >>>>> This refactoring is focused on some of the iterative clustering algorithms which, in each iteration, load a prior set of clusters ( e.g. clusters-0) and process each input vector against them to produce a posterior set of clusters (e.g. clusters-1) for the next iteration. This will result in k-Means, fuzzyK and Dirichlet being collapsed into a ClusterIterator iterating over a ClusterClassifier using a ClusteringPolicy. You can see these classes in o.a.m.clustering. They are a work in progress but in-memory, sequential from sequenceFiles and k-means MR work in tests and can be demonstrated in the DisplayXX examples which employ them. >>>>> >>>>> Paritosh has also been building a ClusterClassificationDriver (o.a.m.clustering.classify) which we want to use to factor all of the redundant cluster-data implementations (-cl option) out of the respective cluster drivers. This will affect Canopy in addition to the above algorithms. +
Ted Dunning 2012-02-23, 05:07
-
RE: Helping out with the .7 releaseSaikat Kanjilal 2012-02-22, 17:06
Jeff,I'm pretty excited to help out with this, so as a starter can you point me to where I should begin my readings of the code, I havent looked too closely but are there certain classes in the clustering area where this refactoring effort is centered around. Regards > Date: Wed, 22 Feb 2012 08:56:23 -0700 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Re: Helping out with the .7 release > > Hi Saikat, > > I agree with Paritosh, that a great place to begin would be to write > some unit tests. This will familiarize you with the code base and help > us a lot with our 0.7 housekeeping release. The new clustering > classification components are going to unify many - but not all - of the > existing clustering algorithms to reduce their complexity by factoring > out duplication and streamlining their integration into semi-supervised > classification engines. > > Please feel free to post any questions you may have in reading through > this code. This is a major refactoring effort and we will need all the > help we can get. Thanks for the offer, > > Jeff > > On 2/21/12 10:46 PM, Saikat Kanjilal wrote: > > Hi Paritosh,Yes creating the test case would be a great first start, however are there other tasks you guys need help with before I can do before the test creation, I will sync trunk and start reading through the code in the meantime.Regards > > > >> Date: Wed, 22 Feb 2012 10:57:51 +0530 > >> From: [EMAIL PROTECTED] > >> To: [EMAIL PROTECTED] > >> Subject: Re: Helping out with the .7 release > >> > >> We are creating clustering as classification components which will help > >> in moving clustering out. Once the component is ready, then the > >> clustering algorithms would need refactoring. > >> The clustering as classification component and the outlier removal > >> component has been created. > >> > >> Most of it is committed, and rest is available as a patch. See > >> https://issues.apache.org/jira/browse/MAHOUT-929 > >> If you will apply the latest patch available on Mahout-929 you can see > >> all that is available now. > >> > >> If you want, you can help with the test case of > >> ClusterClassificationMapper available in the patch. > >> > >> On 22-02-2012 10:27, Saikat Kanjilal wrote: > >>> Hi Guys,I was interested in helping out with the clustering component of mahout, I looked through the JIRA items below and was wondering if there is a specific one that would be good to start with: > >>> > >>> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide > >>> > >>> I initially was thinking to work on Mahout-930 or Mahout-931 but could work on others if needed. > >>> Best Regards > > > +
Saikat Kanjilal 2012-02-22, 17:06
-
Re: Helping out with the .7 releaseJeff Eastman 2012-02-22, 23:25
Hi Saikat,
Glad you're excited. Paritosh offered one suggestion below. You could look at TestKmeansClustering for patterns you could use to test the ClusterClassificationMapper and Driver in MR mode. That should be straightforward, but please coordinate with Paritosh so you don't duplicate efforts. Another place you might look into would be the KMeansDriver and MAHOUT-930. You could work on refactoring KMeansDriver to use the new ClusterClassificationDriver in MAHOUT-929. That would exercise both its sequential and MR options. It will be interesting to see how much code can be removed. Finally, you could see if you can wrap your mind around the ClusterIterator and how it could be used for further refactoring of the KMeansDriver. See TestClusterClassifier for insight. That enough reading and doing for now? Jeff On 2/22/12 10:06 AM, Saikat Kanjilal wrote: > Jeff,I'm pretty excited to help out with this, so as a starter can you point me to where I should begin my readings of the code, I havent looked too closely but are there certain classes in the clustering area where this refactoring effort is centered around. > Regards > >> Date: Wed, 22 Feb 2012 08:56:23 -0700 >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Subject: Re: Helping out with the .7 release >> >> Hi Saikat, >> >> I agree with Paritosh, that a great place to begin would be to write >> some unit tests. This will familiarize you with the code base and help >> us a lot with our 0.7 housekeeping release. The new clustering >> classification components are going to unify many - but not all - of the >> existing clustering algorithms to reduce their complexity by factoring >> out duplication and streamlining their integration into semi-supervised >> classification engines. >> >> Please feel free to post any questions you may have in reading through >> this code. This is a major refactoring effort and we will need all the >> help we can get. Thanks for the offer, >> >> Jeff >> >> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: >>> Hi Paritosh,Yes creating the test case would be a great first start, however are there other tasks you guys need help with before I can do before the test creation, I will sync trunk and start reading through the code in the meantime.Regards >>> >>>> Date: Wed, 22 Feb 2012 10:57:51 +0530 >>>> From: [EMAIL PROTECTED] >>>> To: [EMAIL PROTECTED] >>>> Subject: Re: Helping out with the .7 release >>>> >>>> We are creating clustering as classification components which will help >>>> in moving clustering out. Once the component is ready, then the >>>> clustering algorithms would need refactoring. >>>> The clustering as classification component and the outlier removal >>>> component has been created. >>>> >>>> Most of it is committed, and rest is available as a patch. See >>>> https://issues.apache.org/jira/browse/MAHOUT-929 >>>> If you will apply the latest patch available on Mahout-929 you can see >>>> all that is available now. >>>> >>>> If you want, you can help with the test case of >>>> ClusterClassificationMapper available in the patch. >>>> >>>> On 22-02-2012 10:27, Saikat Kanjilal wrote: >>>>> Hi Guys,I was interested in helping out with the clustering component of mahout, I looked through the JIRA items below and was wondering if there is a specific one that would be good to start with: >>>>> >>>>> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide >>>>> >>>>> I initially was thinking to work on Mahout-930 or Mahout-931 but could work on others if needed. >>>>> Best Regards >>> > +
Jeff Eastman 2012-02-22, 23:25
-
Re: Helping out with the .7 releaseParitosh Ranjan 2012-02-23, 08:03
Saikat,
I have created https://issues.apache.org/jira/browse/MAHOUT-981 for refactoring KMeansDriver to use the new ClusterClassificationDriver. You can provide your patches on this issue. See this to know how to provide a patch https://cwiki.apache.org/MAHOUT/how-to-contribute.html#HowToContribute-Generatingapatch. Before KMeans refactoring, we are expecting the ClusterClassificationMapperTest from you ( for Mahout-929 ). That test case would complete the development of ClusterClassificationDriver and the refactoring can start. Paritosh On 23-02-2012 04:55, Jeff Eastman wrote: > Hi Saikat, > > Glad you're excited. Paritosh offered one suggestion below. You could > look at TestKmeansClustering for patterns you could use to test the > ClusterClassificationMapper and Driver in MR mode. That should be > straightforward, but please coordinate with Paritosh so you don't > duplicate efforts. > > Another place you might look into would be the KMeansDriver and > MAHOUT-930. You could work on refactoring KMeansDriver to use the new > ClusterClassificationDriver in MAHOUT-929. That would exercise both > its sequential and MR options. It will be interesting to see how much > code can be removed. > > Finally, you could see if you can wrap your mind around the > ClusterIterator and how it could be used for further refactoring of > the KMeansDriver. See TestClusterClassifier for insight. > > That enough reading and doing for now? > Jeff > > On 2/22/12 10:06 AM, Saikat Kanjilal wrote: >> Jeff,I'm pretty excited to help out with this, so as a starter can >> you point me to where I should begin my readings of the code, I >> havent looked too closely but are there certain classes in the >> clustering area where this refactoring effort is centered around. >> Regards >> >>> Date: Wed, 22 Feb 2012 08:56:23 -0700 >>> From: [EMAIL PROTECTED] >>> To: [EMAIL PROTECTED] >>> Subject: Re: Helping out with the .7 release >>> >>> Hi Saikat, >>> >>> I agree with Paritosh, that a great place to begin would be to write >>> some unit tests. This will familiarize you with the code base and help >>> us a lot with our 0.7 housekeeping release. The new clustering >>> classification components are going to unify many - but not all - of >>> the >>> existing clustering algorithms to reduce their complexity by factoring >>> out duplication and streamlining their integration into semi-supervised >>> classification engines. >>> >>> Please feel free to post any questions you may have in reading through >>> this code. This is a major refactoring effort and we will need all the >>> help we can get. Thanks for the offer, >>> >>> Jeff >>> >>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: >>>> Hi Paritosh,Yes creating the test case would be a great first >>>> start, however are there other tasks you guys need help with before >>>> I can do before the test creation, I will sync trunk and start >>>> reading through the code in the meantime.Regards >>>> >>>>> Date: Wed, 22 Feb 2012 10:57:51 +0530 >>>>> From: [EMAIL PROTECTED] >>>>> To: [EMAIL PROTECTED] >>>>> Subject: Re: Helping out with the .7 release >>>>> >>>>> We are creating clustering as classification components which will >>>>> help >>>>> in moving clustering out. Once the component is ready, then the >>>>> clustering algorithms would need refactoring. >>>>> The clustering as classification component and the outlier removal >>>>> component has been created. >>>>> >>>>> Most of it is committed, and rest is available as a patch. See >>>>> https://issues.apache.org/jira/browse/MAHOUT-929 >>>>> If you will apply the latest patch available on Mahout-929 you can >>>>> see >>>>> all that is available now. >>>>> >>>>> If you want, you can help with the test case of >>>>> ClusterClassificationMapper available in the patch. >>>>> >>>>> On 22-02-2012 10:27, Saikat Kanjilal wrote: >>>>>> Hi Guys,I was interested in helping out with the clustering >>>>>> component of mahout, I looked through the JIRA items below and +
Paritosh Ranjan 2012-02-23, 08:03
-
RE: Helping out with the .7 releaseSaikat Kanjilal 2012-02-23, 08:08
Thank you, I'll get started on this over the weekend. > Date: Thu, 23 Feb 2012 13:33:42 +0530 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Re: Helping out with the .7 release > > Saikat, > > I have created https://issues.apache.org/jira/browse/MAHOUT-981 for > refactoring KMeansDriver to use the new ClusterClassificationDriver. > > You can provide your patches on this issue. See this to know how to > provide a patch > https://cwiki.apache.org/MAHOUT/how-to-contribute.html#HowToContribute-Generatingapatch. > > Before KMeans refactoring, we are expecting the > ClusterClassificationMapperTest from you ( for Mahout-929 ). That test > case would complete the development of ClusterClassificationDriver and > the refactoring can start. > > Paritosh > > On 23-02-2012 04:55, Jeff Eastman wrote: > > Hi Saikat, > > > > Glad you're excited. Paritosh offered one suggestion below. You could > > look at TestKmeansClustering for patterns you could use to test the > > ClusterClassificationMapper and Driver in MR mode. That should be > > straightforward, but please coordinate with Paritosh so you don't > > duplicate efforts. > > > > Another place you might look into would be the KMeansDriver and > > MAHOUT-930. You could work on refactoring KMeansDriver to use the new > > ClusterClassificationDriver in MAHOUT-929. That would exercise both > > its sequential and MR options. It will be interesting to see how much > > code can be removed. > > > > Finally, you could see if you can wrap your mind around the > > ClusterIterator and how it could be used for further refactoring of > > the KMeansDriver. See TestClusterClassifier for insight. > > > > That enough reading and doing for now? > > Jeff > > > > On 2/22/12 10:06 AM, Saikat Kanjilal wrote: > >> Jeff,I'm pretty excited to help out with this, so as a starter can > >> you point me to where I should begin my readings of the code, I > >> havent looked too closely but are there certain classes in the > >> clustering area where this refactoring effort is centered around. > >> Regards > >> > >>> Date: Wed, 22 Feb 2012 08:56:23 -0700 > >>> From: [EMAIL PROTECTED] > >>> To: [EMAIL PROTECTED] > >>> Subject: Re: Helping out with the .7 release > >>> > >>> Hi Saikat, > >>> > >>> I agree with Paritosh, that a great place to begin would be to write > >>> some unit tests. This will familiarize you with the code base and help > >>> us a lot with our 0.7 housekeeping release. The new clustering > >>> classification components are going to unify many - but not all - of > >>> the > >>> existing clustering algorithms to reduce their complexity by factoring > >>> out duplication and streamlining their integration into semi-supervised > >>> classification engines. > >>> > >>> Please feel free to post any questions you may have in reading through > >>> this code. This is a major refactoring effort and we will need all the > >>> help we can get. Thanks for the offer, > >>> > >>> Jeff > >>> > >>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: > >>>> Hi Paritosh,Yes creating the test case would be a great first > >>>> start, however are there other tasks you guys need help with before > >>>> I can do before the test creation, I will sync trunk and start > >>>> reading through the code in the meantime.Regards > >>>> > >>>>> Date: Wed, 22 Feb 2012 10:57:51 +0530 > >>>>> From: [EMAIL PROTECTED] > >>>>> To: [EMAIL PROTECTED] > >>>>> Subject: Re: Helping out with the .7 release > >>>>> > >>>>> We are creating clustering as classification components which will > >>>>> help > >>>>> in moving clustering out. Once the component is ready, then the > >>>>> clustering algorithms would need refactoring. > >>>>> The clustering as classification component and the outlier removal > >>>>> component has been created. > >>>>> > >>>>> Most of it is committed, and rest is available as a patch. See > >>>>> https://issues.apache.org/jira/browse/MAHOUT-929 > >>>>> If you will apply the latest patch available on Mahout-929 you can +
Saikat Kanjilal 2012-02-23, 08:08
-
RE: Helping out with the .7 releaseSaikat Kanjilal 2012-02-24, 13:33
Paritosh/Jeff,Before I begin the effort of writing the ClusterClassificationMapperTest I had a few questions, pardon my newbieness here: 1) I synched the trunk down and started building and noticed that we have some errors in the tests, is this ok , let me know if I am missing something here in getting the build going, I believe my build environment is setup correctly (maven 2.2.1) with Java 62) I was wondering if I should create a github branch for the code and work off of that, I could then sync my changes when I'm done into trunk and go through the patching process, do you guys see any issues with that3) For the ClusterClassificationMapperTest can I get some more context around this, should we call this ClusterClassificationDriverTest instead, also should the unit tests basically test all of the bulleted points in Mahout-929 or just pass in parameters into the run method and test that by itself Regards > Date: Thu, 23 Feb 2012 13:33:42 +0530 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Re: Helping out with the .7 release > > Saikat, > > I have created https://issues.apache.org/jira/browse/MAHOUT-981 for > refactoring KMeansDriver to use the new ClusterClassificationDriver. > > You can provide your patches on this issue. See this to know how to > provide a patch > https://cwiki.apache.org/MAHOUT/how-to-contribute.html#HowToContribute-Generatingapatch. > > Before KMeans refactoring, we are expecting the > ClusterClassificationMapperTest from you ( for Mahout-929 ). That test > case would complete the development of ClusterClassificationDriver and > the refactoring can start. > > Paritosh > > On 23-02-2012 04:55, Jeff Eastman wrote: > > Hi Saikat, > > > > Glad you're excited. Paritosh offered one suggestion below. You could > > look at TestKmeansClustering for patterns you could use to test the > > ClusterClassificationMapper and Driver in MR mode. That should be > > straightforward, but please coordinate with Paritosh so you don't > > duplicate efforts. > > > > Another place you might look into would be the KMeansDriver and > > MAHOUT-930. You could work on refactoring KMeansDriver to use the new > > ClusterClassificationDriver in MAHOUT-929. That would exercise both > > its sequential and MR options. It will be interesting to see how much > > code can be removed. > > > > Finally, you could see if you can wrap your mind around the > > ClusterIterator and how it could be used for further refactoring of > > the KMeansDriver. See TestClusterClassifier for insight. > > > > That enough reading and doing for now? > > Jeff > > > > On 2/22/12 10:06 AM, Saikat Kanjilal wrote: > >> Jeff,I'm pretty excited to help out with this, so as a starter can > >> you point me to where I should begin my readings of the code, I > >> havent looked too closely but are there certain classes in the > >> clustering area where this refactoring effort is centered around. > >> Regards > >> > >>> Date: Wed, 22 Feb 2012 08:56:23 -0700 > >>> From: [EMAIL PROTECTED] > >>> To: [EMAIL PROTECTED] > >>> Subject: Re: Helping out with the .7 release > >>> > >>> Hi Saikat, > >>> > >>> I agree with Paritosh, that a great place to begin would be to write > >>> some unit tests. This will familiarize you with the code base and help > >>> us a lot with our 0.7 housekeeping release. The new clustering > >>> classification components are going to unify many - but not all - of > >>> the > >>> existing clustering algorithms to reduce their complexity by factoring > >>> out duplication and streamlining their integration into semi-supervised > >>> classification engines. > >>> > >>> Please feel free to post any questions you may have in reading through > >>> this code. This is a major refactoring effort and we will need all the > >>> help we can get. Thanks for the offer, > >>> > >>> Jeff > >>> > >>> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: > >>>> Hi Paritosh,Yes creating the test case would be a great first > >> +
Saikat Kanjilal 2012-02-24, 13:33
-
Re: Helping out with the .7 releaseParitosh Ranjan 2012-02-24, 17:23
Which env are you using for development, windows or linux? All tests
don't pass on windows with cygwin ( at least for me). I even arranged a different linux machine for myself, just to run the test cases ( virtual box would also have worked, just a matter of choice ). The way you submit patch is up to you. ClusterClassification is used after buildClusters phase of the Clustering algorithms ( for which the refactoring is being done ). It will replace the clusterData phase. ClusterClassificationDriver can classify vectors either sequentially or in a mapreduce way. There is already a test case of sequential one. The logic of classification via mapreduce is in the mapper. So, it should be tested there. Try writing a simple test first, something which can test whether the vectors were classified correctly. Look at ClusterClassificationDriverTest for assertions. Later on, we can add more scenarios to it. Issue centric discussions can also be done on jira, you can create an account on jira and can add comments on jira issues also, if you wish. On 24-02-2012 19:03, Saikat Kanjilal wrote: > Paritosh/Jeff,Before I begin the effort of writing the ClusterClassificationMapperTest I had a few questions, pardon my newbieness here: > 1) I synched the trunk down and started building and noticed that we have some errors in the tests, is this ok , let me know if I am missing something here in getting the build going, I believe my build environment is setup correctly (maven 2.2.1) with Java 62) I was wondering if I should create a github branch for the code and work off of that, I could then sync my changes when I'm done into trunk and go through the patching process, do you guys see any issues with that3) For the ClusterClassificationMapperTest can I get some more context around this, should we call this ClusterClassificationDriverTest instead, also should the unit tests basically test all of the bulleted points in Mahout-929 or just pass in parameters into the run method and test that by itself > > Regards > >> Date: Thu, 23 Feb 2012 13:33:42 +0530 >> From: [EMAIL PROTECTED] >> To: [EMAIL PROTECTED] >> Subject: Re: Helping out with the .7 release >> >> Saikat, >> >> I have created https://issues.apache.org/jira/browse/MAHOUT-981 for >> refactoring KMeansDriver to use the new ClusterClassificationDriver. >> >> You can provide your patches on this issue. See this to know how to >> provide a patch >> https://cwiki.apache.org/MAHOUT/how-to-contribute.html#HowToContribute-Generatingapatch. >> >> Before KMeans refactoring, we are expecting the >> ClusterClassificationMapperTest from you ( for Mahout-929 ). That test >> case would complete the development of ClusterClassificationDriver and >> the refactoring can start. >> >> Paritosh >> >> On 23-02-2012 04:55, Jeff Eastman wrote: >>> Hi Saikat, >>> >>> Glad you're excited. Paritosh offered one suggestion below. You could >>> look at TestKmeansClustering for patterns you could use to test the >>> ClusterClassificationMapper and Driver in MR mode. That should be >>> straightforward, but please coordinate with Paritosh so you don't >>> duplicate efforts. >>> >>> Another place you might look into would be the KMeansDriver and >>> MAHOUT-930. You could work on refactoring KMeansDriver to use the new >>> ClusterClassificationDriver in MAHOUT-929. That would exercise both >>> its sequential and MR options. It will be interesting to see how much >>> code can be removed. >>> >>> Finally, you could see if you can wrap your mind around the >>> ClusterIterator and how it could be used for further refactoring of >>> the KMeansDriver. See TestClusterClassifier for insight. >>> >>> That enough reading and doing for now? >>> Jeff >>> >>> On 2/22/12 10:06 AM, Saikat Kanjilal wrote: >>>> Jeff,I'm pretty excited to help out with this, so as a starter can >>>> you point me to where I should begin my readings of the code, I >>>> havent looked too closely but are there certain classes in the >>>> clustering area where this refactoring effort is centered around. +
Paritosh Ranjan 2012-02-24, 17:23
-
RE: Helping out with the .7 releaseSaikat Kanjilal 2012-02-22, 23:29
Yes perfect I'll look at those and begin readings there and figure out next steps.Thanks again for your help in starting this effort. > Date: Wed, 22 Feb 2012 16:25:27 -0700 > From: [EMAIL PROTECTED] > To: [EMAIL PROTECTED] > Subject: Re: Helping out with the .7 release > > Hi Saikat, > > Glad you're excited. Paritosh offered one suggestion below. You could > look at TestKmeansClustering for patterns you could use to test the > ClusterClassificationMapper and Driver in MR mode. That should be > straightforward, but please coordinate with Paritosh so you don't > duplicate efforts. > > Another place you might look into would be the KMeansDriver and > MAHOUT-930. You could work on refactoring KMeansDriver to use the new > ClusterClassificationDriver in MAHOUT-929. That would exercise both its > sequential and MR options. It will be interesting to see how much code > can be removed. > > Finally, you could see if you can wrap your mind around the > ClusterIterator and how it could be used for further refactoring of the > KMeansDriver. See TestClusterClassifier for insight. > > That enough reading and doing for now? > Jeff > > On 2/22/12 10:06 AM, Saikat Kanjilal wrote: > > Jeff,I'm pretty excited to help out with this, so as a starter can you point me to where I should begin my readings of the code, I havent looked too closely but are there certain classes in the clustering area where this refactoring effort is centered around. > > Regards > > > >> Date: Wed, 22 Feb 2012 08:56:23 -0700 > >> From: [EMAIL PROTECTED] > >> To: [EMAIL PROTECTED] > >> Subject: Re: Helping out with the .7 release > >> > >> Hi Saikat, > >> > >> I agree with Paritosh, that a great place to begin would be to write > >> some unit tests. This will familiarize you with the code base and help > >> us a lot with our 0.7 housekeeping release. The new clustering > >> classification components are going to unify many - but not all - of the > >> existing clustering algorithms to reduce their complexity by factoring > >> out duplication and streamlining their integration into semi-supervised > >> classification engines. > >> > >> Please feel free to post any questions you may have in reading through > >> this code. This is a major refactoring effort and we will need all the > >> help we can get. Thanks for the offer, > >> > >> Jeff > >> > >> On 2/21/12 10:46 PM, Saikat Kanjilal wrote: > >>> Hi Paritosh,Yes creating the test case would be a great first start, however are there other tasks you guys need help with before I can do before the test creation, I will sync trunk and start reading through the code in the meantime.Regards > >>> > >>>> Date: Wed, 22 Feb 2012 10:57:51 +0530 > >>>> From: [EMAIL PROTECTED] > >>>> To: [EMAIL PROTECTED] > >>>> Subject: Re: Helping out with the .7 release > >>>> > >>>> We are creating clustering as classification components which will help > >>>> in moving clustering out. Once the component is ready, then the > >>>> clustering algorithms would need refactoring. > >>>> The clustering as classification component and the outlier removal > >>>> component has been created. > >>>> > >>>> Most of it is committed, and rest is available as a patch. See > >>>> https://issues.apache.org/jira/browse/MAHOUT-929 > >>>> If you will apply the latest patch available on Mahout-929 you can see > >>>> all that is available now. > >>>> > >>>> If you want, you can help with the test case of > >>>> ClusterClassificationMapper available in the patch. > >>>> > >>>> On 22-02-2012 10:27, Saikat Kanjilal wrote: > >>>>> Hi Guys,I was interested in helping out with the clustering component of mahout, I looked through the JIRA items below and was wondering if there is a specific one that would be good to start with: > >>>>> > >>>>> https://issues.apache.org/jira/secure/IssueNavigator.jspa?reset=true&jqlQuery=project+%3D+MAHOUT+AND+resolution+%3D+Unresolved+AND+component+%3D+Clustering+ORDER+BY+priority+DESC&mode=hide > >>>>> +
Saikat Kanjilal 2012-02-22, 23:29
|