|
Jeff Eastman
2011-04-12, 21:57
Ted Dunning
2011-04-12, 22:35
Jeff Eastman
2011-04-14, 01:24
Ted Dunning
2011-04-14, 02:54
Jeff Eastman
2011-04-14, 04:00
Jeff Eastman
2011-04-14, 04:24
Ted Dunning
2011-04-14, 04:46
Jeff Eastman
2011-04-14, 15:51
|
-
Converging Clustering and ClassificationJeff Eastman 2011-04-12, 21:57
Hi Ted,
We've been discussing this on and off and I'd like to pick up the thread again. Currently we have AbstractVectorClassifier (in pkg classifier) and VectorModelClassifier (in pkg clustering). This allows any set of Cluster Models (List< Model<VectorWritable>>) to function as a classifier. In your last email you indicated this as a step in the right direction. What else is needed? One thought I've had is this: Most clustering algorithms - the older ones anyway - have static Driver methods "buildClusters()" and "clusterData()". Would it help with the convergence process if these were simply renamed to "trainClusters()" and "classifyData()" (or something similar) respectively? I know it took me a while to see the isomorphism between clustering and classification, so perhaps something simple like this would be an improvement.
-
Re: Converging Clustering and ClassificationTed Dunning 2011-04-12, 22:35
I will respond from the standpoint of an SGD partisan first.
What I think is needed next is some way to save clusterings as models that are interoperable with SGD models. That is, ModelSerializer.readBinary should return a usable classifier when applied to whatever the clustering algorithm saved. This makes clustering models as deployable as SGD models already are and abstracts away the origin of the clustering model. It would be a kick if the clustering driver could be merged with the classification driver (if any) so that it would apply any model supported by ModelSerializer.readBinary to specified data. Then as a token offering to the gods of inter-operability it would be kind of cool if the initial state of k-means or other clustering algorithms could also be such a serialized model. That would allow an SGD model to be the initial state for clustering which would give a vague kind of semi-supervised learning at little cost. On Tue, Apr 12, 2011 at 2:57 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > Hi Ted, > > We've been discussing this on and off and I'd like to pick up the thread > again. Currently we have AbstractVectorClassifier (in pkg classifier) and > VectorModelClassifier (in pkg clustering). This allows any set of Cluster > Models (List< Model<VectorWritable>>) to function as a classifier. In your > last email you indicated this as a step in the right direction. What else is > needed? > > One thought I've had is this: Most clustering algorithms - the older ones > anyway - have static Driver methods "buildClusters()" and "clusterData()". > Would it help with the convergence process if these were simply renamed to > "trainClusters()" and "classifyData()" (or something similar) respectively? > I know it took me a while to see the isomorphism between clustering and > classification, so perhaps something simple like this would be an > improvement. > >
-
FW: Converging Clustering and ClassificationJeff Eastman 2011-04-14, 01:24
Hey Ted,
I've been able to prototype a ClusterClassifier which, like VectorModelClassifier, extends AbstractVectorClassifier but which also implements OnlineLearner and Writable. This should work (it compiles) in the KMeansClusterer in place of Iterable<Cluster> in the sequential code using train(). I've also been able to add a unit test of it in ModelSerializerTest (it compiles too). If this could be completed it would seem to allow kmeans, fuzzyk, dirichlet and maybe even meanshift cluster classifiers to be used with SGD. Going the other way (using a trained classifier as the prior of a clustering run) should also be possible though I haven’t got it sorted out yet. The challenge would be to use AVC.classify() in the various clusterers or to extract initial centers for kmeans & fuzzyk. Dirichlet might be adaptable more directly since its models only have to produce the pi vector of pdfs. Still lots of loose ends in this all. Certainly not for 0.5. Does any of this make sense? From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Tuesday, April 12, 2011 3:58 PM To: Jeff Eastman Subject: Re: Converging Clustering and Classification Cool. On Tue, Apr 12, 2011 at 3:56 PM, Jeff Eastman <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Ok, let me wrap my mind around that. I’ve almost got the token offering part since any Cluster can be used as the prior for kmeans, fuzzyK and dirichlet. A post processing step to serialize a set of clusters a’la ModelSerializer shouldn’t be out of the question either. I’ve got some time this weekend to tinker with it. From: Ted Dunning [mailto:[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>] Sent: Tuesday, April 12, 2011 3:36 PM To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> Cc: Jeff Eastman Subject: Re: Converging Clustering and Classification I will respond from the standpoint of an SGD partisan first. What I think is needed next is some way to save clusterings as models that are interoperable with SGD models. That is, ModelSerializer.readBinary should return a usable classifier when applied to whatever the clustering algorithm saved. This makes clustering models as deployable as SGD models already are and abstracts away the origin of the clustering model. It would be a kick if the clustering driver could be merged with the classification driver (if any) so that it would apply any model supported by ModelSerializer.readBinary to specified data. Then as a token offering to the gods of inter-operability it would be kind of cool if the initial state of k-means or other clustering algorithms could also be such a serialized model. That would allow an SGD model to be the initial state for clustering which would give a vague kind of semi-supervised learning at little cost. On Tue, Apr 12, 2011 at 2:57 PM, Jeff Eastman <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: Hi Ted, We've been discussing this on and off and I'd like to pick up the thread again. Currently we have AbstractVectorClassifier (in pkg classifier) and VectorModelClassifier (in pkg clustering). This allows any set of Cluster Models (List< Model<VectorWritable>>) to function as a classifier. In your last email you indicated this as a step in the right direction. What else is needed? One thought I've had is this: Most clustering algorithms - the older ones anyway - have static Driver methods "buildClusters()" and "clusterData()". Would it help with the convergence process if these were simply renamed to "trainClusters()" and "classifyData()" (or something similar) respectively? I know it took me a while to see the isomorphism between clustering and classification, so perhaps something simple like this would be an improvement.
-
Re: FW: Converging Clustering and ClassificationTed Dunning 2011-04-14, 02:54
On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote:
> > I've been able to prototype a ClusterClassifier which, like > VectorModelClassifier, extends AbstractVectorClassifier but which also > implements OnlineLearner and Writable. > Implementing OnlineLearner is a surprise here. Have to think about it since the learning doesn't have a target variable. > ... If this could be completed it would seem to allow kmeans, fuzzyk, > dirichlet and maybe even meanshift cluster classifiers to be used with SGD. > Very cool. ... The challenge would be to use AVC.classify() in the various clusterers > or to extract initial centers for kmeans & fuzzyk. Dirichlet might be > adaptable more directly since its models only have to produce the pi vector > of pdfs. > Yes. Dirichlet is the one where this makes sense.
-
Re: FW: Converging Clustering and ClassificationJeff Eastman 2011-04-14, 04:00
Lol, not too surprising considering the source. Here's how I got there:
- ClusterClassifier holds a "List<Cluster> models;" field as its only state just like VectorModelClassifier does - Started with ModelSerializerTest since you suggested being compatible with ModelSerializer - This tests OnlineLogisticRegression, CrossFoldLearner and AdaptiveLogisticRegression - The first two are also subclasses of AbstractVectorClassifier just like ClusterClassifier - The tests pass OLR and CFL learners to train(OnlineLearner) so it made sense for a CC to be an OL too - The new CC.train(...) methods map to "models.get(actual).observe()" in Cluster.observe(V) - CC.close() maps to cluster.computeParameters() for each model which computes the posterior cluster parameters - Now the CC is ready for another iteration or to classify, etc. So, the cluster iteration process starts with a prior List<Cluster> which is used to construct the ClusterClassifier. Then in each iteration each point is passed to CC.classify() and the maximum probability element index in the returned Vector is used to train() the CC. Since all the DistanceMeasureClusters contain their appropriate DistanceMeasure, the one with the maximum pdf() is the closest. Just what kmeans already does but done less efficiently (it uses just the minimum distance, but pdf() = e^-distance so the closest cluster has the largest pdf()). Finally, instead of passing in a List<Cluster> in the KMeansClusterer I can just carry around a CC which wraps it. Instead of serializing a List<Cluster> at the end of each iteration I can just serialize the CC. At the beginning of the next iteration, I just deserialize it and go. I was so easy it surely must be wrong :) On 4/13/11 7:54 PM, Ted Dunning wrote: > On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[EMAIL PROTECTED]> wrote: > >> I've been able to prototype a ClusterClassifier which, like >> VectorModelClassifier, extends AbstractVectorClassifier but which also >> implements OnlineLearner and Writable. >> > Implementing OnlineLearner is a surprise here. > > Have to think about it since the learning doesn't have a target variable. > > >> ... If this could be completed it would seem to allow kmeans, fuzzyk, >> dirichlet and maybe even meanshift cluster classifiers to be used with SGD. >> > Very cool. > > ... The challenge would be to use AVC.classify() in the various clusterers >> or to extract initial centers for kmeans& fuzzyk. Dirichlet might be >> adaptable more directly since its models only have to produce the pi vector >> of pdfs. >> > Yes. Dirichlet is the one where this makes sense. >
-
Re: FW: Converging Clustering and ClassificationJeff Eastman 2011-04-14, 04:24
If this isn't all a crock, it could potentially collapse kmeans, fuzzyk
and Dirichlet into a single implementation too: - Begin with a prior ClusterClassifier containing the appropriate sort of Cluster, in clusters-n - For each input Vector, compute the pdf vector using CC.classify() -- For kmeans, train the most likely model from the pdf vector -- For Dirichlet, train the model selected by the multinomial of the pfd vector * mixture vector -- For fuzzyk, train each model by its normalized pdf (would need a new classify method for this) - Close the CC, computing all posterior model parameters - Serialize the CC into clusters-n+1 Now that would really be cool On 4/13/11 9:00 PM, Jeff Eastman wrote: > Lol, not too surprising considering the source. Here's how I got there: > > - ClusterClassifier holds a "List<Cluster> models;" field as its only > state just like VectorModelClassifier does > - Started with ModelSerializerTest since you suggested being > compatible with ModelSerializer > - This tests OnlineLogisticRegression, CrossFoldLearner and > AdaptiveLogisticRegression > - The first two are also subclasses of AbstractVectorClassifier just > like ClusterClassifier > - The tests pass OLR and CFL learners to train(OnlineLearner) so it > made sense for a CC to be an OL too > - The new CC.train(...) methods map to "models.get(actual).observe()" > in Cluster.observe(V) > - CC.close() maps to cluster.computeParameters() for each model which > computes the posterior cluster parameters > - Now the CC is ready for another iteration or to classify, etc. > > So, the cluster iteration process starts with a prior List<Cluster> > which is used to construct the ClusterClassifier. Then in each > iteration each point is passed to CC.classify() and the maximum > probability element index in the returned Vector is used to train() > the CC. Since all the DistanceMeasureClusters contain their > appropriate DistanceMeasure, the one with the maximum pdf() is the > closest. Just what kmeans already does but done less efficiently (it > uses just the minimum distance, but pdf() = e^-distance so the closest > cluster has the largest pdf()). > > Finally, instead of passing in a List<Cluster> in the KMeansClusterer > I can just carry around a CC which wraps it. Instead of serializing a > List<Cluster> at the end of each iteration I can just serialize the > CC. At the beginning of the next iteration, I just deserialize it and go. > > I was so easy it surely must be wrong :) > > > > On 4/13/11 7:54 PM, Ted Dunning wrote: >> On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[EMAIL PROTECTED]> >> wrote: >> >>> I've been able to prototype a ClusterClassifier which, like >>> VectorModelClassifier, extends AbstractVectorClassifier but which also >>> implements OnlineLearner and Writable. >>> >> Implementing OnlineLearner is a surprise here. >> >> Have to think about it since the learning doesn't have a target >> variable. >> >> >>> ... If this could be completed it would seem to allow kmeans, fuzzyk, >>> dirichlet and maybe even meanshift cluster classifiers to be used >>> with SGD. >>> >> Very cool. >> >> ... The challenge would be to use AVC.classify() in the various >> clusterers >>> or to extract initial centers for kmeans& fuzzyk. Dirichlet might be >>> adaptable more directly since its models only have to produce the pi >>> vector >>> of pdfs. >>> >> Yes. Dirichlet is the one where this makes sense. >> >
-
Re: FW: Converging Clustering and ClassificationTed Dunning 2011-04-14, 04:46
Yeah... this is what I had in mind when I said grand unified theory.
On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > If this isn't all a crock, it could potentially collapse kmeans, fuzzyk and > Dirichlet into a single implementation too: > > - Begin with a prior ClusterClassifier containing the appropriate sort of > Cluster, in clusters-n > - For each input Vector, compute the pdf vector using CC.classify() > -- For kmeans, train the most likely model from the pdf vector > -- For Dirichlet, train the model selected by the multinomial of the pfd > vector * mixture vector > -- For fuzzyk, train each model by its normalized pdf (would need a new > classify method for this) > - Close the CC, computing all posterior model parameters > - Serialize the CC into clusters-n+1 > > Now that would really be cool > > > On 4/13/11 9:00 PM, Jeff Eastman wrote: > >> Lol, not too surprising considering the source. Here's how I got there: >> >> - ClusterClassifier holds a "List<Cluster> models;" field as its only >> state just like VectorModelClassifier does >> - Started with ModelSerializerTest since you suggested being compatible >> with ModelSerializer >> - This tests OnlineLogisticRegression, CrossFoldLearner and >> AdaptiveLogisticRegression >> - The first two are also subclasses of AbstractVectorClassifier just like >> ClusterClassifier >> - The tests pass OLR and CFL learners to train(OnlineLearner) so it made >> sense for a CC to be an OL too >> - The new CC.train(...) methods map to "models.get(actual).observe()" in >> Cluster.observe(V) >> - CC.close() maps to cluster.computeParameters() for each model which >> computes the posterior cluster parameters >> - Now the CC is ready for another iteration or to classify, etc. >> >> So, the cluster iteration process starts with a prior List<Cluster> which >> is used to construct the ClusterClassifier. Then in each iteration each >> point is passed to CC.classify() and the maximum probability element index >> in the returned Vector is used to train() the CC. Since all the >> DistanceMeasureClusters contain their appropriate DistanceMeasure, the one >> with the maximum pdf() is the closest. Just what kmeans already does but >> done less efficiently (it uses just the minimum distance, but pdf() >> e^-distance so the closest cluster has the largest pdf()). >> >> Finally, instead of passing in a List<Cluster> in the KMeansClusterer I >> can just carry around a CC which wraps it. Instead of serializing a >> List<Cluster> at the end of each iteration I can just serialize the CC. At >> the beginning of the next iteration, I just deserialize it and go. >> >> I was so easy it surely must be wrong :) >> >> >> >> On 4/13/11 7:54 PM, Ted Dunning wrote: >> >>> On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[EMAIL PROTECTED]> >>> wrote: >>> >>> I've been able to prototype a ClusterClassifier which, like >>>> VectorModelClassifier, extends AbstractVectorClassifier but which also >>>> implements OnlineLearner and Writable. >>>> >>>> Implementing OnlineLearner is a surprise here. >>> >>> Have to think about it since the learning doesn't have a target variable. >>> >>> >>> ... If this could be completed it would seem to allow kmeans, fuzzyk, >>>> dirichlet and maybe even meanshift cluster classifiers to be used with >>>> SGD. >>>> >>>> Very cool. >>> >>> ... The challenge would be to use AVC.classify() in the various >>> clusterers >>> >>>> or to extract initial centers for kmeans& fuzzyk. Dirichlet might be >>>> adaptable more directly since its models only have to produce the pi >>>> vector >>>> of pdfs. >>>> >>>> Yes. Dirichlet is the one where this makes sense. >>> >>> >> >
-
RE: FW: Converging Clustering and ClassificationJeff Eastman 2011-04-14, 15:51
Great, let me see what I can build this weekend as a separate universal clusterer using these ideas
-----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, April 13, 2011 9:46 PM To: [EMAIL PROTECTED] Cc: Jeff Eastman Subject: Re: FW: Converging Clustering and Classification Yeah... this is what I had in mind when I said grand unified theory. On Wed, Apr 13, 2011 at 9:24 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > If this isn't all a crock, it could potentially collapse kmeans, fuzzyk and > Dirichlet into a single implementation too: > > - Begin with a prior ClusterClassifier containing the appropriate sort of > Cluster, in clusters-n > - For each input Vector, compute the pdf vector using CC.classify() > -- For kmeans, train the most likely model from the pdf vector > -- For Dirichlet, train the model selected by the multinomial of the pfd > vector * mixture vector > -- For fuzzyk, train each model by its normalized pdf (would need a new > classify method for this) > - Close the CC, computing all posterior model parameters > - Serialize the CC into clusters-n+1 > > Now that would really be cool > > > On 4/13/11 9:00 PM, Jeff Eastman wrote: > >> Lol, not too surprising considering the source. Here's how I got there: >> >> - ClusterClassifier holds a "List<Cluster> models;" field as its only >> state just like VectorModelClassifier does >> - Started with ModelSerializerTest since you suggested being compatible >> with ModelSerializer >> - This tests OnlineLogisticRegression, CrossFoldLearner and >> AdaptiveLogisticRegression >> - The first two are also subclasses of AbstractVectorClassifier just like >> ClusterClassifier >> - The tests pass OLR and CFL learners to train(OnlineLearner) so it made >> sense for a CC to be an OL too >> - The new CC.train(...) methods map to "models.get(actual).observe()" in >> Cluster.observe(V) >> - CC.close() maps to cluster.computeParameters() for each model which >> computes the posterior cluster parameters >> - Now the CC is ready for another iteration or to classify, etc. >> >> So, the cluster iteration process starts with a prior List<Cluster> which >> is used to construct the ClusterClassifier. Then in each iteration each >> point is passed to CC.classify() and the maximum probability element index >> in the returned Vector is used to train() the CC. Since all the >> DistanceMeasureClusters contain their appropriate DistanceMeasure, the one >> with the maximum pdf() is the closest. Just what kmeans already does but >> done less efficiently (it uses just the minimum distance, but pdf() >> e^-distance so the closest cluster has the largest pdf()). >> >> Finally, instead of passing in a List<Cluster> in the KMeansClusterer I >> can just carry around a CC which wraps it. Instead of serializing a >> List<Cluster> at the end of each iteration I can just serialize the CC. At >> the beginning of the next iteration, I just deserialize it and go. >> >> I was so easy it surely must be wrong :) >> >> >> >> On 4/13/11 7:54 PM, Ted Dunning wrote: >> >>> On Wed, Apr 13, 2011 at 6:24 PM, Jeff Eastman<[EMAIL PROTECTED]> >>> wrote: >>> >>> I've been able to prototype a ClusterClassifier which, like >>>> VectorModelClassifier, extends AbstractVectorClassifier but which also >>>> implements OnlineLearner and Writable. >>>> >>>> Implementing OnlineLearner is a surprise here. >>> >>> Have to think about it since the learning doesn't have a target variable. >>> >>> >>> ... If this could be completed it would seem to allow kmeans, fuzzyk, >>>> dirichlet and maybe even meanshift cluster classifiers to be used with >>>> SGD. >>>> >>>> Very cool. >>> >>> ... The challenge would be to use AVC.classify() in the various >>> clusterers >>> >>>> or to extract initial centers for kmeans& fuzzyk. Dirichlet might be |