|
Vasil Vasilev
2011-01-18, 16:10
Jeff Eastman
2011-01-19, 15:50
Vasil Vasilev
2011-01-20, 08:27
Jeff Eastman
2011-01-20, 18:28
Vasil Vasilev
2011-01-21, 08:39
Ted Dunning
2011-01-21, 15:16
Lance Norskog
2011-01-22, 05:17
Vasil Vasilev
2011-01-24, 11:18
Lance Norskog
2011-01-25, 05:28
Vasil Vasilev
2011-01-25, 15:33
Lance Norskog
2011-01-26, 22:40
Vasil Vasilev
2011-01-27, 11:06
Vasil Vasilev
2011-02-18, 18:16
gaurav redkar
2011-11-09, 12:08
|
-
Meanshift clusteringVasil Vasilev 2011-01-18, 16:10
Hello Mahout developers,
Currently I am trying to get more in depth with the clustering algorithms - how they should be used and tuned. For this purpose I decided to learn from the source code of the different implementations. In this respect I have the following questions about the Meanshift algorithm (sorry if it may sound naive, but I am a novice in the area): 1. I noted that the way it is implemented is different from the straightforward approach that is described in the paper ( http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf). Later I learned from Jira MAHOUT-15 that this was made to enable parallelism. There I also noticed that T2 should be fixed to 1.0. In fact for me it seems that T2 should be correlated with the convergence delta parameter (which by default is 0.5) and should be slightly higher then it. Is my assumption correct? 2. With the current implementation the user has the option to select desired distance measure, but does not have the flexibility to select a kernel. The current approach results in a hard-coded conical kernel with radius T1 and no points outside T1 are considered in the path calculation of the canopy. Is it possible to slightly modify the algorithm (similar to the modification from kmeans to fuzzy kmeans) where weights are associated with a given point that would touch the canopy and these weights are drown from the kernel function. For example they could be drawn from a normal distribution? Do you think the possibility for kernel selection could impact positively the clustering with meanshift in some cases? Regards, Vasil
-
Re: Meanshift clusteringJeff Eastman 2011-01-19, 15:50
On 1/18/11 9:10 AM, Vasil Vasilev wrote:
> Hello Mahout developers, > > Currently I am trying to get more in depth with the clustering algorithms - > how they should be used and tuned. > For this purpose I decided to learn from the source code of the different > implementations. > In this respect I have the following questions about the Meanshift algorithm > (sorry if it may sound naive, but I am a novice in the area): > > 1. I noted that the way it is implemented is different from the > straightforward approach that is described in the paper ( > http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf). > Later I learned from Jira MAHOUT-15 that this was made to enable > parallelism. There I also noticed that T2 should be fixed to 1.0. > In fact for me it seems that T2 should be correlated with the convergence > delta parameter (which by default is 0.5) and should be slightly higher then > it. Is my assumption correct? I think the optimal values for T1 and T2 depend upon the distance measure chosen and the nature of the data itself. As this implementation is really just an iterative application of Canopy, I left both T parameters specifiable too. This is not exactly the same algorithm as Mean Shift in the paper but it seems to do amazingly well in some cases. > 2. With the current implementation the user has the option to select desired > distance measure, but does not have the flexibility to select a kernel. The > current approach results in a hard-coded conical kernel with radius T1 and > no points outside T1 are considered in the path calculation of the canopy. > Is it possible to slightly modify the algorithm (similar to the modification > from kmeans to fuzzy kmeans) where weights are associated with a given point > that would touch the canopy and these weights are drown from the kernel > function. For example they could be drawn from a normal distribution? Do you > think the possibility for kernel selection could impact positively the > clustering with meanshift in some cases? I don't know but it is intriguing. Why don't you try it? > Regards, Vasil >
-
Re: Meanshift clusteringVasil Vasilev 2011-01-20, 08:27
Hi Jeff,
I used the meanshift algorithm to cluster the synthetic control data (I produced slightly different type of vectorization than what was given in the examples) and it gave the best results compared to all other algorithms. So I am really amazed by what it can do! About (1): One interesting thing that I noted is the influence of T2. It turned out that if you do not select T2 correctly the algorithm may not converge. In my case I had the following situation. At the end (6-th iteration) there were 2 canopies left which were within each-other's T1 boundaries, but outside each-other's T2 boundaries. This led to the fact that they were touching each-other on every iteration and switching their centroids. As the distance between the centroids was larger then the convergence delta the algorithm never converged. However when I increased T2 to merge these 2 clusters the algorithm converged successfully. In this respect for me the following procedure worked: 1. Select T2 as small as possible (greater then convergence delta) 2. Increase it until the algorithm converges With this approach the algorithm determined also very nicely the correct number of clusters About (2): OK. I will try it and let you know about the result. On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > On 1/18/11 9:10 AM, Vasil Vasilev wrote: > >> Hello Mahout developers, >> >> Currently I am trying to get more in depth with the clustering algorithms >> - >> how they should be used and tuned. >> For this purpose I decided to learn from the source code of the different >> implementations. >> In this respect I have the following questions about the Meanshift >> algorithm >> (sorry if it may sound naive, but I am a novice in the area): >> >> 1. I noted that the way it is implemented is different from the >> straightforward approach that is described in the paper ( >> >> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf >> ). >> Later I learned from Jira MAHOUT-15 that this was made to enable >> parallelism. There I also noticed that T2 should be fixed to 1.0. >> In fact for me it seems that T2 should be correlated with the convergence >> delta parameter (which by default is 0.5) and should be slightly higher >> then >> it. Is my assumption correct? >> > I think the optimal values for T1 and T2 depend upon the distance measure > chosen and the nature of the data itself. As this implementation is really > just an iterative application of Canopy, I left both T parameters > specifiable too. This is not exactly the same algorithm as Mean Shift in the > paper but it seems to do amazingly well in some cases. > > 2. With the current implementation the user has the option to select >> desired >> distance measure, but does not have the flexibility to select a kernel. >> The >> current approach results in a hard-coded conical kernel with radius T1 and >> no points outside T1 are considered in the path calculation of the canopy. >> Is it possible to slightly modify the algorithm (similar to the >> modification >> from kmeans to fuzzy kmeans) where weights are associated with a given >> point >> that would touch the canopy and these weights are drown from the kernel >> function. For example they could be drawn from a normal distribution? Do >> you >> think the possibility for kernel selection could impact positively the >> clustering with meanshift in some cases? >> > I don't know but it is intriguing. Why don't you try it? > >> Regards, Vasil >> >> >
-
Re: Meanshift clusteringJeff Eastman 2011-01-20, 18:28
Hi Vasil,
Thanks for the great testimonial. Why don't you add your tuning suggestions to the wiki? How did your vectorization differ from the example code? If your synthetic control results are as good as you say they are you ought to post those results too. The current implementation has some scalability limitations - as does canopy - but you are not the first to write about its positive results. I'd be interested in collaborating on implementing kernel selection and it would be nice to be able to compare and contrast the algorithm with the original citation on the other points as well. This might be another page on the wiki as this implementation is novel afaict. I have a pretty big dataset (18 gb, compressed) and will try out mean shift on it when I get a chance. On 1/20/11 1:27 AM, Vasil Vasilev wrote: > Hi Jeff, > > I used the meanshift algorithm to cluster the synthetic control data (I > produced slightly different type of vectorization than what was given in the > examples) and it gave the best results compared to all other algorithms. So > I am really amazed by what it can do! > > About (1): One interesting thing that I noted is the influence of T2. It > turned out that if you do not select T2 correctly the algorithm may not > converge. In my case I had the following situation. At the end (6-th > iteration) there were 2 canopies left which were within each-other's T1 > boundaries, but outside each-other's T2 boundaries. This led to the fact > that they were touching each-other on every iteration and switching their > centroids. As the distance between the centroids was larger then the > convergence delta the algorithm never converged. However when I increased T2 > to merge these 2 clusters the algorithm converged successfully. > In this respect for me the following procedure worked: > 1. Select T2 as small as possible (greater then convergence delta) > 2. Increase it until the algorithm converges > With this approach the algorithm determined also very nicely the correct > number of clusters > > About (2): OK. I will try it and let you know about the result. > > > On Wed, Jan 19, 2011 at 5:50 PM, Jeff Eastman<[EMAIL PROTECTED]>wrote: > >> On 1/18/11 9:10 AM, Vasil Vasilev wrote: >> >>> Hello Mahout developers, >>> >>> Currently I am trying to get more in depth with the clustering algorithms >>> - >>> how they should be used and tuned. >>> For this purpose I decided to learn from the source code of the different >>> implementations. >>> In this respect I have the following questions about the Meanshift >>> algorithm >>> (sorry if it may sound naive, but I am a novice in the area): >>> >>> 1. I noted that the way it is implemented is different from the >>> straightforward approach that is described in the paper ( >>> >>> http://homepages.inf.ed.ac.uk/rbf/CVonline/LOCAL_COPIES/TUZEL1/MeanShift.pdf >>> ). >>> Later I learned from Jira MAHOUT-15 that this was made to enable >>> parallelism. There I also noticed that T2 should be fixed to 1.0. >>> In fact for me it seems that T2 should be correlated with the convergence >>> delta parameter (which by default is 0.5) and should be slightly higher >>> then >>> it. Is my assumption correct? >>> >> I think the optimal values for T1 and T2 depend upon the distance measure >> chosen and the nature of the data itself. As this implementation is really >> just an iterative application of Canopy, I left both T parameters >> specifiable too. This is not exactly the same algorithm as Mean Shift in the >> paper but it seems to do amazingly well in some cases. >> >> 2. With the current implementation the user has the option to select >>> desired >>> distance measure, but does not have the flexibility to select a kernel. >>> The >>> current approach results in a hard-coded conical kernel with radius T1 and >>> no points outside T1 are considered in the path calculation of the canopy. >>> Is it possible to slightly modify the algorithm (similar to the
-
Re: Meanshift clusteringVasil Vasilev 2011-01-21, 08:39
Hi Jeff,
Ok, I can do so - I can write about my experience on the wiki, or may be better - write it first on a document and then send it to you for a review. About the vectorization of the data: In the current example each value of the control signal is assigned to a separate dimension. Having in mind that this data is sampled from real-world measurements this should mean that the oscilations should be Gaussian distributed around the ideal line (for example ideal line could be straight line, or an oscilating line - depends on the type of signal). That's why I suppose Gaussian cluster with Dirichlet algorithm worked well in determining the number of clusters. Unfortunately the data in these clusters were very intermixed, i.e. I wouldn't say it could cluster the data correctly. That's why I decided to create another type of vectorization that will create a good separation of the data and will work well also with the distance measure based algorithms (like kmeans for example). For each sample line a produced a vector with 3 dimensions: dimension 1: Using linear regression with gradient descent algorithm I find what is the trend of the line, i.e. is it increasing, decreasing or straight line dimension 2: Knowing the approximating line (from the linear regression) I count how many times this line gets crossed by the original signal. This helps in separating the cyclic data from all the rest dimension 3: What is the biggest increase/decrease of a single signal line. This helps find shifts So to say - I put a semantics for the data that are to be clustered (I don't know if it is correct to do that, but I couldn't think of how an algorithm could cope with the task without such additional semantics) Also I developed a small swing application which visualizes the clustered signals and which helped me in playing with the algorithms. About the implementation of kernel selection in meanshift: I was wondering what is the best way to test the results. Because with the signals clustering the algorithm copes almost perfectly. I also didn't find much in the papers that were cited on the wiki and in the jira about what effects kernel has in different cases. Have you seen any papers about this? May be we could try also the algorithm on image data and see the difference visually? Regards, Vasil On Thu, Jan 20, 2011 at 8:28 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > Hi Vasil, > > Thanks for the great testimonial. Why don't you add your tuning suggestions > to the wiki? How did your vectorization differ from the example code? If > your synthetic control results are as good as you say they are you ought to > post those results too. The current implementation has some scalability > limitations - as does canopy - but you are not the first to write about its > positive results. I'd be interested in collaborating on implementing kernel > selection and it would be nice to be able to compare and contrast the > algorithm with the original citation on the other points as well. This might > be another page on the wiki as this implementation is novel afaict. > > I have a pretty big dataset (18 gb, compressed) and will try out mean shift > on it when I get a chance. > > > On 1/20/11 1:27 AM, Vasil Vasilev wrote: > >> Hi Jeff, >> >> I used the meanshift algorithm to cluster the synthetic control data (I >> produced slightly different type of vectorization than what was given in >> the >> examples) and it gave the best results compared to all other algorithms. >> So >> I am really amazed by what it can do! >> >> About (1): One interesting thing that I noted is the influence of T2. It >> turned out that if you do not select T2 correctly the algorithm may not >> converge. In my case I had the following situation. At the end (6-th >> iteration) there were 2 canopies left which were within each-other's T1 >> boundaries, but outside each-other's T2 boundaries. This led to the fact
-
Re: Meanshift clusteringTed Dunning 2011-01-21, 15:16
On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <[EMAIL PROTECTED]> wrote:
> > dimension 1: Using linear regression with gradient descent algorithm I find > what is the trend of the line, i.e. is it increasing, decreasing or > straight > line > dimension 2: Knowing the approximating line (from the linear regression) I > count how many times this line gets crossed by the original signal. This > helps in separating the cyclic data from all the rest > dimension 3: What is the biggest increase/decrease of a single signal line. > This helps find shifts > > So to say - I put a semantics for the data that are to be clustered (I > don't > know if it is correct to do that, but I couldn't think of how an algorithm > could cope with the task without such additional semantics) > It is very common for feature extraction like this to be the key for data-mining projects. Such features are absolutely critical for most time series mining and are highly application dependent. One key aspect of your features is that they are shift invariant. > Also I developed a small swing application which visualizes the clustered > signals and which helped me in playing with the algorithms. > Great idea.
-
Re: Meanshift clusteringLance Norskog 2011-01-22, 05:17
Vasil-
Would you consider adding your estimation algorithm to this patch? https://issues.apache.org/jira/browse/MAHOUT-563 The estimator in there now is stupid- a real one would make the Canopy algorithms orders of magnitude more useful. Lance On Fri, Jan 21, 2011 at 7:16 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Fri, Jan 21, 2011 at 12:39 AM, Vasil Vasilev <[EMAIL PROTECTED]> wrote: > >> >> dimension 1: Using linear regression with gradient descent algorithm I find >> what is the trend of the line, i.e. is it increasing, decreasing or >> straight >> line >> dimension 2: Knowing the approximating line (from the linear regression) I >> count how many times this line gets crossed by the original signal. This >> helps in separating the cyclic data from all the rest >> dimension 3: What is the biggest increase/decrease of a single signal line. >> This helps find shifts >> >> So to say - I put a semantics for the data that are to be clustered (I >> don't >> know if it is correct to do that, but I couldn't think of how an algorithm >> could cope with the task without such additional semantics) >> > > It is very common for feature extraction like this to be the key for > data-mining projects. Such features are absolutely critical for most time > series mining and are highly application dependent. > > One key aspect of your features is that they are shift invariant. > > >> Also I developed a small swing application which visualizes the clustered >> signals and which helped me in playing with the algorithms. >> > > Great idea. > -- Lance Norskog [EMAIL PROTECTED]
-
Re: Meanshift clusteringVasil Vasilev 2011-01-24, 11:18
Hello,
In fact my estimation technique works only for the Meanshift algorithm, because in it the centers of the canopies are moving around. With pure Canopy clustering I don't think it will work. I spent some time trying to realize the idea of different kernel profiles with the meanshift clustering algorithm and here are the results: 1. I changed a little bit the original algorithm. Previously when one cluster "touched" other clusters its centroid was computed only based on the centroids of the clusters it touched, but not based on his centroid itself. Now befor calculating the shift I added one more line which makes the cluster observe his personal centroid: public boolean shiftToMean(MeanShiftCanopy canopy) { canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); canopy.computeConvergence(measure, convergenceDelta); canopy.computeParameters(); return canopy.isConverged(); } With this modification also the problem with convergence of the algorithm (that I described above) disappeared, although the number of iterations until convergence increased slightly. This change was necessary in order to introduce the other types of "soft" kernels. 2. I introduced IKernelProfile interface which has the methods: public double calculateValue(double distance, double h); public double calculateDerivativeValue(double distance, double h); 3. I created 2 implementations: TriangularKernelProfile with calculated value: @Override public double calculateDerivativeValue(double distance, double h) { return (distance < h) ? 1.0 : 0.0; } and NormalKernelProfile with calculated value: @Override public double calculateDerivativeValue(double distance, double h) { return UncommonDistributions.dNorm(distance, 0.0, h); } 4. I modified the core for merging canopies: public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy> canopies) { MeanShiftCanopy closestCoveringCanopy = null; double closestNorm = Double.MAX_VALUE; for (MeanShiftCanopy canopy : canopies) { double norm = measure.distance(canopy.getCenter(), aCanopy.getCenter()); double weight = kernelProfile.calculateDerivativeValue(norm, t1); if(weight > 0.0) { aCanopy.touch(canopy, weight); } if (norm < t2 && ((closestCoveringCanopy == null) || (norm < closestNorm))) { closestNorm = norm; closestCoveringCanopy = canopy; } } if (closestCoveringCanopy == null) { canopies.add(aCanopy); } else { closestCoveringCanopy.merge(aCanopy); } } 5. I modified MeanShiftCanopy void touch(MeanShiftCanopy canopy, double weight) { canopy.observe(getCenter(), weight*((double)boundPoints.size())); observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size())); } 6. I modified some other files which were necessary to compile the code and for the tests to pass With the so changed algorithm I had the following observations: 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster content is more intermixed 2. As there is no convergence problem any more I found the following procedure for estimating T2 and convergence delta: - bind convergence delta = T2 / 2 - When you decrease T2 the number of iterations increases and the number of resulting clusters after convergence decreases - You come to a moment when you decrease T2 the number of iterations increases, but the number of resulting clusters remains unchanged. This is the point with the best value for T2 3. In case of Normal kernel what you give as T1 is in fact the standard deviation of a normal distribution, so the radius of the window will be T1^2 If you are interested I can send the code. Regards, Vasil On Sat, Jan 22, 2011 at 7:17 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Vasil- > > Would you consider adding your estimation algorithm to this patch?
-
Re: Meanshift clusteringLance Norskog 2011-01-25, 05:28
Sure, please. Please include hints about the various cases where it
works (or does not). Thanks! On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[EMAIL PROTECTED]> wrote: > Hello, > > In fact my estimation technique works only for the Meanshift algorithm, > because in it the centers of the canopies are moving around. With pure > Canopy clustering I don't think it will work. > > I spent some time trying to realize the idea of different kernel profiles > with the meanshift clustering algorithm and here are the results: > 1. I changed a little bit the original algorithm. Previously when one > cluster "touched" other clusters its centroid was computed only based on the > centroids of the clusters it touched, but not based on his centroid itself. > Now befor calculating the shift I added one more line which makes the > cluster observe his personal centroid: > > public boolean shiftToMean(MeanShiftCanopy canopy) { > canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); > canopy.computeConvergence(measure, convergenceDelta); > canopy.computeParameters(); > return canopy.isConverged(); > } > > With this modification also the problem with convergence of the algorithm > (that I described above) disappeared, although the number of iterations > until convergence increased slightly. > This change was necessary in order to introduce the other types of "soft" > kernels. > > 2. I introduced IKernelProfile interface which has the methods: > public double calculateValue(double distance, double h); > public double calculateDerivativeValue(double distance, double h); > > 3. I created 2 implementations: > TriangularKernelProfile with calculated value: > @Override > public double calculateDerivativeValue(double distance, double h) { > return (distance < h) ? 1.0 : 0.0; > } > > and NormalKernelProfile with calculated value: > @Override > public double calculateDerivativeValue(double distance, double h) { > return UncommonDistributions.dNorm(distance, 0.0, h); > } > > 4. I modified the core for merging canopies: > > public void mergeCanopy(MeanShiftCanopy aCanopy, Collection<MeanShiftCanopy> > canopies) { > MeanShiftCanopy closestCoveringCanopy = null; > double closestNorm = Double.MAX_VALUE; > for (MeanShiftCanopy canopy : canopies) { > double norm = measure.distance(canopy.getCenter(), > aCanopy.getCenter()); > double weight = kernelProfile.calculateDerivativeValue(norm, t1); > if(weight > 0.0) > { > aCanopy.touch(canopy, weight); > } > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < > closestNorm))) { > closestNorm = norm; > closestCoveringCanopy = canopy; > } > } > if (closestCoveringCanopy == null) { > canopies.add(aCanopy); > } else { > closestCoveringCanopy.merge(aCanopy); > } > } > > 5. I modified MeanShiftCanopy > void touch(MeanShiftCanopy canopy, double weight) { > canopy.observe(getCenter(), weight*((double)boundPoints.size())); > observe(canopy.getCenter(), weight*((double)canopy.boundPoints.size())); > } > > 6. I modified some other files which were necessary to compile the code and > for the tests to pass > > With the so changed algorithm I had the following observations: > > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster > content is more intermixed > 2. As there is no convergence problem any more I found the following > procedure for estimating T2 and convergence delta: > - bind convergence delta = T2 / 2 > - When you decrease T2 the number of iterations increases and the number of > resulting clusters after convergence decreases > - You come to a moment when you decrease T2 the number of iterations > increases, but the number of resulting clusters remains unchanged. This is > the point with the best value for T2 > 3. In case of Normal kernel what you give as T1 is in fact the standard Lance Norskog [EMAIL PROTECTED]
-
Re: Meanshift clusteringVasil Vasilev 2011-01-25, 15:33
It is added to https://issues.apache.org/jira/browse/MAHOUT-15
I hope this was the correct place to add it! On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > Sure, please. Please include hints about the various cases where it > works (or does not). > > Thanks! > > On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[EMAIL PROTECTED]> > wrote: > > Hello, > > > > In fact my estimation technique works only for the Meanshift algorithm, > > because in it the centers of the canopies are moving around. With pure > > Canopy clustering I don't think it will work. > > > > I spent some time trying to realize the idea of different kernel profiles > > with the meanshift clustering algorithm and here are the results: > > 1. I changed a little bit the original algorithm. Previously when one > > cluster "touched" other clusters its centroid was computed only based on > the > > centroids of the clusters it touched, but not based on his centroid > itself. > > Now befor calculating the shift I added one more line which makes the > > cluster observe his personal centroid: > > > > public boolean shiftToMean(MeanShiftCanopy canopy) { > > canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); > > canopy.computeConvergence(measure, convergenceDelta); > > canopy.computeParameters(); > > return canopy.isConverged(); > > } > > > > With this modification also the problem with convergence of the algorithm > > (that I described above) disappeared, although the number of iterations > > until convergence increased slightly. > > This change was necessary in order to introduce the other types of "soft" > > kernels. > > > > 2. I introduced IKernelProfile interface which has the methods: > > public double calculateValue(double distance, double h); > > public double calculateDerivativeValue(double distance, double h); > > > > 3. I created 2 implementations: > > TriangularKernelProfile with calculated value: > > @Override > > public double calculateDerivativeValue(double distance, double h) { > > return (distance < h) ? 1.0 : 0.0; > > } > > > > and NormalKernelProfile with calculated value: > > @Override > > public double calculateDerivativeValue(double distance, double h) { > > return UncommonDistributions.dNorm(distance, 0.0, h); > > } > > > > 4. I modified the core for merging canopies: > > > > public void mergeCanopy(MeanShiftCanopy aCanopy, > Collection<MeanShiftCanopy> > > canopies) { > > MeanShiftCanopy closestCoveringCanopy = null; > > double closestNorm = Double.MAX_VALUE; > > for (MeanShiftCanopy canopy : canopies) { > > double norm = measure.distance(canopy.getCenter(), > > aCanopy.getCenter()); > > double weight = kernelProfile.calculateDerivativeValue(norm, t1); > > if(weight > 0.0) > > { > > aCanopy.touch(canopy, weight); > > } > > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < > > closestNorm))) { > > closestNorm = norm; > > closestCoveringCanopy = canopy; > > } > > } > > if (closestCoveringCanopy == null) { > > canopies.add(aCanopy); > > } else { > > closestCoveringCanopy.merge(aCanopy); > > } > > } > > > > 5. I modified MeanShiftCanopy > > void touch(MeanShiftCanopy canopy, double weight) { > > canopy.observe(getCenter(), weight*((double)boundPoints.size())); > > observe(canopy.getCenter(), > weight*((double)canopy.boundPoints.size())); > > } > > > > 6. I modified some other files which were necessary to compile the code > and > > for the tests to pass > > > > With the so changed algorithm I had the following observations: > > > > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster > > content is more intermixed > > 2. As there is no convergence problem any more I found the following > > procedure for estimating T2 and convergence delta: > > - bind convergence delta = T2 / 2
-
Re: Meanshift clusteringLance Norskog 2011-01-26, 22:40
No, please make a new JIRA.
Lance On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[EMAIL PROTECTED]> wrote: > It is added to https://issues.apache.org/jira/browse/MAHOUT-15 > > I hope this was the correct place to add it! > > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > >> Sure, please. Please include hints about the various cases where it >> works (or does not). >> >> Thanks! >> >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[EMAIL PROTECTED]> >> wrote: >> > Hello, >> > >> > In fact my estimation technique works only for the Meanshift algorithm, >> > because in it the centers of the canopies are moving around. With pure >> > Canopy clustering I don't think it will work. >> > >> > I spent some time trying to realize the idea of different kernel profiles >> > with the meanshift clustering algorithm and here are the results: >> > 1. I changed a little bit the original algorithm. Previously when one >> > cluster "touched" other clusters its centroid was computed only based on >> the >> > centroids of the clusters it touched, but not based on his centroid >> itself. >> > Now befor calculating the shift I added one more line which makes the >> > cluster observe his personal centroid: >> > >> > public boolean shiftToMean(MeanShiftCanopy canopy) { >> > canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); >> > canopy.computeConvergence(measure, convergenceDelta); >> > canopy.computeParameters(); >> > return canopy.isConverged(); >> > } >> > >> > With this modification also the problem with convergence of the algorithm >> > (that I described above) disappeared, although the number of iterations >> > until convergence increased slightly. >> > This change was necessary in order to introduce the other types of "soft" >> > kernels. >> > >> > 2. I introduced IKernelProfile interface which has the methods: >> > public double calculateValue(double distance, double h); >> > public double calculateDerivativeValue(double distance, double h); >> > >> > 3. I created 2 implementations: >> > TriangularKernelProfile with calculated value: >> > @Override >> > public double calculateDerivativeValue(double distance, double h) { >> > return (distance < h) ? 1.0 : 0.0; >> > } >> > >> > and NormalKernelProfile with calculated value: >> > @Override >> > public double calculateDerivativeValue(double distance, double h) { >> > return UncommonDistributions.dNorm(distance, 0.0, h); >> > } >> > >> > 4. I modified the core for merging canopies: >> > >> > public void mergeCanopy(MeanShiftCanopy aCanopy, >> Collection<MeanShiftCanopy> >> > canopies) { >> > MeanShiftCanopy closestCoveringCanopy = null; >> > double closestNorm = Double.MAX_VALUE; >> > for (MeanShiftCanopy canopy : canopies) { >> > double norm = measure.distance(canopy.getCenter(), >> > aCanopy.getCenter()); >> > double weight = kernelProfile.calculateDerivativeValue(norm, t1); >> > if(weight > 0.0) >> > { >> > aCanopy.touch(canopy, weight); >> > } >> > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < >> > closestNorm))) { >> > closestNorm = norm; >> > closestCoveringCanopy = canopy; >> > } >> > } >> > if (closestCoveringCanopy == null) { >> > canopies.add(aCanopy); >> > } else { >> > closestCoveringCanopy.merge(aCanopy); >> > } >> > } >> > >> > 5. I modified MeanShiftCanopy >> > void touch(MeanShiftCanopy canopy, double weight) { >> > canopy.observe(getCenter(), weight*((double)boundPoints.size())); >> > observe(canopy.getCenter(), >> weight*((double)canopy.boundPoints.size())); >> > } >> > >> > 6. I modified some other files which were necessary to compile the code >> and >> > for the tests to pass >> > >> > With the so changed algorithm I had the following observations: >> > >> > 1. Using NormalKernel "blurs" the cluster boundaries. I.e. the cluster Lance Norskog [EMAIL PROTECTED]
-
Re: Meanshift clusteringVasil Vasilev 2011-01-27, 11:06
I created: https://issues.apache.org/jira/browse/MAHOUT-597
Regards, Vasil On Thu, Jan 27, 2011 at 12:40 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > No, please make a new JIRA. > > Lance > > On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[EMAIL PROTECTED]> > wrote: > > It is added to https://issues.apache.org/jira/browse/MAHOUT-15 > > > > I hope this was the correct place to add it! > > > > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[EMAIL PROTECTED]> > wrote: > > > >> Sure, please. Please include hints about the various cases where it > >> works (or does not). > >> > >> Thanks! > >> > >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[EMAIL PROTECTED]> > >> wrote: > >> > Hello, > >> > > >> > In fact my estimation technique works only for the Meanshift > algorithm, > >> > because in it the centers of the canopies are moving around. With pure > >> > Canopy clustering I don't think it will work. > >> > > >> > I spent some time trying to realize the idea of different kernel > profiles > >> > with the meanshift clustering algorithm and here are the results: > >> > 1. I changed a little bit the original algorithm. Previously when one > >> > cluster "touched" other clusters its centroid was computed only based > on > >> the > >> > centroids of the clusters it touched, but not based on his centroid > >> itself. > >> > Now befor calculating the shift I added one more line which makes the > >> > cluster observe his personal centroid: > >> > > >> > public boolean shiftToMean(MeanShiftCanopy canopy) { > >> > canopy.observe(canopy.getCenter(), canopy.getBoundPoints().size()); > >> > canopy.computeConvergence(measure, convergenceDelta); > >> > canopy.computeParameters(); > >> > return canopy.isConverged(); > >> > } > >> > > >> > With this modification also the problem with convergence of the > algorithm > >> > (that I described above) disappeared, although the number of > iterations > >> > until convergence increased slightly. > >> > This change was necessary in order to introduce the other types of > "soft" > >> > kernels. > >> > > >> > 2. I introduced IKernelProfile interface which has the methods: > >> > public double calculateValue(double distance, double h); > >> > public double calculateDerivativeValue(double distance, double h); > >> > > >> > 3. I created 2 implementations: > >> > TriangularKernelProfile with calculated value: > >> > @Override > >> > public double calculateDerivativeValue(double distance, double h) { > >> > return (distance < h) ? 1.0 : 0.0; > >> > } > >> > > >> > and NormalKernelProfile with calculated value: > >> > @Override > >> > public double calculateDerivativeValue(double distance, double h) { > >> > return UncommonDistributions.dNorm(distance, 0.0, h); > >> > } > >> > > >> > 4. I modified the core for merging canopies: > >> > > >> > public void mergeCanopy(MeanShiftCanopy aCanopy, > >> Collection<MeanShiftCanopy> > >> > canopies) { > >> > MeanShiftCanopy closestCoveringCanopy = null; > >> > double closestNorm = Double.MAX_VALUE; > >> > for (MeanShiftCanopy canopy : canopies) { > >> > double norm = measure.distance(canopy.getCenter(), > >> > aCanopy.getCenter()); > >> > double weight = kernelProfile.calculateDerivativeValue(norm, t1); > >> > if(weight > 0.0) > >> > { > >> > aCanopy.touch(canopy, weight); > >> > } > >> > if (norm < t2 && ((closestCoveringCanopy == null) || (norm < > >> > closestNorm))) { > >> > closestNorm = norm; > >> > closestCoveringCanopy = canopy; > >> > } > >> > } > >> > if (closestCoveringCanopy == null) { > >> > canopies.add(aCanopy); > >> > } else { > >> > closestCoveringCanopy.merge(aCanopy); > >> > } > >> > } > >> > > >> > 5. I modified MeanShiftCanopy > >> > void touch(MeanShiftCanopy canopy, double weight) { > >> > canopy.observe(getCenter(), weight*((double)boundPoints.size()));
-
Re: Meanshift clusteringVasil Vasilev 2011-02-18, 18:16
Hi Mahout developers,
As suggested above, I wrote about my experience clustering the synthetic control data with the different algorithms. Can you tell me your opinion and suggestions? Also where is the best place to post it on the wiki? https://docs.google.com/leaf?id=0B8qKJ44gVr7QMDEyMTlkY2YtNDQ1OC00MmQ3LWFmZTUtOGIxNGZiODFhOGYx&hl=en Regards, Vasil On Thu, Jan 27, 2011 at 1:06 PM, Vasil Vasilev <[EMAIL PROTECTED]> wrote: > I created: https://issues.apache.org/jira/browse/MAHOUT-597 > > Regards, Vasil > > > On Thu, Jan 27, 2011 at 12:40 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > >> No, please make a new JIRA. >> >> Lance >> >> On Tue, Jan 25, 2011 at 7:33 AM, Vasil Vasilev <[EMAIL PROTECTED]> >> wrote: >> > It is added to https://issues.apache.org/jira/browse/MAHOUT-15 >> > >> > I hope this was the correct place to add it! >> > >> > On Tue, Jan 25, 2011 at 7:28 AM, Lance Norskog <[EMAIL PROTECTED]> >> wrote: >> > >> >> Sure, please. Please include hints about the various cases where it >> >> works (or does not). >> >> >> >> Thanks! >> >> >> >> On Mon, Jan 24, 2011 at 3:18 AM, Vasil Vasilev <[EMAIL PROTECTED]> >> >> wrote: >> >> > Hello, >> >> > >> >> > In fact my estimation technique works only for the Meanshift >> algorithm, >> >> > because in it the centers of the canopies are moving around. With >> pure >> >> > Canopy clustering I don't think it will work. >> >> > >> >> > I spent some time trying to realize the idea of different kernel >> profiles >> >> > with the meanshift clustering algorithm and here are the results: >> >> > 1. I changed a little bit the original algorithm. Previously when one >> >> > cluster "touched" other clusters its centroid was computed only based >> on >> >> the >> >> > centroids of the clusters it touched, but not based on his centroid >> >> itself. >> >> > Now befor calculating the shift I added one more line which makes the >> >> > cluster observe his personal centroid: >> >> > >> >> > public boolean shiftToMean(MeanShiftCanopy canopy) { >> >> > canopy.observe(canopy.getCenter(), >> canopy.getBoundPoints().size()); >> >> > canopy.computeConvergence(measure, convergenceDelta); >> >> > canopy.computeParameters(); >> >> > return canopy.isConverged(); >> >> > } >> >> > >> >> > With this modification also the problem with convergence of the >> algorithm >> >> > (that I described above) disappeared, although the number of >> iterations >> >> > until convergence increased slightly. >> >> > This change was necessary in order to introduce the other types of >> "soft" >> >> > kernels. >> >> > >> >> > 2. I introduced IKernelProfile interface which has the methods: >> >> > public double calculateValue(double distance, double h); >> >> > public double calculateDerivativeValue(double distance, double h); >> >> > >> >> > 3. I created 2 implementations: >> >> > TriangularKernelProfile with calculated value: >> >> > @Override >> >> > public double calculateDerivativeValue(double distance, double h) >> { >> >> > return (distance < h) ? 1.0 : 0.0; >> >> > } >> >> > >> >> > and NormalKernelProfile with calculated value: >> >> > @Override >> >> > public double calculateDerivativeValue(double distance, double h) >> { >> >> > return UncommonDistributions.dNorm(distance, 0.0, h); >> >> > } >> >> > >> >> > 4. I modified the core for merging canopies: >> >> > >> >> > public void mergeCanopy(MeanShiftCanopy aCanopy, >> >> Collection<MeanShiftCanopy> >> >> > canopies) { >> >> > MeanShiftCanopy closestCoveringCanopy = null; >> >> > double closestNorm = Double.MAX_VALUE; >> >> > for (MeanShiftCanopy canopy : canopies) { >> >> > double norm = measure.distance(canopy.getCenter(), >> >> > aCanopy.getCenter()); >> >> > double weight = kernelProfile.calculateDerivativeValue(norm, >> t1); >> >> > if(weight > 0.0) >> >> > { >> >> > aCanopy.touch(canopy, weight); >> >> > }
-
meanshift clusteringgaurav redkar 2011-11-09, 12:08
Hi.. I am unable to identify where is the clusterPoints() function in the
MeanShiftCanopyClusterer.java file being called during the execution of Meanshift job. What i need to know is where are the files in clusteredPoints n clusters-* directory being written when we run the job on hadoop. buildclustersMR() creates the clusters-* directory for each iteration but i am unable to locate the code which actually writes to d part-r-* files . Any suggestions..?? Thanks |