|
Whitmore, Mattie
2012-08-15, 17:45
Ted Dunning
2012-08-15, 21:19
Whitmore, Mattie
2012-08-17, 14:36
Paritosh Ranjan
2012-08-17, 15:19
Whitmore, Mattie
2012-08-17, 15:37
Paritosh Ranjan
2012-08-17, 16:15
Whitmore, Mattie
2012-08-22, 17:00
Ted Dunning
2012-08-22, 17:18
Whitmore, Mattie
2012-08-22, 17:40
Ted Dunning
2012-08-22, 23:16
Paritosh Ranjan
2012-08-23, 01:09
Whitmore, Mattie
2012-08-23, 14:25
Paritosh Ranjan
2012-08-23, 16:33
Whitmore, Mattie
2012-08-29, 14:37
Ted Dunning
2012-08-29, 16:16
Whitmore, Mattie
2012-08-30, 16:53
Ted Dunning
2012-08-30, 18:48
Ted Dunning
2012-08-30, 18:51
Whitmore, Mattie
2012-08-30, 19:25
Ted Dunning
2012-08-30, 19:55
Ted Dunning
2012-08-30, 19:57
|
-
Mahout-279/kmeans++Whitmore, Mattie 2012-08-15, 17:45
Hi!
I have been using RandomSeedGenerator, and was hoping it had a patch like that described in Mahout-279 since I want only 10 vectors out of a set of more than 100,000,000. I have been using canopy clustering for better results, but still need to do a few passes of kmeans to determine my T, and the random seed does take a long time. The comments say that you are working on a kmeans++, I searched around but couldn't confirm any more information about it. Is a scalable kmeans++ in the works? (I know research on the subject is quite new) Thanks! Mattie Whitmore Mathematician/IR&D Software Engineer HARRIS Corporation - Advanced Information Solutions 301.837.5278 [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-15, 21:19
Mattie,
Would this help? https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java and https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > Hi! > > I have been using RandomSeedGenerator, and was hoping it had a patch like > that described in Mahout-279 since I want only 10 vectors out of a set of > more than 100,000,000. I have been using canopy clustering for better > results, but still need to do a few passes of kmeans to determine my T, and > the random seed does take a long time. > > The comments say that you are working on a kmeans++, I searched around but > couldn't confirm any more information about it. Is a scalable kmeans++ in > the works? (I know research on the subject is quite new) > > Thanks! > > > > Mattie Whitmore > Mathematician/IR&D Software Engineer > HARRIS Corporation - Advanced Information Solutions > 301.837.5278 > [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > > > >
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-17, 14:36
Hi Ted,
Yes this is great! I hope to start working with this algorithm in the next couple weeks. I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold, I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output. Am I using this value incorrectly? I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0. Thanks, Mattie -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 15, 2012 5:20 PM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ Mattie, Would this help? https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java and https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > Hi! > > I have been using RandomSeedGenerator, and was hoping it had a patch like > that described in Mahout-279 since I want only 10 vectors out of a set of > more than 100,000,000. I have been using canopy clustering for better > results, but still need to do a few passes of kmeans to determine my T, and > the random seed does take a long time. > > The comments say that you are working on a kmeans++, I searched around but > couldn't confirm any more information about it. Is a scalable kmeans++ in > the works? (I know research on the subject is quite new) > > Thanks! > > > > Mattie Whitmore > Mathematician/IR&D Software Engineer > HARRIS Corporation - Advanced Information Solutions > 301.837.5278 > [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > > > >
-
Re: Mahout-279/kmeans++Paritosh Ranjan 2012-08-17, 15:19
clusterClassificationThreshold is for outlier removal, and this is the way it should be used.
Can you provide some more information about your job and the way you are calling it? And if I look at the code, the vector should be clustered even if the pdf is 0. The method which decides whether the vector should be assigned to a particular cluster or not - /** * Decides whether the vector should be classified or not based on the max pdf * value of the clusters and threshold value. * * @return whether the vector should be classified or not. */ private static boolean shouldClassify(Vector pdfPerCluster, Double clusterClassificationThreshold) { return pdfPerCluster.maxValue() >= clusterClassificationThreshold; } On 17-08-2012 20:06, Whitmore, Mattie wrote: > Hi Ted, > > Yes this is great! I hope to start working with this algorithm in the next couple weeks. > > I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold, I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output. Am I using this value incorrectly? I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0. > > > Thanks, > > Mattie > > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 15, 2012 5:20 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Mattie, > > Would this help? > > https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java > > and > > https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf > > On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > >> Hi! >> >> I have been using RandomSeedGenerator, and was hoping it had a patch like >> that described in Mahout-279 since I want only 10 vectors out of a set of >> more than 100,000,000. I have been using canopy clustering for better >> results, but still need to do a few passes of kmeans to determine my T, and >> the random seed does take a long time. >> >> The comments say that you are working on a kmeans++, I searched around but >> couldn't confirm any more information about it. Is a scalable kmeans++ in >> the works? (I know research on the subject is quite new) >> >> Thanks! >> >> >> >> Mattie Whitmore >> Mathematician/IR&D Software Engineer >> HARRIS Corporation - Advanced Information Solutions >> 301.837.5278 >> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> >> >> >>
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-17, 15:37
Sure, I have a dataset which I wish to cluster using Kmeans. Previously (v0.5) when I did a clusterdump the total amount of vectors within the resultant clusters was the same as the total amount fed to the algorithm. I wish this to be the case when clustering with v0.7. The only change in the algorithm is clusterClassificationThreshold, I set this value to be 0 so that it will in fact cluster all vectors in the dataset.
My logic here was no vector should have a probability of being in some cluster less than 0 and therefore all vectors should cluster. However after running a clusterdump I find that vectors (1/3 roughly) have been pruned. Is this a bug, or me just not understanding the new capabilities? I should also mention I have vectors which are exactly the same (even their names), perhaps they are the ones being pruned, is that possible? Another question if I may: I will eventually want to use the pruning capabilities, does the ClusterOutputPostProcessorDriver method (or a similar method) have the capability of outputting the pruned vectors into a folder? Thanks! Please let me know if I'm still not being clear enough. Mattie -----Original Message----- From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] Sent: Friday, August 17, 2012 11:20 AM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ clusterClassificationThreshold is for outlier removal, and this is the way it should be used. Can you provide some more information about your job and the way you are calling it? And if I look at the code, the vector should be clustered even if the pdf is 0. The method which decides whether the vector should be assigned to a particular cluster or not - /** * Decides whether the vector should be classified or not based on the max pdf * value of the clusters and threshold value. * * @return whether the vector should be classified or not. */ private static boolean shouldClassify(Vector pdfPerCluster, Double clusterClassificationThreshold) { return pdfPerCluster.maxValue() >= clusterClassificationThreshold; } On 17-08-2012 20:06, Whitmore, Mattie wrote: > Hi Ted, > > Yes this is great! I hope to start working with this algorithm in the next couple weeks. > > I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold, I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output. Am I using this value incorrectly? I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0. > > > Thanks, > > Mattie > > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 15, 2012 5:20 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Mattie, > > Would this help? > > https://github.com/tdunning/knn/blob/master/src/main/java/org/apache/mahout/knn/cluster/BallKmeans.java > > and > > https://github.com/tdunning/knn/blob/master/docs/scaling-k-means/scaling-k-means.pdf > > On Wed, Aug 15, 2012 at 10:45 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > >> Hi! >> >> I have been using RandomSeedGenerator, and was hoping it had a patch like >> that described in Mahout-279 since I want only 10 vectors out of a set of >> more than 100,000,000. I have been using canopy clustering for better >> results, but still need to do a few passes of kmeans to determine my T, and >> the random seed does take a long time. >> >> The comments say that you are working on a kmeans++, I searched around but >> couldn't confirm any more information about it. Is a scalable kmeans++ in >> the works? (I know research on the subject is quite new) >> >> Thanks! >> >> >> >> Mattie Whitmore >> Mathematician/IR&D Software Engineer >> HARRIS Corporation - Advanced Information Solutions >> 301.837.5278 >> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> >> >> >>
-
Re: Mahout-279/kmeans++Paritosh Ranjan 2012-08-17, 16:15
The clustering algorithm has also changed internally. So, expect the
results to be different ( and better ). I can think of one reason for this behavior. Maybe lots of clusters are having only one vector inside it, and, AFAIK, clusterdumper will not output any cluster with single vector. So, I think, its clusterdumper which is doing the invisible "pruning" ( by not ouputting clusters with single vectors ). Can you cross check the output once with ClusterOutputPostProcessorDriver? No, no tool can output the pruned vectors. The only way to see all vectors assigned to any cluster is to set clusterClassificationThreshold to 0. If you still face the problem, then please provide the parameters with which you are calling kmeans. Regarding "I should also mention I have vectors which are exactly the same (even their names), perhaps they are the ones being pruned, is that possible? " The name of the vector has nothing to do with clustering, I am not sure whether it will have any effect when clusterdumper is in action. So, crosschecking with ClusterOutputPostProcessorDriver will answer this. Good luck. Paritosh On 17-08-2012 21:07, Whitmore, Mattie wrote: > Sure, I have a dataset which I wish to cluster using Kmeans. Previously (v0.5) when I did a clusterdump the total amount of vectors within the resultant clusters was the same as the total amount fed to the algorithm. I wish this to be the case when clustering with v0.7. The only change in the algorithm is clusterClassificationThreshold, I set this value to be 0 so that it will in fact cluster all vectors in the dataset. > > My logic here was no vector should have a probability of being in some cluster less than 0 and therefore all vectors should cluster. > > However after running a clusterdump I find that vectors (1/3 roughly) have been pruned. > > Is this a bug, or me just not understanding the new capabilities? > > I should also mention I have vectors which are exactly the same (even their names), perhaps they are the ones being pruned, is that possible? > > Another question if I may: I will eventually want to use the pruning capabilities, does the ClusterOutputPostProcessorDriver method (or a similar method) have the capability of outputting the pruned vectors into a folder? > > Thanks! Please let me know if I'm still not being clear enough. > > Mattie > > -----Original Message----- > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 17, 2012 11:20 AM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > clusterClassificationThreshold is for outlier removal, and this is the way it should be used. > > Can you provide some more information about your job and the way you are calling it? > > And if I look at the code, the vector should be clustered even if the pdf is 0. The method which decides whether the vector should be assigned to a particular cluster or not - > > /** > * Decides whether the vector should be classified or not based on the max pdf > * value of the clusters and threshold value. > * > * @return whether the vector should be classified or not. > */ > private static boolean shouldClassify(Vector pdfPerCluster, Double clusterClassificationThreshold) { > return pdfPerCluster.maxValue() >= clusterClassificationThreshold; > } > > On 17-08-2012 20:06, Whitmore, Mattie wrote: > >> Hi Ted, >> >> Yes this is great! I hope to start working with this algorithm in the next couple weeks. >> >> I have a question about the 0.7 implementation of kmeans and the clusterClassificationThreshold, I have this value set at zero, but the output is still showing that about 1/3 of my data is not assigned to a cluster in my output. Am I using this value incorrectly? I did a kmeansdriver.run with the 0.5 and 0.7 api, and had the data pruned despite the clusterClassificationThreshold = 0. >> >> >> Thanks, >> >> Mattie >> >> >> -----Original Message----- >> From: Ted Dunning [mailto:[EMAIL PROTECTED]]
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-22, 17:00
I did cross check with ClusterOutputPostProcessorDriver, and the files are filled with the same number of vectors which clusterdumper is counting.
I have also verified by running canopy multiple times with 0.5 and 0.7 that there is a continual discrepancy between the two clustering versions. The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: 921998/5. They should not necessarily be the same, since I am using canopy clustering to find initial centroids, however I would think they would have the same sum, which they do not (45901885 vs 1599154). Here is the method I am running: public static void KmeansClusteringCanopy(String outputDir, String T, String itMax) throws IOException, InterruptedException, ClassNotFoundException, InstantiationException, IllegalAccessException { Configuration conf = new Configuration(); DistanceMeasure measure = new EuclideanDistanceMeasure(); Path vectorsFolder = new Path(outputDir, "vectors"); Path clusterCenters = new Path(outputDir + "-canopy/centriods"); Path clusterOutput = new Path(outputDir + "-canopy/clusters"); // create canopies instead of initial vectors CanopyDriver.run(conf, vectorsFolder, clusterCenters, measure, Double.parseDouble(T), Double.parseDouble(T), false, 0, false); // kmeans cluster operation KMeansDriver.run(conf, vectorsFolder, new Path(clusterCenters, "clusters-0-final/part-r-00000"), clusterOutput, measure, 0.01, Integer.parseInt(itMax), true, 0.0, false); //post process by putting completed clusters into their own files. ClusterOutputPostProcessorDriver.run(clusterOutput, new Path(clusterOutput+"/CanopyClusterVectorFolders"), false); } What do you think? On another but related note: Is there a plan to have a method -- say ClusterOutputPostProcessorDriver -- which when run outputs the vectors within clusters as well as a separate folder containing pruned outliers? Thanks! Mattie -----Original Message----- From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] Sent: Friday, August 17, 2012 12:16 PM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ The clustering algorithm has also changed internally. So, expect the results to be different ( and better ). I can think of one reason for this behavior. Maybe lots of clusters are having only one vector inside it, and, AFAIK, clusterdumper will not output any cluster with single vector. So, I think, its clusterdumper which is doing the invisible "pruning" ( by not ouputting clusters with single vectors ). Can you cross check the output once with ClusterOutputPostProcessorDriver? No, no tool can output the pruned vectors. The only way to see all vectors assigned to any cluster is to set clusterClassificationThreshold to 0. If you still face the problem, then please provide the parameters with which you are calling kmeans. Regarding "I should also mention I have vectors which are exactly the same (even their names), perhaps they are the ones being pruned, is that possible? " The name of the vector has nothing to do with clustering, I am not sure whether it will have any effect when clusterdumper is in action. So, crosschecking with ClusterOutputPostProcessorDriver will answer this. Good luck. Paritosh On 17-08-2012 21:07, Whitmore, Mattie wrote: > Sure, I have a dataset which I wish to cluster using Kmeans. Previously (v0.5) when I did a clusterdump the total amount of vectors within the resultant clusters was the same as the total amount fed to the algorithm. I wish this to be the case when clustering with v0.7. The only change in the algorithm is clusterClassificationThreshold, I set this value to be 0 so that it will in fact cluster all vectors in the dataset. > > My logic here was no vector should have a probability of being in some cluster less than 0 and therefore all vectors should cluster. > > However after running a clusterdump I find that vectors (1/3 roughly) have been pruned. > > Is this a bug, or me just not understanding the new capabilities?
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-22, 17:18
Just an off thought, do you have duplicate input points?
On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > ... I have also verified by running canopy multiple times with 0.5 and 0.7 > that there is a continual discrepancy between the two clustering versions. > The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: > 921998/5. They should not necessarily be the same, since I am using canopy > clustering to find initial centroids, however I would think they would have > the same sum, which they do not (45901885 vs 1599154). > > Here is the method I am running: > > public static void KmeansClusteringCanopy(String outputDir, String T, > String itMax) > throws IOException, InterruptedException, > ClassNotFoundException, > InstantiationException, IllegalAccessException { > > Configuration conf = new Configuration(); > > DistanceMeasure measure = new EuclideanDistanceMeasure(); > > Path vectorsFolder = new Path(outputDir, "vectors"); > Path clusterCenters = new Path(outputDir + > "-canopy/centriods"); > Path clusterOutput = new Path(outputDir + > "-canopy/clusters"); > > // create canopies instead of initial vectors > CanopyDriver.run(conf, vectorsFolder, clusterCenters, > measure, > Double.parseDouble(T), > Double.parseDouble(T), false, 0, false); > > > // kmeans cluster operation > KMeansDriver.run(conf, vectorsFolder, new > Path(clusterCenters, > "clusters-0-final/part-r-00000"), > clusterOutput, measure, 0.01, > Integer.parseInt(itMax), true, 0.0, false); > > > //post process by putting completed clusters into their > own files. > ClusterOutputPostProcessorDriver.run(clusterOutput, > new > Path(clusterOutput+"/CanopyClusterVectorFolders"), false); > > } > > What do you think? > > On another but related note: Is there a plan to have a method -- say > ClusterOutputPostProcessorDriver -- which when run outputs the vectors > within clusters as well as a separate folder containing pruned outliers? > > Thanks! > > Mattie > > -----Original Message----- > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 17, 2012 12:16 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > The clustering algorithm has also changed internally. So, expect the > results to be different ( and better ). > > I can think of one reason for this behavior. Maybe lots of clusters are > having only one vector inside it, and, AFAIK, clusterdumper will not > output any cluster with single vector. > So, I think, its clusterdumper which is doing the invisible "pruning" ( > by not ouputting clusters with single vectors ). > > Can you cross check the output once with ClusterOutputPostProcessorDriver? > > No, no tool can output the pruned vectors. The only way to see all > vectors assigned to any cluster is to set clusterClassificationThreshold > to 0. > > If you still face the problem, then please provide the parameters with > which you are calling kmeans. > > Regarding "I should also mention I have vectors which are exactly the > same (even their names), perhaps they are the ones being pruned, is that > possible? " > > The name of the vector has nothing to do with clustering, I am not sure > whether it will have any effect when clusterdumper is in action. So, > crosschecking with ClusterOutputPostProcessorDriver will answer this. > > Good luck. > Paritosh > > On 17-08-2012 21:07, Whitmore, Mattie wrote: > > Sure, I have a dataset which I wish to cluster using Kmeans. Previously > (v0.5) when I did a clusterdump the total amount of vectors within the > resultant clusters was the same as the total amount fed to the algorithm. > I wish this to be the case when clustering with v0.7. The only change in
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-22, 17:40
Yes, I have data which is exactly the same. If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)?
Thanks, Mattie -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 1:18 PM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ Just an off thought, do you have duplicate input points? On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > ... I have also verified by running canopy multiple times with 0.5 and 0.7 > that there is a continual discrepancy between the two clustering versions. > The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: > 921998/5. They should not necessarily be the same, since I am using canopy > clustering to find initial centroids, however I would think they would have > the same sum, which they do not (45901885 vs 1599154). > > Here is the method I am running: > > public static void KmeansClusteringCanopy(String outputDir, String T, > String itMax) > throws IOException, InterruptedException, > ClassNotFoundException, > InstantiationException, IllegalAccessException { > > Configuration conf = new Configuration(); > > DistanceMeasure measure = new EuclideanDistanceMeasure(); > > Path vectorsFolder = new Path(outputDir, "vectors"); > Path clusterCenters = new Path(outputDir + > "-canopy/centriods"); > Path clusterOutput = new Path(outputDir + > "-canopy/clusters"); > > // create canopies instead of initial vectors > CanopyDriver.run(conf, vectorsFolder, clusterCenters, > measure, > Double.parseDouble(T), > Double.parseDouble(T), false, 0, false); > > > // kmeans cluster operation > KMeansDriver.run(conf, vectorsFolder, new > Path(clusterCenters, > "clusters-0-final/part-r-00000"), > clusterOutput, measure, 0.01, > Integer.parseInt(itMax), true, 0.0, false); > > > //post process by putting completed clusters into their > own files. > ClusterOutputPostProcessorDriver.run(clusterOutput, > new > Path(clusterOutput+"/CanopyClusterVectorFolders"), false); > > } > > What do you think? > > On another but related note: Is there a plan to have a method -- say > ClusterOutputPostProcessorDriver -- which when run outputs the vectors > within clusters as well as a separate folder containing pruned outliers? > > Thanks! > > Mattie > > -----Original Message----- > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > Sent: Friday, August 17, 2012 12:16 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > The clustering algorithm has also changed internally. So, expect the > results to be different ( and better ). > > I can think of one reason for this behavior. Maybe lots of clusters are > having only one vector inside it, and, AFAIK, clusterdumper will not > output any cluster with single vector. > So, I think, its clusterdumper which is doing the invisible "pruning" ( > by not ouputting clusters with single vectors ). > > Can you cross check the output once with ClusterOutputPostProcessorDriver? > > No, no tool can output the pruned vectors. The only way to see all > vectors assigned to any cluster is to set clusterClassificationThreshold > to 0. > > If you still face the problem, then please provide the parameters with > which you are calling kmeans. > > Regarding "I should also mention I have vectors which are exactly the > same (even their names), perhaps they are the ones being pruned, is that > possible? " > > The name of the vector has nothing to do with clustering, I am not sure
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-22, 23:16
One way to test this is to add a small amount of noise to all of your data
points. This won't be easy from the command line, but is easy from Java. You can do this, for instance: Vector v = // read data as a vector Vector u = new DenseVector(v.size()).assign(Functions.random()); v.assign(u, Functions.plusMult(0.1)); On Wed, Aug 22, 2012 at 10:40 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > Yes, I have data which is exactly the same. If I give every vector a name > which is distinct (albeit the data point is the same as other points in the > set) will this keep the algorithm from dropping non-distinct vectors/data > points (which is what I THINK but have yet to verify is what is going on)? > > Thanks, > > Mattie > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 1:18 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Just an off thought, do you have duplicate input points? > > On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED] > >wrote: > > > ... I have also verified by running canopy multiple times with 0.5 and > 0.7 > > that there is a continual discrepancy between the two clustering > versions. > > The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 > is: > > 921998/5. They should not necessarily be the same, since I am using > canopy > > clustering to find initial centroids, however I would think they would > have > > the same sum, which they do not (45901885 vs 1599154). > > > > Here is the method I am running: > > > > public static void KmeansClusteringCanopy(String outputDir, String T, > > String itMax) > > throws IOException, InterruptedException, > > ClassNotFoundException, > > InstantiationException, IllegalAccessException { > > > > Configuration conf = new Configuration(); > > > > DistanceMeasure measure = new EuclideanDistanceMeasure(); > > > > Path vectorsFolder = new Path(outputDir, "vectors"); > > Path clusterCenters = new Path(outputDir + > > "-canopy/centriods"); > > Path clusterOutput = new Path(outputDir + > > "-canopy/clusters"); > > > > // create canopies instead of initial vectors > > CanopyDriver.run(conf, vectorsFolder, clusterCenters, > > measure, > > Double.parseDouble(T), > > Double.parseDouble(T), false, 0, false); > > > > > > // kmeans cluster operation > > KMeansDriver.run(conf, vectorsFolder, new > > Path(clusterCenters, > > "clusters-0-final/part-r-00000"), > > clusterOutput, measure, 0.01, > > Integer.parseInt(itMax), true, 0.0, > false); > > > > > > //post process by putting completed clusters into their > > own files. > > ClusterOutputPostProcessorDriver.run(clusterOutput, > > new > > Path(clusterOutput+"/CanopyClusterVectorFolders"), false); > > > > } > > > > What do you think? > > > > On another but related note: Is there a plan to have a method -- say > > ClusterOutputPostProcessorDriver -- which when run outputs the vectors > > within clusters as well as a separate folder containing pruned outliers? > > > > Thanks! > > > > Mattie > > > > -----Original Message----- > > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > > Sent: Friday, August 17, 2012 12:16 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Mahout-279/kmeans++ > > > > The clustering algorithm has also changed internally. So, expect the > > results to be different ( and better ). > > > > I can think of one reason for this behavior. Maybe lots of clusters are > > having only one vector inside it, and, AFAIK, clusterdumper will not > > output any cluster with single vector. > > So, I think, its clusterdumper which is doing the invisible "pruning" (
-
Re: Mahout-279/kmeans++Paritosh Ranjan 2012-08-23, 01:09
Can you also try to provide distinct names to vectors and then cluster?
It should not have any affect, but would be good to know the behavior. On 22-08-2012 23:10, Whitmore, Mattie wrote: > Yes, I have data which is exactly the same. If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)? > > Thanks, > > Mattie > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 1:18 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Just an off thought, do you have duplicate input points? > > On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > >> ... I have also verified by running canopy multiple times with 0.5 and 0.7 >> that there is a continual discrepancy between the two clustering versions. >> The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: >> 921998/5. They should not necessarily be the same, since I am using canopy >> clustering to find initial centroids, however I would think they would have >> the same sum, which they do not (45901885 vs 1599154). >> >> Here is the method I am running: >> >> public static void KmeansClusteringCanopy(String outputDir, String T, >> String itMax) >> throws IOException, InterruptedException, >> ClassNotFoundException, >> InstantiationException, IllegalAccessException { >> >> Configuration conf = new Configuration(); >> >> DistanceMeasure measure = new EuclideanDistanceMeasure(); >> >> Path vectorsFolder = new Path(outputDir, "vectors"); >> Path clusterCenters = new Path(outputDir + >> "-canopy/centriods"); >> Path clusterOutput = new Path(outputDir + >> "-canopy/clusters"); >> >> // create canopies instead of initial vectors >> CanopyDriver.run(conf, vectorsFolder, clusterCenters, >> measure, >> Double.parseDouble(T), >> Double.parseDouble(T), false, 0, false); >> >> >> // kmeans cluster operation >> KMeansDriver.run(conf, vectorsFolder, new >> Path(clusterCenters, >> "clusters-0-final/part-r-00000"), >> clusterOutput, measure, 0.01, >> Integer.parseInt(itMax), true, 0.0, false); >> >> >> //post process by putting completed clusters into their >> own files. >> ClusterOutputPostProcessorDriver.run(clusterOutput, >> new >> Path(clusterOutput+"/CanopyClusterVectorFolders"), false); >> >> } >> >> What do you think? >> >> On another but related note: Is there a plan to have a method -- say >> ClusterOutputPostProcessorDriver -- which when run outputs the vectors >> within clusters as well as a separate folder containing pruned outliers? >> >> Thanks! >> >> Mattie >> >> -----Original Message----- >> From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] >> Sent: Friday, August 17, 2012 12:16 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout-279/kmeans++ >> >> The clustering algorithm has also changed internally. So, expect the >> results to be different ( and better ). >> >> I can think of one reason for this behavior. Maybe lots of clusters are >> having only one vector inside it, and, AFAIK, clusterdumper will not >> output any cluster with single vector. >> So, I think, its clusterdumper which is doing the invisible "pruning" ( >> by not ouputting clusters with single vectors ). >> >> Can you cross check the output once with ClusterOutputPostProcessorDriver? >> >> No, no tool can output the pruned vectors. The only way to see all >> vectors assigned to any cluster is to set clusterClassificationThreshold
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-23, 14:25
Yes, unique names will be my next plan -- I just can't kick off that job until after the weekend. If this makes no difference I will also try the noise idea, and I'll follow up about both.
My next question is regarding clusterDump. Is there a way to run this in parallel? I have found some code to execute in java (the preferable method for me) but I would like the method to be faster and not in memory. Is this a possibility? Or in the works? Thanks! -----Original Message----- From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 22, 2012 9:09 PM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ Can you also try to provide distinct names to vectors and then cluster? It should not have any affect, but would be good to know the behavior. On 22-08-2012 23:10, Whitmore, Mattie wrote: > Yes, I have data which is exactly the same. If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)? > > Thanks, > > Mattie > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 1:18 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Just an off thought, do you have duplicate input points? > > On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > >> ... I have also verified by running canopy multiple times with 0.5 and 0.7 >> that there is a continual discrepancy between the two clustering versions. >> The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: >> 921998/5. They should not necessarily be the same, since I am using canopy >> clustering to find initial centroids, however I would think they would have >> the same sum, which they do not (45901885 vs 1599154). >> >> Here is the method I am running: >> >> public static void KmeansClusteringCanopy(String outputDir, String T, >> String itMax) >> throws IOException, InterruptedException, >> ClassNotFoundException, >> InstantiationException, IllegalAccessException { >> >> Configuration conf = new Configuration(); >> >> DistanceMeasure measure = new EuclideanDistanceMeasure(); >> >> Path vectorsFolder = new Path(outputDir, "vectors"); >> Path clusterCenters = new Path(outputDir + >> "-canopy/centriods"); >> Path clusterOutput = new Path(outputDir + >> "-canopy/clusters"); >> >> // create canopies instead of initial vectors >> CanopyDriver.run(conf, vectorsFolder, clusterCenters, >> measure, >> Double.parseDouble(T), >> Double.parseDouble(T), false, 0, false); >> >> >> // kmeans cluster operation >> KMeansDriver.run(conf, vectorsFolder, new >> Path(clusterCenters, >> "clusters-0-final/part-r-00000"), >> clusterOutput, measure, 0.01, >> Integer.parseInt(itMax), true, 0.0, false); >> >> >> //post process by putting completed clusters into their >> own files. >> ClusterOutputPostProcessorDriver.run(clusterOutput, >> new >> Path(clusterOutput+"/CanopyClusterVectorFolders"), false); >> >> } >> >> What do you think? >> >> On another but related note: Is there a plan to have a method -- say >> ClusterOutputPostProcessorDriver -- which when run outputs the vectors >> within clusters as well as a separate folder containing pruned outliers? >> >> Thanks! >> >> Mattie >> >> -----Original Message----- >> From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] >> Sent: Friday, August 17, 2012 12:16 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout-279/kmeans++ >> >> The clustering algorithm has also changed internally. So, expect the
-
Re: Mahout-279/kmeans++Paritosh Ranjan 2012-08-23, 16:33
clusterDump works in memory, and there are no plans yet to make it distributed ( or not in memory ). See thishttps://issues.apache.org/*jira*/browse/MAHOUT-940
clusterpp has an option for distributed processing, so you can process any amount of data with it. On 23-08-2012 19:55, Whitmore, Mattie wrote: > Yes, unique names will be my next plan -- I just can't kick off that job until after the weekend. If this makes no difference I will also try the noise idea, and I'll follow up about both. > > My next question is regarding clusterDump. Is there a way to run this in parallel? I have found some code to execute in java (the preferable method for me) but I would like the method to be faster and not in memory. Is this a possibility? Or in the works? > > Thanks! > > -----Original Message----- > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 9:09 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Can you also try to provide distinct names to vectors and then cluster? > It should not have any affect, but would be good to know the behavior. > > On 22-08-2012 23:10, Whitmore, Mattie wrote: >> Yes, I have data which is exactly the same. If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)? >> >> Thanks, >> >> Mattie >> >> -----Original Message----- >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, August 22, 2012 1:18 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout-279/kmeans++ >> >> Just an off thought, do you have duplicate input points? >> >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: >> >>> ... I have also verified by running canopy multiple times with 0.5 and 0.7 >>> that there is a continual discrepancy between the two clustering versions. >>> The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: >>> 921998/5. They should not necessarily be the same, since I am using canopy >>> clustering to find initial centroids, however I would think they would have >>> the same sum, which they do not (45901885 vs 1599154). >>> >>> Here is the method I am running: >>> >>> public static void KmeansClusteringCanopy(String outputDir, String T, >>> String itMax) >>> throws IOException, InterruptedException, >>> ClassNotFoundException, >>> InstantiationException, IllegalAccessException { >>> >>> Configuration conf = new Configuration(); >>> >>> DistanceMeasure measure = new EuclideanDistanceMeasure(); >>> >>> Path vectorsFolder = new Path(outputDir, "vectors"); >>> Path clusterCenters = new Path(outputDir + >>> "-canopy/centriods"); >>> Path clusterOutput = new Path(outputDir + >>> "-canopy/clusters"); >>> >>> // create canopies instead of initial vectors >>> CanopyDriver.run(conf, vectorsFolder, clusterCenters, >>> measure, >>> Double.parseDouble(T), >>> Double.parseDouble(T), false, 0, false); >>> >>> >>> // kmeans cluster operation >>> KMeansDriver.run(conf, vectorsFolder, new >>> Path(clusterCenters, >>> "clusters-0-final/part-r-00000"), >>> clusterOutput, measure, 0.01, >>> Integer.parseInt(itMax), true, 0.0, false); >>> >>> >>> //post process by putting completed clusters into their >>> own files. >>> ClusterOutputPostProcessorDriver.run(clusterOutput, >>> new >>> Path(clusterOutput+"/CanopyClusterVectorFolders"), false); >>> >>> } >>> >>> What do you think? >>> >>> On another but related note: Is there a plan to have a method -- say
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-29, 14:37
I re-ran the canopy-kmeans analytic, this time with unique names, I lost more points in the resulting clusters ( total points in the clusters = 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The total number of data points fed into the algorithm is 53365862 -- so even v0.5 is missing 14% of the data.
I'm thinking if I weight these dense vectors with a weight equal to the number of identical vectors in the set that could work -- Ball Kmeans seems to do this. Is this a correct interpretation of how to use weights in Ball Kmeans, and is Ball Kmeans ready enough to be used/tested? Thanks -----Original Message----- From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 23, 2012 12:34 PM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ clusterDump works in memory, and there are no plans yet to make it distributed ( or not in memory ). See thishttps://issues.apache.org/*jira*/browse/MAHOUT-940 clusterpp has an option for distributed processing, so you can process any amount of data with it. On 23-08-2012 19:55, Whitmore, Mattie wrote: > Yes, unique names will be my next plan -- I just can't kick off that job until after the weekend. If this makes no difference I will also try the noise idea, and I'll follow up about both. > > My next question is regarding clusterDump. Is there a way to run this in parallel? I have found some code to execute in java (the preferable method for me) but I would like the method to be faster and not in memory. Is this a possibility? Or in the works? > > Thanks! > > -----Original Message----- > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 22, 2012 9:09 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Can you also try to provide distinct names to vectors and then cluster? > It should not have any affect, but would be good to know the behavior. > > On 22-08-2012 23:10, Whitmore, Mattie wrote: >> Yes, I have data which is exactly the same. If I give every vector a name which is distinct (albeit the data point is the same as other points in the set) will this keep the algorithm from dropping non-distinct vectors/data points (which is what I THINK but have yet to verify is what is going on)? >> >> Thanks, >> >> Mattie >> >> -----Original Message----- >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, August 22, 2012 1:18 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout-279/kmeans++ >> >> Just an off thought, do you have duplicate input points? >> >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: >> >>> ... I have also verified by running canopy multiple times with 0.5 and 0.7 >>> that there is a continual discrepancy between the two clustering versions. >>> The max/min vectors in a cluster using 0.5 is: 19192158/215 and 0.7 is: >>> 921998/5. They should not necessarily be the same, since I am using canopy >>> clustering to find initial centroids, however I would think they would have >>> the same sum, which they do not (45901885 vs 1599154). >>> >>> Here is the method I am running: >>> >>> public static void KmeansClusteringCanopy(String outputDir, String T, >>> String itMax) >>> throws IOException, InterruptedException, >>> ClassNotFoundException, >>> InstantiationException, IllegalAccessException { >>> >>> Configuration conf = new Configuration(); >>> >>> DistanceMeasure measure = new EuclideanDistanceMeasure(); >>> >>> Path vectorsFolder = new Path(outputDir, "vectors"); >>> Path clusterCenters = new Path(outputDir + >>> "-canopy/centriods"); >>> Path clusterOutput = new Path(outputDir + >>> "-canopy/clusters"); >>> >>> // create canopies instead of initial vectors >>> CanopyDriver.run(conf, vectorsFolder, clusterCenters, >>> measure, >>> Double.parseDouble(T),
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-29, 16:16
Yes. The ball k-means implementation does use weights to indicate multiple
vectors. The implementation is definitely ready to test. I would be slightly surprised if it has absolutely zero issues, but your feedback on such issues would help them get fixed much sooner than others. On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > I re-ran the canopy-kmeans analytic, this time with unique names, I lost > more points in the resulting clusters ( total points in the clusters > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The total > number of data points fed into the algorithm is 53365862 -- so even v0.5 is > missing 14% of the data. > > I'm thinking if I weight these dense vectors with a weight equal to the > number of identical vectors in the set that could work -- Ball Kmeans seems > to do this. Is this a correct interpretation of how to use weights in Ball > Kmeans, and is Ball Kmeans ready enough to be used/tested? > > Thanks > > -----Original Message----- > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 23, 2012 12:34 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > clusterDump works in memory, and there are no plans yet to make it > distributed ( or not in memory ). See thishttps:// > issues.apache.org/*jira*/browse/MAHOUT-940 > > clusterpp has an option for distributed processing, so you can process any > amount of data with it. > > On 23-08-2012 19:55, Whitmore, Mattie wrote: > > Yes, unique names will be my next plan -- I just can't kick off that job > until after the weekend. If this makes no difference I will also try the > noise idea, and I'll follow up about both. > > > > My next question is regarding clusterDump. Is there a way to run this > in parallel? I have found some code to execute in java (the preferable > method for me) but I would like the method to be faster and not in memory. > Is this a possibility? Or in the works? > > > > Thanks! > > > > -----Original Message----- > > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, August 22, 2012 9:09 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Mahout-279/kmeans++ > > > > Can you also try to provide distinct names to vectors and then cluster? > > It should not have any affect, but would be good to know the behavior. > > > > On 22-08-2012 23:10, Whitmore, Mattie wrote: > >> Yes, I have data which is exactly the same. If I give every vector a > name which is distinct (albeit the data point is the same as other points > in the set) will this keep the algorithm from dropping non-distinct > vectors/data points (which is what I THINK but have yet to verify is what > is going on)? > >> > >> Thanks, > >> > >> Mattie > >> > >> -----Original Message----- > >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] > >> Sent: Wednesday, August 22, 2012 1:18 PM > >> To: [EMAIL PROTECTED] > >> Subject: Re: Mahout-279/kmeans++ > >> > >> Just an off thought, do you have duplicate input points? > >> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED] > >wrote: > >> > >>> ... I have also verified by running canopy multiple times with 0.5 and > 0.7 > >>> that there is a continual discrepancy between the two clustering > versions. > >>> The max/min vectors in a cluster using 0.5 is: 19192158/215 and > 0.7 is: > >>> 921998/5. They should not necessarily be the same, since I am using > canopy > >>> clustering to find initial centroids, however I would think they would > have > >>> the same sum, which they do not (45901885 vs 1599154). > >>> > >>> Here is the method I am running: > >>> > >>> public static void KmeansClusteringCanopy(String outputDir, String T, > >>> String itMax) > >>> throws IOException, InterruptedException, > >>> ClassNotFoundException, > >>> InstantiationException, > IllegalAccessException { > >>> > >>> Configuration conf = new Configuration();
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-30, 16:53
I need to be using the matrices for BallKmeans. Can matrices be named? By this I mean can I assign a column of my matrix to be the "name" of each row?
Thanks! -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Wednesday, August 29, 2012 12:17 PM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ Yes. The ball k-means implementation does use weights to indicate multiple vectors. The implementation is definitely ready to test. I would be slightly surprised if it has absolutely zero issues, but your feedback on such issues would help them get fixed much sooner than others. On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > I re-ran the canopy-kmeans analytic, this time with unique names, I lost > more points in the resulting clusters ( total points in the clusters > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The total > number of data points fed into the algorithm is 53365862 -- so even v0.5 is > missing 14% of the data. > > I'm thinking if I weight these dense vectors with a weight equal to the > number of identical vectors in the set that could work -- Ball Kmeans seems > to do this. Is this a correct interpretation of how to use weights in Ball > Kmeans, and is Ball Kmeans ready enough to be used/tested? > > Thanks > > -----Original Message----- > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 23, 2012 12:34 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > clusterDump works in memory, and there are no plans yet to make it > distributed ( or not in memory ). See thishttps:// > issues.apache.org/*jira*/browse/MAHOUT-940 > > clusterpp has an option for distributed processing, so you can process any > amount of data with it. > > On 23-08-2012 19:55, Whitmore, Mattie wrote: > > Yes, unique names will be my next plan -- I just can't kick off that job > until after the weekend. If this makes no difference I will also try the > noise idea, and I'll follow up about both. > > > > My next question is regarding clusterDump. Is there a way to run this > in parallel? I have found some code to execute in java (the preferable > method for me) but I would like the method to be faster and not in memory. > Is this a possibility? Or in the works? > > > > Thanks! > > > > -----Original Message----- > > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, August 22, 2012 9:09 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Mahout-279/kmeans++ > > > > Can you also try to provide distinct names to vectors and then cluster? > > It should not have any affect, but would be good to know the behavior. > > > > On 22-08-2012 23:10, Whitmore, Mattie wrote: > >> Yes, I have data which is exactly the same. If I give every vector a > name which is distinct (albeit the data point is the same as other points > in the set) will this keep the algorithm from dropping non-distinct > vectors/data points (which is what I THINK but have yet to verify is what > is going on)? > >> > >> Thanks, > >> > >> Mattie > >> > >> -----Original Message----- > >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] > >> Sent: Wednesday, August 22, 2012 1:18 PM > >> To: [EMAIL PROTECTED] > >> Subject: Re: Mahout-279/kmeans++ > >> > >> Just an off thought, do you have duplicate input points? > >> > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie <[EMAIL PROTECTED] > >wrote: > >> > >>> ... I have also verified by running canopy multiple times with 0.5 and > 0.7 > >>> that there is a continual discrepancy between the two clustering > versions. > >>> The max/min vectors in a cluster using 0.5 is: 19192158/215 and > 0.7 is: > >>> 921998/5. They should not necessarily be the same, since I am using > canopy > >>> clustering to find initial centroids, however I would think they would > have > >>> the same sum, which they do not (45901885 vs 1599154). > >>> > >>> Here is the method I am running: > >>> > >>> public static void KmeansClusteringCanopy(String outputDir, String T,
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-30, 18:48
The input to the BallKmeans is actually not a matrix. It is an
Iterable<MatrixSlice>. This can be a matrix since a matrix implements this. So one way to deal with this is to build your own Iterable and put NamedVectors into it. NamedVector retain labels as you want. On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > I need to be using the matrices for BallKmeans. Can matrices be named? By > this I mean can I assign a column of my matrix to be the "name" of each row? > > Thanks! > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, August 29, 2012 12:17 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > Yes. The ball k-means implementation does use weights to indicate multiple > vectors. > > The implementation is definitely ready to test. I would be slightly > surprised if it has absolutely zero issues, but your feedback on such > issues would help them get fixed much sooner than others. > > On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[EMAIL PROTECTED] > >wrote: > > > I re-ran the canopy-kmeans analytic, this time with unique names, I lost > > more points in the resulting clusters ( total points in the clusters > > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The > total > > number of data points fed into the algorithm is 53365862 -- so even v0.5 > is > > missing 14% of the data. > > > > I'm thinking if I weight these dense vectors with a weight equal to the > > number of identical vectors in the set that could work -- Ball Kmeans > seems > > to do this. Is this a correct interpretation of how to use weights in > Ball > > Kmeans, and is Ball Kmeans ready enough to be used/tested? > > > > Thanks > > > > -----Original Message----- > > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, August 23, 2012 12:34 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Mahout-279/kmeans++ > > > > clusterDump works in memory, and there are no plans yet to make it > > distributed ( or not in memory ). See thishttps:// > > issues.apache.org/*jira*/browse/MAHOUT-940 > > > > clusterpp has an option for distributed processing, so you can process > any > > amount of data with it. > > > > On 23-08-2012 19:55, Whitmore, Mattie wrote: > > > Yes, unique names will be my next plan -- I just can't kick off that > job > > until after the weekend. If this makes no difference I will also try the > > noise idea, and I'll follow up about both. > > > > > > My next question is regarding clusterDump. Is there a way to run this > > in parallel? I have found some code to execute in java (the preferable > > method for me) but I would like the method to be faster and not in > memory. > > Is this a possibility? Or in the works? > > > > > > Thanks! > > > > > > -----Original Message----- > > > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > > > Sent: Wednesday, August 22, 2012 9:09 PM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Mahout-279/kmeans++ > > > > > > Can you also try to provide distinct names to vectors and then cluster? > > > It should not have any affect, but would be good to know the behavior. > > > > > > On 22-08-2012 23:10, Whitmore, Mattie wrote: > > >> Yes, I have data which is exactly the same. If I give every vector a > > name which is distinct (albeit the data point is the same as other points > > in the set) will this keep the algorithm from dropping non-distinct > > vectors/data points (which is what I THINK but have yet to verify is what > > is going on)? > > >> > > >> Thanks, > > >> > > >> Mattie > > >> > > >> -----Original Message----- > > >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] > > >> Sent: Wednesday, August 22, 2012 1:18 PM > > >> To: [EMAIL PROTECTED] > > >> Subject: Re: Mahout-279/kmeans++ > > >> > > >> Just an off thought, do you have duplicate input points? > > >> > > >> On Wed, Aug 22, 2012 at 10:00 AM, Whitmore, Mattie < > [EMAIL PROTECTED] > > >wrote:
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-30, 18:51
But columns aren't what I would expect you to want labeled. I think that
row labels might be nicer. Happily, each named vector has a name for the entire vector as well. On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > The input to the BallKmeans is actually not a matrix. It is an > Iterable<MatrixSlice>. This can be a matrix since a matrix implements > this. > > So one way to deal with this is to build your own Iterable and put > NamedVectors into it. NamedVector retain labels as you want. > > > On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > >> I need to be using the matrices for BallKmeans. Can matrices be named? >> By this I mean can I assign a column of my matrix to be the "name" of each >> row? >> >> Thanks! >> >> -----Original Message----- >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, August 29, 2012 12:17 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout-279/kmeans++ >> >> Yes. The ball k-means implementation does use weights to indicate >> multiple >> vectors. >> >> The implementation is definitely ready to test. I would be slightly >> surprised if it has absolutely zero issues, but your feedback on such >> issues would help them get fixed much sooner than others. >> >> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[EMAIL PROTECTED] >> >wrote: >> >> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost >> > more points in the resulting clusters ( total points in the clusters >> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The >> total >> > number of data points fed into the algorithm is 53365862 -- so even >> v0.5 is >> > missing 14% of the data. >> > >> > I'm thinking if I weight these dense vectors with a weight equal to the >> > number of identical vectors in the set that could work -- Ball Kmeans >> seems >> > to do this. Is this a correct interpretation of how to use weights in >> Ball >> > Kmeans, and is Ball Kmeans ready enough to be used/tested? >> > >> > Thanks >> > >> > -----Original Message----- >> > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] >> > Sent: Thursday, August 23, 2012 12:34 PM >> > To: [EMAIL PROTECTED] >> > Subject: Re: Mahout-279/kmeans++ >> > >> > clusterDump works in memory, and there are no plans yet to make it >> > distributed ( or not in memory ). See thishttps:// >> > issues.apache.org/*jira*/browse/MAHOUT-940 >> > >> > clusterpp has an option for distributed processing, so you can process >> any >> > amount of data with it. >> > >> > On 23-08-2012 19:55, Whitmore, Mattie wrote: >> > > Yes, unique names will be my next plan -- I just can't kick off that >> job >> > until after the weekend. If this makes no difference I will also try >> the >> > noise idea, and I'll follow up about both. >> > > >> > > My next question is regarding clusterDump. Is there a way to run this >> > in parallel? I have found some code to execute in java (the preferable >> > method for me) but I would like the method to be faster and not in >> memory. >> > Is this a possibility? Or in the works? >> > > >> > > Thanks! >> > > >> > > -----Original Message----- >> > > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] >> > > Sent: Wednesday, August 22, 2012 9:09 PM >> > > To: [EMAIL PROTECTED] >> > > Subject: Re: Mahout-279/kmeans++ >> > > >> > > Can you also try to provide distinct names to vectors and then >> cluster? >> > > It should not have any affect, but would be good to know the behavior. >> > > >> > > On 22-08-2012 23:10, Whitmore, Mattie wrote: >> > >> Yes, I have data which is exactly the same. If I give every vector a >> > name which is distinct (albeit the data point is the same as other >> points >> > in the set) will this keep the algorithm from dropping non-distinct >> > vectors/data points (which is what I THINK but have yet to verify is >> what >> > is going on)? >> > >> >> > >> Thanks, >> > >> >> > >> Mattie >> > >> >> > >> -----Original Message-----
-
RE: Mahout-279/kmeans++Whitmore, Mattie 2012-08-30, 19:25
I was thinking that one column would be the name for each row -- like a "name column" for each vector in a matrix. I probably mistyped somewhere in there :). Would the algorithm implement better as if given a matrix? I'm thinking of work done on extending matrix multiplication to tensor multiplication I suppose. That is neither here nor there for this current project.
Thanks for the guidance! -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Thursday, August 30, 2012 2:52 PM To: [EMAIL PROTECTED] Subject: Re: Mahout-279/kmeans++ But columns aren't what I would expect you to want labeled. I think that row labels might be nicer. Happily, each named vector has a name for the entire vector as well. On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > The input to the BallKmeans is actually not a matrix. It is an > Iterable<MatrixSlice>. This can be a matrix since a matrix implements > this. > > So one way to deal with this is to build your own Iterable and put > NamedVectors into it. NamedVector retain labels as you want. > > > On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > >> I need to be using the matrices for BallKmeans. Can matrices be named? >> By this I mean can I assign a column of my matrix to be the "name" of each >> row? >> >> Thanks! >> >> -----Original Message----- >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, August 29, 2012 12:17 PM >> To: [EMAIL PROTECTED] >> Subject: Re: Mahout-279/kmeans++ >> >> Yes. The ball k-means implementation does use weights to indicate >> multiple >> vectors. >> >> The implementation is definitely ready to test. I would be slightly >> surprised if it has absolutely zero issues, but your feedback on such >> issues would help them get fixed much sooner than others. >> >> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[EMAIL PROTECTED] >> >wrote: >> >> > I re-ran the canopy-kmeans analytic, this time with unique names, I lost >> > more points in the resulting clusters ( total points in the clusters >> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The >> total >> > number of data points fed into the algorithm is 53365862 -- so even >> v0.5 is >> > missing 14% of the data. >> > >> > I'm thinking if I weight these dense vectors with a weight equal to the >> > number of identical vectors in the set that could work -- Ball Kmeans >> seems >> > to do this. Is this a correct interpretation of how to use weights in >> Ball >> > Kmeans, and is Ball Kmeans ready enough to be used/tested? >> > >> > Thanks >> > >> > -----Original Message----- >> > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] >> > Sent: Thursday, August 23, 2012 12:34 PM >> > To: [EMAIL PROTECTED] >> > Subject: Re: Mahout-279/kmeans++ >> > >> > clusterDump works in memory, and there are no plans yet to make it >> > distributed ( or not in memory ). See thishttps:// >> > issues.apache.org/*jira*/browse/MAHOUT-940 >> > >> > clusterpp has an option for distributed processing, so you can process >> any >> > amount of data with it. >> > >> > On 23-08-2012 19:55, Whitmore, Mattie wrote: >> > > Yes, unique names will be my next plan -- I just can't kick off that >> job >> > until after the weekend. If this makes no difference I will also try >> the >> > noise idea, and I'll follow up about both. >> > > >> > > My next question is regarding clusterDump. Is there a way to run this >> > in parallel? I have found some code to execute in java (the preferable >> > method for me) but I would like the method to be faster and not in >> memory. >> > Is this a possibility? Or in the works? >> > > >> > > Thanks! >> > > >> > > -----Original Message----- >> > > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] >> > > Sent: Wednesday, August 22, 2012 9:09 PM >> > > To: [EMAIL PROTECTED] >> > > Subject: Re: Mahout-279/kmeans++ >> > > >> > > Can you also try to provide distinct names to vectors and then
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-30, 19:55
The names are outside the vector or matrix data. Vectors and matrices
store numbers, not strings. On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > I was thinking that one column would be the name for each row -- like a > "name column" for each vector in a matrix. I probably mistyped somewhere > in there :). Would the algorithm implement better as if given a matrix? > I'm thinking of work done on extending matrix multiplication to tensor > multiplication I suppose. That is neither here nor there for this current > project. > > Thanks for the guidance! > > > -----Original Message----- > From: Ted Dunning [mailto:[EMAIL PROTECTED]] > Sent: Thursday, August 30, 2012 2:52 PM > To: [EMAIL PROTECTED] > Subject: Re: Mahout-279/kmeans++ > > But columns aren't what I would expect you to want labeled. I think that > row labels might be nicer. Happily, each named vector has a name for the > entire vector as well. > > On Thu, Aug 30, 2012 at 2:48 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > The input to the BallKmeans is actually not a matrix. It is an > > Iterable<MatrixSlice>. This can be a matrix since a matrix implements > > this. > > > > So one way to deal with this is to build your own Iterable and put > > NamedVectors into it. NamedVector retain labels as you want. > > > > > > On Thu, Aug 30, 2012 at 12:53 PM, Whitmore, Mattie <[EMAIL PROTECTED] > >wrote: > > > >> I need to be using the matrices for BallKmeans. Can matrices be named? > >> By this I mean can I assign a column of my matrix to be the "name" of > each > >> row? > >> > >> Thanks! > >> > >> -----Original Message----- > >> From: Ted Dunning [mailto:[EMAIL PROTECTED]] > >> Sent: Wednesday, August 29, 2012 12:17 PM > >> To: [EMAIL PROTECTED] > >> Subject: Re: Mahout-279/kmeans++ > >> > >> Yes. The ball k-means implementation does use weights to indicate > >> multiple > >> vectors. > >> > >> The implementation is definitely ready to test. I would be slightly > >> surprised if it has absolutely zero issues, but your feedback on such > >> issues would help them get fixed much sooner than others. > >> > >> On Wed, Aug 29, 2012 at 10:37 AM, Whitmore, Mattie <[EMAIL PROTECTED] > >> >wrote: > >> > >> > I re-ran the canopy-kmeans analytic, this time with unique names, I > lost > >> > more points in the resulting clusters ( total points in the clusters > >> > 745490, vs previously: 1599154 for v0.7 and 45901885 for v0.5). The > >> total > >> > number of data points fed into the algorithm is 53365862 -- so even > >> v0.5 is > >> > missing 14% of the data. > >> > > >> > I'm thinking if I weight these dense vectors with a weight equal to > the > >> > number of identical vectors in the set that could work -- Ball Kmeans > >> seems > >> > to do this. Is this a correct interpretation of how to use weights in > >> Ball > >> > Kmeans, and is Ball Kmeans ready enough to be used/tested? > >> > > >> > Thanks > >> > > >> > -----Original Message----- > >> > From: Paritosh Ranjan [mailto:[EMAIL PROTECTED]] > >> > Sent: Thursday, August 23, 2012 12:34 PM > >> > To: [EMAIL PROTECTED] > >> > Subject: Re: Mahout-279/kmeans++ > >> > > >> > clusterDump works in memory, and there are no plans yet to make it > >> > distributed ( or not in memory ). See thishttps:// > >> > issues.apache.org/*jira*/browse/MAHOUT-940 > >> > > >> > clusterpp has an option for distributed processing, so you can process > >> any > >> > amount of data with it. > >> > > >> > On 23-08-2012 19:55, Whitmore, Mattie wrote: > >> > > Yes, unique names will be my next plan -- I just can't kick off that > >> job > >> > until after the weekend. If this makes no difference I will also try > >> the > >> > noise idea, and I'll follow up about both. > >> > > > >> > > My next question is regarding clusterDump. Is there a way to run > this > >> > in parallel? I have found some code to execute in java (the preferable > >> > method for me) but I would like the method to be faster and not in
-
Re: Mahout-279/kmeans++Ted Dunning 2012-08-30, 19:57
No. The algorithm works either way. The algorithm doesn't need the full
capabilities of a matrix since it just makes a few sequential passes through the data. On Thu, Aug 30, 2012 at 3:25 PM, Whitmore, Mattie <[EMAIL PROTECTED]>wrote: > Would the algorithm implement better as if given a matrix? I'm thinking of > work done on extending matrix multiplication to tensor multiplication I > suppose. That is neither here nor there for this current project. |