|
Pat Ferrel
2012-06-04, 16:05
Pat Ferrel
2012-06-04, 20:40
Jeff Eastman
2012-06-04, 21:19
Jeff Eastman
2012-06-04, 21:22
Pat Ferrel
2012-06-04, 22:07
Pat Ferrel
2012-06-05, 17:43
Pat Ferrel
2012-06-05, 18:48
Pat Ferrel
2012-06-05, 19:12
Jeff Eastman
2012-06-06, 13:53
Jeff Eastman
2012-06-06, 14:48
Robin Anil
2012-06-06, 14:51
Pat Ferrel
2012-06-06, 15:49
|
-
Problem using SNAPSHOT kmeansPat Ferrel 2012-06-04, 16:05
Using the CLI to kmeans from several trunk versions I get an error I
don't understand. When the job died the b3/canopy-centroids/clusters-0-final contained the random-seeds file generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 had several part files but b3/kmeans-clusters/clusters-1 was empty. When I look through the code from the trace it doesn't make much sense. Command line: mahout kmeans -i b3/vectors/tfidf-vectors/ -k 20 -c b3/canopy-centroids/clusters-0-final -cl -o b3/kmeans-clusters -ow -cd 0.01 -x 30 -dm org.apache.mahout.common.distance.CosineDistanceMeasure Error: 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final], --convergenceDelta=[0.01], --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], --maxIter=[30], --method=[mapreduce], --numClusters=[20], --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], --tempDir=[temp]} 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info from SCDynamicStore 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting b3/canopy-centroids/clusters-0-final 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to b3/canopy-centroids/clusters-0-final/part-randomSeed 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: b3/vectors/tfidf-vectors Clusters In: b3/canopy-centroids/clusters-0-final/part-randomSeed Out: b3/kmeans-clusters Distance: org.apache.mahout.common.distance.CosineDistanceMeasure 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {} 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor Cluster Iterator running iteration 1 over priorPath: b3/kmeans-clusters/clusters-0 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to process : 1 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680 12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0% 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001 org.apache.mahout.math.IndexException: Index -1 is outside allowable range of [0,20) at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439) at org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44) at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52) at org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0 Exception in thread "main" java.lang.InterruptedException: Cluster Iteration 1 failed processing b3/kmeans-clusters/clusters-1 at org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186) at org.apache.mahout.clustering.kmeans.KMeansDriver.buildClusters(KMeansDriver.java:229) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:149) at org.apache.mahout.clustering.kmeans.KMeansDriver.run(KMeansDriver.java:108) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.mahout.clustering.kmeans.KMeansDriver.main(KMeansDriver.java:49) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.mahout.driver.MahoutDriver.main(MahoutDriver.java:195)
-
Re: Problem using SNAPSHOT kmeansPat Ferrel 2012-06-04, 20:40
Hmm, switched back to mahout 0.6 and the same command line produced the
expected results with the same data. No error. Can't find anything on JIRA. Is anyone else using kmeans from the trunk on real data? On 6/4/12 9:05 AM, Pat Ferrel wrote: > Using the CLI to kmeans from several trunk versions I get an error I > don't understand. When the job died the > b3/canopy-centroids/clusters-0-final contained the random-seeds file > generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 > had several part files but b3/kmeans-clusters/clusters-1 was empty. > When I look through the code from the trace it doesn't make much sense. > > Command line: > mahout kmeans > -i b3/vectors/tfidf-vectors/ > -k 20 > -c b3/canopy-centroids/clusters-0-final > -cl > -o b3/kmeans-clusters > -ow > -cd 0.01 > -x 30 > -dm org.apache.mahout.common.distance.CosineDistanceMeasure > > Error: > 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: > {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final], > --convergenceDelta=[0.01], > --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], > --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], > --maxIter=[30], --method=[mapreduce], --numClusters=[20], > --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], > --tempDir=[temp]} > 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info > from SCDynamicStore > 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting > b3/canopy-centroids/clusters-0-final > 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor > 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to > b3/canopy-centroids/clusters-0-final/part-randomSeed > 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: > b3/vectors/tfidf-vectors Clusters In: > b3/canopy-centroids/clusters-0-final/part-randomSeed Out: > b3/kmeans-clusters Distance: > org.apache.mahout.common.distance.CosineDistanceMeasure > 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max > Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable > Input Vectors: {} > 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor > Cluster Iterator running iteration 1 over priorPath: > b3/kmeans-clusters/clusters-0 > 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to > process : 1 > 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001 > 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100 > 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720 > 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680 > 12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0% > 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001 > org.apache.mahout.math.IndexException: Index -1 is outside allowable > range of [0,20) > at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439) > at > org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44) > at > org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52) > at > org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18) > at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) > at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212) > 12/06/04 07:55:09 INFO mapred.JobClient: Job complete: job_local_0001 > 12/06/04 07:55:09 INFO mapred.JobClient: Counters: 0 > Exception in thread "main" java.lang.InterruptedException: Cluster > Iteration 1 failed processing b3/kmeans-clusters/clusters-1 > at > org.apache.mahout.clustering.iterator.ClusterIterator.iterateMR(ClusterIterator.java:186)
-
Re: Problem using SNAPSHOT kmeansJeff Eastman 2012-06-04, 21:19
It looks like the probabilities vector returned by
AbstractClusteringPolicy.classify() has no non-zero elements. In this case, AbstractClusteringPolicy.select()'s call to AbstractVector.maxValueIndex() is returning -1 and that is causing the exception. How could this happen? I'm not exactly sure, but consider that the probabilities vector is calculated in AbstractClusteringPolicy.classify() by calling DistanceMeasureCluster.pdf() on each of the prior clusters in b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't see how this could ever return zero. Certainly, some of your vectors will match the prior cluster centers exactly (they were sampled from the input) and those values would return pdf==1. Even if the cosine distance was 1 the pdf would be 0.5. Some things to try: - Have you verified the contents of your input vectors actually have data in them? - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 contents? - Is it possible to run the sequential version (-xm sequential)? If it is you could run it in a debugger to gain more insight. Jeff On 6/4/12 12:05 PM, Pat Ferrel wrote: > Using the CLI to kmeans from several trunk versions I get an error I > don't understand. When the job died the > b3/canopy-centroids/clusters-0-final contained the random-seeds file > generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 > had several part files but b3/kmeans-clusters/clusters-1 was empty. > When I look through the code from the trace it doesn't make much sense. > > Command line: > mahout kmeans > -i b3/vectors/tfidf-vectors/ > -k 20 > -c b3/canopy-centroids/clusters-0-final > -cl > -o b3/kmeans-clusters > -ow > -cd 0.01 > -x 30 > -dm org.apache.mahout.common.distance.CosineDistanceMeasure > > Error: > 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: > {--clustering=null, --clusters=[b3/canopy-centroids/clusters-0-final], > --convergenceDelta=[0.01], > --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], > --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], > --maxIter=[30], --method=[mapreduce], --numClusters=[20], > --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], > --tempDir=[temp]} > 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info > from SCDynamicStore > 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting > b3/canopy-centroids/clusters-0-final > 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes > where applicable > 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor > 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors to > b3/canopy-centroids/clusters-0-final/part-randomSeed > 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: > b3/vectors/tfidf-vectors Clusters In: > b3/canopy-centroids/clusters-0-final/part-randomSeed Out: > b3/kmeans-clusters Distance: > org.apache.mahout.common.distance.CosineDistanceMeasure > 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max > Iterations: 30 num Reduce Tasks: org.apache.mahout.math.VectorWritable > Input Vectors: {} > 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor > Cluster Iterator running iteration 1 over priorPath: > b3/kmeans-clusters/clusters-0 > 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to > process : 1 > 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001 > 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100 > 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720 > 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680 > 12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0% > 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001 > org.apache.mahout.math.IndexException: Index -1 is outside allowable > range of [0,20) > at org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439)
-
Re: Problem using SNAPSHOT kmeansJeff Eastman 2012-06-04, 21:22
This is the new ClusterIterator k-means implementation and you may have
indeed found a corner case. Take a look at my logic in the preceding and let's see if there is a fix we can try. On 6/4/12 4:40 PM, Pat Ferrel wrote: > Hmm, switched back to mahout 0.6 and the same command line produced > the expected results with the same data. No error. Can't find anything > on JIRA. > > Is anyone else using kmeans from the trunk on real data? > > On 6/4/12 9:05 AM, Pat Ferrel wrote: >> Using the CLI to kmeans from several trunk versions I get an error I >> don't understand. When the job died the >> b3/canopy-centroids/clusters-0-final contained the random-seeds file >> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 >> had several part files but b3/kmeans-clusters/clusters-1 was empty. >> When I look through the code from the trace it doesn't make much sense. >> >> Command line: >> mahout kmeans >> -i b3/vectors/tfidf-vectors/ >> -k 20 >> -c b3/canopy-centroids/clusters-0-final >> -cl >> -o b3/kmeans-clusters >> -ow >> -cd 0.01 >> -x 30 >> -dm org.apache.mahout.common.distance.CosineDistanceMeasure >> >> Error: >> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: >> {--clustering=null, >> --clusters=[b3/canopy-centroids/clusters-0-final], >> --convergenceDelta=[0.01], >> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], >> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], >> --maxIter=[30], --method=[mapreduce], --numClusters=[20], >> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], >> --tempDir=[temp]} >> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info >> from SCDynamicStore >> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting >> b3/canopy-centroids/clusters-0-final >> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor >> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors >> to b3/canopy-centroids/clusters-0-final/part-randomSeed >> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: >> b3/vectors/tfidf-vectors Clusters In: >> b3/canopy-centroids/clusters-0-final/part-randomSeed Out: >> b3/kmeans-clusters Distance: >> org.apache.mahout.common.distance.CosineDistanceMeasure >> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max >> Iterations: 30 num Reduce Tasks: >> org.apache.mahout.math.VectorWritable Input Vectors: {} >> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor >> Cluster Iterator running iteration 1 over priorPath: >> b3/kmeans-clusters/clusters-0 >> 12/06/04 07:55:05 INFO input.FileInputFormat: Total input paths to >> process : 1 >> 12/06/04 07:55:05 INFO mapred.JobClient: Running job: job_local_0001 >> 12/06/04 07:55:06 INFO mapred.MapTask: io.sort.mb = 100 >> 12/06/04 07:55:08 INFO mapred.MapTask: data buffer = 79691776/99614720 >> 12/06/04 07:55:08 INFO mapred.MapTask: record buffer = 262144/327680 >> 12/06/04 07:55:08 INFO mapred.JobClient: map 0% reduce 0% >> 12/06/04 07:55:09 WARN mapred.LocalJobRunner: job_local_0001 >> org.apache.mahout.math.IndexException: Index -1 is outside allowable >> range of [0,20) >> at >> org.apache.mahout.math.AbstractVector.set(AbstractVector.java:439) >> at >> org.apache.mahout.clustering.iterator.AbstractClusteringPolicy.select(AbstractClusteringPolicy.java:44) >> at >> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:52) >> at >> org.apache.mahout.clustering.iterator.CIMapper.map(CIMapper.java:18) >> at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) >> at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:764) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) >> at >> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:212)
-
Re: Problem using SNAPSHOT kmeansPat Ferrel 2012-06-04, 22:07
Some things to try:
- Have you verified the contents of your input vectors actually have data in them? * YES, from the other email you know that the data works fine in 0.6 - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 contents? * YES, It is attached from trunk's clusterdump after the failure of kmeans, of course. A simple data set fortunately. - Is it possible to run the sequential version (-xm sequential)? If it is you could run it in a debugger to gain more insight. * YES, will report back. On 6/4/12 2:19 PM, Jeff Eastman wrote: > It looks like the probabilities vector returned by > AbstractClusteringPolicy.classify() has no non-zero elements. In this > case, AbstractClusteringPolicy.select()'s call to > AbstractVector.maxValueIndex() is returning -1 and that is causing the > exception. > > How could this happen? I'm not exactly sure, but consider that the > probabilities vector is calculated in > AbstractClusteringPolicy.classify() by calling > DistanceMeasureCluster.pdf() on each of the prior clusters in > b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't > see how this could ever return zero. Certainly, some of your vectors > will match the prior cluster centers exactly (they were sampled from > the input) and those values would return pdf==1. Even if the cosine > distance was 1 the pdf would be 0.5. > > Some things to try: > - Have you verified the contents of your input vectors actually have > data in them? > - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 > contents? > - Is it possible to run the sequential version (-xm sequential)? If it > is you could run it in a debugger to gain more insight. > > Jeff > > On 6/4/12 12:05 PM, Pat Ferrel wrote: >> Using the CLI to kmeans from several trunk versions I get an error I >> don't understand. When the job died the >> b3/canopy-centroids/clusters-0-final contained the random-seeds file >> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 >> had several part files but b3/kmeans-clusters/clusters-1 was empty. >> When I look through the code from the trace it doesn't make much sense. >> >> Command line: >> mahout kmeans >> -i b3/vectors/tfidf-vectors/ >> -k 20 >> -c b3/canopy-centroids/clusters-0-final >> -cl >> -o b3/kmeans-clusters >> -ow >> -cd 0.01 >> -x 30 >> -dm org.apache.mahout.common.distance.CosineDistanceMeasure >> >> Error: >> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: >> {--clustering=null, >> --clusters=[b3/canopy-centroids/clusters-0-final], >> --convergenceDelta=[0.01], >> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], >> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], >> --maxIter=[30], --method=[mapreduce], --numClusters=[20], >> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], >> --tempDir=[temp]} >> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info >> from SCDynamicStore >> 12/06/04 07:55:03 INFO common.HadoopUtil: Deleting >> b3/canopy-centroids/clusters-0-final >> 12/06/04 07:55:04 WARN util.NativeCodeLoader: Unable to load >> native-hadoop library for your platform... using builtin-java classes >> where applicable >> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new compressor >> 12/06/04 07:55:04 INFO kmeans.RandomSeedGenerator: Wrote 20 vectors >> to b3/canopy-centroids/clusters-0-final/part-randomSeed >> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: Input: >> b3/vectors/tfidf-vectors Clusters In: >> b3/canopy-centroids/clusters-0-final/part-randomSeed Out: >> b3/kmeans-clusters Distance: >> org.apache.mahout.common.distance.CosineDistanceMeasure >> 12/06/04 07:55:04 INFO kmeans.KMeansDriver: convergence: 0.01 max >> Iterations: 30 num Reduce Tasks: >> org.apache.mahout.math.VectorWritable Input Vectors: {} >> 12/06/04 07:55:04 INFO compress.CodecPool: Got brand-new decompressor >> Cluster Iterator running iteration 1 over priorPath:
-
Re: Problem using SNAPSHOT kmeansPat Ferrel 2012-06-05, 17:43
I'm not completely sure what I'm looking at but...
In iterateSeq on iteration #1 of processing vectors/tfidf-vectors it reads vector = "https://farfetchers.com/category/collections/source/brice-berard:{" it's a named vector where the url is the name, the value is "{", which looks wrong and when that is classified to get a probability it gets probabilities = "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}" That causes the probabilities.maxValueIndex() = -1 and everything dies. vector looks wrong, doesn't it? Truncated? I went back to try the same on mahout 0.6 but iterateSeq does not get called though I used -xm sequential on both runs. I can't see kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is that part of the refactoring? On 6/4/12 3:07 PM, Pat Ferrel wrote: > Some things to try: > - Have you verified the contents of your input vectors actually have > data in them? > * YES, from the other email you know that the data works fine in 0.6 > - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 > contents? > * YES, It is attached from trunk's clusterdump after the failure of > kmeans, of course. A simple data set fortunately. > - Is it possible to run the sequential version (-xm sequential)? If it > is you could run it in a debugger to gain more insight. > * YES, will report back. > > On 6/4/12 2:19 PM, Jeff Eastman wrote: >> It looks like the probabilities vector returned by >> AbstractClusteringPolicy.classify() has no non-zero elements. In this >> case, AbstractClusteringPolicy.select()'s call to >> AbstractVector.maxValueIndex() is returning -1 and that is causing >> the exception. >> >> How could this happen? I'm not exactly sure, but consider that the >> probabilities vector is calculated in >> AbstractClusteringPolicy.classify() by calling >> DistanceMeasureCluster.pdf() on each of the prior clusters in >> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't >> see how this could ever return zero. Certainly, some of your vectors >> will match the prior cluster centers exactly (they were sampled from >> the input) and those values would return pdf==1. Even if the cosine >> distance was 1 the pdf would be 0.5. >> >> Some things to try: >> - Have you verified the contents of your input vectors actually have >> data in them? >> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 >> contents? >> - Is it possible to run the sequential version (-xm sequential)? If >> it is you could run it in a debugger to gain more insight. >> >> Jeff >> >> On 6/4/12 12:05 PM, Pat Ferrel wrote: >>> Using the CLI to kmeans from several trunk versions I get an error I >>> don't understand. When the job died the >>> b3/canopy-centroids/clusters-0-final contained the random-seeds file >>> generated by the kmeans driver and the b3/kmeans-clusters/clusters-0 >>> had several part files but b3/kmeans-clusters/clusters-1 was empty. >>> When I look through the code from the trace it doesn't make much sense. >>> >>> Command line: >>> mahout kmeans >>> -i b3/vectors/tfidf-vectors/ >>> -k 20 >>> -c b3/canopy-centroids/clusters-0-final >>> -cl >>> -o b3/kmeans-clusters >>> -ow >>> -cd 0.01 >>> -x 30 >>> -dm org.apache.mahout.common.distance.CosineDistanceMeasure >>> >>> Error: >>> 12/06/04 07:55:03 INFO common.AbstractJob: Command line arguments: >>> {--clustering=null, >>> --clusters=[b3/canopy-centroids/clusters-0-final], >>> --convergenceDelta=[0.01], >>> --distanceMeasure=[org.apache.mahout.common.distance.CosineDistanceMeasure], >>> --endPhase=[2147483647], --input=[b3/vectors/tfidf-vectors/], >>> --maxIter=[30], --method=[mapreduce], --numClusters=[20], >>> --output=[b3/kmeans-clusters], --overwrite=null, --startPhase=[0], >>> --tempDir=[temp]} >>> 2012-06-04 07:55:03.752 java[67308:1903] Unable to load realm info >>> from SCDynamicStore
-
Re: Problem using SNAPSHOT kmeansPat Ferrel 2012-06-05, 18:48
Using seqdumper on the TFIDF vectors, that vector is indeed in the list
Key: https://farfetchers.com/category/collections/source/brice-berard: Value: https://farfetchers.com/category/collections/source/brice-berard:{ Looking in the seqfiles we find the document in part-00005 of 10 in no particular part of the file. Key: https://farfetchers.com/category/collections/source/brice-berard: Value: ::Title:: Brice Berard | FarFetchers.com Blog Posts On the chance that this originates in seq2sparse I'll try changing options until the vector looks different. and try clustering again. On 6/5/12 10:43 AM, Pat Ferrel wrote: > I'm not completely sure what I'm looking at but... > > In iterateSeq on iteration #1 of processing vectors/tfidf-vectors it > reads > vector = > "https://farfetchers.com/category/collections/source/brice-berard:{" > > it's a named vector where the url is the name, the value is "{", > which looks wrong and when that is classified to get a probability it > gets > > probabilities = > "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}" > > That causes the probabilities.maxValueIndex() = -1 and everything dies. > > vector looks wrong, doesn't it? Truncated? > > I went back to try the same on mahout 0.6 but iterateSeq does not get > called though I used -xm sequential on both runs. I can't see > kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is that > part of the refactoring? > > On 6/4/12 3:07 PM, Pat Ferrel wrote: >> Some things to try: >> - Have you verified the contents of your input vectors actually have >> data in them? >> * YES, from the other email you know that the data works fine in 0.6 >> - Can you run the cluster dumper on the b3/kmeans-clusters/clusters-0 >> contents? >> * YES, It is attached from trunk's clusterdump after the failure of >> kmeans, of course. A simple data set fortunately. >> - Is it possible to run the sequential version (-xm sequential)? If >> it is you could run it in a debugger to gain more insight. >> * YES, will report back. >> >> On 6/4/12 2:19 PM, Jeff Eastman wrote: >>> It looks like the probabilities vector returned by >>> AbstractClusteringPolicy.classify() has no non-zero elements. In >>> this case, AbstractClusteringPolicy.select()'s call to >>> AbstractVector.maxValueIndex() is returning -1 and that is causing >>> the exception. >>> >>> How could this happen? I'm not exactly sure, but consider that the >>> probabilities vector is calculated in >>> AbstractClusteringPolicy.classify() by calling >>> DistanceMeasureCluster.pdf() on each of the prior clusters in >>> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't >>> see how this could ever return zero. Certainly, some of your vectors >>> will match the prior cluster centers exactly (they were sampled from >>> the input) and those values would return pdf==1. Even if the cosine >>> distance was 1 the pdf would be 0.5. >>> >>> Some things to try: >>> - Have you verified the contents of your input vectors actually have >>> data in them? >>> - Can you run the cluster dumper on the >>> b3/kmeans-clusters/clusters-0 contents? >>> - Is it possible to run the sequential version (-xm sequential)? If >>> it is you could run it in a debugger to gain more insight. >>> >>> Jeff >>> >>> On 6/4/12 12:05 PM, Pat Ferrel wrote: >>>> Using the CLI to kmeans from several trunk versions I get an error >>>> I don't understand. When the job died the >>>> b3/canopy-centroids/clusters-0-final contained the random-seeds >>>> file generated by the kmeans driver and the >>>> b3/kmeans-clusters/clusters-0 had several part files but >>>> b3/kmeans-clusters/clusters-1 was empty. When I look through the >>>> code from the trace it doesn't make much sense. >>>> >>>> Command line: >>>> mahout kmeans >>>> -i b3/vectors/tfidf-vectors/ >>>> -k 20 >>>> -c b3/canopy-centroids/clusters-0-final >>>> -cl >>>> -o b3/kmeans-clusters
-
Re: Problem using SNAPSHOT kmeansPat Ferrel 2012-06-05, 19:12
I think I found the root but not sure what needs fixing.
I took out n-gram generation and the vector now looks like this: Key: https://farfetchers.com/category/collections/source/brice-berard: Value: https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269} This works in clustering. It doesn't seem like a malformed vector should crash clustering (it apparently doesn't in mahout 0.6) but it looks like something in seq2sparse's n-gram weighting does cause a malformed vector. I'll file a JIRA On 6/5/12 11:48 AM, Pat Ferrel wrote: > Using seqdumper on the TFIDF vectors, that vector is indeed in the list > Key: https://farfetchers.com/category/collections/source/brice-berard: > Value: https://farfetchers.com/category/collections/source/brice-berard:{ > > Looking in the seqfiles we find the document in part-00005 of 10 in no > particular part of the file. > Key: https://farfetchers.com/category/collections/source/brice-berard: > Value: ::Title:: > Brice Berard | FarFetchers.com > Blog Posts > > On the chance that this originates in seq2sparse I'll try changing > options until the vector looks different. and try clustering again. > > On 6/5/12 10:43 AM, Pat Ferrel wrote: >> I'm not completely sure what I'm looking at but... >> >> In iterateSeq on iteration #1 of processing vectors/tfidf-vectors it >> reads >> vector = >> "https://farfetchers.com/category/collections/source/brice-berard:{" >> >> it's a named vector where the url is the name, the value is "{", >> which looks wrong and when that is classified to get a probability it >> gets >> >> probabilities = >> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}" >> >> That causes the probabilities.maxValueIndex() = -1 and everything dies. >> >> vector looks wrong, doesn't it? Truncated? >> >> I went back to try the same on mahout 0.6 but iterateSeq does not get >> called though I used -xm sequential on both runs. I can't see >> kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is >> that part of the refactoring? >> >> On 6/4/12 3:07 PM, Pat Ferrel wrote: >>> Some things to try: >>> - Have you verified the contents of your input vectors actually have >>> data in them? >>> * YES, from the other email you know that the data works fine in 0.6 >>> - Can you run the cluster dumper on the >>> b3/kmeans-clusters/clusters-0 contents? >>> * YES, It is attached from trunk's clusterdump after the failure of >>> kmeans, of course. A simple data set fortunately. >>> - Is it possible to run the sequential version (-xm sequential)? If >>> it is you could run it in a debugger to gain more insight. >>> * YES, will report back. >>> >>> On 6/4/12 2:19 PM, Jeff Eastman wrote: >>>> It looks like the probabilities vector returned by >>>> AbstractClusteringPolicy.classify() has no non-zero elements. In >>>> this case, AbstractClusteringPolicy.select()'s call to >>>> AbstractVector.maxValueIndex() is returning -1 and that is causing >>>> the exception. >>>> >>>> How could this happen? I'm not exactly sure, but consider that the >>>> probabilities vector is calculated in >>>> AbstractClusteringPolicy.classify() by calling >>>> DistanceMeasureCluster.pdf() on each of the prior clusters in >>>> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I don't >>>> see how this could ever return zero. Certainly, some of your >>>> vectors will match the prior cluster centers exactly (they were >>>> sampled from the input) and those values would return pdf==1. Even >>>> if the cosine distance was 1 the pdf would be 0.5. >>>> >>>> Some things to try: >>>> - Have you verified the contents of your input vectors actually >>>> have data in them? >>>> - Can you run the cluster dumper on the >>>> b3/kmeans-clusters/clusters-0 contents? >>>> - Is it possible to run the sequential version (-xm sequential)? If
-
Re: Problem using SNAPSHOT kmeansJeff Eastman 2012-06-06, 13:53
Yes, it looks like the input vectors are empty and this is the source of
the error. I'm troubled; however, that empty vectors can have this impact on k-means. I'm going to write a unit test to see if I can duplicate this exception. On 6/5/12 3:12 PM, Pat Ferrel wrote: > I think I found the root but not sure what needs fixing. > > I took out n-gram generation and the vector now looks like this: > Key: https://farfetchers.com/category/collections/source/brice-berard: > Value: > https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269} > > This works in clustering. > > It doesn't seem like a malformed vector should crash clustering (it > apparently doesn't in mahout 0.6) but it looks like something in > seq2sparse's n-gram weighting does cause a malformed vector. > > I'll file a JIRA > > On 6/5/12 11:48 AM, Pat Ferrel wrote: >> Using seqdumper on the TFIDF vectors, that vector is indeed in the list >> Key: https://farfetchers.com/category/collections/source/brice-berard: >> Value: >> https://farfetchers.com/category/collections/source/brice-berard:{ >> >> Looking in the seqfiles we find the document in part-00005 of 10 in >> no particular part of the file. >> Key: https://farfetchers.com/category/collections/source/brice-berard: >> Value: ::Title:: >> Brice Berard | FarFetchers.com >> Blog Posts >> >> On the chance that this originates in seq2sparse I'll try changing >> options until the vector looks different. and try clustering again. >> >> On 6/5/12 10:43 AM, Pat Ferrel wrote: >>> I'm not completely sure what I'm looking at but... >>> >>> In iterateSeq on iteration #1 of processing vectors/tfidf-vectors >>> it reads >>> vector = >>> "https://farfetchers.com/category/collections/source/brice-berard:{" >>> >>> it's a named vector where the url is the name, the value is "{", >>> which looks wrong and when that is classified to get a probability >>> it gets >>> >>> probabilities = >>> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}" >>> >>> That causes the probabilities.maxValueIndex() = -1 and everything dies. >>> >>> vector looks wrong, doesn't it? Truncated? >>> >>> I went back to try the same on mahout 0.6 but iterateSeq does not >>> get called though I used -xm sequential on both runs. I can't see >>> kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is >>> that part of the refactoring? >>> >>> On 6/4/12 3:07 PM, Pat Ferrel wrote: >>>> Some things to try: >>>> - Have you verified the contents of your input vectors actually >>>> have data in them? >>>> * YES, from the other email you know that the data works fine in 0.6 >>>> - Can you run the cluster dumper on the >>>> b3/kmeans-clusters/clusters-0 contents? >>>> * YES, It is attached from trunk's clusterdump after the failure of >>>> kmeans, of course. A simple data set fortunately. >>>> - Is it possible to run the sequential version (-xm sequential)? If >>>> it is you could run it in a debugger to gain more insight. >>>> * YES, will report back. >>>> >>>> On 6/4/12 2:19 PM, Jeff Eastman wrote: >>>>> It looks like the probabilities vector returned by >>>>> AbstractClusteringPolicy.classify() has no non-zero elements. In >>>>> this case, AbstractClusteringPolicy.select()'s call to >>>>> AbstractVector.maxValueIndex() is returning -1 and that is causing >>>>> the exception. >>>>> >>>>> How could this happen? I'm not exactly sure, but consider that the >>>>> probabilities vector is calculated in >>>>> AbstractClusteringPolicy.classify() by calling >>>>> DistanceMeasureCluster.pdf() on each of the prior clusters in >>>>> b3/kmeans-clusters/clusters-0. With a CosineDistanceMeasure I >>>>> don't see how this could ever return zero. Certainly, some of your >>>>> vectors will match the prior cluster centers exactly (they were >>>>> sampled from the input) and those values would return pdf==1. Even
-
Re: Problem using SNAPSHOT kmeansJeff Eastman 2012-06-06, 14:48
I was able to easily duplicate this exception by creating a Kluster with
a zero center and requesting the pdf of a zero vector. This invokes CosineDistanceMeasure.distance() with two empty vectors, creating a corner case where the dotProduct and denominator are both zero. Thus the distance is NaN and this propagates to the probabilities vector as {NaN, NaN, ... NaN} and the out of bounds exception in select() that you've observed. The operant line in CosineDistanceMeasure is: return 1.0 - dotProduct / denominator; ... and the problem presents when both dotProduct and denominator are zero. It seems unreasonable for k-means to fail to cluster zero vectors in this case. Seems like in this case the distance ought to return 1. What do others think? On 6/6/12 9:53 AM, Jeff Eastman wrote: > Yes, it looks like the input vectors are empty and this is the source > of the error. I'm troubled; however, that empty vectors can have this > impact on k-means. I'm going to write a unit test to see if I can > duplicate this exception. > > On 6/5/12 3:12 PM, Pat Ferrel wrote: >> I think I found the root but not sure what needs fixing. >> >> I took out n-gram generation and the vector now looks like this: >> Key: https://farfetchers.com/category/collections/source/brice-berard: >> Value: >> https://farfetchers.com/category/collections/source/brice-berard:{701:0.5484552974788475,1876:0.6020428878306935,3620:0.5802940184767269} >> >> This works in clustering. >> >> It doesn't seem like a malformed vector should crash clustering (it >> apparently doesn't in mahout 0.6) but it looks like something in >> seq2sparse's n-gram weighting does cause a malformed vector. >> >> I'll file a JIRA >> >> On 6/5/12 11:48 AM, Pat Ferrel wrote: >>> Using seqdumper on the TFIDF vectors, that vector is indeed in the list >>> Key: https://farfetchers.com/category/collections/source/brice-berard: >>> Value: >>> https://farfetchers.com/category/collections/source/brice-berard:{ >>> >>> Looking in the seqfiles we find the document in part-00005 of 10 in >>> no particular part of the file. >>> Key: https://farfetchers.com/category/collections/source/brice-berard: >>> Value: ::Title:: >>> Brice Berard | FarFetchers.com >>> Blog Posts >>> >>> On the chance that this originates in seq2sparse I'll try changing >>> options until the vector looks different. and try clustering again. >>> >>> On 6/5/12 10:43 AM, Pat Ferrel wrote: >>>> I'm not completely sure what I'm looking at but... >>>> >>>> In iterateSeq on iteration #1 of processing vectors/tfidf-vectors >>>> it reads >>>> vector = >>>> "https://farfetchers.com/category/collections/source/brice-berard:{" >>>> >>>> it's a named vector where the url is the name, the value is "{", >>>> which looks wrong and when that is classified to get a probability >>>> it gets >>>> >>>> probabilities = >>>> "{0:NaN,1:NaN,2:NaN,3:NaN,4:NaN,5:NaN,6:NaN,7:NaN,8:NaN,9:NaN,10:NaN,11:NaN,12:NaN,13:NaN,14:NaN,15:NaN,16:NaN,17:NaN,18:NaN,19:NaN}" >>>> >>>> That causes the probabilities.maxValueIndex() = -1 and everything >>>> dies. >>>> >>>> vector looks wrong, doesn't it? Truncated? >>>> >>>> I went back to try the same on mahout 0.6 but iterateSeq does not >>>> get called though I used -xm sequential on both runs. I can't see >>>> kmeans-clusters/clusters-0 being created on mahout 0.6 either. Is >>>> that part of the refactoring? >>>> >>>> On 6/4/12 3:07 PM, Pat Ferrel wrote: >>>>> Some things to try: >>>>> - Have you verified the contents of your input vectors actually >>>>> have data in them? >>>>> * YES, from the other email you know that the data works fine in 0.6 >>>>> - Can you run the cluster dumper on the >>>>> b3/kmeans-clusters/clusters-0 contents? >>>>> * YES, It is attached from trunk's clusterdump after the failure >>>>> of kmeans, of course. A simple data set fortunately. >>>>> - Is it possible to run the sequential version (-xm sequential)? >>>>> If it is you could run it in a debugger to gain more insight.
-
Re: Problem using SNAPSHOT kmeansRobin Anil 2012-06-06, 14:51
yes
------ Robin Anil On Wed, Jun 6, 2012 at 4:48 PM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > ... and the problem presents when both dotProduct and denominator are > zero. It seems unreasonable for k-means to fail to cluster zero vectors in > this case. Seems like in this case the distance ought to return 1. >
-
Re: Problem using SNAPSHOT kmeansPat Ferrel 2012-06-06, 15:49
I'll pull and test on the original data when the fix gets to git.
On 6/6/12 7:51 AM, Robin Anil wrote: > yes > ------ > Robin Anil > > > On Wed, Jun 6, 2012 at 4:48 PM, Jeff Eastman<[EMAIL PROTECTED]>wrote: > >> ... and the problem presents when both dotProduct and denominator are >> zero. It seems unreasonable for k-means to fail to cluster zero vectors in >> this case. Seems like in this case the distance ought to return 1. >> |