|
Severance, Steve
2010-08-16, 18:15
Ted Dunning
2010-08-16, 18:19
Severance, Steve
2010-08-16, 18:24
Severance, Steve
2010-08-16, 19:07
Sean Owen
2010-08-16, 19:11
Ted Dunning
2010-08-16, 19:14
Severance, Steve
2010-08-16, 20:59
Drew Farris
2010-08-16, 21:22
Robin Anil
2010-08-17, 02:03
Severance, Steve
2010-08-17, 02:07
|
-
Clustering QuestionsSeverance, Steve 2010-08-16, 18:15
Hi. I have a few questions. I am using Mahout to do KMeans clustering. I have found the process somewhat complex. Some of my questions may have been answered in JIRA tickets but I did look before I wrote this.
1. It appears that the .job files contain the code that is actually needed to run. How do I build these? They don't seem to be built when I build mahout with Maven. 2. The Mahout 0.3 tag line numbers don't seem to match with the compiled jars. What revision number is 0.3 built from? 3. It looks like the format of the cluster files changed between 0.3and 0.4. Is this true? 4. I was never able to get the Cluster dumping tool to work. I wrotemy own to export the clusters to hive for analysis. Are there any plans for= better Hive integration? Thanks. Steve
-
Re: Clustering QuestionsTed Dunning 2010-08-16, 18:19
On Mon, Aug 16, 2010 at 11:15 AM, Severance, Steve <[EMAIL PROTECTED]>wrote:
> > 1. It appears that the .job files contain the code that is actually > needed to run. How do I build these? They don't seem to be built when I > build mahout with Maven. > Which version are you using? I recommend trunk for pretty much everything. > 2. The Mahout 0.3 tag line numbers don't seem to match with the > compiled jars. What revision number is 0.3 built from? > It should have been what was tagged. But, even so, I recommend using trunk. > 3. It looks like the format of the cluster files changed between > 0.3and 0.4. Is this true? > Others can say for sure, but this is very likely. 0.4 is going to be a major change. 4. I was never able to get the Cluster dumping tool to work. I wrotemy > own to export the clusters to hive for analysis. Are there any plans for> better Hive integration? > This has been substantially improved. Is there something that can be done to facilitate Hive integration without making Hive a dependency?
-
RE: Clustering QuestionsSeverance, Steve 2010-08-16, 18:24
Thanks Ted.
I will move my code to trunk and get it working. Steve -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, August 16, 2010 11:20 AM To: [EMAIL PROTECTED] Subject: Re: Clustering Questions On Mon, Aug 16, 2010 at 11:15 AM, Severance, Steve <[EMAIL PROTECTED]>wrote: > > 1. It appears that the .job files contain the code that is actually > needed to run. How do I build these? They don't seem to be built when > I build mahout with Maven. > Which version are you using? I recommend trunk for pretty much everything. > 2. The Mahout 0.3 tag line numbers don't seem to match with the > compiled jars. What revision number is 0.3 built from? > It should have been what was tagged. But, even so, I recommend using trunk. > 3. It looks like the format of the cluster files changed between > 0.3and 0.4. Is this true? > Others can say for sure, but this is very likely. 0.4 is going to be a major change. 4. I was never able to get the Cluster dumping tool to work. I wrotemy > own to export the clusters to hive for analysis. Are there any plans > for= better Hive integration? > This has been substantially improved. Is there something that can be done to facilitate Hive integration without making Hive a dependency?
-
RE: Clustering QuestionsSeverance, Steve 2010-08-16, 19:07
I updated to the current revision of trunk. It does not package correctly as some of the tests fail.
Failed tests: testStartParallelCounting(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) testStartGroupingItems(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) Tests in error: testLoglikelihood(org.apache.mahout.math.hadoop.similarity.vector.DistributedLoglikelihoodVectorSimilarityTest) testKMeansWithCanopyClusterInput(org.apache.mahout.clustering.kmeans.TestKmeansClustering) testCompleteJob(org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest) testCompleteJobBoolean(org.apache.mahout.cf.taste.hadoop.item.RecommenderJobTest) testTanimoto(org.apache.mahout.math.hadoop.similarity.vector.DistributedTanimotoCoefficientVectorSimilarityTest) testStartParallelFPGrowth(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) testCanopyMapperManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation) testCanopyMapperEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation) testCanopyReducerManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation) testCanopyReducerEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation) testCanopyGenManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation) testCanopyGenEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation) testClusterMapperManhattan(org.apache.mahout.clustering.canopy.TestCanopyCreation) testClusterMapperEuclidean(org.apache.mahout.clustering.canopy.TestCanopyCreation) testClusteringManhattanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation) testClusteringEuclideanMR(org.apache.mahout.clustering.canopy.TestCanopyCreation) testUserDefinedDistanceMeasure(org.apache.mahout.clustering.canopy.TestCanopyCreation) testCanopyEuclideanMRJob(org.apache.mahout.clustering.meanshift.TestMeanShift) testCompleteJob(org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest) testMaxSimilaritiesPerItem(org.apache.mahout.cf.taste.hadoop.similarity.item.ItemSimilarityTest) testRowWeightMapper(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob) testSimilarityReducer(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob) testSimilarityReducerSelfSimilarity(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob) testSmallSampleMatrix(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob) testLimitEntriesInSimilarityMatrix(org.apache.mahout.math.hadoop.similarity.TestRowSimilarityJob) testEvaluate(org.apache.mahout.ga.watchmaker.MahoutEvaluatorTest) testMaxHeapFPGrowth(org.apache.mahout.fpm.pfpgrowth.FPGrowthTest) testFuzzyKMeansMRJob(org.apache.mahout.clustering.fuzzykmeans.TestFuzzyKmeansClustering) testTranspose(org.apache.mahout.math.hadoop.TestDistributedRowMatrix) testMatrixTimesVector(org.apache.mahout.math.hadoop.TestDistributedRowMatrix) testMatrixTimesSquaredVector(org.apache.mahout.math.hadoop.TestDistributedRowMatrix) testMatrixTimesMatrix(org.apache.mahout.math.hadoop.TestDistributedRowMatrix) testSelfTestBayes(org.apache.mahout.classifier.bayes.BayesClassifierSelfTest) testSelfTestCBayes(org.apache.mahout.classifier.bayes.BayesClassifierSelfTest) testDistributedLanczosSolver(org.apache.mahout.math.hadoop.decomposer.TestDistributedLanczosSolver) I can provide any extra info needed. My other build of trunk which was from August 4th fails to run seq2sparse because the lucene standard analyzer cannot be found. Which Job file should contain this? Thanks. Steve -----Original Message----- From: Ted Dunning [mailto:[EMAIL PROTECTED]] Sent: Monday, August 16, 2010 11:20 AM To: [EMAIL PROTECTED] Subject: Re: Clustering Questions On Mon, Aug 16, 2010 at 11:15 AM, Severance, Steve <[EMAIL PROTECTED]>wrote: > > 1. It appears that the .job files contain the code that is actually Which version are you using? I recommend trunk for pretty much everything. It should have been what was tagged. But, even so, I recommend using trunk. Others can say for sure, but this is very likely. 0.4 is going to be a major change. 4. I was never able to get the Cluster dumping tool to work. I wrotemy This has been substantially improved. Is there something that can be done to facilitate Hive integration without making Hive a dependency?
-
Re: Clustering QuestionsSean Owen 2010-08-16, 19:11
Hmm, these are all passing for me. Sounds like some quirk in your
local setup. Under target/surefire-reports you will find complete logs from tests, which would probably reveal the nature of the problem. On Mon, Aug 16, 2010 at 8:07 PM, Severance, Steve <[EMAIL PROTECTED]> wrote: > I updated to the current revision of trunk. It does not package correctly as some of the tests fail. > > Failed tests: > testStartParallelCounting(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) > testStartGroupingItems(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) >
-
Re: Clustering QuestionsTed Dunning 2010-08-16, 19:14
What platform (did you already say)?
On Mon, Aug 16, 2010 at 12:07 PM, Severance, Steve <[EMAIL PROTECTED]>wrote: > I can provide any extra info needed. >
-
RE: Clustering QuestionsSeverance, Steve 2010-08-16, 20:59
I am on Windows 7. Building through Cygwin. Here is one of the surefire reports.
Steve ------------------------------------------------------------------------------- Test set: org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest ------------------------------------------------------------------------------- Tests run: 3, Failures: 2, Errors: 1, Skipped: 0, Time elapsed: 4.157 sec <<< FAILURE! testStartParallelCounting(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) Time elapsed: 1.35 sec <<< FAILURE! junit.framework.ComparisonFailure: null expected:<[[(B,6), (D,6), (A,5), (E,4), (C,3)]]> but was:<[[]]> at junit.framework.Assert.assertEquals(Assert.java:81) at junit.framework.Assert.assertEquals(Assert.java:87) at org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest.testStartParallelCounting(PFPGrowthTest.java:93) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102) at org.apache.maven.surefire.Surefire.run(Surefire.java:180) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021) testStartGroupingItems(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) Time elapsed: 0.014 sec <<< FAILURE! junit.framework.ComparisonFailure: null expected:<{[D=0, E=1, A=0, B=0, C=1]}> but was:<{[]}> at junit.framework.Assert.assertEquals(Assert.java:81) at junit.framework.Assert.assertEquals(Assert.java:87) at org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest.testStartGroupingItems(PFPGrowthTest.java:101) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.framework.TestResult$1.protect(TestResult.java:110) at junit.framework.TestResult.runProtected(TestResult.java:128) at junit.framework.TestResult.run(TestResult.java:113) at junit.framework.TestCase.run(TestCase.java:124) at junit.framework.TestSuite.runTest(TestSuite.java:232) at junit.framework.TestSuite.run(TestSuite.java:227) at org.junit.internal.runners.JUnit38ClassRunner.run(JUnit38ClassRunner.java:83) at org.apache.maven.surefire.junit4.JUnit4TestSet.execute(JUnit4TestSet.java:59) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.executeTestSet(AbstractDirectoryTestSuite.java:115) at org.apache.maven.surefire.suite.AbstractDirectoryTestSuite.execute(AbstractDirectoryTestSuite.java:102) at org.apache.maven.surefire.Surefire.run(Surefire.java:180) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.maven.surefire.booter.SurefireBooter.runSuitesInProcess(SurefireBooter.java:350) at org.apache.maven.surefire.booter.SurefireBooter.main(SurefireBooter.java:1021) testStartParallelFPGrowth(org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest) Time elapsed: 2.789 sec <<< ERROR! org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input path does not exist: file:/D:/apache/mahout/trunk/core/output/frequentpatterns/fpgrowth at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:224) at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55) at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:241) at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:885) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:779) at org.apache.hadoop.mapreduce.Job.submit(Job.java:432) at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:447) at org.apache.mahout.fpm.pfpgrowth.PFPGrowth.startAggregating(PFPGrowth.java:240) at org.apache.mahout.fpm.pfpgrowth.PFPGrowthTest.testStartParallelFPGrowth(PFPGrowthTest.java:110) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at junit.framework.TestCase.runTest(TestCase.java:168) at junit.framework.TestCase.runBare(TestCase.java:134) at junit.fram
-
Re: Clustering QuestionsDrew Farris 2010-08-16, 21:22
On Mon, Aug 16, 2010 at 2:15 PM, Severance, Steve <[EMAIL PROTECTED]> wrote:
> 1. It appears that the .job files contain the code that is actually needed to run. How do I build these? They don't seem to be built when I build mahout with Maven. 'mvn clean install' will write the job files to */target/*.job -- example/target/mahout-examples-0.4-SNAPSHOT.job for example. If the unit tests are failing, the job files won't be built. You can do a build with unit tests disabled using 'mvn clean install -Pfastinstall' There is likely a problem running the unit tests that is specific to Windows 7, I know there have been reports regarding difficulties with the test on Windows platforms previously. There are some tips on wiki page regarding building in Windows that might be useful: https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout HTH, Drew
-
Re: Clustering QuestionsRobin Anil 2010-08-17, 02:03
Seems to me like a lack of memory error. Try increasing the heap size.
Hadoop is throwing "out of mem" exception, which doesnt get propagated to the driver Robin On Tue, Aug 17, 2010 at 2:52 AM, Drew Farris <[EMAIL PROTECTED]> wrote: > On Mon, Aug 16, 2010 at 2:15 PM, Severance, Steve <[EMAIL PROTECTED]> > wrote: > > > 1. It appears that the .job files contain the code that is actually > needed to run. How do I build these? They don't seem to be built when I > build mahout with Maven. > > 'mvn clean install' will write the job files to */target/*.job -- > example/target/mahout-examples-0.4-SNAPSHOT.job for example. If the > unit tests are failing, the job files won't be built. You can do a > build with unit tests disabled using 'mvn clean install -Pfastinstall' > > There is likely a problem running the unit tests that is specific to > Windows 7, I know there have been reports regarding difficulties with > the test on Windows platforms previously. > > There are some tips on wiki page regarding building in Windows that > might be useful: > https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout > > HTH, > Drew >
-
RE: Clustering QuestionsSeverance, Steve 2010-08-17, 02:07
I built everything on OSX and it works now.
Thanks. -----Original Message----- From: Robin Anil [mailto:[EMAIL PROTECTED]] Sent: Monday, August 16, 2010 7:04 PM To: [EMAIL PROTECTED] Subject: Re: Clustering Questions Seems to me like a lack of memory error. Try increasing the heap size. Hadoop is throwing "out of mem" exception, which doesnt get propagated to the driver Robin On Tue, Aug 17, 2010 at 2:52 AM, Drew Farris <[EMAIL PROTECTED]> wrote: > On Mon, Aug 16, 2010 at 2:15 PM, Severance, Steve > <[EMAIL PROTECTED]> > wrote: > > > 1. It appears that the .job files contain the code that is actually > needed to run. How do I build these? They don't seem to be built when > I build mahout with Maven. > > 'mvn clean install' will write the job files to */target/*.job -- > example/target/mahout-examples-0.4-SNAPSHOT.job for example. If the > unit tests are failing, the job files won't be built. You can do a > build with unit tests disabled using 'mvn clean install -Pfastinstall' > > There is likely a problem running the unit tests that is specific to > Windows 7, I know there have been reports regarding difficulties with > the test on Windows platforms previously. > > There are some tips on wiki page regarding building in Windows that > might be useful: > https://cwiki.apache.org/confluence/display/MAHOUT/BuildingMahout > > HTH, > Drew > |