|
Joe Kumar
2010-09-19, 00:57
Joe Kumar
2010-09-19, 12:06
Gangadhar Nittala
2010-09-20, 03:13
Ted Dunning
2010-09-20, 03:25
Joe Kumar
2010-09-20, 06:38
Robin Anil
2010-09-20, 10:31
Joe Kumar
2010-09-20, 17:09
Joe Kumar
2010-09-21, 02:30
Gangadhar Nittala
2010-09-21, 03:13
Gangadhar Nittala
2010-09-24, 02:43
Joe Kumar
2010-09-24, 12:44
Gangadhar Nittala
2010-09-26, 13:28
Gangadhar Nittala
2010-10-07, 04:22
Ted Dunning
2010-10-07, 16:57
Gangadhar Nittala
2010-10-07, 21:44
Ted Dunning
2010-09-21, 05:41
Joe Kumar
2010-09-20, 05:14
deneche abdelhakim
2010-09-20, 05:45
Joe Kumar
2010-09-15, 04:56
Robin Anil
2010-09-15, 05:10
Joe Kumar
2010-09-15, 05:16
Gangadhar Nittala
2010-09-16, 01:41
Joe Kumar
2010-09-16, 02:20
Joe Kumar
2010-09-17, 03:34
Gangadhar Nittala
2010-09-18, 00:36
Joe Kumar
2010-09-18, 03:33
Gangadhar Nittala
2010-09-18, 16:36
|
-
Options in TrainClassifier.javaJoe Kumar 2010-09-19, 00:57
Gangadhar,
After running TrainClassifier again, the map task just failed with the same exception and I am pretty sure it is an issue with disk space. As the map was progressing, I was monitoring my free disk space dropping from 81GB. It came down to 0 after almost 66% through the map task and then the exception happened. After the exception, another map task was resuming at 33% and I got close to 15GB free space (i guess the first map task freed up some space) and I am sure they would drop down to zero again and throw the same exception. I am going to modify the country.txt to just 1 country and recreate wikipediainput and run TrainClassifier. Will let you know how it goes.. Do we have any benchmarks / system requirements for running this example ? Has anyone else had success running this example anytime. Would appreciate your inputs / thots. Should we look at tuning the code for handling these situations ? Any quick suggestions on where to start looking at ? regards, Joe. +
Joe Kumar 2010-09-19, 00:57
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-19, 12:06
Gangadhar,
I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the wikipediainput data set and then ran TrainClassifier and it worked. when I ran TestClassifier as below, I got blank results in the output. $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d wikipediainput -ng 3 -type bayes -source hdfs Summary ------------------------------------------------------- Correctly Classified Instances : 0 ?% Incorrectly Classified Instances : 0 ?% Total Classified Instances : 0 ======================================================Confusion Matrix ------------------------------------------------------- a <--Classified as 0 | 0 a = spain Default Category: unknown: 1 I am not sure if I am doing something wrong.. have to figure out why my o/p is so blank. I'll document these steps and mention about country.txt in the wiki. Question to all Should we have 2 country.txt 1. country_full_list.txt - this is the existing list 2. country_sample_list.txt - a list with 2 or 3 countries To get a flavor of the wikipedia bayes example, we can use country_sample.txt. When new people want to just try out the example, they can reference this txt file as a parameter. To run the example in a robust scalable infrastructure, we could use country_full_list.txt. any thots ? regards Joe. On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Gangadhar, > > After running TrainClassifier again, the map task just failed with the same > exception and I am pretty sure it is an issue with disk space. > As the map was progressing, I was monitoring my free disk space dropping > from 81GB. It came down to 0 after almost 66% through the map task and then > the exception happened. After the exception, another map task was resuming > at 33% and I got close to 15GB free space (i guess the first map task freed > up some space) and I am sure they would drop down to zero again and throw > the same exception. > I am going to modify the country.txt to just 1 country and recreate > wikipediainput and run TrainClassifier. Will let you know how it goes.. > > Do we have any benchmarks / system requirements for running this example ? > Has anyone else had success running this example anytime. Would appreciate > your inputs / thots. > > Should we look at tuning the code for handling these situations ? Any quick > suggestions on where to start looking at ? > > regards, > Joe. > > > > +
Joe Kumar 2010-09-19, 12:06
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-09-20, 03:13
Joe,
Even I tried with reducing the number of countries in the country.txt. That didn't help. And in my case, I was monitoring the disk space and at no time did it reach 0%. So, I am not sure if that is the case. To remove the dependency on the number of countries, I even tried with the subjects.txt as the classification - that also did not help. I think this problem is due to the type of the data being processed, but what I am not sure of is what I need to change to get the data to be processed successfully. The experienced folks on Mahout will be able to tell us what is missing I guess. Thank you Gangadhar On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Gangadhar, > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just have > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the > wikipediainput data set and then ran TrainClassifier and it worked. when I > ran TestClassifier as below, I got blank results in the output. > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d > wikipediainput -ng 3 -type bayes -source hdfs > > Summary > ------------------------------------------------------- > Correctly Classified Instances : 0 ?% > Incorrectly Classified Instances : 0 ?% > Total Classified Instances : 0 > > ======================================================> Confusion Matrix > ------------------------------------------------------- > a <--Classified as > 0 | 0 a = spain > Default Category: unknown: 1 > > I am not sure if I am doing something wrong.. have to figure out why my o/p > is so blank. > I'll document these steps and mention about country.txt in the wiki. > > Question to all > Should we have 2 country.txt > > 1. country_full_list.txt - this is the existing list > 2. country_sample_list.txt - a list with 2 or 3 countries > > To get a flavor of the wikipedia bayes example, we can use > country_sample.txt. When new people want to just try out the example, they > can reference this txt file as a parameter. > To run the example in a robust scalable infrastructure, we could use > country_full_list.txt. > any thots ? > > regards > Joe. > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > >> Gangadhar, >> >> After running TrainClassifier again, the map task just failed with the same >> exception and I am pretty sure it is an issue with disk space. >> As the map was progressing, I was monitoring my free disk space dropping >> from 81GB. It came down to 0 after almost 66% through the map task and then >> the exception happened. After the exception, another map task was resuming >> at 33% and I got close to 15GB free space (i guess the first map task freed >> up some space) and I am sure they would drop down to zero again and throw >> the same exception. >> I am going to modify the country.txt to just 1 country and recreate >> wikipediainput and run TrainClassifier. Will let you know how it goes.. >> >> Do we have any benchmarks / system requirements for running this example ? >> Has anyone else had success running this example anytime. Would appreciate >> your inputs / thots. >> >> Should we look at tuning the code for handling these situations ? Any quick >> suggestions on where to start looking at ? >> >> regards, >> Joe. >> >> >> >> > +
Gangadhar Nittala 2010-09-20, 03:13
-
Re: Options in TrainClassifier.javaTed Dunning 2010-09-20, 03:25
I am watching these efforts with interest, but have been unable to
contribute much to the process. I would encourage Joe and others to keep whittling this problem down so that we can understand what is causing it. In the meantime, I think that the SGD classifiers are close to production quality. For problems with less than several million training examples, and especially problems with many sparse features, I think that these classifiers might be easier to get started with than the Naive Bayes classifiers. To make a virtue of a defect, the SGD based classifiers to not use Hadoop for training. This makes deployment of a classification training workflow easier, but limits the total size of data that can be handled. What would you guys need to get started with trying these alternative models? On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala <[EMAIL PROTECTED]>wrote: > Joe, > Even I tried with reducing the number of countries in the country.txt. > That didn't help. And in my case, I was monitoring the disk space and > at no time did it reach 0%. So, I am not sure if that is the case. To > remove the dependency on the number of countries, I even tried with > the subjects.txt as the classification - that also did not help. > I think this problem is due to the type of the data being processed, > but what I am not sure of is what I need to change to get the data to > be processed successfully. > > The experienced folks on Mahout will be able to tell us what is missing I > guess. > > Thank you > Gangadhar > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > Gangadhar, > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just > have > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the > > wikipediainput data set and then ran TrainClassifier and it worked. when > I > > ran TestClassifier as below, I got blank results in the output. > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d > > wikipediainput -ng 3 -type bayes -source hdfs > > > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 0 ?% > > Incorrectly Classified Instances : 0 ?% > > Total Classified Instances : 0 > > > > ======================================================> > Confusion Matrix > > ------------------------------------------------------- > > a <--Classified as > > 0 | 0 a = spain > > Default Category: unknown: 1 > > > > I am not sure if I am doing something wrong.. have to figure out why my > o/p > > is so blank. > > I'll document these steps and mention about country.txt in the wiki. > > > > Question to all > > Should we have 2 country.txt > > > > 1. country_full_list.txt - this is the existing list > > 2. country_sample_list.txt - a list with 2 or 3 countries > > > > To get a flavor of the wikipedia bayes example, we can use > > country_sample.txt. When new people want to just try out the example, > they > > can reference this txt file as a parameter. > > To run the example in a robust scalable infrastructure, we could use > > country_full_list.txt. > > any thots ? > > > > regards > > Joe. > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > >> Gangadhar, > >> > >> After running TrainClassifier again, the map task just failed with the > same > >> exception and I am pretty sure it is an issue with disk space. > >> As the map was progressing, I was monitoring my free disk space dropping > >> from 81GB. It came down to 0 after almost 66% through the map task and > then > >> the exception happened. After the exception, another map task was > resuming > >> at 33% and I got close to 15GB free space (i guess the first map task > freed > >> up some space) and I am sure they would drop down to zero again and > throw > >> the same exception. +
Ted Dunning 2010-09-20, 03:25
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-20, 06:38
Hi Ted,
sure. will keep digging.. About SGD, I dont have an idea about how it works et al. If there is some documentation / reference / quick summary to read about it that'll be gr8. Just saw one reference in https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. I am assuming we should be able to create a model from wikipedia articles and label the country of a new article. If so, could you please provide a note on how to do this. We already have the wikipedia data being extracted for specific countries using WikipediaDatasetCreatorDriver. How do we go about training the classifier using SGD ? thanks for your help, Joe. On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > I am watching these efforts with interest, but have been unable to > contribute much to the process. I would encourage Joe and others to keep > whittling this problem down so that we can understand what is causing it. > > In the meantime, I think that the SGD classifiers are close to production > quality. For problems with less than several million training examples, > and > especially problems with many sparse features, I think that these > classifiers might be easier to get started with than the Naive Bayes > classifiers. To make a virtue of a defect, the SGD based classifiers to > not > use Hadoop for training. This makes deployment of a classification > training > workflow easier, but limits the total size of data that can be handled. > > What would you guys need to get started with trying these alternative > models? > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala > <[EMAIL PROTECTED]>wrote: > > > Joe, > > Even I tried with reducing the number of countries in the country.txt. > > That didn't help. And in my case, I was monitoring the disk space and > > at no time did it reach 0%. So, I am not sure if that is the case. To > > remove the dependency on the number of countries, I even tried with > > the subjects.txt as the classification - that also did not help. > > I think this problem is due to the type of the data being processed, > > but what I am not sure of is what I need to change to get the data to > > be processed successfully. > > > > The experienced folks on Mahout will be able to tell us what is missing I > > guess. > > > > Thank you > > Gangadhar > > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > Gangadhar, > > > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just > > have > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the > > > wikipediainput data set and then ran TrainClassifier and it worked. > when > > I > > > ran TestClassifier as below, I got blank results in the output. > > > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d > > > wikipediainput -ng 3 -type bayes -source hdfs > > > > > > Summary > > > ------------------------------------------------------- > > > Correctly Classified Instances : 0 ?% > > > Incorrectly Classified Instances : 0 ?% > > > Total Classified Instances : 0 > > > > > > ======================================================> > > Confusion Matrix > > > ------------------------------------------------------- > > > a <--Classified as > > > 0 | 0 a = spain > > > Default Category: unknown: 1 > > > > > > I am not sure if I am doing something wrong.. have to figure out why my > > o/p > > > is so blank. > > > I'll document these steps and mention about country.txt in the wiki. > > > > > > Question to all > > > Should we have 2 country.txt > > > > > > 1. country_full_list.txt - this is the existing list > > > 2. country_sample_list.txt - a list with 2 or 3 countries > > > > > > To get a flavor of the wikipedia bayes example, we can use > > > country_sample.txt. When new people want to just try out the example, +
Joe Kumar 2010-09-20, 06:38
-
Re: Options in TrainClassifier.javaRobin Anil 2010-09-20, 10:31
Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You
need atleast 2 countries. otherwise there is no classification. Secondly ngram =3 is a bit too high. With wikipedia this will result in a huge number of features. Why dont you try with one and see. Robin On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Hi Ted, > > sure. will keep digging.. > > About SGD, I dont have an idea about how it works et al. If there is some > documentation / reference / quick summary to read about it that'll be gr8. > Just saw one reference in > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. > > I am assuming we should be able to create a model from wikipedia articles > and label the country of a new article. If so, could you please provide a > note on how to do this. We already have the wikipedia data being extracted > for specific countries using WikipediaDatasetCreatorDriver. How do we go > about training the classifier using SGD ? > > thanks for your help, > Joe. > > > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > I am watching these efforts with interest, but have been unable to > > contribute much to the process. I would encourage Joe and others to keep > > whittling this problem down so that we can understand what is causing it. > > > > In the meantime, I think that the SGD classifiers are close to production > > quality. For problems with less than several million training examples, > > and > > especially problems with many sparse features, I think that these > > classifiers might be easier to get started with than the Naive Bayes > > classifiers. To make a virtue of a defect, the SGD based classifiers to > > not > > use Hadoop for training. This makes deployment of a classification > > training > > workflow easier, but limits the total size of data that can be handled. > > > > What would you guys need to get started with trying these alternative > > models? > > > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala > > <[EMAIL PROTECTED]>wrote: > > > > > Joe, > > > Even I tried with reducing the number of countries in the country.txt. > > > That didn't help. And in my case, I was monitoring the disk space and > > > at no time did it reach 0%. So, I am not sure if that is the case. To > > > remove the dependency on the number of countries, I even tried with > > > the subjects.txt as the classification - that also did not help. > > > I think this problem is due to the type of the data being processed, > > > but what I am not sure of is what I need to change to get the data to > > > be processed successfully. > > > > > > The experienced folks on Mahout will be able to tell us what is missing > I > > > guess. > > > > > > Thank you > > > Gangadhar > > > > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > > Gangadhar, > > > > > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to > just > > > have > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the > > > > wikipediainput data set and then ran TrainClassifier and it worked. > > when > > > I > > > > ran TestClassifier as below, I got blank results in the output. > > > > > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel > -d > > > > wikipediainput -ng 3 -type bayes -source hdfs > > > > > > > > Summary > > > > ------------------------------------------------------- > > > > Correctly Classified Instances : 0 ?% > > > > Incorrectly Classified Instances : 0 ?% > > > > Total Classified Instances : 0 > > > > > > > > ======================================================> > > > Confusion Matrix > > > > ------------------------------------------------------- > > > > a <--Classified as > > > > 0 | 0 a = spain > > > > Default Category: unknown: 1 +
Robin Anil 2010-09-20, 10:31
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-20, 17:09
Robin,
Thanks for your tip. Will try it out and post updates. reg Joe. On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You > need atleast 2 countries. otherwise there is no classification. Secondly > ngram =3 is a bit too high. With wikipedia this will result in a huge > number > of features. Why dont you try with one and see. > > Robin > > On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > Hi Ted, > > > > sure. will keep digging.. > > > > About SGD, I dont have an idea about how it works et al. If there is some > > documentation / reference / quick summary to read about it that'll be > gr8. > > Just saw one reference in > > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. > > > > I am assuming we should be able to create a model from wikipedia articles > > and label the country of a new article. If so, could you please provide a > > note on how to do this. We already have the wikipedia data being > extracted > > for specific countries using WikipediaDatasetCreatorDriver. How do we go > > about training the classifier using SGD ? > > > > thanks for your help, > > Joe. > > > > > > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > I am watching these efforts with interest, but have been unable to > > > contribute much to the process. I would encourage Joe and others to > keep > > > whittling this problem down so that we can understand what is causing > it. > > > > > > In the meantime, I think that the SGD classifiers are close to > production > > > quality. For problems with less than several million training > examples, > > > and > > > especially problems with many sparse features, I think that these > > > classifiers might be easier to get started with than the Naive Bayes > > > classifiers. To make a virtue of a defect, the SGD based classifiers > to > > > not > > > use Hadoop for training. This makes deployment of a classification > > > training > > > workflow easier, but limits the total size of data that can be handled. > > > > > > What would you guys need to get started with trying these alternative > > > models? > > > > > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala > > > <[EMAIL PROTECTED]>wrote: > > > > > > > Joe, > > > > Even I tried with reducing the number of countries in the > country.txt. > > > > That didn't help. And in my case, I was monitoring the disk space and > > > > at no time did it reach 0%. So, I am not sure if that is the case. To > > > > remove the dependency on the number of countries, I even tried with > > > > the subjects.txt as the classification - that also did not help. > > > > I think this problem is due to the type of the data being processed, > > > > but what I am not sure of is what I need to change to get the data to > > > > be processed successfully. > > > > > > > > The experienced folks on Mahout will be able to tell us what is > missing > > I > > > > guess. > > > > > > > > Thank you > > > > Gangadhar > > > > > > > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[EMAIL PROTECTED]> > wrote: > > > > > Gangadhar, > > > > > > > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to > > just > > > > have > > > > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create > the > > > > > wikipediainput data set and then ran TrainClassifier and it worked. > > > when > > > > I > > > > > ran TestClassifier as below, I got blank results in the output. > > > > > > > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > > > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel > > -d > > > > > wikipediainput -ng 3 -type bayes -source hdfs > > > > > > > > > > Summary > > > > > ------------------------------------------------------- > > > > > Correctly Classified Instances : 0 ?% > > > > > Incorrectly Classified Instances : 0 ?% +
Joe Kumar 2010-09-20, 17:09
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-21, 02:30
Robin / Gangadhar,
With ngram as 1 and all the countries in the country.txt , the model is getting created without any issues. $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput -o wikipediamodel -type bayes -source hdfs Robin, Even for ngram parameter, the default value is mentioned as 1 but it is set as a mandatory parameter in TrainClassifier. so i'll modify the code to set the default ngram as 1 and make it as a non mandatory param. That aside, When I try to test the model, the summary is getting printed like below. Summary ------------------------------------------------------- Correctly Classified Instances : 0 ?% Incorrectly Classified Instances : 0 ?% Total Classified Instances : 0 Need to figure out the reason.. Since TestClassifier also has the same params and settings like TrainClassifier, can i modify it to set the default values for ngram, classifierType & dataSource ? reg, Joe. On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Robin, > > Thanks for your tip. > Will try it out and post updates. > > reg > Joe. > > > On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > >> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You >> need atleast 2 countries. otherwise there is no classification. Secondly >> ngram =3 is a bit too high. With wikipedia this will result in a huge >> number >> of features. Why dont you try with one and see. >> >> Robin >> >> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> >> > Hi Ted, >> > >> > sure. will keep digging.. >> > >> > About SGD, I dont have an idea about how it works et al. If there is >> some >> > documentation / reference / quick summary to read about it that'll be >> gr8. >> > Just saw one reference in >> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. >> > >> > I am assuming we should be able to create a model from wikipedia >> articles >> > and label the country of a new article. If so, could you please provide >> a >> > note on how to do this. We already have the wikipedia data being >> extracted >> > for specific countries using WikipediaDatasetCreatorDriver. How do we go >> > about training the classifier using SGD ? >> > >> > thanks for your help, >> > Joe. >> > >> > >> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[EMAIL PROTECTED]> >> > wrote: >> > >> > > I am watching these efforts with interest, but have been unable to >> > > contribute much to the process. I would encourage Joe and others to >> keep >> > > whittling this problem down so that we can understand what is causing >> it. >> > > >> > > In the meantime, I think that the SGD classifiers are close to >> production >> > > quality. For problems with less than several million training >> examples, >> > > and >> > > especially problems with many sparse features, I think that these >> > > classifiers might be easier to get started with than the Naive Bayes >> > > classifiers. To make a virtue of a defect, the SGD based classifiers >> to >> > > not >> > > use Hadoop for training. This makes deployment of a classification >> > > training >> > > workflow easier, but limits the total size of data that can be >> handled. >> > > >> > > What would you guys need to get started with trying these alternative >> > > models? >> > > >> > > On Sun, Sep 19, 2010 at 8:13 PM, Gangadhar Nittala >> > > <[EMAIL PROTECTED]>wrote: >> > > >> > > > Joe, >> > > > Even I tried with reducing the number of countries in the >> country.txt. >> > > > That didn't help. And in my case, I was monitoring the disk space >> and >> > > > at no time did it reach 0%. So, I am not sure if that is the case. >> To >> > > > remove the dependency on the number of countries, I even tried with >> > > > the subjects.txt as the classification - that also did not help. +
Joe Kumar 2010-09-21, 02:30
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-09-21, 03:13
Joe,
I will try with the ngram setting of 1 and let you know how it goes. Robin, the ngram parameter is used to check the number of subsequences of characters isn't it ? Or is it evaluated differently w.r.t to the Bayesian classifier ? Ted, like Joe mentioned, if you could point us to some information on SGD we could try it and report back the results to the list. Thank you Gangadhar On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Robin / Gangadhar, > With ngram as 1 and all the countries in the country.txt , the model is > getting created without any issues. > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput > -o wikipediamodel -type bayes -source hdfs > > Robin, > Even for ngram parameter, the default value is mentioned as 1 but it is set > as a mandatory parameter in TrainClassifier. so i'll modify the code to set > the default ngram as 1 and make it as a non mandatory param. > > That aside, When I try to test the model, the summary is getting printed > like below. > Summary > ------------------------------------------------------- > Correctly Classified Instances : 0 ?% > Incorrectly Classified Instances : 0 ?% > Total Classified Instances : 0 > Need to figure out the reason.. > > Since TestClassifier also has the same params and settings like > TrainClassifier, can i modify it to set the default values for ngram, > classifierType & dataSource ? > > reg, > Joe. > > On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > >> Robin, >> >> Thanks for your tip. >> Will try it out and post updates. >> >> reg >> Joe. >> >> >> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote: >> >>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You >>> need atleast 2 countries. otherwise there is no classification. Secondly >>> ngram =3 is a bit too high. With wikipedia this will result in a huge >>> number >>> of features. Why dont you try with one and see. >>> >>> Robin >>> >>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >>> >>> > Hi Ted, >>> > >>> > sure. will keep digging.. >>> > >>> > About SGD, I dont have an idea about how it works et al. If there is >>> some >>> > documentation / reference / quick summary to read about it that'll be >>> gr8. >>> > Just saw one reference in >>> > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. >>> > >>> > I am assuming we should be able to create a model from wikipedia >>> articles >>> > and label the country of a new article. If so, could you please provide >>> a >>> > note on how to do this. We already have the wikipedia data being >>> extracted >>> > for specific countries using WikipediaDatasetCreatorDriver. How do we go >>> > about training the classifier using SGD ? >>> > >>> > thanks for your help, >>> > Joe. >>> > >>> > >>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[EMAIL PROTECTED]> >>> > wrote: >>> > >>> > > I am watching these efforts with interest, but have been unable to >>> > > contribute much to the process. I would encourage Joe and others to >>> keep >>> > > whittling this problem down so that we can understand what is causing >>> it. >>> > > >>> > > In the meantime, I think that the SGD classifiers are close to >>> production >>> > > quality. For problems with less than several million training >>> examples, >>> > > and >>> > > especially problems with many sparse features, I think that these >>> > > classifiers might be easier to get started with than the Naive Bayes >>> > > classifiers. To make a virtue of a defect, the SGD based classifiers >>> to >>> > > not >>> > > use Hadoop for training. This makes deployment of a classification >>> > > training >>> > > workflow easier, but limits the total size of data that can be >>> handled. >>> > > >>> > > What would you guys need to get started with trying these alternative +
Gangadhar Nittala 2010-09-21, 03:13
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-09-24, 02:43
Joe,
Can you let me know what was the command you used to test the classifier ? With the ngrams set to 1 as suggested by Robin, I was able to train the classifier. The command: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1 --input wikipediainput10 --output wikipediamodel10 --classifierType bayes --dataSource hdfs After this, as per the wiki, we need to get the data from HDFS. I did that <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10 After this, the classifier is to be tested: $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10 -d wikipediainput10 -ng 1 -type bayes -source hdfs When I run this, this runs for close to 2 hours and after 2 hours, it errors out with a java.io.FileException saying that the logs_ is a directory in the wikipediainput10 folder. I am sorry I can't provide the stack trace right now because I accidentally closed the terminal window before I could copy it. I will run this again and send the stack trace. But, if you can send me the steps that you followed after running the classifier, I can repeat those and see if I am able to successfully execute the classifier. Thank you Gangadhar On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala <[EMAIL PROTECTED]> wrote: > Joe, > I will try with the ngram setting of 1 and let you know how it goes. > Robin, the ngram parameter is used to check the number of subsequences > of characters isn't it ? Or is it evaluated differently w.r.t to the > Bayesian classifier ? > > Ted, like Joe mentioned, if you could point us to some information on > SGD we could try it and report back the results to the list. > > Thank you > Gangadhar > > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> Robin / Gangadhar, >> With ngram as 1 and all the countries in the country.txt , the model is >> getting created without any issues. >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i wikipediainput >> -o wikipediamodel -type bayes -source hdfs >> >> Robin, >> Even for ngram parameter, the default value is mentioned as 1 but it is set >> as a mandatory parameter in TrainClassifier. so i'll modify the code to set >> the default ngram as 1 and make it as a non mandatory param. >> >> That aside, When I try to test the model, the summary is getting printed >> like below. >> Summary >> ------------------------------------------------------- >> Correctly Classified Instances : 0 ?% >> Incorrectly Classified Instances : 0 ?% >> Total Classified Instances : 0 >> Need to figure out the reason.. >> >> Since TestClassifier also has the same params and settings like >> TrainClassifier, can i modify it to set the default values for ngram, >> classifierType & dataSource ? >> >> reg, >> Joe. >> >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> >>> Robin, >>> >>> Thanks for your tip. >>> Will try it out and post updates. >>> >>> reg >>> Joe. >>> >>> >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[EMAIL PROTECTED]> wrote: >>> >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. You >>>> need atleast 2 countries. otherwise there is no classification. Secondly >>>> ngram =3 is a bit too high. With wikipedia this will result in a huge >>>> number >>>> of features. Why dont you try with one and see. >>>> >>>> Robin >>>> >>>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >>>> >>>> > Hi Ted, >>>> > >>>> > sure. will keep digging.. >>>> > >>>> > About SGD, I dont have an idea about how it works et al. If there is >>>> some >>>> > documentation / reference / quick summary to read about it that'll be >>>> gr8. >>>> > Just saw one reference in +
Gangadhar Nittala 2010-09-24, 02:43
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-24, 12:44
Hi Gangadhar,
I ran TestClassifier with similar parameters. It didnt take me 2 hrs though. I have documented the steps that worked for me at https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example Can you please get the patch available at MAHOUT-509 and apply it and then try the steps in the wiki. Please let me know if you still face issues. reg Joe. On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[EMAIL PROTECTED] > wrote: > Joe, > Can you let me know what was the command you used to test the > classifier ? With the ngrams set to 1 as suggested by Robin, I was > able to train the classifier. The command: > $HADOOP_HOME/bin/hadoop jar > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1 > --input wikipediainput10 --output wikipediamodel10 --classifierType > bayes --dataSource hdfs > > After this, as per the wiki, we need to get the data from HDFS. I did that > <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10 > > After this, the classifier is to be tested: > $HADOOP_HOME/bin/hadoop jar > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10 > -d wikipediainput10 -ng 1 -type bayes -source hdfs > > When I run this, this runs for close to 2 hours and after 2 hours, it > errors out with a java.io.FileException saying that the logs_ is a > directory in the wikipediainput10 folder. I am sorry I can't provide > the stack trace right now because I accidentally closed the terminal > window before I could copy it. I will run this again and send the > stack trace. > > But, if you can send me the steps that you followed after running the > classifier, I can repeat those and see if I am able to successfully > execute the classifier. > > Thank you > Gangadhar > > > On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala > <[EMAIL PROTECTED]> wrote: > > Joe, > > I will try with the ngram setting of 1 and let you know how it goes. > > Robin, the ngram parameter is used to check the number of subsequences > > of characters isn't it ? Or is it evaluated differently w.r.t to the > > Bayesian classifier ? > > > > Ted, like Joe mentioned, if you could point us to some information on > > SGD we could try it and report back the results to the list. > > > > Thank you > > Gangadhar > > > > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > >> Robin / Gangadhar, > >> With ngram as 1 and all the countries in the country.txt , the model is > >> getting created without any issues. > >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i > wikipediainput > >> -o wikipediamodel -type bayes -source hdfs > >> > >> Robin, > >> Even for ngram parameter, the default value is mentioned as 1 but it is > set > >> as a mandatory parameter in TrainClassifier. so i'll modify the code to > set > >> the default ngram as 1 and make it as a non mandatory param. > >> > >> That aside, When I try to test the model, the summary is getting printed > >> like below. > >> Summary > >> ------------------------------------------------------- > >> Correctly Classified Instances : 0 ?% > >> Incorrectly Classified Instances : 0 ?% > >> Total Classified Instances : 0 > >> Need to figure out the reason.. > >> > >> Since TestClassifier also has the same params and settings like > >> TrainClassifier, can i modify it to set the default values for ngram, > >> classifierType & dataSource ? > >> > >> reg, > >> Joe. > >> > >> On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > >> > >>> Robin, > >>> > >>> Thanks for your tip. > >>> Will try it out and post updates. > >>> > >>> reg > >>> Joe. > >>> > >>> > >>> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[EMAIL PROTECTED]> > wrote: > >>> > >>>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. +
Joe Kumar 2010-09-24, 12:44
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-09-26, 13:28
Joe,
I am out of town for this week and won't have access to my machine. I will check this during the weekend and will get back to you. Will follow the steps in the wiki. Thank you On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Hi Gangadhar, > > I ran TestClassifier with similar parameters. It didnt take me 2 hrs though. > > I have documented the steps that worked for me at > https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example > Can you please get the patch available at MAHOUT-509 and apply it and then > try the steps in the wiki. > Please let me know if you still face issues. > > reg > Joe. > > > On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[EMAIL PROTECTED] >> wrote: > >> Joe, >> Can you let me know what was the command you used to test the >> classifier ? With the ngrams set to 1 as suggested by Robin, I was >> able to train the classifier. The command: >> $HADOOP_HOME/bin/hadoop jar >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1 >> --input wikipediainput10 --output wikipediamodel10 --classifierType >> bayes --dataSource hdfs >> >> After this, as per the wiki, we need to get the data from HDFS. I did that >> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10 >> >> After this, the classifier is to be tested: >> $HADOOP_HOME/bin/hadoop jar >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10 >> -d wikipediainput10 -ng 1 -type bayes -source hdfs >> >> When I run this, this runs for close to 2 hours and after 2 hours, it >> errors out with a java.io.FileException saying that the logs_ is a >> directory in the wikipediainput10 folder. I am sorry I can't provide >> the stack trace right now because I accidentally closed the terminal >> window before I could copy it. I will run this again and send the >> stack trace. >> >> But, if you can send me the steps that you followed after running the >> classifier, I can repeat those and see if I am able to successfully >> execute the classifier. >> >> Thank you >> Gangadhar >> >> >> On Mon, Sep 20, 2010 at 11:13 PM, Gangadhar Nittala >> <[EMAIL PROTECTED]> wrote: >> > Joe, >> > I will try with the ngram setting of 1 and let you know how it goes. >> > Robin, the ngram parameter is used to check the number of subsequences >> > of characters isn't it ? Or is it evaluated differently w.r.t to the >> > Bayesian classifier ? >> > >> > Ted, like Joe mentioned, if you could point us to some information on >> > SGD we could try it and report back the results to the list. >> > >> > Thank you >> > Gangadhar >> > >> > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> >> Robin / Gangadhar, >> >> With ngram as 1 and all the countries in the country.txt , the model is >> >> getting created without any issues. >> >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> >> org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i >> wikipediainput >> >> -o wikipediamodel -type bayes -source hdfs >> >> >> >> Robin, >> >> Even for ngram parameter, the default value is mentioned as 1 but it is >> set >> >> as a mandatory parameter in TrainClassifier. so i'll modify the code to >> set >> >> the default ngram as 1 and make it as a non mandatory param. >> >> >> >> That aside, When I try to test the model, the summary is getting printed >> >> like below. >> >> Summary >> >> ------------------------------------------------------- >> >> Correctly Classified Instances : 0 ?% >> >> Incorrectly Classified Instances : 0 ?% >> >> Total Classified Instances : 0 >> >> Need to figure out the reason.. >> >> >> >> Since TestClassifier also has the same params and settings like >> >> TrainClassifier, can i modify it to set the default values for ngram, >> >> classifierType & dataSource ? +
Gangadhar Nittala 2010-09-26, 13:28
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-10-07, 04:22
Joe / others,
I was finally able to test the changes that were done as part of MAHOUT-509[ https://issues.apache.org/jira/browse/MAHOUT-509] and follow the instructions in the wiki for the Bayes example [ https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example ]. The instructions in the wiki work only if the testclassifier.props has the values for the required options. Else, the user needs to provide the values on the command line for the datasource, classifiertype and the n-gram size. The testClassifier executed and printed a large matrix of values (though I still don't know how to interpret the results :) ) Also, I found a minor problem in the TestClassifier.java where in there is an Integer.parseInt with the command line option that is read. If there are any leading / ending spaces in the testclassifier.props, this results in a NumberFormatException. Attached patch does a trim on the string before doing a parseInt. I have attached a patch which has the modified testclassifier.props and the fix with the parseInt. I think both these belong to MAHOUT-509. If you think the wiki can be modified to include the parameters instead of having settings in a .props file (preferring clarity for the user over ease of use), then I can modify the wiki instructions and remove the .props file from the patch. The fix for the TestClassifier.java though, I think is required - it is to sanitize the user input. I am not sure of what is the preferred approach for providing patches for a resolved issue. Should I create a new issue just for this or would it be easier to add this patch to the existing issue itself? Please let me know and I shall create a new issue and attach the modified patch file to it. Thank you Gangadhar p.s: I named the patch file with an underscore as the existing issue already has a MAHOUT-509.patch On Sun, Sep 26, 2010 at 9:28 AM, Gangadhar Nittala <[EMAIL PROTECTED]> wrote: > Joe, > I am out of town for this week and won't have access to my machine. I > will check this during the weekend and will get back to you. Will > follow the steps in the wiki. > > Thank you > > On Fri, Sep 24, 2010 at 8:44 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> Hi Gangadhar, >> >> I ran TestClassifier with similar parameters. It didnt take me 2 hrs though. >> >> I have documented the steps that worked for me at >> https://cwiki.apache.org/confluence/display/MAHOUT/Wikipedia+Bayes+Example >> Can you please get the patch available at MAHOUT-509 and apply it and then >> try the steps in the wiki. >> Please let me know if you still face issues. >> >> reg >> Joe. >> >> >> On Thu, Sep 23, 2010 at 10:43 PM, Gangadhar Nittala <[EMAIL PROTECTED] >>> wrote: >> >>> Joe, >>> Can you let me know what was the command you used to test the >>> classifier ? With the ngrams set to 1 as suggested by Robin, I was >>> able to train the classifier. The command: >>> $HADOOP_HOME/bin/hadoop jar >>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 1 >>> --input wikipediainput10 --output wikipediamodel10 --classifierType >>> bayes --dataSource hdfs >>> >>> After this, as per the wiki, we need to get the data from HDFS. I did that >>> <HADOOP_HOME>/bin/hadoop dfs -get wikipediainput10 wikipediainput10 >>> >>> After this, the classifier is to be tested: >>> $HADOOP_HOME/bin/hadoop jar >>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >>> org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel10 >>> -d wikipediainput10 -ng 1 -type bayes -source hdfs >>> >>> When I run this, this runs for close to 2 hours and after 2 hours, it >>> errors out with a java.io.FileException saying that the logs_ is a >>> directory in the wikipediainput10 folder. I am sorry I can't provide >>> the stack trace right now because I accidentally closed the terminal >>> window before I could copy it. I will run this again and send the >>> stack trace. >>> >>> But, if you can send me the steps that you followed after running the +
Gangadhar Nittala 2010-10-07, 04:22
-
Re: Options in TrainClassifier.javaTed Dunning 2010-10-07, 16:57
Can you attach the patch there? The mailing list strips attachments.
On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala <[EMAIL PROTECTED]>wrote: > I have attached a patch which has the modified testclassifier.props > and the fix with the parseInt. I think both these belong to > MAHOUT-509 > +
Ted Dunning 2010-10-07, 16:57
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-10-07, 21:44
Ted,
I've added the patch MAHOUT-509_1.patch in Jira [ https://issues.apache.org/jira/browse/MAHOUT-509 ] . Thank you On Thu, Oct 7, 2010 at 12:57 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Can you attach the patch there? The mailing list strips attachments. > > On Wed, Oct 6, 2010 at 9:22 PM, Gangadhar Nittala > <[EMAIL PROTECTED]>wrote: > >> I have attached a patch which has the modified testclassifier.props >> and the fix with the parseInt. I think both these belong to >> MAHOUT-509 >> > +
Gangadhar Nittala 2010-10-07, 21:44
-
Re: Options in TrainClassifier.javaTed Dunning 2010-09-21, 05:41
There is a test program called TrainNewsGroups
in org.apache.mahout.classifier.sgd in the examples module. I would love to work with you to get better documentation pulled together. On Mon, Sep 20, 2010 at 8:13 PM, Gangadhar Nittala <[EMAIL PROTECTED]>wrote: > Joe, > I will try with the ngram setting of 1 and let you know how it goes. > Robin, the ngram parameter is used to check the number of subsequences > of characters isn't it ? Or is it evaluated differently w.r.t to the > Bayesian classifier ? > > Ted, like Joe mentioned, if you could point us to some information on > SGD we could try it and report back the results to the list. > > Thank you > Gangadhar > > On Mon, Sep 20, 2010 at 10:30 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > Robin / Gangadhar, > > With ngram as 1 and all the countries in the country.txt , the model is > > getting created without any issues. > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > org.apache.mahout.classifier.bayes.TrainClassifier -ng 1 -i > wikipediainput > > -o wikipediamodel -type bayes -source hdfs > > > > Robin, > > Even for ngram parameter, the default value is mentioned as 1 but it is > set > > as a mandatory parameter in TrainClassifier. so i'll modify the code to > set > > the default ngram as 1 and make it as a non mandatory param. > > > > That aside, When I try to test the model, the summary is getting printed > > like below. > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 0 ?% > > Incorrectly Classified Instances : 0 ?% > > Total Classified Instances : 0 > > Need to figure out the reason.. > > > > Since TestClassifier also has the same params and settings like > > TrainClassifier, can i modify it to set the default values for ngram, > > classifierType & dataSource ? > > > > reg, > > Joe. > > > > On Mon, Sep 20, 2010 at 1:09 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > >> Robin, > >> > >> Thanks for your tip. > >> Will try it out and post updates. > >> > >> reg > >> Joe. > >> > >> > >> On Mon, Sep 20, 2010 at 6:31 AM, Robin Anil <[EMAIL PROTECTED]> > wrote: > >> > >>> Hi Guys, Sorry about not replying, I see two problems(possible). 1st. > You > >>> need atleast 2 countries. otherwise there is no classification. > Secondly > >>> ngram =3 is a bit too high. With wikipedia this will result in a huge > >>> number > >>> of features. Why dont you try with one and see. > >>> > >>> Robin > >>> > >>> On Mon, Sep 20, 2010 at 12:08 PM, Joe Kumar <[EMAIL PROTECTED]> > wrote: > >>> > >>> > Hi Ted, > >>> > > >>> > sure. will keep digging.. > >>> > > >>> > About SGD, I dont have an idea about how it works et al. If there is > >>> some > >>> > documentation / reference / quick summary to read about it that'll be > >>> gr8. > >>> > Just saw one reference in > >>> > > https://cwiki.apache.org/confluence/display/MAHOUT/Logistic+Regression. > >>> > > >>> > I am assuming we should be able to create a model from wikipedia > >>> articles > >>> > and label the country of a new article. If so, could you please > provide > >>> a > >>> > note on how to do this. We already have the wikipedia data being > >>> extracted > >>> > for specific countries using WikipediaDatasetCreatorDriver. How do we > go > >>> > about training the classifier using SGD ? > >>> > > >>> > thanks for your help, > >>> > Joe. > >>> > > >>> > > >>> > On Sun, Sep 19, 2010 at 11:25 PM, Ted Dunning <[EMAIL PROTECTED] > > > >>> > wrote: > >>> > > >>> > > I am watching these efforts with interest, but have been unable to > >>> > > contribute much to the process. I would encourage Joe and others > to > >>> keep > >>> > > whittling this problem down so that we can understand what is > causing > >>> it. > >>> > > > >>> > > In the meantime, I think that the SGD classifiers are close to > >>> production > >>> > > quality. For problems with less than several million training +
Ted Dunning 2010-09-21, 05:41
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-20, 05:14
Gangadhar,
Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4 (revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space. commands tat I executed. I had issues with my namenode and so did a format using hadoop namenode -format. $MAHOUT_HOME/examples/src/test/resources/country.txt had just 1 entry (spain). I havent tried with multiple entries. $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o wikipedia/chunks -c 64 $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i wikipedia/chunks -o wikipediainput -c $MAHOUT_HOME/examples/src/test/resources/country.txt $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o wikipediamodel -type bayes -source hdfs $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d wikipediainput -ng 3 -type bayes -source hdfs Please try the above and let me know. we'll try and find out what is going wrong. Reg, Joe. On Sun, Sep 19, 2010 at 11:13 PM, Gangadhar Nittala <[EMAIL PROTECTED] > wrote: > Joe, > Even I tried with reducing the number of countries in the country.txt. > That didn't help. And in my case, I was monitoring the disk space and > at no time did it reach 0%. So, I am not sure if that is the case. To > remove the dependency on the number of countries, I even tried with > the subjects.txt as the classification - that also did not help. > I think this problem is due to the type of the data being processed, > but what I am not sure of is what I need to change to get the data to > be processed successfully. > > The experienced folks on Mahout will be able to tell us what is missing I > guess. > > Thank you > Gangadhar > > On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > Gangadhar, > > > > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just > have > > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the > > wikipediainput data set and then ran TrainClassifier and it worked. when > I > > ran TestClassifier as below, I got blank results in the output. > > > > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d > > wikipediainput -ng 3 -type bayes -source hdfs > > > > Summary > > ------------------------------------------------------- > > Correctly Classified Instances : 0 ?% > > Incorrectly Classified Instances : 0 ?% > > Total Classified Instances : 0 > > > > ======================================================> > Confusion Matrix > > ------------------------------------------------------- > > a <--Classified as > > 0 | 0 a = spain > > Default Category: unknown: 1 > > > > I am not sure if I am doing something wrong.. have to figure out why my > o/p > > is so blank. > > I'll document these steps and mention about country.txt in the wiki. > > > > Question to all > > Should we have 2 country.txt > > > > 1. country_full_list.txt - this is the existing list > > 2. country_sample_list.txt - a list with 2 or 3 countries > > > > To get a flavor of the wikipedia bayes example, we can use > > country_sample.txt. When new people want to just try out the example, > they > > can reference this txt file as a parameter. > > To run the example in a robust scalable infrastructure, we could use > > country_full_list.txt. > > any thots ? > > > > regards > > Joe. > > > > On Sat, Sep 18, 2010 at 8:57 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > >> Gangadhar, > >> > >> After running TrainClassifier again, the map task just failed with the +
Joe Kumar 2010-09-20, 05:14
-
Re: Options in TrainClassifier.javadeneche abdelhakim 2010-09-20, 05:45
I don't know if it's related, but I remember getting a similar
Exception one year ago when I was working on the implementation of Random Forests. In my case it was caused by SequenceFile.Sorter.merge(). I ended up writing my own merge function because I really didn't need to sort the output. On Mon, Sep 20, 2010 at 6:14 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Gangadhar, > > Just to eliminate the usual suspects, I am using Mac OSX 10.5.8, Mahout 0.4 > (revision 986659), Hadoop 0.20.2, 2GB Mem for Hadoop , 80 GB free space. > commands tat I executed. > > I had issues with my namenode and so did a format using hadoop namenode > -format. > $MAHOUT_HOME/examples/src/test/resources/country.txt had just 1 entry > (spain). I havent tried with multiple entries. > > $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.WikipediaXmlSplitter -d > $MAHOUT_HOME/examples/temp/enwiki-latest-pages-articles10.xml -o > wikipedia/chunks -c 64 > > $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver -i > wikipedia/chunks -o wikipediainput -c > $MAHOUT_HOME/examples/src/test/resources/country.txt > > $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TrainClassifier -i wikipediainput -o > wikipediamodel -type bayes -source hdfs > > $> hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d > wikipediainput -ng 3 -type bayes -source hdfs > > Please try the above and let me know. we'll try and find out what is going > wrong. > Reg, > Joe. > > On Sun, Sep 19, 2010 at 11:13 PM, Gangadhar Nittala <[EMAIL PROTECTED] >> wrote: > >> Joe, >> Even I tried with reducing the number of countries in the country.txt. >> That didn't help. And in my case, I was monitoring the disk space and >> at no time did it reach 0%. So, I am not sure if that is the case. To >> remove the dependency on the number of countries, I even tried with >> the subjects.txt as the classification - that also did not help. >> I think this problem is due to the type of the data being processed, >> but what I am not sure of is what I need to change to get the data to >> be processed successfully. >> >> The experienced folks on Mahout will be able to tell us what is missing I >> guess. >> >> Thank you >> Gangadhar >> >> On Sun, Sep 19, 2010 at 8:06 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> > Gangadhar, >> > >> > I modified $MAHOUT_HOME/examples/src/test/resources/country.txt to just >> have >> > 1 entry (spain) and used WikipediaDatasetCreatorDriver to create the >> > wikipediainput data set and then ran TrainClassifier and it worked. when >> I >> > ran TestClassifier as below, I got blank results in the output. >> > >> > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> > org.apache.mahout.classifier.bayes.TestClassifier -m wikipediamodel -d >> > wikipediainput -ng 3 -type bayes -source hdfs >> > >> > Summary >> > ------------------------------------------------------- >> > Correctly Classified Instances : 0 ?% >> > Incorrectly Classified Instances : 0 ?% >> > Total Classified Instances : 0 >> > >> > ======================================================>> > Confusion Matrix >> > ------------------------------------------------------- >> > a <--Classified as >> > 0 | 0 a = spain >> > Default Category: unknown: 1 >> > >> > I am not sure if I am doing something wrong.. have to figure out why my >> o/p >> > is so blank. >> > I'll document these steps and mention about country.txt in the wiki. >> > >> > Question to all >> > Should we have 2 country.txt >> > >> > 1. country_full_list.txt - this is the existing list >> > 2. country_sample_list.txt - a list with 2 or 3 countries +
deneche abdelhakim 2010-09-20, 05:45
-
Options in TrainClassifier.javaJoe Kumar 2010-09-15, 04:56
Hi all,
As I was going through wikipedia example, I encountered a situation with TrainClassifier wherein some of the options with default values are actually mandatory. The documentation / command line help says that 1. default source (--datasource) is hdfs but TrainClassifier has withRequired(true) while building the --datasource option. We are checking if the dataSourceType is hbase else set it to hdfs. so ideally withRequired should be set to false 2. default --classifierType is bayes but withRequired is set to true and we have code like if ("bayes".equalsIgnoreCase(classifierType)) { log.info("Training Bayes Classifier"); trainNaiveBayes(inputPath, outputPath, params); } else if ("cbayes".equalsIgnoreCase(classifierType)) { log.info("Training Complementary Bayes Classifier"); // setup the HDFS and copy the files there, then run the trainer trainCNaiveBayes(inputPath, outputPath, params); } which should be changed to *if ("cbayes".equalsIgnoreCase(classifierType)) {* log.info("Training Complementary Bayes Classifier"); trainCNaiveBayes(inputPath, outputPath, params); } *else {* log.info("Training Bayes Classifier"); // setup the HDFS and copy the files there, then run the trainer trainNaiveBayes(inputPath, outputPath, params); } Please let me know if this looks valid and I'll submit a patch for a JIRA issue. reg Joe. +
Joe Kumar 2010-09-15, 04:56
-
Re: Options in TrainClassifier.javaRobin Anil 2010-09-15, 05:10
On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[EMAIL PROTECTED]> wrote:
> Hi all, > > As I was going through wikipedia example, I encountered a situation with > TrainClassifier wherein some of the options with default values are > actually > mandatory. > The documentation / command line help says that > > 1. default source (--datasource) is hdfs but TrainClassifier > has withRequired(true) while building the --datasource option. We are > checking if the dataSourceType is hbase else set it to hdfs. so > ideally withRequired should be set to false > 2. default --classifierType is bayes but withRequired is set to true and > we have code like > > if ("bayes".equalsIgnoreCase(classifierType)) { > log.info("Training Bayes Classifier"); > trainNaiveBayes(inputPath, outputPath, params); > > } else if ("cbayes".equalsIgnoreCase(classifierType)) { > log.info("Training Complementary Bayes Classifier"); > // setup the HDFS and copy the files there, then run the trainer > trainCNaiveBayes(inputPath, outputPath, params); > } > > which should be changed to > > *if ("cbayes".equalsIgnoreCase(classifierType)) {* > log.info("Training Complementary Bayes Classifier"); > trainCNaiveBayes(inputPath, outputPath, params); > > } *else {* > log.info("Training Bayes Classifier"); > // setup the HDFS and copy the files there, then run the trainer > trainNaiveBayes(inputPath, outputPath, params); > } > > Please let me know if this looks valid and I'll submit a patch for a JIRA > issue. > > +1 all valid. , Go ahead and fix it and in the cmdline flags write the default behavior in the flag description > reg > Joe. > +
Robin Anil 2010-09-15, 05:10
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-15, 05:16
Robin,
sure. I'll submit a patch. The command line flag already has the default behavior specified. --classifierType (-type) classifierType Type of classifier: bayes|cbayes. Default: bayes --dataSource (-source) dataSource Location of model: hdfs|hbase. Default Value: hdfs So there is no change in the flag description. reg, Joe. On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > Hi all, > > > > As I was going through wikipedia example, I encountered a situation with > > TrainClassifier wherein some of the options with default values are > > actually > > mandatory. > > The documentation / command line help says that > > > > 1. default source (--datasource) is hdfs but TrainClassifier > > has withRequired(true) while building the --datasource option. We are > > checking if the dataSourceType is hbase else set it to hdfs. so > > ideally withRequired should be set to false > > 2. default --classifierType is bayes but withRequired is set to true > and > > we have code like > > > > if ("bayes".equalsIgnoreCase(classifierType)) { > > log.info("Training Bayes Classifier"); > > trainNaiveBayes(inputPath, outputPath, params); > > > > } else if ("cbayes".equalsIgnoreCase(classifierType)) { > > log.info("Training Complementary Bayes Classifier"); > > // setup the HDFS and copy the files there, then run the trainer > > trainCNaiveBayes(inputPath, outputPath, params); > > } > > > > which should be changed to > > > > *if ("cbayes".equalsIgnoreCase(classifierType)) {* > > log.info("Training Complementary Bayes Classifier"); > > trainCNaiveBayes(inputPath, outputPath, params); > > > > } *else {* > > log.info("Training Bayes Classifier"); > > // setup the HDFS and copy the files there, then run the trainer > > trainNaiveBayes(inputPath, outputPath, params); > > } > > > > Please let me know if this looks valid and I'll submit a patch for a JIRA > > issue. > > > > +1 all valid. , Go ahead and fix it and in the cmdline flags write the > default behavior in the flag description > > > > reg > > Joe. > > > +
Joe Kumar 2010-09-15, 05:16
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-09-16, 01:41
I ran into the issue that Joe mentioned about the command line
parameters. I just added the datasource to the command line to execute thus $HADOOP_HOME/bin/hadoop jar $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 --input wikipediainput10 --output wikipediamodel10 --classifierType bayes --dataSource hdfs On a related note, Joe, were you able to run the TrainClassifier without any errors ? When I tried this, the map-reduce job would abort always at 99%. I tried the example that was given in the wiki with both subjects and countries. I even reduced the list of countries in the country.txt assuming that was what was causing the issue. No matter what, the classifier task fails. And the exception in the task log : 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart = 41271492; bufend = 58259002; bufvoid = 99614720 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart = 196379; kvend = 130842; length = 327680 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask: Finished spill 287 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask: Starting flush of map output 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask: Finished spill 288 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker: Error running child org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) at org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) at org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) at org.apache.hadoop.mapred.Child.main(Child.java:170) I checked the hadoop JIRA and this seems to be fixed already https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what I am doing wrong. Any suggestions to what I need to change to get this fixed will be very helpful. I have been struggling with this for a while now. Thank you On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Robin, > > sure. I'll submit a patch. > > The command line flag already has the default behavior specified. > --classifierType (-type) classifierType Type of classifier: > bayes|cbayes. > Default: bayes > > --dataSource (-source) dataSource Location of model: hdfs|hbase. > > Default Value: hdfs > So there is no change in the flag description. > > reg, > Joe. > > > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <[EMAIL PROTECTED]> wrote: > >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> >> > Hi all, >> > >> > As I was going through wikipedia example, I encountered a situation with >> > TrainClassifier wherein some of the options with default values are >> > actually >> > mandatory. >> > The documentation / command line help says that >> > >> > 1. default source (--datasource) is hdfs but TrainClassifier >> > has withRequired(true) while building the --datasource option. We are >> > checking if the dataSourceType is hbase else set it to hdfs. so >> > ideally withRequired should be set to false >> > 2. default --classifierType is bayes but withRequired is set to true >> and >> > we have code like >> > >> > if ("bayes".equalsIgnoreCase(classifierType)) { >> > log.info("Training Bayes Classifier"); >> > trainNaiveBayes(inputPath, outputPath, params); > +
Gangadhar Nittala 2010-09-16, 01:41
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-16, 02:20
Hi Gangadhar,
rite. I did the same to execute the TrainClassifier but then since the default datasource is hdfs, we should not be mandated to provide this parameter. I havent completed executing the TrainClassifier yet. I'll do it tonite and let you know if I get into trouble. reg, Joe. On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala <[EMAIL PROTECTED]>wrote: > I ran into the issue that Joe mentioned about the command line > parameters. I just added the datasource to the command line to execute > thus > $HADOOP_HOME/bin/hadoop jar > $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 > --input wikipediainput10 --output wikipediamodel10 --classifierType > bayes --dataSource hdfs > > On a related note, Joe, were you able to run the TrainClassifier > without any errors ? When I tried this, the map-reduce job would abort > always at 99%. I tried the example that was given in the wiki with > both subjects and countries. I even reduced the list of countries in > the country.txt assuming that was what was causing the issue. No > matter what, the classifier task fails. And the exception in the task > log : > > 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart > = 41271492; bufend = 58259002; bufvoid = 99614720 > 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart > = 196379; kvend = 130842; length = 327680 > 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask: > Finished spill 287 > 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask: > Starting flush of map output > 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask: > Finished spill 288 > 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker: > Error running child > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > any valid local directory for > > taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) > at > org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) > at > org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469) > at > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) > at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) > at org.apache.hadoop.mapred.Child.main(Child.java:170) > > I checked the hadoop JIRA and this seems to be fixed already > https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what > I am doing wrong. Any suggestions to what I need to change to get this > fixed will be very helpful. I have been struggling with this for a > while now. > > Thank you > > On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > Robin, > > > > sure. I'll submit a patch. > > > > The command line flag already has the default behavior specified. > > --classifierType (-type) classifierType Type of classifier: > > bayes|cbayes. > > Default: bayes > > > > --dataSource (-source) dataSource Location of model: > hdfs|hbase. > > > > Default Value: hdfs > > So there is no change in the flag description. > > > > reg, > > Joe. > > > > > > On Wed, Sep 15, 2010 at 1:10 AM, Robin Anil <[EMAIL PROTECTED]> > wrote: > > > >> On Wed, Sep 15, 2010 at 10:26 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: > >> > >> > Hi all, > >> > > >> > As I was going through wikipedia example, I encountered a situation > with > >> > TrainClassifier wherein some of the options with default values are > >> > actually > >> > mandatory. > >> > The documentation / command line help says that +
Joe Kumar 2010-09-16, 02:20
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-17, 03:34
Gangadhar,
After some system issues, I finally ran the TrainClassifier. After almost 65% into the map job, I got the same error that you have mentioned. INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0, Status : FAILED org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any valid local directory for taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out at org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) ... Havent yet analyzed the root cause / solution but just wanted to confirm that I am facing the same issue as you do. I'll try to search / analyze and post more details. reg, Joe. On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Hi Gangadhar, > > rite. I did the same to execute the TrainClassifier but then since the > default datasource is hdfs, we should not be mandated to provide this > parameter. > I havent completed executing the TrainClassifier yet. I'll do it tonite and > let you know if I get into trouble. > > reg, > Joe. > > > On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala < > [EMAIL PROTECTED]> wrote: > >> I ran into the issue that Joe mentioned about the command line >> parameters. I just added the datasource to the command line to execute >> thus >> $HADOOP_HOME/bin/hadoop jar >> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 >> --input wikipediainput10 --output wikipediamodel10 --classifierType >> bayes --dataSource hdfs >> >> On a related note, Joe, were you able to run the TrainClassifier >> without any errors ? When I tried this, the map-reduce job would abort >> always at 99%. I tried the example that was given in the wiki with >> both subjects and countries. I even reduced the list of countries in >> the country.txt assuming that was what was causing the issue. No >> matter what, the classifier task fails. And the exception in the task >> log : >> >> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart >> = 41271492; bufend = 58259002; bufvoid = 99614720 >> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart >> = 196379; kvend = 130842; length = 327680 >> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask: >> Finished spill 287 >> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask: >> Starting flush of map output >> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask: >> Finished spill 288 >> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker: >> Error running child >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >> any valid local directory for >> >> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out >> at >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) >> at >> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) >> at >> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469) >> at >> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) >> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) >> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) >> at org.apache.hadoop.mapred.Child.main(Child.java:170) >> >> I checked the hadoop JIRA and this seems to be fixed already >> https://issues.apache.org/jira/browse/HADOOP-4963. I am not sure what >> I am doing wrong. Any suggestions to what I need to change to get this >> fixed will be very helpful. I have been struggling with this for a >> while now. >> >> Thank you >> >> On Wed, Sep 15, 2010 at 1:16 AM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> > Robin, +
Joe Kumar 2010-09-17, 03:34
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-09-18, 00:36
Thank you Joe for the confirmation. I am also checking the code to see
what is causing this issue. May be others in the list will know what can cause this issue. I am guessing the root cause is not Mahout but something in Hadoop. On Thu, Sep 16, 2010 at 11:34 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Gangadhar, > > After some system issues, I finally ran the TrainClassifier. After almost > 65% into the map job, I got the same error that you have mentioned. > INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0, > Status : FAILED > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > valid local directory for > taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out > at > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) > ... > Havent yet analyzed the root cause / solution but just wanted to confirm > that I am facing the same issue as you do. > I'll try to search / analyze and post more details. > > reg, > Joe. > > On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > >> Hi Gangadhar, >> >> rite. I did the same to execute the TrainClassifier but then since the >> default datasource is hdfs, we should not be mandated to provide this >> parameter. >> I havent completed executing the TrainClassifier yet. I'll do it tonite and >> let you know if I get into trouble. >> >> reg, >> Joe. >> >> >> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala < >> [EMAIL PROTECTED]> wrote: >> >>> I ran into the issue that Joe mentioned about the command line >>> parameters. I just added the datasource to the command line to execute >>> thus >>> $HADOOP_HOME/bin/hadoop jar >>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 >>> --input wikipediainput10 --output wikipediamodel10 --classifierType >>> bayes --dataSource hdfs >>> >>> On a related note, Joe, were you able to run the TrainClassifier >>> without any errors ? When I tried this, the map-reduce job would abort >>> always at 99%. I tried the example that was given in the wiki with >>> both subjects and countries. I even reduced the list of countries in >>> the country.txt assuming that was what was causing the issue. No >>> matter what, the classifier task fails. And the exception in the task >>> log : >>> >>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart >>> = 41271492; bufend = 58259002; bufvoid = 99614720 >>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart >>> = 196379; kvend = 130842; length = 327680 >>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask: >>> Finished spill 287 >>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask: >>> Starting flush of map output >>> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask: >>> Finished spill 288 >>> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker: >>> Error running child >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> any valid local directory for >>> >>> taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out >>> at >>> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) >>> at >>> org.apache.hadoop.fs.LocalDirAllocator.getLocalPathForWrite(LocalDirAllocator.java:124) >>> at >>> org.apache.hadoop.mapred.MapOutputFile.getOutputFileForWrite(MapOutputFile.java:61) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1469) >>> at >>> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1154) >>> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:359) >>> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:307) >>> at org.apache.hadoop.mapred.Child.main(Child.java:170) +
Gangadhar Nittala 2010-09-18, 00:36
-
Re: Options in TrainClassifier.javaJoe Kumar 2010-09-18, 03:33
Gangadhar,
I couldnt find any concrete reason behind this error. Some of them have reported this to happen very sporadic. As per some suggestions in this thread ( http://www.mail-archive.com/[EMAIL PROTECTED]/msg09250.html) , I have changed the location of hadoop tmp dir. Also I have cleaned up some space in my laptop (now having 81GB of free space) and have started the job again. I m trying to see if freeing up space helps. I'll post any progress. Has anyone else faced similar issues. Would appreciate feedbacks / thots. reg Joe. On Fri, Sep 17, 2010 at 8:36 PM, Gangadhar Nittala <[EMAIL PROTECTED]>wrote: > Thank you Joe for the confirmation. I am also checking the code to see > what is causing this issue. May be others in the list will know what > can cause this issue. I am guessing the root cause is not Mahout but > something in Hadoop. > > On Thu, Sep 16, 2010 at 11:34 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > Gangadhar, > > > > After some system issues, I finally ran the TrainClassifier. After almost > > 65% into the map job, I got the same error that you have mentioned. > > INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0, > > Status : FAILED > > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any > > valid local directory for > > > taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out > > at > > > org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) > > ... > > Havent yet analyzed the root cause / solution but just wanted to confirm > > that I am facing the same issue as you do. > > I'll try to search / analyze and post more details. > > > > reg, > > Joe. > > > > On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > > > >> Hi Gangadhar, > >> > >> rite. I did the same to execute the TrainClassifier but then since the > >> default datasource is hdfs, we should not be mandated to provide this > >> parameter. > >> I havent completed executing the TrainClassifier yet. I'll do it tonite > and > >> let you know if I get into trouble. > >> > >> reg, > >> Joe. > >> > >> > >> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala < > >> [EMAIL PROTECTED]> wrote: > >> > >>> I ran into the issue that Joe mentioned about the command line > >>> parameters. I just added the datasource to the command line to execute > >>> thus > >>> $HADOOP_HOME/bin/hadoop jar > >>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job > >>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 > >>> --input wikipediainput10 --output wikipediamodel10 --classifierType > >>> bayes --dataSource hdfs > >>> > >>> On a related note, Joe, were you able to run the TrainClassifier > >>> without any errors ? When I tried this, the map-reduce job would abort > >>> always at 99%. I tried the example that was given in the wiki with > >>> both subjects and countries. I even reduced the list of countries in > >>> the country.txt assuming that was what was causing the issue. No > >>> matter what, the classifier task fails. And the exception in the task > >>> log : > >>> > >>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart > >>> = 41271492; bufend = 58259002; bufvoid = 99614720 > >>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart > >>> = 196379; kvend = 130842; length = 327680 > >>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask: > >>> Finished spill 287 > >>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask: > >>> Starting flush of map output > >>> 2010-09-14 08:26:00,386 INFO org.apache.hadoop.mapred.MapTask: > >>> Finished spill 288 > >>> 2010-09-14 08:26:08,765 WARN org.apache.hadoop.mapred.TaskTracker: > >>> Error running child > >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > >>> any valid local directory for > >>> > >>> > taskTracker/jobcache/job_201009132133_0002/attempt_201009132133_0002_m_000001_3/output/file.out +
Joe Kumar 2010-09-18, 03:33
-
Re: Options in TrainClassifier.javaGangadhar Nittala 2010-09-18, 16:36
Joe,
I don't think it is the disk space that could be the problem because I did have enough disk space (well, not 81GB, but around 40GB free) . I will try if the suggestions in the thread you mentioned make any difference. Will keep you posted. Thank you On Fri, Sep 17, 2010 at 11:33 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: > Gangadhar, > > I couldnt find any concrete reason behind this error. Some of them have > reported this to happen very sporadic. As per some suggestions in this > thread ( > http://www.mail-archive.com/[EMAIL PROTECTED]/msg09250.html) , I > have changed the location of hadoop tmp dir. Also I have cleaned up some > space in my laptop (now having 81GB of free space) and have started the job > again. I m trying to see if freeing up space helps. I'll post any progress. > > Has anyone else faced similar issues. Would appreciate feedbacks / thots. > > reg > Joe. > > > On Fri, Sep 17, 2010 at 8:36 PM, Gangadhar Nittala > <[EMAIL PROTECTED]>wrote: > >> Thank you Joe for the confirmation. I am also checking the code to see >> what is causing this issue. May be others in the list will know what >> can cause this issue. I am guessing the root cause is not Mahout but >> something in Hadoop. >> >> On Thu, Sep 16, 2010 at 11:34 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> > Gangadhar, >> > >> > After some system issues, I finally ran the TrainClassifier. After almost >> > 65% into the map job, I got the same error that you have mentioned. >> > INFO mapred.JobClient: Task Id : attempt_201009160819_0002_m_000000_0, >> > Status : FAILED >> > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find any >> > valid local directory for >> > >> taskTracker/jobcache/job_201009160819_0002/attempt_201009160819_0002_m_000000_0/output/file.out >> > at >> > >> org.apache.hadoop.fs.LocalDirAllocator$AllocatorPerContext.getLocalPathForWrite(LocalDirAllocator.java:343) >> > ... >> > Havent yet analyzed the root cause / solution but just wanted to confirm >> > that I am facing the same issue as you do. >> > I'll try to search / analyze and post more details. >> > >> > reg, >> > Joe. >> > >> > On Wed, Sep 15, 2010 at 10:20 PM, Joe Kumar <[EMAIL PROTECTED]> wrote: >> > >> >> Hi Gangadhar, >> >> >> >> rite. I did the same to execute the TrainClassifier but then since the >> >> default datasource is hdfs, we should not be mandated to provide this >> >> parameter. >> >> I havent completed executing the TrainClassifier yet. I'll do it tonite >> and >> >> let you know if I get into trouble. >> >> >> >> reg, >> >> Joe. >> >> >> >> >> >> On Wed, Sep 15, 2010 at 9:41 PM, Gangadhar Nittala < >> >> [EMAIL PROTECTED]> wrote: >> >> >> >>> I ran into the issue that Joe mentioned about the command line >> >>> parameters. I just added the datasource to the command line to execute >> >>> thus >> >>> $HADOOP_HOME/bin/hadoop jar >> >>> $MAHOUT_HOME/examples/target/mahout-examples-0.4-SNAPSHOT.job >> >>> org.apache.mahout.classifier.bayes.TrainClassifier --gramSize 3 >> >>> --input wikipediainput10 --output wikipediamodel10 --classifierType >> >>> bayes --dataSource hdfs >> >>> >> >>> On a related note, Joe, were you able to run the TrainClassifier >> >>> without any errors ? When I tried this, the map-reduce job would abort >> >>> always at 99%. I tried the example that was given in the wiki with >> >>> both subjects and countries. I even reduced the list of countries in >> >>> the country.txt assuming that was what was causing the issue. No >> >>> matter what, the classifier task fails. And the exception in the task >> >>> log : >> >>> >> >>> 10-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: bufstart >> >>> = 41271492; bufend = 58259002; bufvoid = 99614720 >> >>> 2010-09-14 08:25:27,026 INFO org.apache.hadoop.mapred.MapTask: kvstart >> >>> = 196379; kvend = 130842; length = 327680 >> >>> 2010-09-14 08:25:48,136 INFO org.apache.hadoop.mapred.MapTask: >> >>> Finished spill 287 >> >>> 2010-09-14 08:25:48,417 INFO org.apache.hadoop.mapred.MapTask: +
Gangadhar Nittala 2010-09-18, 16:36
|