|
|
-
PLEASE HELP! - MAHOUT CLASSIFICATION
Sam Cunningham 2011-12-09, 19:57
I really need help. I am working on a project: I have a cron job that collects RSS feeds from news sites (Reuters and Associated Press). I need to classify these news data based on their content (just like 20news example). The categories are business, entertainment, health, politics, scitech, and sports. I use half of the data for training and the other half for testing. Attached, please find the training, testing and model files in compressed form. As you will see when I test the model I get extremely good results for some topics (business, sports, and entertainment). I get really bad results (almost %0) for other topics (health, scitech, and politics). What's wrong?
What is more interesting is that I get real bad results with "health" topic when I test the classifier against the training data which is the dataset in creating the model, itself. This is strange.
Please help.
Thank you,
Sam
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Suneel Marthi 2011-12-09, 19:59
Which classifier r u running?
________________________________ From: Sam Cunningham <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, December 9, 2011 2:57 PM Subject: PLEASE HELP! - MAHOUT CLASSIFICATION I really need help. I am working on a project: I have a cron job that collects RSS feeds from news sites (Reuters and Associated Press). I need to classify these news data based on their content (just like 20news example). The categories are business, entertainment, health, politics, scitech, and sports. I use half of the data for training and the other half for testing. Attached, please find the training, testing and model files in compressed form. As you will see when I test the model I get extremely good results for some topics (business, sports, and entertainment). I get really bad results (almost %0) for other topics (health, scitech, and politics). What's wrong?
What is more interesting is that I get real bad results with "health" topic when I test the classifier against the training data which is the dataset in creating the model, itself. This is strange.
Please help.
Thank you,
Sam
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Dmitriy Lyubimov 2011-12-09, 21:07
Sam, the list wouldn't let attachments .
On Fri, Dec 9, 2011 at 11:57 AM, Sam Cunningham <[EMAIL PROTECTED]> wrote: > I really need help. I am working on a project: I have a cron job that collects > RSS feeds from news sites (Reuters and Associated Press). I need to classify > these news data based on their content (just like 20news example). The > categories are business, entertainment, health, politics, scitech, and sports. I > use half of the data for training and the other half for testing. Attached, > please find the training, testing and model files in compressed form. As you > will see when I test the model I get extremely good results for some topics > (business, sports, and entertainment). I get really bad results (almost %0) for > other topics (health, scitech, and politics). What's wrong? > > What is more interesting is that I get real bad results with "health" topic when > I test the classifier against the training data which is the dataset in creating > the model, itself. This is strange. > > Please help. > > Thank you, > > Sam > > >
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Sam Cunningham 2011-12-09, 23:37
Suneel Marthi <suneel_marthi <at> yahoo.com> writes:
> > Which classifier r u running? > Hi Suneel,
I am running cbayes. Here is the command options for the trainer:
$MAHOUT_HOME/bin/mahout trainclassifier -i /user/sayhan/articles-train -o /user/sayhan/articles-model -type cbayes -ng 1 -source hdfs
I am running cbayes for testing the classifier as well.
Sam
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Sam Cunningham 2011-12-09, 23:40
Dmitriy Lyubimov <dlieu.7 <at> gmail.com> writes: > > Sam, the list wouldn't let attachments . > Hi Dmitriy, Here is the link to the attachments along with the same message content. Please let me know if you can't get the attachments. Thank you for your help, http://lucene.472066.n3.nabble.com/PLEASE-HELP-MAHOUT-CLASSIFICATION-td3573905.htmlSam
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Suneel Marthi 2011-12-09, 23:43
Hi Sam,
I am assuming that you are running the latest code from the Mahout 0.6 trunk. Did you try running your dataset through SGD classifier for both training and testing? Suneel ________________________________ From: Sam Cunningham <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Friday, December 9, 2011 6:37 PM Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Suneel Marthi <suneel_marthi <at> yahoo.com> writes:
> > Which classifier r u running? > Hi Suneel,
I am running cbayes. Here is the command options for the trainer:
$MAHOUT_HOME/bin/mahout trainclassifier -i /user/sayhan/articles-train -o /user/sayhan/articles-model -type cbayes -ng 1 -source hdfs
I am running cbayes for testing the classifier as well.
Sam
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Sam Cunningham 2011-12-10, 03:03
Suneel Marthi <suneel_marthi <at> yahoo.com> writes:
> > Hi Sam, > > I am assuming that you are running the latest code from the Mahout 0.6 trunk. > > Did you try running your dataset through SGD classifier for both training and testing? > > Suneel >
Suneel,
I am running Mahout distribution v0.5. Though, I am not sure what difference would that make? I ran my dataset with bayes/cbayes only. I don't have any sample code for SGD or its command option. Is there any SGD example for 20news dataset so that I can follow (for training and testing)?
Sam
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Ted Dunning 2011-12-10, 08:20
a) run with trunk b) see https://github.com/tdunning/Chapter-16c) also see org.apache.mahout.classifier.sgd.TrainNewsGroups Your training data is tiny. The bayes classifiers are designed for large data. Poor results are not very surprising at this data size. On Fri, Dec 9, 2011 at 8:03 PM, Sam Cunningham <[EMAIL PROTECTED]> wrote: > I am running Mahout distribution v0.5. Though, I am not sure what > difference > would that make? I ran my dataset with bayes/cbayes only. I don't have any > sample code for SGD or its command option. Is there any SGD example for > 20news > dataset so that I can follow (for training and testing)? >
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Suneel Marthi 2011-12-10, 17:12
Sam, Per Ted's email below please run with the trunk for your work. Please look at Chapters 13 - 16 in the Mahout in Action book for sample code snippets for classifying 20 newsgroups with SGD. There presently is no command line option (I am not aware of one and could be wrong) for running the 20 newsgroup example with SGD. The only command line tools for SGD - trainlogistic and runlogistic expect the input files to be in CSV format which is not what you have. I have a sample program for qualifying datasets (similar to the format you have) using SGD which I can share with you later today. Regards, Suneel ________________________________ From: Ted Dunning <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, December 10, 2011 3:20 AM Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION a) run with trunk b) see https://github.com/tdunning/Chapter-16c) also see org.apache.mahout.classifier.sgd.TrainNewsGroups Your training data is tiny. The bayes classifiers are designed for large data. Poor results are not very surprising at this data size. On Fri, Dec 9, 2011 at 8:03 PM, Sam Cunningham <[EMAIL PROTECTED]> wrote: > I am running Mahout distribution v0.5. Though, I am not sure what > difference > would that make? I ran my dataset with bayes/cbayes only. I don't have any > sample code for SGD or its command option. Is there any SGD example for > 20news > dataset so that I can follow (for training and testing)? >
-
Re: PLEASE HELP! - MAHOUT CLASSIFICATION
Suneel Marthi 2011-12-10, 17:13
Sorry, I stand correted on the SGD command line tools, please look at TrainNewsGroups as Ted suggests. ________________________________ From: Suneel Marthi <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> Sent: Saturday, December 10, 2011 12:12 PM Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION Sam, Per Ted's email below please run with the trunk for your work. Please look at Chapters 13 - 16 in the Mahout in Action book for sample code snippets for classifying 20 newsgroups with SGD. There presently is no command line option (I am not aware of one and could be wrong) for running the 20 newsgroup example with SGD. The only command line tools for SGD - trainlogistic and runlogistic expect the input files to be in CSV format which is not what you have. I have a sample program for qualifying datasets (similar to the format you have) using SGD which I can share with you later today. Regards, Suneel ________________________________ From: Ted Dunning <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, December 10, 2011 3:20 AM Subject: Re: PLEASE HELP! - MAHOUT CLASSIFICATION a) run with trunk b) see https://github.com/tdunning/Chapter-16c) also see org.apache.mahout.classifier.sgd.TrainNewsGroups Your training data is tiny. The bayes classifiers are designed for large data. Poor results are not very surprising at this data size. On Fri, Dec 9, 2011 at 8:03 PM, Sam Cunningham <[EMAIL PROTECTED]> wrote: > I am running Mahout distribution v0.5. Though, I am not sure what > difference > would that make? I ran my dataset with bayes/cbayes only. I don't have any > sample code for SGD or its command option. Is there any SGD example for > 20news > dataset so that I can follow (for training and testing)? >
|
|