-Mahout 0.6 Naive Bayes Accuracy
Dimitri Goldin 2012-03-27, 16:26
We were evaluating Mahout 0.6's Naive Bayes implementation using a
training set of 70000 documents (we know, that with this amount of
documents distributed training does not yet make too much sense).
During the tests we noticed that the performance is around 80% with the
20newsgroups data - which is quite balanced (in the sense that
there are approximately the same number of documents per class). Most
documents tended to be classified as the class with the most number of
Using our own data we only achieved an accuracy between ~35% and ~55%
depending on the classes' sizes within the test-sets.
We also tested replacing the tokenization, which right now is performed
on tabs and spaces using guavas Splitter class, which we replaced with
Lucenes GermanAnalyzer. This gave us around 10% more accuracy with
balanced training-data, resulting in ~89% accuracy.
Having tried Mallets naive bayes implementation we achieved ~95%
accuracy without having to balance the training-data. Does anybody know
which implementation detail might cause this or why balance seems
influence mahouts implementation much more?
I also found the following thread from fall 2011, which seems to
describe a similar problem:
Unfortunately there was no follow-up to this, but maybe someone already
Thanks in advance,
Isabel Drost 2012-03-28, 07:10
Dimitri Goldin 2012-03-29, 10:28