|
|
-
Classifier Architecture
Robin Anil 2010-03-01, 21:33
I am kicking this discussion on how we are going to integrate RF, NB, CNB, WINNOW, SGD, SVM Phew!.
Since I wrote NB and CNB, I will list down(in a subsequent emails in blocks) what my assumptions were, how it integrates with hdfs and hbase. How training, testing, online and batch classification is done.
>From what I think right now - Apart from SGD everything else a batch trainer. SGD is the only pure online trainer cum classifier. - NB/CNB was designed as a binary feature classifier (as per the paper it was on text) and does multi label classification as a simple score comparison. - NB/CNB uses only tokens as features so there is no need to convert text features to integers back and forth. Randomizers removes this limitation, if we go ahead with only that - NB/CNB uses the tokens as the row and column byte when looking up in the HBase table - SVM and others use threshold values for each feature to decide the cutting plane. - RF uses vectors to store labels and does not use the multi label vector at the moment Questions - Interfaces (how are they going to look like) - Trainer - Classifier (binary) (multi label classification) - Test - Ensemble - bagging boosting ? - What is the basic storage interface everyone should use Matrix? Then we can have hdfs backed matrix, hbase backed matrix, inmemory matrix - If basic storage could be different i mean decision tree is not a matrix, what is the fixed input output format. - How can we extend the test setup like Confusion matrix to capture info from all classifiers - If we make some assumptions now what will we do when classifiers like HMM, CRF come into the picture. they need more than just vectors but also the order of features. Robin
+
Robin Anil 2010-03-01, 21:33
-
Re: Classifier Architecture
Ted Dunning 2010-03-01, 22:18
On Mon, Mar 1, 2010 at 1:33 PM, Robin Anil <[EMAIL PROTECTED]> wrote:
> I am kicking this discussion on how we are going to integrate RF, NB, CNB, > WINNOW, SGD, SVM Phew!. >
Great. Architecture is good. Especially after we have several examples. > > From what I think right now > - Apart from SGD everything else a batch trainer. SGD is the only pure > online trainer cum classifier. >
And even SGD is currently written so as to be used in batch setting. (actually Pegasos is on-line as well, but written in batch style like SGD). > Questions > - Interfaces (how are they going to look like) > - Trainer > - Classifier (binary) (multi label classification) > - Test >
We need command line and some day some kind of workflow interfaces. The batch orientation that we have right now is probably just fine for 99% of all applications. > - Ensemble - bagging boosting ? >
See random forests.
But really, let's see if people come up with a need. Bagging and boosting can be good ways to improve over-fitting problems, but let's see if Pegasos and SGD solve those problems for us.
> - What is the basic storage interface everyone should use Matrix? Then we > can have hdfs backed matrix, hbase backed matrix, inmemory matrix >
Matrix, yes.
But also I think that allowing Drew's avro document format with a randomizer (or field list for NB) specification would also be good.
Lucene index + randomizer would also be useful.
- If basic storage could be different i mean decision tree is not a matrix, > what is the fixed input output format. >
Input formats are much more easily standardized. The only common characteristic I can think of for all output formats is that there should be a way to use the persistent output of classifier training to generate a model that can classify more inputs. Maybe there should be some vague requirement that it be possible to produce a more or less human readable representation.
Other than those very generic and vague requirements, I can't understand what we can say about the output of a classifier. If anybody starts pushing for PMML output, that might be a nice way to meet the "more or less human readable" aspect. > - How can we extend the test setup like Confusion matrix to capture info > from all classifiers >
If you can read a model from disk and apply new inputs, then it should be possible to generalize the evaluation process. > - If we make some assumptions now what will we do when classifiers like > HMM, CRF come into the picture. they need more than just vectors but also > the order of features. >
Drew's document format becomes very important there.
+
Ted Dunning 2010-03-01, 22:18
|