I am kicking this discussion on how we are going to integrate RF, NB, CNB,
WINNOW, SGD, SVM Phew!.
Since I wrote NB and CNB, I will list down(in a subsequent emails in blocks)
what my assumptions were, how it integrates with hdfs and hbase. How
training, testing, online and batch classification is done.
>From what I think right now
- Apart from SGD everything else a batch trainer. SGD is the only pure
online trainer cum classifier.
- NB/CNB was designed as a binary feature classifier (as per the paper it
was on text) and does multi label classification as a simple score
- NB/CNB uses only tokens as features so there is no need to convert text
features to integers back and forth. Randomizers removes this limitation, if
we go ahead with only that
- NB/CNB uses the tokens as the row and column byte when looking up in the
- SVM and others use threshold values for each feature to decide the cutting
- RF uses vectors to store labels and does not use the multi label vector at
- Interfaces (how are they going to look like)
- Classifier (binary) (multi label classification)
- Ensemble - bagging boosting ?
- What is the basic storage interface everyone should use Matrix? Then we
can have hdfs backed matrix, hbase backed matrix, inmemory matrix
- If basic storage could be different i mean decision tree is not a matrix,
what is the fixed input output format.
- How can we extend the test setup like Confusion matrix to capture info
from all classifiers
- If we make some assumptions now what will we do when classifiers like
HMM, CRF come into the picture. they need more than just vectors but also
the order of features.