|
|
-
Re: Seeking classification adviceJake Mannix 2012-06-08, 13:37
On Fri, Jun 8, 2012 at 12:22 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> That doesn't really do good multi-labeling. > > Also, with 100 categories, most classifiers begin to have some issues. > Hmm... we've been using multinomial LR on label sets of 100-200 categories at twitter and having pretty good success - you have to recalibrate what it means to have good accuracy at this scale (i.e. 40-50% is actually *great*), but e.g. we do consistently *better* than inter-annotator agreement (think of how hard it is for a human to put things into one of 100+ buckets, esp. if there is any ambiguity in the label definitions). Once you get this many categories, it seems to me that you kinda *have to* have your results not be a single category, but instead a (weighted) collection of the top results, and interpret it as "most likely your category is one of these". >From there it's not a far jump to say that the way you put the labels on in the first place was relatively uncertain. Alternatively, instead of LR, if you have training data which really does have a lot of muliply-labeled documents, then going with Labeled LDA<http://dl.acm.org/citation.cfm?id=1699543> is possibly better. This can be done (basically) in the LDA code we have in Mahout, if you make some slight modifications (I'll have to check how far behind trunk is from Twitter's fork - it may be that doing this on trunk is hard right now). > > On Fri, Jun 8, 2012 at 3:21 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > > > Wait, what's wrong with using the usual SGD for multinomial LR and > picking > > the top couple of classes by probability, if several are close in size? > > > > On Thu, Jun 7, 2012 at 3:46 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > > > There are a variety of methods to use here. I would recommend you try > a > > > variety of them to decide on your best approach. > > > > > > 1) first, count all the combinations of labels. If there are not that > > > many, you may just want to consider each combination a separate > category. > > > Another option is to separate the categories into independent sets of > > > disjoint categories. > > > > > > 2) second, try to determine which categories are most confusable or > > > similar. One way to do this is to simply build a NB or CNB classifier > > and > > > look at the confusion matrix to see which errors get made. Those > groups > > of > > > categories that get confused are candidates for a super category. > > > > > > 3) now start building your classifiers. You might try various tree > > > structures of categories, including a flat tree, one build up of > > confusable > > > classes, and one built up based on your intuitions. At each level, > build > > > either a binary classifier per category or a 1 of n classifier if no > two > > > categories at that level ever are tagged. You may want to build a > > > secondary binary classifier for each category whose inputs are the > > outputs > > > from each of the first level categorizers. > > > > > > 4) tune and adjust. tune and adjust. > > > > > > On Fri, Jun 8, 2012 at 12:31 AM, David Engel <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I've been dabbling with Mahout off and on for a few months preparing > > > > for a classification project. It's now time to stop experimenting > and > > > > do something for real. I've picked up a lot of things from following > > > > this list, but would like some advice regarding a few things before > > > > proceeding. I'll start with a very brief description of the project > > > > and then follow up with some questions. > > > > > > > > We need to classify potentially millions of documents into about 100 > > > > or so categories. Most documents will probably only belong to 1 > > > > category, but some will belong to several. It's also possible for > > > > some documents to not belong to any of the chosen categories. > > > > > > > > As noted, we need to handle the case where a document belongs to > > > > multiple categories. My understanding is the classification -jake |