|
Lance Norskog
2012-08-03, 07:35
Ted Dunning
2012-08-03, 07:43
Pat Ferrel
2012-08-03, 15:23
Ted Dunning
2012-08-03, 17:29
Dawid Weiss
2012-08-03, 19:05
Ted Dunning
2012-08-03, 19:31
Dawid Weiss
2012-08-03, 20:08
Lance Norskog
2012-08-04, 01:31
Pat Ferrel
2012-08-05, 17:11
Lance Norskog
2012-08-07, 01:19
SAMIK CHAKRABORTY
2012-08-07, 12:37
Ted Dunning
2012-08-07, 12:54
|
-
Tags generation?Lance Norskog 2012-08-03, 07:35
I'm looking for a good tags generator. A function from document/term
matrix onto term list is a good bet, since it creates an analysis of the interplay of document and term. I have an LSA implementation for grinding on document/term matrices. This is very effective but seems overkill. Is there a simpler function from a document/term matrix onto a terms list? Maybe the mean tf-idf or log-entropy? -- Lance Norskog [EMAIL PROTECTED]
-
Re: Tags generation?Ted Dunning 2012-08-03, 07:43
tf-idf is a good approximation of the LLR score for many applications and
often gives useful signatures although not always super pretty. It helps to have an overall minimum document frequency for terms of the be considered for being tags. This is the same as an IDF maximum. On Fri, Aug 3, 2012 at 1:35 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > I'm looking for a good tags generator. A function from document/term > matrix onto term list is a good bet, since it creates an analysis of > the interplay of document and term. I have an LSA implementation for > grinding on document/term matrices. This is very effective but seems > overkill. Is there a simpler function from a document/term matrix onto > a terms list? Maybe the mean tf-idf or log-entropy? > > -- > Lance Norskog > [EMAIL PROTECTED] >
-
Re: Tags generation?Pat Ferrel 2012-08-03, 15:23
We do what Ted describes by tossing frequently used terms with the IDF max, tossing stop words and stemming with a lucene analyzer. The stemming makes the tags less readable for sure but without it the near duplicate terms make for a strange looking tag list. With or without stemming the top TFIDF terms work rather well for tags.
If you are using tags in a UI the question becomes, what do you do when a user selects a tag? The classical answer is search for that term but if you do that you throw away the vector signature and are doing a single word search. We are planning to do a reweighing of the term vector and using it to do a "MoreLikeThis" Solr search, if we every get to it... On Aug 3, 2012, at 12:43 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: tf-idf is a good approximation of the LLR score for many applications and often gives useful signatures although not always super pretty. It helps to have an overall minimum document frequency for terms of the be considered for being tags. This is the same as an IDF maximum. On Fri, Aug 3, 2012 at 1:35 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > I'm looking for a good tags generator. A function from document/term > matrix onto term list is a good bet, since it creates an analysis of > the interplay of document and term. I have an LSA implementation for > grinding on document/term matrices. This is very effective but seems > overkill. Is there a simpler function from a document/term matrix onto > a terms list? Maybe the mean tf-idf or log-entropy? > > -- > Lance Norskog > [EMAIL PROTECTED] >
-
Re: Tags generation?Ted Dunning 2012-08-03, 17:29
Unstemming is pretty simple. Just build an unstemming dictionary based on
seeing what word forms have lead to a stemmed form. Include frequencies. When unstemming in the context of a document, pick the most popular (corpus-wide) version that actually appears in the document. On Fri, Aug 3, 2012 at 9:23 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > We do what Ted describes by tossing frequently used terms with the IDF > max, tossing stop words and stemming with a lucene analyzer. The stemming > makes the tags less readable for sure but without it the near duplicate > terms make for a strange looking tag list. With or without stemming the top > TFIDF terms work rather well for tags. >
-
Re: Tags generation?Dawid Weiss 2012-08-03, 19:05
> Unstemming is pretty simple. Just build an unstemming dictionary based on
> seeing what word forms have lead to a stemmed form. Include frequencies. This can lead to very funny (or not, depends how you look at it) mistakes when different lemmas stem to the same token. How frequent and important this phenomenon is varies from language to language (and can be calculated apriori). Dawid
-
Re: Tags generation?Ted Dunning 2012-08-03, 19:31
This is definitely just the first step. Similar goofs happen with
inappropriate stemming. For instance, AIDS should not stem to aid. A reasonable way to find and classify exceptional cases is to look at cooccurrence statistics. The contexts of original forms can be examined to find cases where there is a clear semantic mismatch between the original and the set of all forms that stem to the same form. But just picking the most common that is present in the document is a pretty good step for all that it produces some oddities. The results are much better than showing a user the stemmed forms. On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <[EMAIL PROTECTED]>wrote: > > Unstemming is pretty simple. Just build an unstemming dictionary based > on > > seeing what word forms have lead to a stemmed form. Include frequencies. > > This can lead to very funny (or not, depends how you look at it) > mistakes when different lemmas stem to the same token. How frequent > and important this phenomenon is varies from language to language (and > can be calculated apriori). > > Dawid >
-
Re: Tags generation?Dawid Weiss 2012-08-03, 20:08
I know, I know. :) Just wanted to mention that it could lead to funny
results, that's all. There are lots of way of doing proper form disambiguation, including shallow tagging which then allows to retrieve correct base forms for lemmas, not stems. Stemming is typically good enough (and fast) so your advise was 100% fine. Dawid On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > This is definitely just the first step. Similar goofs happen with > inappropriate stemming. For instance, AIDS should not stem to aid. > > A reasonable way to find and classify exceptional cases is to look at > cooccurrence statistics. The contexts of original forms can be examined to > find cases where there is a clear semantic mismatch between the original > and the set of all forms that stem to the same form. > > But just picking the most common that is present in the document is a > pretty good step for all that it produces some oddities. The results are > much better than showing a user the stemmed forms. > > On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <[EMAIL PROTECTED]>wrote: > >> > Unstemming is pretty simple. Just build an unstemming dictionary based >> on >> > seeing what word forms have lead to a stemmed form. Include frequencies. >> >> This can lead to very funny (or not, depends how you look at it) >> mistakes when different lemmas stem to the same token. How frequent >> and important this phenomenon is varies from language to language (and >> can be calculated apriori). >> >> Dawid >>
-
Re: Tags generation?Lance Norskog 2012-08-04, 01:31
Thanks everyone- I hadn't considered the stem/synonym problem. I have
code for regularizing a doc/term matrix with tf, binary, log and augmented norm for the cells and idf, gfidf, entropy, normal (term vector) and probabilistic inverse. Running any of these, and then SVD, on a Reuters article may take 10-20 ms. This uses a sentence/term matrix for document summarization. After doing all of this, I realized that maybe just the regularized matrix was good enough. One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers. All across the board. If you want good tags, select your parts of speech! On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > I know, I know. :) Just wanted to mention that it could lead to funny > results, that's all. There are lots of way of doing proper form > disambiguation, including shallow tagging which then allows to > retrieve correct base forms for lemmas, not stems. Stemming is > typically good enough (and fast) so your advise was 100% fine. > > Dawid > > On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> This is definitely just the first step. Similar goofs happen with >> inappropriate stemming. For instance, AIDS should not stem to aid. >> >> A reasonable way to find and classify exceptional cases is to look at >> cooccurrence statistics. The contexts of original forms can be examined to >> find cases where there is a clear semantic mismatch between the original >> and the set of all forms that stem to the same form. >> >> But just picking the most common that is present in the document is a >> pretty good step for all that it produces some oddities. The results are >> much better than showing a user the stemmed forms. >> >> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <[EMAIL PROTECTED]>wrote: >> >>> > Unstemming is pretty simple. Just build an unstemming dictionary based >>> on >>> > seeing what word forms have lead to a stemmed form. Include frequencies. >>> >>> This can lead to very funny (or not, depends how you look at it) >>> mistakes when different lemmas stem to the same token. How frequent >>> and important this phenomenon is varies from language to language (and >>> can be calculated apriori). >>> >>> Dawid >>> -- Lance Norskog [EMAIL PROTECTED]
-
Re: Tags generation?Pat Ferrel 2012-08-05, 17:11
The way back from stem to tag is interesting from the standpoint of making tags human readable. I had assumed a lookup but this seems much more satisfying and flexible. In order to keep frequencies it will take something like a dictionary creation step in the analyzer. This in turn seems to imply a join so a whole new map reduce job--maybe not completely trivial?
It seems that NLP can be used in two very different ways here. First as a filter (keep only nouns and verbs?) second to differentiate semantics (can:verb, can:noun). One method is a dimensional reduction technique the other increases dimensions but can lead to orthogonal dimensions from the same term. I suppose both could be used together as the above example indicates. It sounds like you are using it to filter (only?) Can you explain what you mean by: "One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers.' On Aug 3, 2012, at 6:31 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: Thanks everyone- I hadn't considered the stem/synonym problem. I have code for regularizing a doc/term matrix with tf, binary, log and augmented norm for the cells and idf, gfidf, entropy, normal (term vector) and probabilistic inverse. Running any of these, and then SVD, on a Reuters article may take 10-20 ms. This uses a sentence/term matrix for document summarization. After doing all of this, I realized that maybe just the regularized matrix was good enough. One thing came through- parts-of-speech selection for nouns and verbs helped 5-10% in every combination of regularizers. All across the board. If you want good tags, select your parts of speech! On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > I know, I know. :) Just wanted to mention that it could lead to funny > results, that's all. There are lots of way of doing proper form > disambiguation, including shallow tagging which then allows to > retrieve correct base forms for lemmas, not stems. Stemming is > typically good enough (and fast) so your advise was 100% fine. > > Dawid > > On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> This is definitely just the first step. Similar goofs happen with >> inappropriate stemming. For instance, AIDS should not stem to aid. >> >> A reasonable way to find and classify exceptional cases is to look at >> cooccurrence statistics. The contexts of original forms can be examined to >> find cases where there is a clear semantic mismatch between the original >> and the set of all forms that stem to the same form. >> >> But just picking the most common that is present in the document is a >> pretty good step for all that it produces some oddities. The results are >> much better than showing a user the stemmed forms. >> >> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <[EMAIL PROTECTED]>wrote: >> >>>> Unstemming is pretty simple. Just build an unstemming dictionary based >>> on >>>> seeing what word forms have lead to a stemmed form. Include frequencies. >>> >>> This can lead to very funny (or not, depends how you look at it) >>> mistakes when different lemmas stem to the same token. How frequent >>> and important this phenomenon is varies from language to language (and >>> can be calculated apriori). >>> >>> Dawid >>> -- Lance Norskog [EMAIL PROTECTED]
-
Re: Tags generation?Lance Norskog 2012-08-07, 01:19
I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun',
'verb', etc. I removed all words that were not nouns or verbs. In my use case, this is a total win. In other cases, maybe not: Twitter has a quite varied non-grammer. On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > The way back from stem to tag is interesting from the standpoint of making tags human readable. I had assumed a lookup but this seems much more satisfying and flexible. In order to keep frequencies it will take something like a dictionary creation step in the analyzer. This in turn seems to imply a join so a whole new map reduce job--maybe not completely trivial? > > It seems that NLP can be used in two very different ways here. First as a filter (keep only nouns and verbs?) second to differentiate semantics (can:verb, can:noun). One method is a dimensional reduction technique the other increases dimensions but can lead to orthogonal dimensions from the same term. I suppose both could be used together as the above example indicates. > > It sounds like you are using it to filter (only?) Can you explain what you mean by: > "One thing came through- parts-of-speech selection for nouns and verbs > helped 5-10% in every combination of regularizers.' > > > On Aug 3, 2012, at 6:31 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > Thanks everyone- I hadn't considered the stem/synonym problem. I have > code for regularizing a doc/term matrix with tf, binary, log and > augmented norm for the cells and idf, gfidf, entropy, normal (term > vector) and probabilistic inverse. Running any of these, and then SVD, > on a Reuters article may take 10-20 ms. This uses a sentence/term > matrix for document summarization. After doing all of this, I realized > that maybe just the regularized matrix was good enough. > > One thing came through- parts-of-speech selection for nouns and verbs > helped 5-10% in every combination of regularizers. All across the > board. If you want good tags, select your parts of speech! > > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss > <[EMAIL PROTECTED]> wrote: >> I know, I know. :) Just wanted to mention that it could lead to funny >> results, that's all. There are lots of way of doing proper form >> disambiguation, including shallow tagging which then allows to >> retrieve correct base forms for lemmas, not stems. Stemming is >> typically good enough (and fast) so your advise was 100% fine. >> >> Dawid >> >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: >>> This is definitely just the first step. Similar goofs happen with >>> inappropriate stemming. For instance, AIDS should not stem to aid. >>> >>> A reasonable way to find and classify exceptional cases is to look at >>> cooccurrence statistics. The contexts of original forms can be examined to >>> find cases where there is a clear semantic mismatch between the original >>> and the set of all forms that stem to the same form. >>> >>> But just picking the most common that is present in the document is a >>> pretty good step for all that it produces some oddities. The results are >>> much better than showing a user the stemmed forms. >>> >>> On Fri, Aug 3, 2012 at 1:05 PM, Dawid Weiss <[EMAIL PROTECTED]>wrote: >>> >>>>> Unstemming is pretty simple. Just build an unstemming dictionary based >>>> on >>>>> seeing what word forms have lead to a stemmed form. Include frequencies. >>>> >>>> This can lead to very funny (or not, depends how you look at it) >>>> mistakes when different lemmas stem to the same token. How frequent >>>> and important this phenomenon is varies from language to language (and >>>> can be calculated apriori). >>>> >>>> Dawid >>>> > > > > -- > Lance Norskog > [EMAIL PROTECTED] > -- Lance Norskog [EMAIL PROTECTED]
-
Re: Tags generation?SAMIK CHAKRABORTY 2012-08-07, 12:37
Hi All,
We have developed an auto tagging system for our micro-blogging platform. Here is what we have done: The purpose of the system was to look for tags in an articles automatically when someone posts a link in our micro-blogging site. The goal was to allow us to follow a tag instead (in addition) of (to) a person. So we used some custom code on top of Mahout, UIMA, Open-NLP etc. If you are interested to see how it works take a look at: http://www.scoopspot.com/ One more thing, we also created a robot which goes to some of the well known web sites like: Read Write Web, Hackers News, Tech Crunch etc which gets the article from the web and publishes that to our micro-blog. As we already have the tag following, we get the information without any problem. That's very cool (to us at least). You can see the output of the robot at this location: http://news.scoopspot.com/ I thought, this might be an example of what Mahout can do and related to this thread, so felt like sharing with you guys. Sorry if it looks like off-topic. Regards, Samik On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', > 'verb', etc. I removed all words that were not nouns or verbs. In my > use case, this is a total win. In other cases, maybe not: Twitter has > a quite varied non-grammer. > > On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > > The way back from stem to tag is interesting from the standpoint of > making tags human readable. I had assumed a lookup but this seems much more > satisfying and flexible. In order to keep frequencies it will take > something like a dictionary creation step in the analyzer. This in turn > seems to imply a join so a whole new map reduce job--maybe not completely > trivial? > > > > It seems that NLP can be used in two very different ways here. First as > a filter (keep only nouns and verbs?) second to differentiate semantics > (can:verb, can:noun). One method is a dimensional reduction technique the > other increases dimensions but can lead to orthogonal dimensions from the > same term. I suppose both could be used together as the above example > indicates. > > > > It sounds like you are using it to filter (only?) Can you explain what > you mean by: > > "One thing came through- parts-of-speech selection for nouns and verbs > > helped 5-10% in every combination of regularizers.' > > > > > > On Aug 3, 2012, at 6:31 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > > > Thanks everyone- I hadn't considered the stem/synonym problem. I have > > code for regularizing a doc/term matrix with tf, binary, log and > > augmented norm for the cells and idf, gfidf, entropy, normal (term > > vector) and probabilistic inverse. Running any of these, and then SVD, > > on a Reuters article may take 10-20 ms. This uses a sentence/term > > matrix for document summarization. After doing all of this, I realized > > that maybe just the regularized matrix was good enough. > > > > One thing came through- parts-of-speech selection for nouns and verbs > > helped 5-10% in every combination of regularizers. All across the > > board. If you want good tags, select your parts of speech! > > > > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss > > <[EMAIL PROTECTED]> wrote: > >> I know, I know. :) Just wanted to mention that it could lead to funny > >> results, that's all. There are lots of way of doing proper form > >> disambiguation, including shallow tagging which then allows to > >> retrieve correct base forms for lemmas, not stems. Stemming is > >> typically good enough (and fast) so your advise was 100% fine. > >> > >> Dawid > >> > >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > >>> This is definitely just the first step. Similar goofs happen with > >>> inappropriate stemming. For instance, AIDS should not stem to aid. > >>> > >>> A reasonable way to find and classify exceptional cases is to look at
-
Re: Tags generation?Ted Dunning 2012-08-07, 12:54
Nice stuff. And glad that Mahout was able to help!
On Tue, Aug 7, 2012 at 7:37 AM, SAMIK CHAKRABORTY <[EMAIL PROTECTED]> wrote: > Hi All, > > We have developed an auto tagging system for our micro-blogging platform. > Here is what we have done: > > The purpose of the system was to look for tags in an articles automatically > when someone posts a link in our micro-blogging site. The goal was to allow > us to follow a tag instead (in addition) of (to) a person. So we used some > custom code on top of Mahout, UIMA, Open-NLP etc. > > If you are interested to see how it works take a look at: > http://www.scoopspot.com/ > > One more thing, we also created a robot which goes to some of the well > known web sites like: Read Write Web, Hackers News, Tech Crunch etc which > gets the article from the web and publishes that to our micro-blog. As we > already have the tag following, we get the information without any problem. > That's very cool (to us at least). You can see the output of the robot at > this location: > > http://news.scoopspot.com/ > > I thought, this might be an example of what Mahout can do and related to > this thread, so felt like sharing with you guys. > > Sorry if it looks like off-topic. > > Regards, > Samik > > On Tue, Aug 7, 2012 at 6:49 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > > I used the OpenNLP Parts-Of-Speech tool to label all words as 'noun', > > 'verb', etc. I removed all words that were not nouns or verbs. In my > > use case, this is a total win. In other cases, maybe not: Twitter has > > a quite varied non-grammer. > > > > On Sun, Aug 5, 2012 at 10:11 AM, Pat Ferrel <[EMAIL PROTECTED]> wrote: > > > The way back from stem to tag is interesting from the standpoint of > > making tags human readable. I had assumed a lookup but this seems much > more > > satisfying and flexible. In order to keep frequencies it will take > > something like a dictionary creation step in the analyzer. This in turn > > seems to imply a join so a whole new map reduce job--maybe not completely > > trivial? > > > > > > It seems that NLP can be used in two very different ways here. First as > > a filter (keep only nouns and verbs?) second to differentiate semantics > > (can:verb, can:noun). One method is a dimensional reduction technique the > > other increases dimensions but can lead to orthogonal dimensions from the > > same term. I suppose both could be used together as the above example > > indicates. > > > > > > It sounds like you are using it to filter (only?) Can you explain what > > you mean by: > > > "One thing came through- parts-of-speech selection for nouns and verbs > > > helped 5-10% in every combination of regularizers.' > > > > > > > > > On Aug 3, 2012, at 6:31 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > > > > > > Thanks everyone- I hadn't considered the stem/synonym problem. I have > > > code for regularizing a doc/term matrix with tf, binary, log and > > > augmented norm for the cells and idf, gfidf, entropy, normal (term > > > vector) and probabilistic inverse. Running any of these, and then SVD, > > > on a Reuters article may take 10-20 ms. This uses a sentence/term > > > matrix for document summarization. After doing all of this, I realized > > > that maybe just the regularized matrix was good enough. > > > > > > One thing came through- parts-of-speech selection for nouns and verbs > > > helped 5-10% in every combination of regularizers. All across the > > > board. If you want good tags, select your parts of speech! > > > > > > On Fri, Aug 3, 2012 at 1:08 PM, Dawid Weiss > > > <[EMAIL PROTECTED]> wrote: > > >> I know, I know. :) Just wanted to mention that it could lead to funny > > >> results, that's all. There are lots of way of doing proper form > > >> disambiguation, including shallow tagging which then allows to > > >> retrieve correct base forms for lemmas, not stems. Stemming is > > >> typically good enough (and fast) so your advise was 100% fine. > > >> > > >> Dawid > > >> > > >> On Fri, Aug 3, 2012 at 9:31 PM, Ted Dunning <[EMAIL PROTECTED]> |