|
|
Shashikant Kore 2010-01-12, 12:49
Hi,
I am looking at LLR scores for two terms in a cluster which seem non-intuitive to me.
The corpus size is 706,120 and size of the cluster is 21964.
Term1 appears in 904 docs in the cluster and 1144 docs outside the cluster. Term2 appears in 36 docs in the cluster and 60280 docs outside the cluster.
As I can see Term1 is rarer outside the cluster, but common in the cluster (relatively speaking.) But, when I calculate LLR scores, Term1's score (3569) is lower than that of Term2 (3622). This looks counter-intuitive to me. Is it the case that LLR score is higher if term is common outside the cluster and rare inside? Can this be "fixed"?
The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you wish to calculate.
Term1 k11 904 k12 21060 k21 1144 k22 683012
Term2 k11 36 k12 21928 k21 60280 k22 623876
Thanks,
--shashi
-
Re: LLR Scoring question
Robin Anil 2010-01-12, 13:11
I dont have my code here to verify the result. Can you show the calculation here i mean the values of the log etc. Maybe will give a better idea On Tue, Jan 12, 2010 at 6:19 PM, Shashikant Kore <[EMAIL PROTECTED]>wrote:
> Hi, > > I am looking at LLR scores for two terms in a cluster which seem > non-intuitive to me. > > The corpus size is 706,120 and size of the cluster is 21964. > > Term1 appears in 904 docs in the cluster and 1144 docs outside the > cluster. > Term2 appears in 36 docs in the cluster and 60280 docs outside the > cluster. > > As I can see Term1 is rarer outside the cluster, but common in the > cluster (relatively speaking.) But, when I calculate LLR scores, > Term1's score (3569) is lower than that of Term2 (3622). This looks > counter-intuitive to me. Is it the case that LLR score is higher if > term is common outside the cluster and rare inside? Can this be > "fixed"? > > The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you > wish to calculate. > > Term1 > k11 904 > k12 21060 > k21 1144 > k22 683012 > > Term2 > k11 36 > k12 21928 > k21 60280 > k22 623876 > > Thanks, > > --shashi >
-
Re: LLR Scoring question
Shashikant Kore 2010-01-12, 13:56
Not sure, which values you asked for. Here are the entropy values as calculated in the following class. http://svn.apache.org/viewvc/lucene/mahout/trunk/math/src/main/java/org/apache/mahout/math/stats/LogLikelihood.java?view=markupTerm1 rowEntropy 12226 columnEntropy 96057 matrixEntropy 110068 result 3569 Term2 rowEntropy 204240 columnEntropy 96031 matrixEntropy 302083 result 3622 --shashi On Tue, Jan 12, 2010 at 6:41 PM, Robin Anil <[EMAIL PROTECTED]> wrote: > I dont have my code here to verify the result. Can you show the calculation > here i mean the values of the log etc. Maybe will give a better idea > > > On Tue, Jan 12, 2010 at 6:19 PM, Shashikant Kore <[EMAIL PROTECTED]>wrote: > >> Hi, >> >> I am looking at LLR scores for two terms in a cluster which seem >> non-intuitive to me. >> >> The corpus size is 706,120 and size of the cluster is 21964. >> >> Term1 appears in 904 docs in the cluster and 1144 docs outside the >> cluster. >> Term2 appears in 36 docs in the cluster and 60280 docs outside the >> cluster. >> >> As I can see Term1 is rarer outside the cluster, but common in the >> cluster (relatively speaking.) But, when I calculate LLR scores, >> Term1's score (3569) is lower than that of Term2 (3622). This looks >> counter-intuitive to me. Is it the case that LLR score is higher if >> term is common outside the cluster and rare inside? Can this be >> "fixed"? >> >> The k11, k12, k21,k22 values for Term1 and Term2 are as follows if you >> wish to calculate. >> >> Term1 >> k11 904 >> k12 21060 >> k21 1144 >> k22 683012 >> >> Term2 >> k11 36 >> k12 21928 >> k21 60280 >> k22 623876 >> >> Thanks, >> >> --shashi >> >
-
Re: LLR Scoring question
Ted Dunning 2010-01-12, 19:20
Raw LLR has a large value whenever there is an anomaly. In this case, term2 is rare in the cluster and common outside and is thus an anomaly.
One thing that I do is to use a variant of the LLR score:
rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR)
This score has two advantages over the basic LLR:
a) it is positive where k11 is bigger than expected, negative where it is lower. This resolves your current problem.
b) if there is no difference it is asymptotically normally distributed. This allows people to talk about "number of standard deviations" which is a more common frame of reference than the chi^2 distribution. On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <[EMAIL PROTECTED]>wrote:
> As I can see Term1 is rarer outside the cluster, but common in the > cluster (relatively speaking.) But, when I calculate LLR scores, > Term1's score (3569) is lower than that of Term2 (3622). This looks > counter-intuitive to me. Is it the case that LLR score is higher if > term is common outside the cluster and rare inside? Can this be > "fixed"? >
-- Ted Dunning, CTO DeepDyve
-
Re: LLR Scoring question
Ted Dunning 2010-01-12, 19:30
I should add that for collocations, this almost never matters because a pair of words can only occur less than expected if one of the words is very common.
The only example in English that I know off-hand is the phrase "the the" which does occur (due to typographical error, generally), but because the is sooo common, it occurs less than expected.
Any word that cooccurs with a less common word than "the" will tend to have a very low expected frequency. As such, it is hard to have a non-zero frequency that is less than expected. Even zero occurrences is not a whole lot less than the expected frequency unless you have a truly ginormous corpus.
For the case of cluster labeling or classification features, however, it is quite plausible for a feature to be less common in the cluster of interest than in the rest of the corpus and because the cluster may be relatively large, it is also quite plausible for this feature to have non-zero count and a pretty respectable LLR. On Tue, Jan 12, 2010 at 11:20 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> > Raw LLR has a large value whenever there is an anomaly. In this case, > term2 is rare in the cluster and common outside and is thus an anomaly. > > One thing that I do is to use a variant of the LLR score: > > rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR) > > This score has two advantages over the basic LLR: > > a) it is positive where k11 is bigger than expected, negative where it is > lower. This resolves your current problem. > > b) if there is no difference it is asymptotically normally distributed. > This allows people to talk about "number of standard deviations" which is a > more common frame of reference than the chi^2 distribution. > > > > On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <[EMAIL PROTECTED]>wrote: > >> As I can see Term1 is rarer outside the cluster, but common in the >> cluster (relatively speaking.) But, when I calculate LLR scores, >> Term1's score (3569) is lower than that of Term2 (3622). This looks >> counter-intuitive to me. Is it the case that LLR score is higher if >> term is common outside the cluster and rare inside? Can this be >> "fixed"? >> > > > > -- > Ted Dunning, CTO > DeepDyve > > -- Ted Dunning, CTO DeepDyve
-
Re: LLR Scoring question
Shashikant Kore 2010-01-13, 17:00
Ted,
Thank you for the tip.
> > rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR) >
I didn't get what k1* and k2* are. I used (k11+k12) and (k21+k22) in the denominator. That gives correct result.
--shashi
On Wed, Jan 13, 2010 at 12:50 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Raw LLR has a large value whenever there is an anomaly. In this case, term2 > is rare in the cluster and common outside and is thus an anomaly. > > One thing that I do is to use a variant of the LLR score: > > rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR) > > This score has two advantages over the basic LLR: > > a) it is positive where k11 is bigger than expected, negative where it is > lower. This resolves your current problem. > > b) if there is no difference it is asymptotically normally distributed. > This allows people to talk about "number of standard deviations" which is a > more common frame of reference than the chi^2 distribution. > > > On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <[EMAIL PROTECTED]>wrote: > >> As I can see Term1 is rarer outside the cluster, but common in the >> cluster (relatively speaking.) But, when I calculate LLR scores, >> Term1's score (3569) is lower than that of Term2 (3622). This looks >> counter-intuitive to me. Is it the case that LLR score is higher if >> term is common outside the cluster and rare inside? Can this be >> "fixed"? >> > > > > -- > Ted Dunning, CTO > DeepDyve >
-
Re: LLR Scoring question
Ted Dunning 2010-01-13, 18:02
On Wed, Jan 13, 2010 at 9:00 AM, Shashikant Kore <[EMAIL PROTECTED]>wrote:
> Ted, > > Thank you for the tip. > > > > > rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR) > > > > I didn't get what k1* and k2* are. I used (k11+k12) and (k21+k22) in > the denominator. That gives correct result. > > Your interpretation is correct. I use the * as a shorthand.
|
|