-Re: LLR Scoring question
Ted Dunning 2010-01-12, 19:20
Raw LLR has a large value whenever there is an anomaly. In this case, term2
is rare in the cluster and common outside and is thus an anomaly.
One thing that I do is to use a variant of the LLR score:
rootLLR = signum(k11/k1* - k21/k2*) * sqrt(LLR)
This score has two advantages over the basic LLR:
a) it is positive where k11 is bigger than expected, negative where it is
lower. This resolves your current problem.
b) if there is no difference it is asymptotically normally distributed.
This allows people to talk about "number of standard deviations" which is a
more common frame of reference than the chi^2 distribution.
On Tue, Jan 12, 2010 at 4:49 AM, Shashikant Kore <[EMAIL PROTECTED]>wrote:
> As I can see Term1 is rarer outside the cluster, but common in the
> cluster (relatively speaking.) But, when I calculate LLR scores,
> Term1's score (3569) is lower than that of Term2 (3622). This looks
> counter-intuitive to me. Is it the case that LLR score is higher if
> term is common outside the cluster and rare inside? Can this be
Ted Dunning, CTO