Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - "Direction" of co-occurence and log-likelihood ratio


Copy link to this message
-
Re: "Direction" of co-occurence and log-likelihood ratio
Sean Owen 2012-06-21, 20:55
Is this not just a matter of comparing the frequency of "the" with
"the the"? If "the" is 1/n of the words, then "the the" ought to be
1/n^2. If it's less, it's under-represented.

On Thu, Jun 21, 2012 at 9:01 PM, Nimrod Priell <[EMAIL PROTECTED]> wrote:
> I am wondering if there's a way to detect whether the deviation from independence is of the type that the co-occurrance is under-represented or over-represented w.r.t random sampling. Ideally, I'd like a measure on, say, (-inf, inf) where if the result is negative there is under-representation of the class where both A and B occur, and if it is positive, there is an overabundance of samples with (A intersection B).
>
> My initial guess was that LLR(k_11, k_12, k_21, k_22) has one minima with respect to k_11, i.e. keeping all other parameters fixed, it will be decreasing with k_11 up to a point, then increasing. That minimum is obviously when the co-occurance is random.