|
|
John Conwell 2012-01-27, 22:11
I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I noticed the topicModel output datatypes changed from the original LDA implementation. So I figured I'd just write my own for CVB, and base it off the LDA implementation.
And I noticed something odd. When running LDAPrintTopics , it gathers the top N terms by topic (topWordsForTopics), and normalizes the values in the vector, which makes sense. But during the normalization calculation it also weights the vector by using Math.exp(score) instead of just the straight score for all calculations.
I get that using Math.exp(score) will give exponentially larger values a stronger weighting than smaller values, but why is this done in the normalization?
And if I was going to use the topicModel output as the input to some other algorithm, would want to run the topicModel vectors through the same kind of weighting normalization? And if so, why not just persist the topicModel in this weighted normalized format in the first place?
And finally, should I also use this same weighting normalization on the docTopics output as well? The docTopics are normalized (well, they all add up to 1), but are the normalized in the same manner?
I'm just trying to figure out how to use the LDA output, and figure out if there are any steps I need to consider before I use it as input to something else.
--
Thanks, John C
+
John Conwell 2012-01-27, 22:11
-
Re: LDA output questions
John Conwell 2012-01-28, 01:21
Ok, I think I just wrote all that (and wasted a couple hours) for nothing. It looks like topicModel output for the CVB algorithm is the normalized output from the last model generated from the tempState folder. Basically it automatically does for me some of what LDAPrintTopics does; normalizes the topic word weights.
That means there is no reason to do the weighting normalizaiton for CVB, correct? And we still have to manually pull out the top N terms by weight for the topic, and match their index in the vector with the dictionary in order to get a new readable top N words per topic, correct?
But, I think in all that looking I found a bug in LDAPrintTopics. It is supposed to spit out the top N words per topic, where top N is based on the term weight for that topic. The function maybeEnqueue() uses a PriorityQueue<Pair<String,Double>>, but doesn't pass in a Comparitor, so it uses the Comparable implementation for Pair<A,B>, which first compares the String in the Pair and then if equal it would compare the Double value. But this never happens since no terms are duplicated for a topic, and hence the term weight value is never checked. I double checked by putting a breakpoint in the compareTo method in Pair, and it never made it past the string comparison.
All this means is that LDAPrintTopics is outputting the top N terms per topic, by term string sorted order.
On Fri, Jan 27, 2012 at 2:11 PM, John Conwell <[EMAIL PROTECTED]> wrote:
> I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I > noticed the topicModel output datatypes changed from the original LDA > implementation. So I figured I'd just write my own for CVB, and base it > off the LDA implementation. > > And I noticed something odd. When running LDAPrintTopics , it gathers the > top N terms by topic (topWordsForTopics), and normalizes the values in the > vector, which makes sense. But during the normalization calculation it > also weights the vector by using Math.exp(score) instead of just the > straight score for all calculations. > > I get that using Math.exp(score) will give exponentially larger values a > stronger weighting than smaller values, but why is this done in the > normalization? > > And if I was going to use the topicModel output as the input to some other > algorithm, would want to run the topicModel vectors through the same kind > of weighting normalization? And if so, why not just persist the topicModel > in this weighted normalized format in the first place? > > And finally, should I also use this same weighting normalization on > the docTopics output as well? The docTopics are normalized (well, they all > add up to 1), but are the normalized in the same manner? > > I'm just trying to figure out how to use the LDA output, and figure out if > there are any steps I need to consider before I use it as input to > something else. > > -- > > Thanks, > John C > > --
Thanks, John C
+
John Conwell 2012-01-28, 01:21
-
Re: LDA output questions
Jake Mannix 2012-01-28, 04:24
Hey John,
Sorry I didn't get back to respond to this earlier: you are essentially correct, but if you want to have the effect of "LDAPrintTopics" in the new CVB data, since the format is exactly that of a DistributedRowMatrix (ie a simple SequenceFile<IntWritable,VectorWritable>), you need do nothing other than:
$MAHOUT_HOME/bin/vectordump -s <modelpath> -d <dictionarypath> \ -dt sequencefile -p -sort -o ./output_topics.txt
If you only want the top N terms/features per topic, add the "-vs 100" to that option list.
Hope that helps.
-jake
p.s. yes, LDAPrintTopics does a lot of funny things, and might indeed be buggy. But I'm more interested in finding bugs / pieces of missing docs in the new CVB code, as we are probably removing the old code in the next release.
On Fri, Jan 27, 2012 at 5:21 PM, John Conwell <[EMAIL PROTECTED]> wrote:
> Ok, I think I just wrote all that (and wasted a couple hours) for nothing. > It looks like topicModel output for the CVB algorithm is the normalized > output from the last model generated from the tempState folder. Basically > it automatically does for me some of what LDAPrintTopics does; normalizes > the topic word weights. > > That means there is no reason to do the weighting normalizaiton for CVB, > correct? And we still have to manually pull out the top N terms by weight > for the topic, and match their index in the vector with the dictionary in > order to get a new readable top N words per topic, correct? > > But, I think in all that looking I found a bug in LDAPrintTopics. It is > supposed to spit out the top N words per topic, where top N is based on the > term weight for that topic. The function maybeEnqueue() uses a > PriorityQueue<Pair<String,Double>>, but doesn't pass in a Comparitor, so it > uses the Comparable implementation for Pair<A,B>, which first compares the > String in the Pair and then if equal it would compare the Double value. > But this never happens since no terms are duplicated for a topic, and > hence the term weight value is never checked. I double checked by putting > a breakpoint in the compareTo method in Pair, and it never made it past the > string comparison. > > All this means is that LDAPrintTopics is outputting the top N terms per > topic, by term string sorted order. > > > > On Fri, Jan 27, 2012 at 2:11 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I > > noticed the topicModel output datatypes changed from the original LDA > > implementation. So I figured I'd just write my own for CVB, and base it > > off the LDA implementation. > > > > And I noticed something odd. When running LDAPrintTopics , it gathers > the > > top N terms by topic (topWordsForTopics), and normalizes the values in > the > > vector, which makes sense. But during the normalization calculation it > > also weights the vector by using Math.exp(score) instead of just the > > straight score for all calculations. > > > > I get that using Math.exp(score) will give exponentially larger values a > > stronger weighting than smaller values, but why is this done in the > > normalization? > > > > And if I was going to use the topicModel output as the input to some > other > > algorithm, would want to run the topicModel vectors through the same kind > > of weighting normalization? And if so, why not just persist the > topicModel > > in this weighted normalized format in the first place? > > > > And finally, should I also use this same weighting normalization on > > the docTopics output as well? The docTopics are normalized (well, they > all > > add up to 1), but are the normalized in the same manner? > > > > I'm just trying to figure out how to use the LDA output, and figure out > if > > there are any steps I need to consider before I use it as input to > > something else. > > > > -- > > > > Thanks, > > John C > > > > > > > -- > > Thanks, > John C >
+
Jake Mannix 2012-01-28, 04:24
-
Re: LDA output questions
John Conwell 2012-01-30, 18:53
Totally understand on the bug fix. But for anyone who wants the fix, I've created a patch, attached to this email.
Basically just create a new Comparator when you create the PriorityQueue, and only compare the second value of each Pair (the double value), and ignore the string.
On Fri, Jan 27, 2012 at 8:24 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:
> Hey John, > > Sorry I didn't get back to respond to this earlier: you are essentially > correct, > but if you want to have the effect of "LDAPrintTopics" in the new CVB data, > since the format is exactly that of a DistributedRowMatrix (ie a simple > SequenceFile<IntWritable,VectorWritable>), you need do nothing other than: > > $MAHOUT_HOME/bin/vectordump -s <modelpath> -d <dictionarypath> \ > -dt sequencefile -p -sort -o ./output_topics.txt > > If you only want the top N terms/features per topic, add the "-vs 100" to > that > option list. > > Hope that helps. > > -jake > > p.s. yes, LDAPrintTopics does a lot of funny things, and might indeed be > buggy. But I'm more interested in finding bugs / pieces of missing docs > in the new CVB code, as we are probably removing the old code in the > next release. > > On Fri, Jan 27, 2012 at 5:21 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > Ok, I think I just wrote all that (and wasted a couple hours) for > nothing. > > It looks like topicModel output for the CVB algorithm is the normalized > > output from the last model generated from the tempState folder. > Basically > > it automatically does for me some of what LDAPrintTopics does; normalizes > > the topic word weights. > > > > That means there is no reason to do the weighting normalizaiton for CVB, > > correct? And we still have to manually pull out the top N terms by > weight > > for the topic, and match their index in the vector with the dictionary in > > order to get a new readable top N words per topic, correct? > > > > But, I think in all that looking I found a bug in LDAPrintTopics. It is > > supposed to spit out the top N words per topic, where top N is based on > the > > term weight for that topic. The function maybeEnqueue() uses a > > PriorityQueue<Pair<String,Double>>, but doesn't pass in a Comparitor, so > it > > uses the Comparable implementation for Pair<A,B>, which first compares > the > > String in the Pair and then if equal it would compare the Double value. > > But this never happens since no terms are duplicated for a topic, and > > hence the term weight value is never checked. I double checked by > putting > > a breakpoint in the compareTo method in Pair, and it never made it past > the > > string comparison. > > > > All this means is that LDAPrintTopics is outputting the top N terms per > > topic, by term string sorted order. > > > > > > > > On Fri, Jan 27, 2012 at 2:11 PM, John Conwell <[EMAIL PROTECTED]> wrote: > > > > > I used the CVB variant of LDA, and when I tried to run LDAPrintTopics I > > > noticed the topicModel output datatypes changed from the original LDA > > > implementation. So I figured I'd just write my own for CVB, and base > it > > > off the LDA implementation. > > > > > > And I noticed something odd. When running LDAPrintTopics , it gathers > > the > > > top N terms by topic (topWordsForTopics), and normalizes the values in > > the > > > vector, which makes sense. But during the normalization calculation it > > > also weights the vector by using Math.exp(score) instead of just the > > > straight score for all calculations. > > > > > > I get that using Math.exp(score) will give exponentially larger values > a > > > stronger weighting than smaller values, but why is this done in the > > > normalization? > > > > > > And if I was going to use the topicModel output as the input to some > > other > > > algorithm, would want to run the topicModel vectors through the same > kind > > > of weighting normalization? And if so, why not just persist the > > topicModel > > > in this weighted normalized format in the first place? > > > Thanks, John C
+
John Conwell 2012-01-30, 18:53
|
|