-Re: 2 questions about lda implementation
ivan obeso 2012-05-10, 15:55
Ok, i have started to migrate my program to mahout 0.6 with the new LDA
version. Before all, Im doing all with java code, no command line programs.
The problem is that i cant use the sequence files that i generated for the
old version. I writed it with a SequenceFile.Writer writing a Text as Key
and a Text as Value. Now its not allowed because the CVB0Driver wants a
IntWritable as key. I know that I have to use the
SparseVectorsFromSequenceFiles class to convert my sequencefiles to the
input files that CVB0Driver wants. Is that correct? My problem is the lack
of documentantion about this classes. I dont know how to use
SparseVectorsFromSequenceFiles's run method. Can someone explain me the
usage of this?
Also, I need a page explaining all the parameters of the CVB0Driver's run
method (¡¡19 parameters!! are too much). Because i dont know the meaning of
some of them and i dont find any usefull information :(
On Tue, May 8, 2012 at 1:00 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:
> Hi Ivan,
> First off, let me say that you should probably start migrating to using
> the new
> LDA implementation which came in 0.6, which is invoked via the "mahout
> command, or by directly launching the o.a.m.clustering.lda.cvb.CVB0Driver
> in your code, as the old LDA which you're referencing will be going away
> But for now, I'll try to answer your questions on the old impl:
> On Tue, May 8, 2012 at 8:54 AM, ivan obeso <[EMAIL PROTECTED]
> > Im using mahout 0.6. I had runned the "mahout lda..." tool for command
> > for apply lda method in a corpus. But now, i want to code it in my java
> > program and Im having a lot of problems because it crashes. Can someone
> > give me an example java code running correctly?
> > Looking at the output of LDA, I have 2 folders:
> > - docTopics: wich contains a Text key (the document ID) and a vector
> > (that is the membership of this document to each topic).
> > -state-n: I assume that the intPairWritable is (topicID, wordID) so it
> > as wordID as all the corpus for each topic. And the DoubleWritable Value
> > dont know what is. I think its the membership between the topic and the
> > word, but i dont know what type of meassure method is used. For example,
> > here is an split that I have printed:
> You're correct here - the values are unnormalized log( p(wordId | topicId)
> values. To recover probabilities, you need to exponentiate them, and
> so that if you sum over all the values for a given topicId, the sum == 1.
> > ...
> > (4, 17847) -28.424714110200803
> > (4, 17848) -32.54168874531223
> > (4, 17849) -51.954687480087074
> > (4, 17850) -1.8811618929248652E-12
> > (4, 17851) -7.102634146221668
> > (4, 17852) 3.440324743165531
> > (4, 17853) 1.118778127312393
> > (4, 17854) 2.2973859313207385
> > (4, 17855) 2.1602327860824015
> > (4, 17856) -2.5362957334351677E-6
> > (4, 17857) -32.80559170476965
> > (4, 17858) -1.9791269423308222E-7
> > ...
> > Can somebody help me explaining me this?