|
|
-
2 questions about lda implementation
ivan obeso 2012-05-08, 15:54
Im using mahout 0.6. I had runned the "mahout lda..." tool for command line for apply lda method in a corpus. But now, i want to code it in my java program and Im having a lot of problems because it crashes. Can someone give me an example java code running correctly?
Looking at the output of LDA, I have 2 folders: - docTopics: wich contains a Text key (the document ID) and a vector Value (that is the membership of this document to each topic). -state-n: I assume that the intPairWritable is (topicID, wordID) so it have as wordID as all the corpus for each topic. And the DoubleWritable Value I dont know what is. I think its the membership between the topic and the word, but i dont know what type of meassure method is used. For example, here is an split that I have printed: ... (4, 17847) -28.424714110200803 (4, 17848) -32.54168874531223 (4, 17849) -51.954687480087074 (4, 17850) -1.8811618929248652E-12 (4, 17851) -7.102634146221668 (4, 17852) 3.440324743165531 (4, 17853) 1.118778127312393 (4, 17854) 2.2973859313207385 (4, 17855) 2.1602327860824015 (4, 17856) -2.5362957334351677E-6 (4, 17857) -32.80559170476965 (4, 17858) -1.9791269423308222E-7 ...
Can somebody help me explaining me this?
+
ivan obeso 2012-05-08, 15:54
-
Re: 2 questions about lda implementation
Jake Mannix 2012-05-08, 17:00
Hi Ivan,
First off, let me say that you should probably start migrating to using the new LDA implementation which came in 0.6, which is invoked via the "mahout cvb..." command, or by directly launching the o.a.m.clustering.lda.cvb.CVB0Driver in your code, as the old LDA which you're referencing will be going away soon.
But for now, I'll try to answer your questions on the old impl:
On Tue, May 8, 2012 at 8:54 AM, ivan obeso <[EMAIL PROTECTED]>wrote:
> Im using mahout 0.6. I had runned the "mahout lda..." tool for command line > for apply lda method in a corpus. But now, i want to code it in my java > program and Im having a lot of problems because it crashes. Can someone > give me an example java code running correctly? > > Looking at the output of LDA, I have 2 folders: > - docTopics: wich contains a Text key (the document ID) and a vector Value > (that is the membership of this document to each topic). > -state-n: I assume that the intPairWritable is (topicID, wordID) so it have > as wordID as all the corpus for each topic. And the DoubleWritable Value I > dont know what is. I think its the membership between the topic and the > word, but i dont know what type of meassure method is used. For example, > here is an split that I have printed: >
You're correct here - the values are unnormalized log( p(wordId | topicId) ) values. To recover probabilities, you need to exponentiate them, and normalize so that if you sum over all the values for a given topicId, the sum == 1. > ... > (4, 17847) -28.424714110200803 > (4, 17848) -32.54168874531223 > (4, 17849) -51.954687480087074 > (4, 17850) -1.8811618929248652E-12 > (4, 17851) -7.102634146221668 > (4, 17852) 3.440324743165531 > (4, 17853) 1.118778127312393 > (4, 17854) 2.2973859313207385 > (4, 17855) 2.1602327860824015 > (4, 17856) -2.5362957334351677E-6 > (4, 17857) -32.80559170476965 > (4, 17858) -1.9791269423308222E-7 > ... > > Can somebody help me explaining me this? >
--
-jake
+
Jake Mannix 2012-05-08, 17:00
-
Re: 2 questions about lda implementation
ivan obeso 2012-05-10, 15:55
Ok, i have started to migrate my program to mahout 0.6 with the new LDA version. Before all, Im doing all with java code, no command line programs.
The problem is that i cant use the sequence files that i generated for the old version. I writed it with a SequenceFile.Writer writing a Text as Key and a Text as Value. Now its not allowed because the CVB0Driver wants a IntWritable as key. I know that I have to use the SparseVectorsFromSequenceFiles class to convert my sequencefiles to the input files that CVB0Driver wants. Is that correct? My problem is the lack of documentantion about this classes. I dont know how to use SparseVectorsFromSequenceFiles's run method. Can someone explain me the usage of this?
Also, I need a page explaining all the parameters of the CVB0Driver's run method (¡¡19 parameters!! are too much). Because i dont know the meaning of some of them and i dont find any usefull information :(
Thanks.
On Tue, May 8, 2012 at 1:00 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:
> Hi Ivan, > > First off, let me say that you should probably start migrating to using > the new > LDA implementation which came in 0.6, which is invoked via the "mahout > cvb..." > command, or by directly launching the o.a.m.clustering.lda.cvb.CVB0Driver > in your code, as the old LDA which you're referencing will be going away > soon. > > But for now, I'll try to answer your questions on the old impl: > > On Tue, May 8, 2012 at 8:54 AM, ivan obeso <[EMAIL PROTECTED] > >wrote: > > > Im using mahout 0.6. I had runned the "mahout lda..." tool for command > line > > for apply lda method in a corpus. But now, i want to code it in my java > > program and Im having a lot of problems because it crashes. Can someone > > give me an example java code running correctly? > > > > Looking at the output of LDA, I have 2 folders: > > - docTopics: wich contains a Text key (the document ID) and a vector > Value > > (that is the membership of this document to each topic). > > -state-n: I assume that the intPairWritable is (topicID, wordID) so it > have > > as wordID as all the corpus for each topic. And the DoubleWritable Value > I > > dont know what is. I think its the membership between the topic and the > > word, but i dont know what type of meassure method is used. For example, > > here is an split that I have printed: > > > > You're correct here - the values are unnormalized log( p(wordId | topicId) > ) > values. To recover probabilities, you need to exponentiate them, and > normalize > so that if you sum over all the values for a given topicId, the sum == 1. > > > > ... > > (4, 17847) -28.424714110200803 > > (4, 17848) -32.54168874531223 > > (4, 17849) -51.954687480087074 > > (4, 17850) -1.8811618929248652E-12 > > (4, 17851) -7.102634146221668 > > (4, 17852) 3.440324743165531 > > (4, 17853) 1.118778127312393 > > (4, 17854) 2.2973859313207385 > > (4, 17855) 2.1602327860824015 > > (4, 17856) -2.5362957334351677E-6 > > (4, 17857) -32.80559170476965 > > (4, 17858) -1.9791269423308222E-7 > > ... > > > > Can somebody help me explaining me this? > > > > > > -- > > -jake >
+
ivan obeso 2012-05-10, 15:55
|
|