|
|
-
Re: LDA QuestionsGokhan Capan 2012-08-07, 18:01
Hi Jake,
Today I submitted the diff. It is available at https://issues.apache.org/jira/browse/MAHOUT-1051 Thanks for the advices On Tue, Aug 7, 2012 at 1:06 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > Sounds great Gokhan! > > On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan <[EMAIL PROTECTED]> wrote: > > > Jake, > > > > I converted the ids to integers with rowid, and then > > modified InMemoryCollapsedVariationBayes0.loadVectors() such that it > > returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are > keys > > from <IntWritable, VectorWritable> tf vectors. I am not sure if it works, > > since the values of mapped integer ids (results of rowid) are in the > range > > [0, #ofDocuments), but I > > believe it does. > > > > Constructing SparseMatrix needs RandomAccessSparseVector as row vectors > and > > tf-vectors are sparse vectors, so I assumed that an incoming tf vector > > itself, or getDelegate if it is a NamedVector, can be cast to > > RandomAccessSparseVector. > > I will submit the diff tomorrow, so you can review and commit. > > > > Thank you for your help. > > > > > > On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > > > Hi Gokhan, > > > > > > This looks like a bug in the > > > InMemoryCollapsedVariationBayes0.loadVectors() > > > method - it takes the SequenceFile<? extends Writable, VectorWritable> > > and > > > ignores > > > the keys, assigning the rows in order into an in-memory Matrix. > > > > > > If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o > > > <output path>" > > > this converts Text keys into IntWritable keys (and leaves behind an > index > > > file, a mapping > > > of Text -> IntWritable which tells you which int is assigned to which > > > original text key). > > > > > > Then what you'd want to do is modify > > > InMemoryCollapsedVariationBayes0.loadVectors() > > > to actually use the keys which are given to it, instead of reassigning > to > > > sequential > > > ids. If you make this change, we'd love to have the diff - not too > many > > > people use > > > the cvb0_local path (it's usually used for debugging and testing > smaller > > > data sets to see that topics are converging properly), but getting it > to > > > actually produce > > > document -> topic outputs which correlate with original docIds would be > > > very nice! :) > > > > > > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <[EMAIL PROTECTED]> > wrote: > > > > > > > Hi, > > > > > > > > My question is about interpreting lda document-topics output. > > > > > > > > I am using trunk. > > > > > > > > I have a directory of documents, each of which are named by integers, > > and > > > > there is no sub-directory of the data directory. > > > > The directory structure is as follows > > > > $ ls /path/to/data/ > > > > 1 > > > > 2 > > > > 5 > > > > ... > > > > > > > > From those documents I want to detect topics, and output: > > > > - topic - top terms > > > > - document - top topics > > > > > > > > To this end, I first run seqdirectory on the directory: > > > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1 > > > > > > > > Then I run seq2sparse to create tf vectors of documents: > > > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF > --maxDFSigma 3 > > > > --namedVector > > > > > > > > After creating vectors, I run cvb0_local on those tf-vectors: > > > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to > > > > $LDA_OUT/words -top 20 -m 50 --dictionary > $SPARSEDIR/dictionary.file-0 > > > > > > > > And to interpret the results, I use mahout's vectordump utility: > > > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize > > 10 > > > > -sort true -p true > > > > > > > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words > --dictionary > > > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile > --vectorSize > > > 10 > > > > -sort true -p true > > > > > > > > The resulting words file consists of #ofTopics lines. Gokhan |