Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - LDA Questions


Copy link to this message
-
Re: LDA Questions
Gokhan Capan 2012-08-07, 18:01
Hi Jake,

Today I submitted the diff. It is available at
https://issues.apache.org/jira/browse/MAHOUT-1051

Thanks for the advices

On Tue, Aug 7, 2012 at 1:06 AM, Jake Mannix <[EMAIL PROTECTED]> wrote:

> Sounds great Gokhan!
>
> On Mon, Aug 6, 2012 at 2:53 PM, Gokhan Capan <[EMAIL PROTECTED]> wrote:
>
> > Jake,
> >
> > I converted the ids to integers with rowid, and then
> > modified InMemoryCollapsedVariationBayes0.loadVectors() such that it
> > returns a SparseMatrix (instead of SparseRowMatrix) whose row ids are
> keys
> > from <IntWritable, VectorWritable> tf vectors. I am not sure if it works,
> > since the values of mapped integer ids (results of rowid) are in the
> range
> > [0, #ofDocuments), but I
> > believe it does.
> >
> > Constructing SparseMatrix needs RandomAccessSparseVector as row vectors
> and
> > tf-vectors are sparse vectors, so I assumed that an incoming tf vector
> > itself, or getDelegate if it is a NamedVector, can be cast to
> > RandomAccessSparseVector.
> > I will submit the diff tomorrow, so you can review and commit.
> >
> > Thank you for your help.
> >
> >
> > On Mon, Aug 6, 2012 at 8:19 PM, Jake Mannix <[EMAIL PROTECTED]>
> wrote:
> >
> > > Hi Gokhan,
> > >
> > >   This looks like a bug in the
> > > InMemoryCollapsedVariationBayes0.loadVectors()
> > > method - it takes the SequenceFile<? extends Writable, VectorWritable>
> > and
> > > ignores
> > > the keys, assigning the rows in order into an in-memory Matrix.
> > >
> > >   If you run "$MAHOUT_HOME/bin/mahout rowid -i <your tf-vector-path> -o
> > > <output path>"
> > > this converts Text keys into IntWritable keys (and leaves behind an
> index
> > > file, a mapping
> > > of Text -> IntWritable which tells you which int is assigned to which
> > > original text key).
> > >
> > >   Then what you'd want to do is modify
> > > InMemoryCollapsedVariationBayes0.loadVectors()
> > > to actually use the keys which are given to it, instead of reassigning
> to
> > > sequential
> > > ids.  If you make this change, we'd love to have the diff - not too
> many
> > > people use
> > > the cvb0_local path (it's usually used for debugging and testing
> smaller
> > > data sets to see that topics are converging properly), but getting it
> to
> > > actually produce
> > > document -> topic outputs which correlate with original docIds would be
> > > very nice! :)
> > >
> > > On Mon, Aug 6, 2012 at 4:00 AM, Gokhan Capan <[EMAIL PROTECTED]>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > My question is about interpreting lda document-topics output.
> > > >
> > > > I am using trunk.
> > > >
> > > > I have a directory of documents, each of which are named by integers,
> > and
> > > > there is no sub-directory of the data directory.
> > > > The directory structure is as follows
> > > > $ ls /path/to/data/
> > > >    1
> > > >    2
> > > >    5
> > > >    ...
> > > >
> > > > From those documents I want to detect topics, and output:
> > > > - topic - top terms
> > > > - document - top topics
> > > >
> > > > To this end, I first run seqdirectory on the directory:
> > > > $ mahout seqdirectory -i $DIR_IN -o $SEQDIR -c UTF-8 -chunk 1
> > > >
> > > > Then I run seq2sparse to create tf vectors of documents:
> > > > $ mahout seq2sparse -i $SEQDIR -o $SPARSEDIR --weight TF
> --maxDFSigma 3
> > > > --namedVector
> > > >
> > > > After creating vectors, I run cvb0_local on those tf-vectors:
> > > > $ mahout cvb0_local -i $SPARSEDIR/tf-vectors -do $LDA_OUT/docs -to
> > > > $LDA_OUT/words -top 20 -m 50 --dictionary
> $SPARSEDIR/dictionary.file-0
> > > >
> > > > And to interpret the results, I use mahout's vectordump utility:
> > > > $ mahout vectordump -i $LDA_OUT/docs -o $LDA_HR_OUT/docs --vectorSize
> > 10
> > > > -sort true -p true
> > > >
> > > > $ mahout vectordump -i $LDA_OUT/words -o $LDA_HR_OUT/words
> --dictionary
> > > > $SPARSEDIR/dictionary.file-0 --dictionaryType sequencefile
> --vectorSize
> > > 10
> > > > -sort true -p true
> > > >
> > > > The resulting words file consists of #ofTopics lines.

Gokhan