Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - Using Mahout to train an CVB and retrieve it's topics


Copy link to this message
-
Re: Using Mahout to train an CVB and retrieve it's topics
Folcon Red 2012-07-29, 19:35
Thanks Dan and Jake,

The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/sgeadmin
/text_cvb_document/part-m-00000 is:

Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class
org.apache.mahout.math.VectorWritable
Count: 0

I'm not certain what went wrong.

Kind Regards,
Folcon

On 29 July 2012 18:49, DAN HELM <[EMAIL PROTECTED]> wrote:

> Folcon,
>
> I'm still using Mahout 0.6 so don't know much about changes in 0.7.
>
> Your output folder for "dt" looks correct.  The relevant data would be
> in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would
> be passing to a "-s" option.  But I see it says size is only 97 so that
> looks suspicious.  So you can just view file (for starters) as: mahout
> seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the
> vector dumper command (as Jake pointed out) has a lot more options to post-process
> the data but you may want to first just see what is in that file.
>
> Dan
>
>    *From:* Folcon Red <[EMAIL PROTECTED]>
> *To:* Jake Mannix <[EMAIL PROTECTED]>
> *Cc:* [EMAIL PROTECTED]; DAN HELM <[EMAIL PROTECTED]>
> *Sent:* Sunday, July 29, 2012 1:08 PM
> *Subject:* Re: Using Mahout to train an CVB and retrieve it's topics
>
> Hi Guys,
>
> Thanks for replying, the problem is whenever I use any -s flag I get the
> error "Unexpected -s while processing Job-Specific Options:"
>
> Also I'm not sure if this is supposed to be the output of -dt
>
> sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop
> starcluster
> Found 3 items
> -rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 /user/
> sgeadmin/text_cvb_document/_SUCCESS
> drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 /user/
> sgeadmin/text_cvb_document/_logs
> -rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 /user/
> sgeadmin/text_cvb_document/part-m-00000
>
> Should I be using a newer version of mahout? I've just been using the 0.7
> distribution so far as apparently the compiled versions are missing parts
> that the distributed ones have.
>
> Kind Regards,
> Folcon
>
> PS: Thanks for the help so far!
>
> On 29 July 2012 04:52, Jake Mannix <[EMAIL PROTECTED]> wrote:
>
>
>
> On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[EMAIL PROTECTED]> wrote:
>
> Hi Folcon,
>
> In the folder you specified for the –dt option for cvb command
> there should be sequence files with the document to topic associations
> (Key:
> IntWritable, Value: VectorWritable).
>
>
> Yeah, this is correct, although this:
>
>
> You can dump in text format as: mahout seqdumper –s <sequence file>
>
>
> is not as good as using vectordumper:
>
>    mahout vectordump -s <sequence file> --dictionary <path to dictionary.file-0>
> \
>        --dictionaryType seqfile --vectorSize <num entries per topic you
> want to see> -sort
>
> This joins your topic vectors with the dictionary, then picks out the top
> k terms (with their
> probabilities) for each topic and prints them to the console (or to the
> file you specify with
> an --output option).
>
> *although* I notice now that in trunk when I just checked, VectorDumper.java
> had a bug
> in it for "vectorSize" - line 175 asks for cmdline option "
> numIndexesPerVector" not
> vectorSize, ack!  So I took the liberty of fixing that, but you'll need
> to "svn up" and rebuild
> your jar before using vectordump like this.
>
>
>  So in text output from seqdumper, the key is a document id and the
> vector contains
> the topics and associated scores associated with the document.  I think
> all topics are listed for each
> document but many with near zero score.
> In my case I used rowid to convert keys of original sparse
> document vectors from Text to Integer before running cvb and this
> generates a mapping file so I know the textual
> keys that correspond to the numeric document ids (since my original
> document ids were file names and I created named vectors).