Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Mahout, mail # user - Using Mahout to train an CVB and retrieve it's topics


Copy link to this message
-
Re: Using Mahout to train an CVB and retrieve it's topics
DAN HELM 2012-07-29, 20:29
Yep something went wrong, most likely with the clustering.  part file is empty.  Should look something like this:
 
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Key: 0: Value: {0:0.06475650422868284,1:0.010728747158503565,2:0.005463535698651016,3:0.023451709705466457}
Key: 1: Value: {0:0.01838885430227092,1:0.05068404879399544,2:0.002110418548647133,3:0.005566514441743756}
Key: 2: Value: {0:0.018575587065216153,1:1.236602313900785E-5,2:8.654629660837919E-6,3:5.820637306957196E-6}
...
...
 
Key refers to a document id and the Value are topic ids:weights assigned to document id.
 
So you need to figure out where things went wrong.  I'm assume folder /user/sgeadmin/text_lda also has empty part files?  Assuming parts files are there run seqdumper on one.  Should have data like the above except in this case the key will be a topic id and the vector will be term ids:weights.
 
You can also check folder /user/sgeadmin/text_vec/tf-vectors to make sure sparse vectors were generated for your input to cvb.
 
Dan
 

________________________________
 From: Folcon Red <[EMAIL PROTECTED]>
To: DAN HELM <[EMAIL PROTECTED]>
Cc: Jake Mannix <[EMAIL PROTECTED]>; "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
Sent: Sunday, July 29, 2012 3:35 PM
Subject: Re: Using Mahout to train an CVB and retrieve it's topics
  

Thanks Dan and Jake,

The output I got from $MAHOUT_HOME/bin/mahout seqdumper -i /user/sgeadmin/text_cvb_document/part-m-00000 is:

Input Path: /user/sgeadmin/text_cvb_document/part-m-00000
Key class: class org.apache.hadoop.io.IntWritable Value Class: class org.apache.mahout.math.VectorWritable
Count: 0

I'm not certain what went wrong.

Kind Regards,
Folcon

On 29 July 2012 18:49, DAN HELM <[EMAIL PROTECTED]> wrote:

Folcon,

>I'm still using Mahout 0.6 so don't know much about changes in 0.7.

>Your output folder for "dt" looks correct.  The relevant data would be in  /user/sgeadmin/text_cvb_document/part-m-00000 which is what I would be passing to a "-s" option.  But I see it says size is only 97 so that looks suspicious.  So you can just view file (for starters) as: mahout seqdumper -s /user/sgeadmin/text_cvb_document/part-m-00000.  And the vector dumper command (as Jake pointed out) has a lot more options to post-process the data but you may want to first just see what is in that file.

>Dan
>
>
> From: Folcon Red <[EMAIL PROTECTED]>
>To: Jake Mannix <[EMAIL PROTECTED]>
>Cc: [EMAIL PROTECTED]; DAN HELM <[EMAIL PROTECTED]>
>Sent: Sunday, July 29, 2012 1:08 PM
>Subject: Re: Using Mahout to train an CVB and retrieve it's topics
>  
>
>
>Hi Guys,
>
>
>Thanks for replying, the problem is whenever I use any -s flag I get the error "Unexpected -s while processing Job-Specific Options:" 
>
>
>Also I'm not sure if this is supposed to be the output of -dt
>
>
>sgeadmin@master:~$ dumbo ls /user/sgeadmin/text_cvb_document -hadoop starcluster
>Found 3 items
>-rw-r--r--   3 sgeadmin supergroup          0 2012-07-29 16:51 /user/sgeadmin/text_cvb_document/_SUCCESS
>drwxr-xr-x   - sgeadmin supergroup          0 2012-07-29 16:50 /user/sgeadmin/text_cvb_document/_logs
>-rw-r--r--   3 sgeadmin supergroup         97 2012-07-29 16:51 /user/sgeadmin/text_cvb_document/part-m-00000
>
>
>Should I be using a newer version of mahout? I've just been using the 0.7 distribution so far as apparently the compiled versions are missing parts that the distributed ones have.
>
>
>Kind Regards,
>Folcon
>
>
>PS: Thanks for the help so far!
>
>
>On 29 July 2012 04:52, Jake Mannix <[EMAIL PROTECTED]> wrote:
>
>
>>
>>
>>On Sat, Jul 28, 2012 at 6:40 PM, DAN HELM <[EMAIL PROTECTED]> wrote:
>>
>>Hi Folcon,
>>> 
>>>In the folder you specified for the –dt option for cvb command
>>>there should be sequence files with the document to topic associations (Key:
>>>IntWritable, Value: VectorWritable). 
>>
>>
>>Yeah, this is correct, although this:
>>
>>
>>You can dump in text format as: mahout seqdumper –s <sequence file>
________________________________