-Re: Pointer to Reference Docs
Julian Ortega 2012-09-17, 09:32
The *seqdirectory *command takes every file in the specified directory and
makes a Hadoop Sequence File
<http://wiki.apache.org/hadoop/SequenceFile>out of it. Sequence Files
have a key and a value, and in the case you want
to turn a list of file into Sequence Files then the file name will be the
key and the file contents will be the value. Nonetheless, this is quite
unpractical if your corpus is large as disk reading and writing can become
painfully slow. You might want to have a look at this discussion on
discusses how to use the Sequence File API to transform a key-value
CSV file into sequence files
The *seq2sparse *Mahout shell command converts the text documents in
Sequence File format to vectors using either TF or
TF-IDF<http://en.wikipedia.org/wiki/Tf*idf>weighting with n-gram
I suggest looking at this quick
now, but I would strongly recommend reading the Mahout in Action
specifically chapter 8.
Hope this helps
On Mon, Sep 17, 2012 at 11:18 AM, David Scarlatti <[EMAIL PROTECTED]>wrote:
> Hi, I'd appreciate any hint on the best source of reference information...
> I've found different examples and quick guides but If I want to know i.e.
> what seqdirecoty or seq2sparse exactly does and which are the different
> command line options with a detailed description, I can't find the place...
> Is this something still to do in Mahout? Should I look to the source code
> to knos this?
> Thanks in advance.