-Re: tokenizer for text
Baoqiang Cao 2012-05-18, 14:56
In addition. You could try to increase the word occurance thresholds
in -s and -md options.
On Fri, May 18, 2012 at 9:41 AM, John Conwell <[EMAIL PROTECTED]> wrote:
> What do you have in mind as far as a different tokenizer? Are you doing
> stopword filtering? Maybe look at the stopword list and see if there are
> other noise words you wish to add. If you are using Lucene to filter
> stopwords, its stopword list if pretty small(20 or so words). Stemming is
> another method often used to reduce your feature space. You could look
> at lemmatization instead of stemming. It wont reduce the feature space as
> much, but could help in normalizing different terms with the same lemme.
> You can put together your own lucene analyzer with whatever lucene filter
> pipeline you want into SparseVectorsFromSequenceFiles in order to replace
> the stock tokenizer.
> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[EMAIL PROTECTED]> wrote:
>> Hi List,
>> I am trying to use Mahout to do cluster on text. The problem is after
>> running the procedure SparseVectorsFromSequenceFiles, the dimension of
>> tf-idf vector is too high (about 50K) and it increases as the number
>> of document increases. I think there are two ways to handle that. One
>> is to use dimension reduction. The other one is to used a better
>> tokenizer which should be the better option.
>> My questions are
>> 1) how can I change the default tokenizer? or where can I find a new one?
>> 2) Is there a third option for me to deal with the number of dimension?
>> Thanks a lot.
> John C