|
|
Jiaan Zeng 2012-05-18, 14:15
Hi List,
I am trying to use Mahout to do cluster on text. The problem is after running the procedure SparseVectorsFromSequenceFiles, the dimension of tf-idf vector is too high (about 50K) and it increases as the number of document increases. I think there are two ways to handle that. One is to use dimension reduction. The other one is to used a better tokenizer which should be the better option.
My questions are
1) how can I change the default tokenizer? or where can I find a new one? 2) Is there a third option for me to deal with the number of dimension?
Thanks a lot.
-- Regards, Jiaan
John Conwell 2012-05-18, 14:41
What do you have in mind as far as a different tokenizer? Are you doing stopword filtering? Maybe look at the stopword list and see if there are other noise words you wish to add. If you are using Lucene to filter stopwords, its stopword list if pretty small(20 or so words). Stemming is another method often used to reduce your feature space. You could look at lemmatization instead of stemming. It wont reduce the feature space as much, but could help in normalizing different terms with the same lemme.
You can put together your own lucene analyzer with whatever lucene filter pipeline you want into SparseVectorsFromSequenceFiles in order to replace the stock tokenizer.
On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[EMAIL PROTECTED]> wrote:
> Hi List, > > I am trying to use Mahout to do cluster on text. The problem is after > running the procedure SparseVectorsFromSequenceFiles, the dimension of > tf-idf vector is too high (about 50K) and it increases as the number > of document increases. I think there are two ways to handle that. One > is to use dimension reduction. The other one is to used a better > tokenizer which should be the better option. > > My questions are > > 1) how can I change the default tokenizer? or where can I find a new one? > 2) Is there a third option for me to deal with the number of dimension? > > Thanks a lot. > > -- > Regards, > Jiaan >
--
Thanks, John C
Baoqiang Cao 2012-05-18, 14:56
In addition. You could try to increase the word occurance thresholds in -s and -md options.
On Fri, May 18, 2012 at 9:41 AM, John Conwell <[EMAIL PROTECTED]> wrote: > What do you have in mind as far as a different tokenizer? Are you doing > stopword filtering? Maybe look at the stopword list and see if there are > other noise words you wish to add. If you are using Lucene to filter > stopwords, its stopword list if pretty small(20 or so words). Stemming is > another method often used to reduce your feature space. You could look > at lemmatization instead of stemming. It wont reduce the feature space as > much, but could help in normalizing different terms with the same lemme. > > You can put together your own lucene analyzer with whatever lucene filter > pipeline you want into SparseVectorsFromSequenceFiles in order to replace > the stock tokenizer. > > > > On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[EMAIL PROTECTED]> wrote: > >> Hi List, >> >> I am trying to use Mahout to do cluster on text. The problem is after >> running the procedure SparseVectorsFromSequenceFiles, the dimension of >> tf-idf vector is too high (about 50K) and it increases as the number >> of document increases. I think there are two ways to handle that. One >> is to use dimension reduction. The other one is to used a better >> tokenizer which should be the better option. >> >> My questions are >> >> 1) how can I change the default tokenizer? or where can I find a new one? >> 2) Is there a third option for me to deal with the number of dimension? >> >> Thanks a lot. >> >> -- >> Regards, >> Jiaan >> > > > > -- > > Thanks, > John C
Jiaan Zeng 2012-05-18, 15:09
Thanks for the quick reply.
Stop word filtering or stemming may not help much I think. Too, the point of using tf-idf vector is to deal with high occurrence frequency word. Stop word filtering or stemming seems counter against the tf-idf intention. The problem is that the text has lots of noises (the text is OCR text so it has lots of OCR errors). Is there a tokenizer with noise filter that I can plug in? Or where can I find a noise filter to deal with that?
On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[EMAIL PROTECTED]> wrote: > In addition. You could try to increase the word occurance thresholds > in -s and -md options. > > On Fri, May 18, 2012 at 9:41 AM, John Conwell <[EMAIL PROTECTED]> wrote: >> What do you have in mind as far as a different tokenizer? Are you doing >> stopword filtering? Maybe look at the stopword list and see if there are >> other noise words you wish to add. If you are using Lucene to filter >> stopwords, its stopword list if pretty small(20 or so words). Stemming is >> another method often used to reduce your feature space. You could look >> at lemmatization instead of stemming. It wont reduce the feature space as >> much, but could help in normalizing different terms with the same lemme. >> >> You can put together your own lucene analyzer with whatever lucene filter >> pipeline you want into SparseVectorsFromSequenceFiles in order to replace >> the stock tokenizer. >> >> >> >> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[EMAIL PROTECTED]> wrote: >> >>> Hi List, >>> >>> I am trying to use Mahout to do cluster on text. The problem is after >>> running the procedure SparseVectorsFromSequenceFiles, the dimension of >>> tf-idf vector is too high (about 50K) and it increases as the number >>> of document increases. I think there are two ways to handle that. One >>> is to use dimension reduction. The other one is to used a better >>> tokenizer which should be the better option. >>> >>> My questions are >>> >>> 1) how can I change the default tokenizer? or where can I find a new one? >>> 2) Is there a third option for me to deal with the number of dimension? >>> >>> Thanks a lot. >>> >>> -- >>> Regards, >>> Jiaan >>> >> >> >> >> -- >> >> Thanks, >> John C
-- Regards, Jiaan
John Conwell 2012-05-18, 15:37
Noise in OCR often manifests itself as a whole bunch of singletons in the corpus of meaningless terms like "lsdjfdslkfj". So the minFrequency flag can help in filtering out these terms.
Stopwords should be handled by tfidf. For example the word "the" probably has a high frequency in every document in the corpus, so it'll have a low tfidf score. But trimming out stopwords is still a good way to reduce your feature space, even if its just to reduce the size of your dataset, and speed up processing. This can be very helpful when you have a very large corpus.
Lemmatization and Stemming can actually enhance the the tfidf score of influential terms. For example say a document used the following list of terms, each term twice: "jog, jogging, jogged, jogs, jogger". Here are 5 terms that will each be treated as distinct values in your vector space, each with a frequency of 2. The document seems to have a lot to do with the act of jogging, but since each term will get a tfidf score of its own frequency value of 2, these terms wont strongly influence the similarity function when clustering. Stemming/lemmatization will normalize these 5 terms down to one term "jog", with a frequency value of 10, and will have a higher tfidf score than any of the individual terms (as long as the corpus of documents isn't all about running). This does two things, dramatically reduces your feature space, and can increase the influence of key terms in a document, which will give you stronger clustering results around these key terms. On Fri, May 18, 2012 at 8:09 AM, Jiaan Zeng <[EMAIL PROTECTED]> wrote:
> Thanks for the quick reply. > > Stop word filtering or stemming may not help much I think. Too, the > point of using tf-idf vector is to deal with high occurrence frequency > word. Stop word filtering or stemming seems counter against the tf-idf > intention. The problem is that the text has lots of noises (the text > is OCR text so it has lots of OCR errors). Is there a tokenizer with > noise filter that I can plug in? Or where can I find a noise filter to > deal with that? > > On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[EMAIL PROTECTED]> > wrote: > > In addition. You could try to increase the word occurance thresholds > > in -s and -md options. > > > > On Fri, May 18, 2012 at 9:41 AM, John Conwell <[EMAIL PROTECTED]> wrote: > >> What do you have in mind as far as a different tokenizer? Are you doing > >> stopword filtering? Maybe look at the stopword list and see if there > are > >> other noise words you wish to add. If you are using Lucene to filter > >> stopwords, its stopword list if pretty small(20 or so words). Stemming > is > >> another method often used to reduce your feature space. You could look > >> at lemmatization instead of stemming. It wont reduce the feature space > as > >> much, but could help in normalizing different terms with the same lemme. > >> > >> You can put together your own lucene analyzer with whatever lucene > filter > >> pipeline you want into SparseVectorsFromSequenceFiles in order to > replace > >> the stock tokenizer. > >> > >> > >> > >> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[EMAIL PROTECTED]> > wrote: > >> > >>> Hi List, > >>> > >>> I am trying to use Mahout to do cluster on text. The problem is after > >>> running the procedure SparseVectorsFromSequenceFiles, the dimension of > >>> tf-idf vector is too high (about 50K) and it increases as the number > >>> of document increases. I think there are two ways to handle that. One > >>> is to use dimension reduction. The other one is to used a better > >>> tokenizer which should be the better option. > >>> > >>> My questions are > >>> > >>> 1) how can I change the default tokenizer? or where can I find a new > one? > >>> 2) Is there a third option for me to deal with the number of dimension? > >>> > >>> Thanks a lot. > >>> > >>> -- > >>> Regards, > >>> Jiaan > >>> > >> > >> > >> > >> -- > >> > >> Thanks, > >> John C > > > > -- > Regards, > Jiaan Thanks, John C
Jiaan Zeng 2012-05-18, 16:37
very helpful info! Thanks a lot.
On Fri, May 18, 2012 at 11:37 AM, John Conwell <[EMAIL PROTECTED]> wrote: > Noise in OCR often manifests itself as a whole bunch of singletons in the > corpus of meaningless terms like "lsdjfdslkfj". So the minFrequency flag > can help in filtering out these terms. > > Stopwords should be handled by tfidf. For example the word "the" probably > has a high frequency in every document in the corpus, so it'll have a low > tfidf score. But trimming out stopwords is still a good way to reduce your > feature space, even if its just to reduce the size of your dataset, and > speed up processing. This can be very helpful when you have a very large > corpus. > > Lemmatization and Stemming can actually enhance the the tfidf score > of influential terms. For example say a document used the following list > of terms, each term twice: "jog, jogging, jogged, jogs, jogger". Here are > 5 terms that will each be treated as distinct values in your vector space, > each with a frequency of 2. The document seems to have a lot to do with > the act of jogging, but since each term will get a tfidf score of its own > frequency value of 2, these terms wont strongly influence the similarity > function when clustering. Stemming/lemmatization will normalize these 5 > terms down to one term "jog", with a frequency value of 10, and will have a > higher tfidf score than any of the individual terms (as long as the corpus > of documents isn't all about running). This does two things, dramatically > reduces your feature space, and can increase the influence of key terms in > a document, which will give you stronger clustering results around these > key terms. > > > On Fri, May 18, 2012 at 8:09 AM, Jiaan Zeng <[EMAIL PROTECTED]> wrote: > >> Thanks for the quick reply. >> >> Stop word filtering or stemming may not help much I think. Too, the >> point of using tf-idf vector is to deal with high occurrence frequency >> word. Stop word filtering or stemming seems counter against the tf-idf >> intention. The problem is that the text has lots of noises (the text >> is OCR text so it has lots of OCR errors). Is there a tokenizer with >> noise filter that I can plug in? Or where can I find a noise filter to >> deal with that? >> >> On Fri, May 18, 2012 at 10:56 AM, Baoqiang Cao <[EMAIL PROTECTED]> >> wrote: >> > In addition. You could try to increase the word occurance thresholds >> > in -s and -md options. >> > >> > On Fri, May 18, 2012 at 9:41 AM, John Conwell <[EMAIL PROTECTED]> wrote: >> >> What do you have in mind as far as a different tokenizer? Are you doing >> >> stopword filtering? Maybe look at the stopword list and see if there >> are >> >> other noise words you wish to add. If you are using Lucene to filter >> >> stopwords, its stopword list if pretty small(20 or so words). Stemming >> is >> >> another method often used to reduce your feature space. You could look >> >> at lemmatization instead of stemming. It wont reduce the feature space >> as >> >> much, but could help in normalizing different terms with the same lemme. >> >> >> >> You can put together your own lucene analyzer with whatever lucene >> filter >> >> pipeline you want into SparseVectorsFromSequenceFiles in order to >> replace >> >> the stock tokenizer. >> >> >> >> >> >> >> >> On Fri, May 18, 2012 at 7:15 AM, Jiaan Zeng <[EMAIL PROTECTED]> >> wrote: >> >> >> >>> Hi List, >> >>> >> >>> I am trying to use Mahout to do cluster on text. The problem is after >> >>> running the procedure SparseVectorsFromSequenceFiles, the dimension of >> >>> tf-idf vector is too high (about 50K) and it increases as the number >> >>> of document increases. I think there are two ways to handle that. One >> >>> is to use dimension reduction. The other one is to used a better >> >>> tokenizer which should be the better option. >> >>> >> >>> My questions are >> >>> >> >>> 1) how can I change the default tokenizer? or where can I find a new >> one? >> >>> 2) Is there a third option for me to deal with the number of dimension?
Regards, Jiaan
|
|