Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr, mail # user - Chinese chars are not indexed ?


Copy link to this message
-
Re: Chinese chars are not indexed ?
Andy 2010-06-28, 14:43
What if Chinese is mixed with English?

I have text that is entered by users and it could be a mix of Chinese, English, etc.

What's the best way to handle that?

Thanks.

--- On Mon, 6/28/10, Ahmet Arslan <[EMAIL PROTECTED]> wrote:

> From: Ahmet Arslan <[EMAIL PROTECTED]>
> Subject: Re: Chinese chars are not indexed ?
> To: [EMAIL PROTECTED]
> Date: Monday, June 28, 2010, 3:44 AM
> > oh yes, *...* works. thanks.
> >
> > I saw tokenizer is defined in schema.xml. There are a
> few
> > places that define the tokenizer. Wondering if it is
> enough
> > to define one for:
>
> It is better to define a brand new field type specific to
> Chinese.
>
> http://wiki.apache.org/solr/LanguageAnalysis?highlight=%28CJKtokenizer%29#Chinese.2C_Japanese.2C_KoreanSomething
> like:
>
> at index time:
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
>
> at query time:
> <tokenizer class="solr.CJKTokenizerFactory"/>
> <filter class="solr.LowerCaseFilterFactory"/>
> <filter class="solr.PositionFilterFactory" />
>
>
>
>      
>