Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Solr, mail # user - Tokenizing Chinese & multi-language search


+
Andy 2011-03-16, 01:07
+
Otis Gospodnetic 2011-03-16, 03:51
Copy link to this message
-
Re: Tokenizing Chinese & multi-language search
Andy 2011-03-16, 05:04
Hi Otis,

It doesn't look like the last 2 options would work for me. So I guess my best bet is to ask the user to specify the language when they type in the query.

Once I get that information from the user, how do I dynamically pick an analyzer for the query string?

Thanks

Andy

--- On Tue, 3/15/11, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:

> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> Subject: Re: Tokenizing Chinese & multi-language search
> To: [EMAIL PROTECTED]
> Date: Tuesday, March 15, 2011, 11:51 PM
> Hi Andy,
>
> Is the "I don't know what language the query is in"
> something you could change
> by...
> - asking the user
> - deriving from HTTP request headers
> - identifying the query language (if queries are long
> enough and "texty")
> - ...
>
> Otis
> ----
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
>
>
>
> ----- Original Message ----
> > From: Andy <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Tue, March 15, 2011 9:07:36 PM
> > Subject: Tokenizing Chinese & multi-language
> search
> >
> > Hi,
> >
> > I remember reading in this list a while ago that Solr
> will only  tokenize on
> >whitespace even when using CJKAnalyzer. That would make
> Solr  unusable on
> >Chinese or any other languages that don't use
> whitespace as  separator.
> >
> > 1) I remember reading about a workaround.
> Unfortunately I  can't find the post
> >that mentioned it. Could someone give me pointers on
> how to  address this issue?
> >
> > 2) Let's say I have fixed this issue and have 
> properly analyzed and indexed
> >the Chinese documents. My documents are in 
> multiple languages. I plan to use
> >separate fields for documents in different 
> languages: text_en, text_zh,
> >text_ja, text_fr, etc. Each field will be 
> associated with the appropriate
> >analyzer.
> >
> > My problem now is how to deal with  the query
> string. I don't know what
> >language the query is in, so I won't be able  to
> select the appropriate analyzer
> >for the query string. If I just use the  standard
> analyzer on the query string,
> >any query that's in Chinese won't be  tokenized
> correctly. So would the whole
> >system still work in this  case?
> >
> > This must be a pretty common use case, handling
> multi-language  search. What is
> >the recommended way of dealing with this 
> problem?
> >
> > Thanks.
> > Andy
> >
> >
> >       
> >
>