-Re: Tokenizing Chinese & multi-language search
Andy 2011-03-16, 05:04
It doesn't look like the last 2 options would work for me. So I guess my best bet is to ask the user to specify the language when they type in the query.
Once I get that information from the user, how do I dynamically pick an analyzer for the query string?
--- On Tue, 3/15/11, Otis Gospodnetic <[EMAIL PROTECTED]> wrote:
> From: Otis Gospodnetic <[EMAIL PROTECTED]>
> Subject: Re: Tokenizing Chinese & multi-language search
> To: [EMAIL PROTECTED]
> Date: Tuesday, March 15, 2011, 11:51 PM
> Hi Andy,
> Is the "I don't know what language the query is in"
> something you could change
> - asking the user
> - deriving from HTTP request headers
> - identifying the query language (if queries are long
> enough and "texty")
> - ...
> Sematext :: http://sematext.com/ :: Solr - Lucene - Nutch
> Lucene ecosystem search :: http://search-lucene.com/
> ----- Original Message ----
> > From: Andy <[EMAIL PROTECTED]>
> > To: [EMAIL PROTECTED]
> > Sent: Tue, March 15, 2011 9:07:36 PM
> > Subject: Tokenizing Chinese & multi-language
> > Hi,
> > I remember reading in this list a while ago that Solr
> will only tokenize on
> >whitespace even when using CJKAnalyzer. That would make
> Solr unusable on
> >Chinese or any other languages that don't use
> whitespace as separator.
> > 1) I remember reading about a workaround.
> Unfortunately I can't find the post
> >that mentioned it. Could someone give me pointers on
> how to address this issue?
> > 2) Let's say I have fixed this issue and have
> properly analyzed and indexed
> >the Chinese documents. My documents are in
> multiple languages. I plan to use
> >separate fields for documents in different
> languages: text_en, text_zh,
> >text_ja, text_fr, etc. Each field will be
> associated with the appropriate
> > My problem now is how to deal with the query
> string. I don't know what
> >language the query is in, so I won't be able to
> select the appropriate analyzer
> >for the query string. If I just use the standard
> analyzer on the query string,
> >any query that's in Chinese won't be tokenized
> correctly. So would the whole
> >system still work in this case?
> > This must be a pretty common use case, handling
> multi-language search. What is
> >the recommended way of dealing with this
> > Thanks.
> > Andy