Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene, mail # user - Two questions on RussianAnalyzer

Copy link to this message
Re: Two questions on RussianAnalyzer
Robert Muir 2012-04-19, 20:37
On Thu, Apr 19, 2012 at 7:26 AM, Vladimir Gubarkov <[EMAIL PROTECTED]> wrote:
> New analyzer:
> [aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q,
> r, s, t, u, v, z, y, z]
> Old analyzer:
> [aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p,
> q, r, s, t, u, v, z, y, z]
> Please note the differences.

Right, the tokenizer has changed. This is mentioned in the javadocs:

> The most uncomfortable in new behaviour to me is that in past I used
> to search by subdomain like
> bbb.com:8888
> and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and
> so on. Now I have 0 results.

Don't simply set your version parameter to 3.6 without reindexing.
This is really important!!!!!!!!!!!
Otherwise it defeats the whole purpose.

> My questions are: 1) it this change is by design (not a mistake) and
> 2) is the only option to achieve old behaviour is to use
> Version.LUCENE_30 for creating analyzer?

No, this analyzer is just an example. You can always easily make your
own analyzer (just extend ReusableAnalyzerBase)
with maybe 6 or 7 lines of code to combine whatever Tokenizer
(ClassicTokenizer, UAX29URLEmail, StandardTokenizer, whatever),
along with any combination of filters (such as stemmers, whatever)
that you want.

> The other problem with RussionAnalyzer is with the letter Yo
> http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often
> replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and
> such words are considered same.
> What I want to achieve is that my search by word with yo also yield
> words with this letter replaced to ye (and vice-versa).
> What I'm currently doing is roughly next:
> // NOTE: I have to define my class in this package, because method
> russianAnalyzer.createComponents is protected
> package org.apache.lucene.analysis.ru;

Don't try to so hard to extend existing analyzers. Just make your own.
They are just examples.

> public class YoCharFilter extends CharFilter {

CharFilter is really mostly in case you need to correct offsets too
for highlighting and such before the tokenizer.

But you dont, this is a simple 1-1 mapping that won't affect
tokenization. Its just a trivial normalization. I would use a
tokenfilter instead.

> Maybe it may have sense to add a configuration option to
> RussianAnalyzer itself (distinguish or not yo & ye)?

I dont think so, there are tons of choices here (for example we
provide 2 stemming options for russian, and more exist), and also
even this normalization is complicated,  for example some people might
have documents where russian ye is actually latin 'e' and so on.
I've seen it, so i know it exists.

So we generally just provide XYZ_Analyzer as an example mostly, it
would be a lot for us to add every possible use case as an option
to every possible Analyzer. Instead just make your own!