|
|
-
Two questions on RussianAnalyzer
Vladimir Gubarkov 2012-04-19, 11:26
Hi, Upon updating to Lucene 3.6 I've noticed that new RussianAnalyzer analyzes not the same way as before. Please see example: private List<String> getTokens(Analyzer theAnalyzer, String str) throws IOException { final TokenStream tokenStream theAnalyzer.tokenStream(MessageFields.BODY, new StringReader(str)); tokenStream.reset(); final CharTermAttribute termAttribute tokenStream.getAttribute(CharTermAttribute.class); List<String> tokens = new LinkedList<String>(); while (tokenStream.incrementToken()) { final String term = new String(termAttribute.buffer(), 0, termAttribute.length()); tokens.add(term); // System.out.println(">>" + term); } return tokens; } @Test public void testDots() throws IOException { final String str = "aaa.bbb.com:8888 " + "a,b;c/d'e$f&g*h+i-j%k/l_m#n@o!p?q>r\"s~t(u`v|z}y\\z"; System.out.println("New analyzer:"); System.out.println(getTokens(new RussianAnalyzer(Version.LUCENE_36), str)); System.out.println("Old analyzer:"); System.out.println(getTokens(new RussianAnalyzer(Version.LUCENE_30), str)); } This shows: New analyzer: [aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q, r, s, t, u, v, z, y, z] Old analyzer: [aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q, r, s, t, u, v, z, y, z] Please note the differences. The most uncomfortable in new behaviour to me is that in past I used to search by subdomain like bbb.com:8888 and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and so on. Now I have 0 results. My questions are: 1) it this change is by design (not a mistake) and 2) is the only option to achieve old behaviour is to use Version.LUCENE_30 for creating analyzer? The other problem with RussionAnalyzer is with the letter Yo http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and such words are considered same. What I want to achieve is that my search by word with yo also yield words with this letter replaced to ye (and vice-versa). What I'm currently doing is roughly next: // NOTE: I have to define my class in this package, because method russianAnalyzer.createComponents is protected package org.apache.lucene.analysis.ru; public class RussianAnalyzerImproved extends ReusableAnalyzerBase{ private RussianAnalyzer russianAnalyzer = new RussianAnalyzer(LuceneVersion.VERSION); @Override protected Reader initReader(Reader reader) { return new YoCharFilter(CharReader.get(reader)); } @Override protected TokenStreamComponents createComponents(String fieldName, Reader reader) { return russianAnalyzer.createComponents(fieldName, reader); } } public class YoCharFilter extends CharFilter { public YoCharFilter(CharStream in) { super(in); } @Override public int read(char[] cbuf, int off, int len) throws IOException { final int charsRead = super.read(cbuf, off, len); if (charsRead > 0) { final int end = off + charsRead; while (off < end) { if (cbuf[off] == 'ё' || cbuf[off] == 'Ё') cbuf[off] = 'е'; off++; } } return charsRead; } } But I'm not sure this is the correct approach. What do you think? Maybe it may have sense to add a configuration option to RussianAnalyzer itself (distinguish or not yo & ye)? Sincerely yours, Vladimir ---------------------------------------------------------------------
-
RE: Two questions on RussianAnalyzer
Uwe Schindler 2012-04-19, 15:57
> My questions are: 1) it this change is by design (not a mistake) and > 2) is the only option to achieve old behaviour is to use > Version.LUCENE_30 for creating analyzer?
This is why this option is there! ---------------------------------------------------------------------
-
Re: Two questions on RussianAnalyzer
Vladimir Gubarkov 2012-04-19, 20:15
On Thu, Apr 19, 2012 at 7:57 PM, Uwe Schindler <[EMAIL PROTECTED]> wrote: >> My questions are: 1) it this change is by design (not a mistake) and >> 2) is the only option to achieve old behaviour is to use >> Version.LUCENE_30 for creating analyzer? > > This is why this option is there!
Right and it's great, but this not answers my questions, actually =).
> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
---------------------------------------------------------------------
-
Re: Two questions on RussianAnalyzer
Robert Muir 2012-04-19, 20:37
On Thu, Apr 19, 2012 at 7:26 AM, Vladimir Gubarkov <[EMAIL PROTECTED]> wrote: > New analyzer: > [aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q, > r, s, t, u, v, z, y, z] > Old analyzer: > [aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, > q, r, s, t, u, v, z, y, z] > > Please note the differences. Right, the tokenizer has changed. This is mentioned in the javadocs: http://lucene.apache.org/core/3_6_0/api/contrib-analyzers/org/apache/lucene/analysis/ru/RussianAnalyzer.html> > The most uncomfortable in new behaviour to me is that in past I used > to search by subdomain like > bbb.com:8888 > and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and > so on. Now I have 0 results. Don't simply set your version parameter to 3.6 without reindexing. This is really important!!!!!!!!!!! Otherwise it defeats the whole purpose. > > My questions are: 1) it this change is by design (not a mistake) and > 2) is the only option to achieve old behaviour is to use > Version.LUCENE_30 for creating analyzer? No, this analyzer is just an example. You can always easily make your own analyzer (just extend ReusableAnalyzerBase) with maybe 6 or 7 lines of code to combine whatever Tokenizer (ClassicTokenizer, UAX29URLEmail, StandardTokenizer, whatever), along with any combination of filters (such as stemmers, whatever) that you want. > > The other problem with RussionAnalyzer is with the letter Yo > http://en.wikipedia.org/wiki/Yo_(Cyrillic) which in russian often > replaced by letter Ye http://en.wikipedia.org/wiki/Ye_(Cyrillic), and > such words are considered same. > What I want to achieve is that my search by word with yo also yield > words with this letter replaced to ye (and vice-versa). > > What I'm currently doing is roughly next: > > // NOTE: I have to define my class in this package, because method > russianAnalyzer.createComponents is protected > package org.apache.lucene.analysis.ru; Don't try to so hard to extend existing analyzers. Just make your own. They are just examples. > > public class YoCharFilter extends CharFilter { CharFilter is really mostly in case you need to correct offsets too for highlighting and such before the tokenizer. But you dont, this is a simple 1-1 mapping that won't affect tokenization. Its just a trivial normalization. I would use a tokenfilter instead. > Maybe it may have sense to add a configuration option to > RussianAnalyzer itself (distinguish or not yo & ye)? I dont think so, there are tons of choices here (for example we provide 2 stemming options for russian, and more exist), and also even this normalization is complicated, for example some people might have documents where russian ye is actually latin 'e' and so on. I've seen it, so i know it exists. So we generally just provide XYZ_Analyzer as an example mostly, it would be a lot for us to add every possible use case as an option to every possible Analyzer. Instead just make your own! -- lucidimagination.com ---------------------------------------------------------------------
-
RE: Two questions on RussianAnalyzer
Steven A Rowe 2012-04-19, 20:43
Hi Vladimir, > The most uncomfortable in new behaviour to me is that in past I used > to search by subdomain like bbb.com:8888 and have displayed results > with www.bbb.com:8888, aaa.bbb.com:8888 and so on. Now I have 0 > results. About domain names, see my response to a similar question today on the Solr users list: < http://markmail.org/message/3ddxwc7dunblthyt>. Steve
-
Re: Two questions on RussianAnalyzer
Vladimir Gubarkov 2012-04-19, 20:51
Thank you Robert for detailed reply On Fri, Apr 20, 2012 at 12:37 AM, Robert Muir <[EMAIL PROTECTED]> wrote: > On Thu, Apr 19, 2012 at 7:26 AM, Vladimir Gubarkov <[EMAIL PROTECTED]> wrote: >> New analyzer: >> [aaa.bbb.com, 8888, a, b, c, d'e, f, g, h, i, j, k, l_m, n, o, p, q, >> r, s, t, u, v, z, y, z] >> Old analyzer: >> [aaa, bbb, com, 8888, a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, >> q, r, s, t, u, v, z, y, z] >> >> Please note the differences. > > Right, the tokenizer has changed. This is mentioned in the javadocs: > http://lucene.apache.org/core/3_6_0/api/contrib-analyzers/org/apache/lucene/analysis/ru/RussianAnalyzer.html> >> >> The most uncomfortable in new behaviour to me is that in past I used >> to search by subdomain like >> bbb.com:8888 >> and have displayed results with www.bbb.com:8888, aaa.bbb.com:8888 and >> so on. Now I have 0 results. > > Don't simply set your version parameter to 3.6 without reindexing. > This is really important!!!!!!!!!!! > Otherwise it defeats the whole purpose. > Hmmm... I know this and I reindexed! I'll try to explain the problem (fortunately, already solved by using LUCENE_30) ones again: When indexing with new analyzer the whole lexeme "some.cool.site.com" goes to index, not 4 lexems "some", "cool", "site", "com". So it's now imposible to find this document with query: "site.com". I'm having an RSS subscription for that search, and now it's broken. ---------------------------------------------------------------------
-
Re: Two questions on RussianAnalyzer
Robert Muir 2012-04-19, 20:58
On Thu, Apr 19, 2012 at 4:51 PM, Vladimir Gubarkov <[EMAIL PROTECTED]> wrote:
> Hmmm... I know this and I reindexed! > I'll try to explain the problem (fortunately, already solved by using > LUCENE_30) ones again: > When indexing with new analyzer the whole lexeme "some.cool.site.com" > goes to index, not 4 lexems "some", "cool", "site", "com". > So it's now imposible to find this document with query: "site.com". > I'm having an RSS subscription for that search, and now it's broken. >
You are right, to search for prefixes of that (assuming its a URL, which it may or may not be, it depends on the domain and use case), you need something else. So I think Steven Rowe's advice is best here. -- lucidimagination.com
---------------------------------------------------------------------
-
Re: Two questions on RussianAnalyzer
Robert Muir 2012-04-19, 21:01
On Thu, Apr 19, 2012 at 4:51 PM, Vladimir Gubarkov <[EMAIL PROTECTED]> wrote: > So it's now imposible to find this document with query: "site.com". > I'm having an RSS subscription for that search, and now it's broken. >
Just to point out, its not impossible, as i suggested before, if you were happy with the old tokenizer and you dont like passing LUCENE_30, then just make your own analyzer, using ClassicTokenizer + <your list of filters> instead.
-- lucidimagination.com
---------------------------------------------------------------------
-
Re: Two questions on RussianAnalyzer
Vladimir Gubarkov 2012-04-19, 21:30
Thank you Steven, I'll look into this On Fri, Apr 20, 2012 at 12:43 AM, Steven A Rowe <[EMAIL PROTECTED]> wrote: > Hi Vladimir, > >> The most uncomfortable in new behaviour to me is that in past I used >> to search by subdomain like bbb.com:8888 and have displayed results >> with www.bbb.com:8888, aaa.bbb.com:8888 and so on. Now I have 0 >> results. > > About domain names, see my response to a similar question today on the Solr users list: < http://markmail.org/message/3ddxwc7dunblthyt>. > > Steve > ---------------------------------------------------------------------
|
|