On Tue, Mar 31, 2015 at 5:32 AM, Bruno Albuquerque <[EMAIL PROTECTED]> wrote:
There's no canonical form of "language support" in Lucy. There are only
Analyzers which happen to be tuned for content in a specific language.
What Analyzers do is tokenize and normalize content. You start with a Unicode
text string. Let's say it's the following:
Eats, Shoots and Leaves.
If you perform no analysis, the only search which will match that field is
the exact term query `Eats, Shoots and Leaves.` -- because there's only one
entry in the term dictionary and that's it.
# Tokens produced by analysis chain and stored in index:
['Eats, Shoots and Leaves.']
If you use an Analyzer which only splits on whitespace, you will become able
to search for individual terms, but your searches will be case-sensitive and
punctuation will get in the way. For example, a search for `Leaves` will fail
but a search for `Leaves.` will succeed.
['Eats,', 'Shoots', 'and', 'Leaves.']
If you use an Analyzer which splits on whitespace and is intelligent about
removing punctuation, that problem is solved.
['Eats', 'Shoots', 'and', 'Leaves']
If you add case folding to the analysis chain, then searches for both `leaves`
and `Leaves` will succeed.
['eats', 'shoots', 'and', 'leaves']
(Note that no matter which Analyzer you use, the same transform must be
applied at search time in order to match.)
If you add an English Snowball stemmer, then searches for both `leaves` and
`leave` will match (though not `leaf`, which stems to `leaf` using Snowball
['eat', 'shoot', 'and', 'leave']
So... to implement "language support" for another language, you need to create
an Analyzer which implements a Transform() method which applies
tokenization and normalization appropriate for that language.
Does that make sense?