Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene.Net, mail # user - Spanish analyzer in ravendb


Copy link to this message
-
Re: Spanish analyzer in ravendb
vicente garcia 2012-06-15, 07:06
Thanks a lot Simon! maybe I could port a Spanish Lucene Analyzer to
Lucene.net...

Thanks :)

On Thu, Jun 14, 2012 at 7:44 PM, Simon Svensson <[EMAIL PROTECTED]> wrote:
> It's easy to write analyzers, you basically chain together a few
> TokenFilters and call it a day. And to back up that statement I provide an
> example spanish analyzer written by someone who basically threw his complete
> Spanish vocabulary into the stop word list. DictionaryLoader is a class
> which loads your hunspell dictionaries (.aff and .dic files) from your
> storage (filesystem, embedded resources, etc). There are some further
> development that can be done, like overriding/implementing
> ReusableTokenStream and verify that the filters are in the correct order.
>
> using System;
> using System.Collections;
> using System.IO;
> using Lucene.Net.Analysis;
> using Lucene.Net.Analysis.Hunspell;
> using Lucene.Net.Analysis.Standard;
> using Version = Lucene.Net.Util.Version;
>
> public class SpanishHunspellAnalyzer : Analyzer {
>    private static readonly HunspellDictionary Dictionary > DictionaryLoader.Load(@"es_ES");
>    private static readonly Hashtable Stopwords = new Hashtable {
>        { "Me", null }, { "no", null }, { "habla", null }, { "español", null
> }
>    };
>
>    public override TokenStream TokenStream(String fieldName, TextReader
> reader) {
>        var stream = new StandardTokenizer(Version.LUCENE_29, reader);
>
>        TokenFilter filter = new LowerCaseFilter(stream);
>        filter = new HunspellStemFilter(filter, Dictionary);
>        filter = new StopFilter(true, filter, Stopwords, true);
>        return filter;
>    }
> }
>
> // Simon
>
>
> On 2012-06-14 18:44, vicente garcia wrote:
>>
>> Thank you Simon, you can specify a
>> "Raven.Database.Indexing.Collation.Cultures.EsCollationAnalyzer,
>> Raven.Database" but you can't perform full text search queries because
>> this index don't tokenize the content.
>>
>> http://ravendb.net/docs/client-api/querying/static-indexes/customizing-results-order
>>
>> I saw that there is not a SpanishAnalyzer, we only have a
>> SpanishStemmer, but I don't need an stammer, I need a spanish analyzer
>> with its stops words, etc.
>>
>> Has someones another idea on how to index Spanish content?
>>
>> Thank you very much :)
>>
>> On Thu, Jun 14, 2012 at 4:59 PM, Simon Svensson<[EMAIL PROTECTED]>  wrote:
>>>
>>> Welcome,
>>>
>>> See Configuring index options[1] to specify a custom analyzer that can
>>> handle spanish content.
>>>
>>> A quick check shows that Contrib.Analyzers does not contain a spanish
>>> analyzer. There is a SpanishStemmer available in the Snowball contrib.
>>> You
>>> could also use a spanish hunspell dictionary for stemming[2].
>>>
>>> // Simon
>>>
>>> [1]
>>>
>>> http://ravendb.net/docs/client-api/querying/static-indexes/configuring-index-options
>>> [2] https://github.com/sisve/Lucene.Net.Analysis.Hunspell
>>>
>>>
>>> On 2012-06-14 16:49, vicente garcia wrote:
>>>>
>>>> Hi to all, this is my first mail to this list :)
>>>>
>>>> I'd like to index spanish content in raven db, I have been searching a
>>>> lot, but I don't know how I can do it.
>>>>
>>>> Could someones help me please?
>>>>
>>>> Thanks :)
>>>>
>>
>>
>

--
LinkedIn profile: http://www.linkedin.com/in/vicentegarcia

Twiter: http://twitter.com/clrstack

Blog: http://geeks.ms/blogs/vgarcia