|
|
-
Re: Spanish analyzer in ravendbvicente garcia 2012-06-15, 07:06
Thanks a lot Simon! maybe I could port a Spanish Lucene Analyzer to
Lucene.net... Thanks :) On Thu, Jun 14, 2012 at 7:44 PM, Simon Svensson <[EMAIL PROTECTED]> wrote: > It's easy to write analyzers, you basically chain together a few > TokenFilters and call it a day. And to back up that statement I provide an > example spanish analyzer written by someone who basically threw his complete > Spanish vocabulary into the stop word list. DictionaryLoader is a class > which loads your hunspell dictionaries (.aff and .dic files) from your > storage (filesystem, embedded resources, etc). There are some further > development that can be done, like overriding/implementing > ReusableTokenStream and verify that the filters are in the correct order. > > using System; > using System.Collections; > using System.IO; > using Lucene.Net.Analysis; > using Lucene.Net.Analysis.Hunspell; > using Lucene.Net.Analysis.Standard; > using Version = Lucene.Net.Util.Version; > > public class SpanishHunspellAnalyzer : Analyzer { > private static readonly HunspellDictionary Dictionary > DictionaryLoader.Load(@"es_ES"); > private static readonly Hashtable Stopwords = new Hashtable { > { "Me", null }, { "no", null }, { "habla", null }, { "español", null > } > }; > > public override TokenStream TokenStream(String fieldName, TextReader > reader) { > var stream = new StandardTokenizer(Version.LUCENE_29, reader); > > TokenFilter filter = new LowerCaseFilter(stream); > filter = new HunspellStemFilter(filter, Dictionary); > filter = new StopFilter(true, filter, Stopwords, true); > return filter; > } > } > > // Simon > > > On 2012-06-14 18:44, vicente garcia wrote: >> >> Thank you Simon, you can specify a >> "Raven.Database.Indexing.Collation.Cultures.EsCollationAnalyzer, >> Raven.Database" but you can't perform full text search queries because >> this index don't tokenize the content. >> >> http://ravendb.net/docs/client-api/querying/static-indexes/customizing-results-order >> >> I saw that there is not a SpanishAnalyzer, we only have a >> SpanishStemmer, but I don't need an stammer, I need a spanish analyzer >> with its stops words, etc. >> >> Has someones another idea on how to index Spanish content? >> >> Thank you very much :) >> >> On Thu, Jun 14, 2012 at 4:59 PM, Simon Svensson<[EMAIL PROTECTED]> wrote: >>> >>> Welcome, >>> >>> See Configuring index options[1] to specify a custom analyzer that can >>> handle spanish content. >>> >>> A quick check shows that Contrib.Analyzers does not contain a spanish >>> analyzer. There is a SpanishStemmer available in the Snowball contrib. >>> You >>> could also use a spanish hunspell dictionary for stemming[2]. >>> >>> // Simon >>> >>> [1] >>> >>> http://ravendb.net/docs/client-api/querying/static-indexes/configuring-index-options >>> [2] https://github.com/sisve/Lucene.Net.Analysis.Hunspell >>> >>> >>> On 2012-06-14 16:49, vicente garcia wrote: >>>> >>>> Hi to all, this is my first mail to this list :) >>>> >>>> I'd like to index spanish content in raven db, I have been searching a >>>> lot, but I don't know how I can do it. >>>> >>>> Could someones help me please? >>>> >>>> Thanks :) >>>> >> >> > -- LinkedIn profile: http://www.linkedin.com/in/vicentegarcia Twiter: http://twitter.com/clrstack Blog: http://geeks.ms/blogs/vgarcia |