On Thu, Feb 3, 2011 at 8:55 AM, Emmanuel Espina
<[EMAIL PROTECTED]> wrote:
> It uses fuzzy queries instead of a ngram query, and then I rank the results
> by word frequency in the text with the aid of a python script (all that is
> explained in the post). I got pretty good results (between 50% and 90%
> improvements), but slower (about double time).
>
Hi Emmanuel:
I think its great you are evaluating different techniques here, our
spelling could use some help :)
By the way: we added a new
spellchecking technique that sounds quite
similar to what you describe (DirectSpellChecker),
but hopefully without the performance issues.
Its only available in trunk (
http://svn.apache.org/repos/asf/lucene/dev/trunk/)I tried to do a very rough evaluation on its jira issue:
https://issues.apache.org/jira/browse/LUCENE-2507, but nothing very
serious and as in-depth as what it looks like you did.
Anyway, if you want to play you can
experiment with it either at the
lucene level (its in contrib/
spellchecker) or via solr, by using
DirectSolrSpellChecker... though I think the parameters in the example
solrconfig are likely not the best :)
I have an app using this more fleshed-out config (in combination with
the new collation options), and it seems to be reasonable:
<!-- a
spellchecker that uses no auxiliary index -->
<lst name="
spellchecker">
<str name="name">default</str>
<str name="field">text</str>
<str name="classname">solr.DirectSolrSpellChecker</str>
<str name="minPrefix">1</str>
<str name="maxEdits">2</str>
<str name="maxInspections">25</str> <!-- probably way too high
for most apps though -->
<str name="minQueryLength">3</str>
<str name="comparatorClass">freq</str>
<str name="thresholdTokenFrequency">1</str>
<str name="distanceMeasure">org.apache.lucene.search.spell.JaroWinklerDistance</str>
</lst>