|
Andrew Nagy
2007-07-06, 16:39
Tristan Vittorio
2007-07-06, 23:51
Otis Gospodnetic
2007-07-07, 14:28
Tristan Vittorio
2007-07-08, 00:55
climbingrose
2007-07-09, 07:32
Tristan Vittorio
2007-07-09, 09:04
climbingrose
2007-07-09, 10:20
Tristan Vittorio
2007-07-09, 10:46
Charles Hornberger
2007-07-09, 18:26
climbingrose
2007-08-10, 15:53
climbingrose
2007-08-11, 03:40
climbingrose
2007-08-11, 03:49
Pieter Berkel
2007-08-11, 05:19
climbingrose
2007-08-11, 08:36
karl wettin
2007-08-11, 22:04
climbingrose
2007-08-12, 00:35
Pieter Berkel
2007-08-12, 01:03
karl wettin
2007-08-12, 01:08
climbingrose
2007-08-12, 12:24
karl wettin
2007-08-17, 15:18
climbingrose
2007-08-17, 16:28
Otis Gospodnetic
2007-10-08, 04:55
|
-
Spell Check HandlerAndrew Nagy 2007-07-06, 16:39
Hello, is there any documentation on how to use the new spell check module?
Thanks Andrew
-
Re: Spell Check HandlerTristan Vittorio 2007-07-06, 23:51
I couldn't find any documention on the spell check handler either but found
enough information from the solrconfig.xml file, simply search for "SpellCheckerRequestHandler" (online version here): http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml You can view the original development discussion from JIRA (not sure how helpful that will be for you though): https://issues.apache.org/jira/browse/SOLR-81 In a nutshell, the configuration parameters available are:: suggestionCount: determines how many spelling suggestions are returned. accuracy: a float value between 1.0 and 0.0 on how close the suggested words should match the original word being checked. spellcheckerIndexDir and termSourceField: check solrconfig.xml for a full explanation. In order to use the spell checking hander for the first time, you need to explicitly build the spelling index with a sample query something like this: http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> Depending on how large you main index is, this rebuild operation could take a while. Subsequent queries can omit '&cmd=rebuild' and will return results much faster: http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> The order of the suggestions returned seems to be based on the accuracy figure (i.e. how close it matches the original word). it would be great to be able to sort these suggested results based on term frequency / document frequency of the suggested word in the main index, since the most accurate suggestion may not always be the most relevant. As far as I can tell there is currently no way of doing this using the spellchecker handler alone (you could always run seperate standard queries on each word suggestion and order by numDocs, but that would be very inefficient), has anybody else tried to achieve this? cheers, Tristan On 7/7/07, Andrew Nagy <[EMAIL PROTECTED] > wrote: > > Hello, is there any documentation on how to use the new spell check > module? > > Thanks > Andrew >
-
Re: Spell Check HandlerOtis Gospodnetic 2007-07-07, 14:28
Tristan - good summary - want to copy that to the Solr Wiki?
Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: Tristan Vittorio <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Saturday, July 7, 2007 1:51:15 AM Subject: Re: Spell Check Handler I couldn't find any documention on the spell check handler either but found enough information from the solrconfig.xml file, simply search for "SpellCheckerRequestHandler" (online version here): http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml You can view the original development discussion from JIRA (not sure how helpful that will be for you though): https://issues.apache.org/jira/browse/SOLR-81 In a nutshell, the configuration parameters available are:: suggestionCount: determines how many spelling suggestions are returned. accuracy: a float value between 1.0 and 0.0 on how close the suggested words should match the original word being checked. spellcheckerIndexDir and termSourceField: check solrconfig.xml for a full explanation. In order to use the spell checking hander for the first time, you need to explicitly build the spelling index with a sample query something like this: http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> Depending on how large you main index is, this rebuild operation could take a while. Subsequent queries can omit '&cmd=rebuild' and will return results much faster: http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> The order of the suggestions returned seems to be based on the accuracy figure (i.e. how close it matches the original word). it would be great to be able to sort these suggested results based on term frequency / document frequency of the suggested word in the main index, since the most accurate suggestion may not always be the most relevant. As far as I can tell there is currently no way of doing this using the spellchecker handler alone (you could always run seperate standard queries on each word suggestion and order by numDocs, but that would be very inefficient), has anybody else tried to achieve this? cheers, Tristan On 7/7/07, Andrew Nagy <[EMAIL PROTECTED] > wrote: > > Hello, is there any documentation on how to use the new spell check > module? > > Thanks > Andrew >
-
Re: Spell Check HandlerTristan Vittorio 2007-07-08, 00:55
Hi Otis,
I have written a draft wiki entry for the spell checker: http://wiki.apache.org/solr/SpellCheckerRequestHandler I've learned that my initial observation about the suggestion ordering was incorrect, it does in fact order the results by popularity (or term frequency) of the word in the termSourceField, the problem I experienced was caused by setting termSourceField to a field of type "text", which heavily stemmed and analyzed the words. I found that using the StandardTokenizer and StandardFilter and removing the PorterStemmer and LowerCaseFilter from the field schema really improved the spell checker performance. I haven't included this info on the wiki page yet, I'll try to update it soon when I have a bit more time. cheers, Tristan On 7/8/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > Tristan - good summary - want to copy that to the Solr Wiki? > > Thanks, > Otis > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > ----- Original Message ---- > From: Tristan Vittorio <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Saturday, July 7, 2007 1:51:15 AM > Subject: Re: Spell Check Handler > > I couldn't find any documention on the spell check handler either but > found > enough information from the solrconfig.xml file, simply search for > "SpellCheckerRequestHandler" (online version here): > > http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml > > You can view the original development discussion from JIRA (not sure how > helpful that will be for you though): > https://issues.apache.org/jira/browse/SOLR-81 > > In a nutshell, the configuration parameters available are:: > > suggestionCount: determines how many spelling suggestions are returned. > accuracy: a float value between 1.0 and 0.0 on how close the suggested > words > should match the original word being checked. > spellcheckerIndexDir and termSourceField: check solrconfig.xml for a full > explanation. > > In order to use the spell checking hander for the first time, you need to > explicitly build the spelling index with a sample query something like > this: > > http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild > <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> > Depending on how large you main index is, this rebuild operation could > take > a while. Subsequent queries can omit '&cmd=rebuild' and will return > results > much faster: > > http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker > <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> > The order of the suggestions returned seems to be based on the accuracy > figure (i.e. how close it matches the original word). it would be great to > be able to sort these suggested results based on term frequency / document > frequency of the suggested word in the main index, since the most accurate > suggestion may not always be the most relevant. > > As far as I can tell there is currently no way of doing this using the > spellchecker handler alone (you could always run seperate standard queries > on each word suggestion and order by numDocs, but that would be very > inefficient), has anybody else tried to achieve this? > > cheers, > Tristan > > > > On 7/7/07, Andrew Nagy <[EMAIL PROTECTED] > wrote: > > > > Hello, is there any documentation on how to use the new spell check > > module? > > > > Thanks > > Andrew > > > > > >
-
Re: Spell Check Handlerclimbingrose 2007-07-09, 07:32
Hi Tristan,
Is this spellchecker available in 1.2 release or I have to build the trunk. I tried your instructions but Solr returns nothing: http://localhost:8984/solr/select/?q=title_text:java&qt=spellchecker&cmd=rebuild Result: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3</int> </lst> <str name="cmdExecuted">rebuild</str> <arr name="suggestions"/> </response> Thanks. On 7/8/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > Hi Otis, > > I have written a draft wiki entry for the spell checker: > http://wiki.apache.org/solr/SpellCheckerRequestHandler > > I've learned that my initial observation about the suggestion ordering was > incorrect, it does in fact order the results by popularity (or term > frequency) of the word in the termSourceField, the problem I experienced > was > caused by setting termSourceField to a field of type "text", which heavily > stemmed and analyzed the words. I found that using the StandardTokenizer > and StandardFilter and removing the PorterStemmer and LowerCaseFilter from > the field schema really improved the spell checker performance. > > I haven't included this info on the wiki page yet, I'll try to update it > soon when I have a bit more time. > > cheers, > Tristan > > > > On 7/8/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > > > Tristan - good summary - want to copy that to the Solr Wiki? > > > > Thanks, > > Otis > > > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > ----- Original Message ---- > > From: Tristan Vittorio <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Saturday, July 7, 2007 1:51:15 AM > > Subject: Re: Spell Check Handler > > > > I couldn't find any documention on the spell check handler either but > > found > > enough information from the solrconfig.xml file, simply search for > > "SpellCheckerRequestHandler" (online version here): > > > > > http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml > > > > You can view the original development discussion from JIRA (not sure how > > helpful that will be for you though): > > https://issues.apache.org/jira/browse/SOLR-81 > > > > In a nutshell, the configuration parameters available are:: > > > > suggestionCount: determines how many spelling suggestions are returned. > > accuracy: a float value between 1.0 and 0.0 on how close the suggested > > words > > should match the original word being checked. > > spellcheckerIndexDir and termSourceField: check solrconfig.xml for a > full > > explanation. > > > > In order to use the spell checking hander for the first time, you need > to > > explicitly build the spelling index with a sample query something like > > this: > > > > > http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild > > <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> > > Depending on how large you main index is, this rebuild operation could > > take > > a while. Subsequent queries can omit '&cmd=rebuild' and will return > > results > > much faster: > > > > http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker > > <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> > > The order of the suggestions returned seems to be based on the accuracy > > figure (i.e. how close it matches the original word). it would be great > to > > be able to sort these suggested results based on term frequency / > document > > frequency of the suggested word in the main index, since the most > accurate > > suggestion may not always be the most relevant. > > > > As far as I can tell there is currently no way of doing this using the > > spellchecker handler alone (you could always run seperate standard > queries > > on each word suggestion and order by numDocs, but that would be very > > inefficient), has anybody else tried to achieve this? > > > > cheers, > > Tristan > > > > > > > > On 7/7/07, Andrew Nagy <[EMAIL PROTECTED] > wrote: Regards, Cuong Hoang
-
Re: Spell Check HandlerTristan Vittorio 2007-07-09, 09:04
The spellchecker should be available in 1.2 release, your query is
incorrect, try the following: http://localhost:8984/solr/select/?q=java&qt=spellchecker&termSourceField=title_text&cmd=rebuild the 'q' parameter must only contain the word being checked; you must specify the field separately. You can set "termSourceField" in your solrconfig.xmlfile so you do not need to explicitly set it each time you want to run a spell check query. Also make sure your field isn't heavily processed (i.e. with porter stemmer analyzers) otherwise the suggestions will look a bit weird / mangled. Take a look at the wiki page for more info: http://wiki.apache.org/solr/SpellCheckerRequestHandler cheers, Tristan On 7/9/07, climbingrose <[EMAIL PROTECTED]> wrote: > > Hi Tristan, > > Is this spellchecker available in 1.2 release or I have to build the > trunk. > I tried your instructions but Solr returns nothing: > > > http://localhost:8984/solr/select/?q=title_text:java&qt=spellchecker&cmd=rebuild > > Result: > > <response> > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">3</int> > </lst> > <str name="cmdExecuted">rebuild</str> > <arr name="suggestions"/> > </response> > > Thanks. > > > On 7/8/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > > > Hi Otis, > > > > I have written a draft wiki entry for the spell checker: > > http://wiki.apache.org/solr/SpellCheckerRequestHandler > > > > I've learned that my initial observation about the suggestion ordering > was > > incorrect, it does in fact order the results by popularity (or term > > frequency) of the word in the termSourceField, the problem I experienced > > was > > caused by setting termSourceField to a field of type "text", which > heavily > > stemmed and analyzed the words. I found that using the > StandardTokenizer > > and StandardFilter and removing the PorterStemmer and LowerCaseFilter > from > > the field schema really improved the spell checker performance. > > > > I haven't included this info on the wiki page yet, I'll try to update it > > soon when I have a bit more time. > > > > cheers, > > Tristan > > > > > > > > On 7/8/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > > > > > Tristan - good summary - want to copy that to the Solr Wiki? > > > > > > Thanks, > > > Otis > > > > > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > > > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > > > ----- Original Message ---- > > > From: Tristan Vittorio <[EMAIL PROTECTED]> > > > To: [EMAIL PROTECTED] > > > Sent: Saturday, July 7, 2007 1:51:15 AM > > > Subject: Re: Spell Check Handler > > > > > > I couldn't find any documention on the spell check handler either but > > > found > > > enough information from the solrconfig.xml file, simply search for > > > "SpellCheckerRequestHandler" (online version here): > > > > > > > > > http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solrconfig.xml > > > > > > You can view the original development discussion from JIRA (not sure > how > > > helpful that will be for you though): > > > https://issues.apache.org/jira/browse/SOLR-81 > > > > > > In a nutshell, the configuration parameters available are:: > > > > > > suggestionCount: determines how many spelling suggestions are > returned. > > > accuracy: a float value between 1.0 and 0.0 on how close the suggested > > > words > > > should match the original word being checked. > > > spellcheckerIndexDir and termSourceField: check solrconfig.xml for a > > full > > > explanation. > > > > > > In order to use the spell checking hander for the first time, you need > > to > > > explicitly build the spelling index with a sample query something like > > > this: > > > > > > > > > http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker&cmd=rebuild > > > <http://localhost:8080/solr/select/?q=macrosoft&qt=spellchecker> > > > Depending on how large you main index is, this rebuild operation could
-
Re: Spell Check Handlerclimbingrose 2007-07-09, 10:20
Thanks for the quick reply. However, I'm still not able to setup
spellchecker. Solr does create spell directory under data but doesn't seem to build the spellchecker index. Here are snippets of my schema.xml: <field name="title" type="string" indexed="true" stored="true"/> <requestHandler name="spellchecker" class="solr.SpellCheckerRequestHandler" startup="lazy"> <!-- default values for query parameters --> <lst name="defaults"> <int name="suggestionCount">1</int> <float name="accuracy">0.5</float> </lst> <!-- Main init params for handler --> <!-- The directory where your SpellChecker Index should live. --> <!-- May be absolute, or relative to the Solr "dataDir" directory. --> <!-- If this option is not specified, a RAM directory will be used --> <str name="spellcheckerIndexDir">spell</str> <!-- the field in your schema that you want to be able to build --> <!-- your spell index on. This should be a field that uses a very --> <!-- simple FieldType without a lot of Analysis (ie: string) --> <str name="termSourceField">title</str> </requestHandler> I tried this url: http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand receive this: <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">2</int> </lst> <str name="cmdExecuted">rebuild</str> <arr name="suggestions"/> </response> On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > The spellchecker should be available in 1.2 release, your query is > incorrect, try the following: > > > http://localhost:8984/solr/select/?q=java&qt=spellchecker&termSourceField=title_text&cmd=rebuild > > the 'q' parameter must only contain the word being checked; you must > specify > the field separately. You can set "termSourceField" in your > solrconfig.xmlfile so you do not need to explicitly set it each time > you want to run a > spell check query. Also make sure your field isn't heavily processed (i.e. > with porter stemmer analyzers) otherwise the suggestions will look a bit > weird / mangled. Take a look at the wiki page for more info: > > http://wiki.apache.org/solr/SpellCheckerRequestHandler > > cheers, > Tristan > > > > On 7/9/07, climbingrose <[EMAIL PROTECTED]> wrote: > > > > Hi Tristan, > > > > Is this spellchecker available in 1.2 release or I have to build the > > trunk. > > I tried your instructions but Solr returns nothing: > > > > > > > http://localhost:8984/solr/select/?q=title_text:java&qt=spellchecker&cmd=rebuild > > > > Result: > > > > <response> > > <lst name="responseHeader"> > > <int name="status">0</int> > > <int name="QTime">3</int> > > </lst> > > <str name="cmdExecuted">rebuild</str> > > <arr name="suggestions"/> > > </response> > > > > Thanks. > > > > > > On 7/8/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > > > > > Hi Otis, > > > > > > I have written a draft wiki entry for the spell checker: > > > http://wiki.apache.org/solr/SpellCheckerRequestHandler > > > > > > I've learned that my initial observation about the suggestion ordering > > was > > > incorrect, it does in fact order the results by popularity (or term > > > frequency) of the word in the termSourceField, the problem I > experienced > > > was > > > caused by setting termSourceField to a field of type "text", which > > heavily > > > stemmed and analyzed the words. I found that using the > > StandardTokenizer > > > and StandardFilter and removing the PorterStemmer and LowerCaseFilter > > from > > > the field schema really improved the spell checker performance. > > > > > > I haven't included this info on the wiki page yet, I'll try to update > it > > > soon when I have a bit more time. > > > > > > cheers, > > > Tristan > > > > > > > > > > > > On 7/8/07, Otis Gospodnetic <[EMAIL PROTECTED]> wrote: > > > > > > > > Tristan - good summary - want to copy that to the Solr Wiki? > > > > > > > > Thanks, > > > > Otis > > > > > > > > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Regards, Cuong Hoang
-
Re: Spell Check HandlerTristan Vittorio 2007-07-09, 10:46
I think there is some confusion regarding how the spell checker actually
uses the termSourceField. It is suggested that you use a simple field type such a "string", however since this field type does not tokenize or split words, it is only useful in situations where the whole field is considered a dictionary "word": <add> <doc> <field name="title">Accountant</field> <http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand><field name="title">Auditor</field> <field name="title">Solicitor</field> </doc </add> The follow example case will not work with spell checker since the whole field is considered a single word or string: <add> <doc> <field name="title">Accountant reveals that Accounting is boring</field> </doc </add> I might suggest that you create an additional field in your schema that takes advantage of the StandardTokenizer and StandardFilter which doesn't perform a great deal of processing on the field yet should provide decent results when used with the spell checker: <fieldType name="spell" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words=" stopwords.txt"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words=" stopwords.txt"/> <filter class="solr.StandardFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer> </fieldType> If you want this field to be automatically populated with the contents of the title field when a document is added to the index, simply use a copyField: <copyField source="title" dest="spell"/> Hope this helps, let me know if this is still not clear, I probably will add it to the wiki page soon. cheers, Tristan On 7/9/07, climbingrose <[EMAIL PROTECTED]> wrote: > > Thanks for the quick reply. However, I'm still not able to setup > spellchecker. Solr does create spell directory under data but doesn't seem > to build the spellchecker index. Here are snippets of my schema.xml: > > <field name="title" type="string" indexed="true" stored="true"/> > > <requestHandler name="spellchecker" class="solr.SpellCheckerRequestHandler > " > startup="lazy"> > <!-- default values for query parameters --> > <lst name="defaults"> > <int name="suggestionCount">1</int> > <float name="accuracy">0.5</float> > </lst> > > <!-- Main init params for handler --> > > <!-- The directory where your SpellChecker Index should live. --> > <!-- May be absolute, or relative to the Solr "dataDir" directory. > --> > <!-- If this option is not specified, a RAM directory will be used > --> > <str name="spellcheckerIndexDir">spell</str> > > <!-- the field in your schema that you want to be able to build --> > <!-- your spell index on. This should be a field that uses a very --> > <!-- simple FieldType without a lot of Analysis (ie: string) --> > <str name="termSourceField">title</str> > > </requestHandler> > > I tried this url: > > http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand > receive this: > > <response> > <lst name="responseHeader"> > <int name="status">0</int> > <int name="QTime">2</int> > </lst> > <str name="cmdExecuted">rebuild</str> > <arr name="suggestions"/> > </response> > > > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > > > The spellchecker should be available in 1.2 release, your query is > > incorrect, try the following: > > > > > > > http://localhost:8984/solr/select/?q=java&qt=spellchecker&termSourceField=title_text&cmd=rebuild > > > > the 'q' parameter must only contain the word being checked; you must
-
Re: Spell Check HandlerCharles Hornberger 2007-07-09, 18:26
For what it's worth, I recently did a quick implementation of the
spellchecker feature, and I simply created another field in my schema (Iike 'spell' in Tristan's example below). After feeding content into my search index, I used the spell field into add one single-field document for every distinct word in my document collection (I'm assuming the content folks have run spell-checkers :-)). E.g.: <doc><field name="spell">aardvark</field></doc> <doc><field name="spell">abacus</field></doc> <doc><field name="spell">abbot</field></doc> <doc><field name="spell">acacia</field></doc> etc. I also added some extra documents for proper names that appear in my documents. For instance, there are a couple fields that have comma-separated list of names, so I for each of those -- in addition to documents for "john", "doe", and "jane", which were generated by the naive word-splitting done in the first pass -- I added documents like so: <doc><field name="spell">john doe</field></doc> <doc><field name="spell">jane doe</field></doc> etc. You could do the same for other searchable multi-word tokens in your input -- song/album/book/movie titles, publisher names, geographic names (cities, neighborhoods, etc.), product names, and so on. -Charlie On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > I think there is some confusion regarding how the spell checker actually > uses the termSourceField. It is suggested that you use a simple field type > such a "string", however since this field type does not tokenize or split > words, it is only useful in situations where the whole field is considered a > dictionary "word": > > <add> > <doc> > <field name="title">Accountant</field> > <http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand><field > name="title">Auditor</field> > <field name="title">Solicitor</field> > </doc > </add> > > The follow example case will not work with spell checker since the whole > field is considered a single word or string: > > <add> > <doc> > <field name="title">Accountant reveals that Accounting is boring</field> > </doc > </add> > > I might suggest that you create an additional field in your schema that > takes advantage of the StandardTokenizer and StandardFilter which doesn't > perform a great deal of processing on the field yet should provide decent > results when used with the spell checker: > > <fieldType name="spell" class="solr.TextField" positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" words=" > stopwords.txt"/> > <filter class="solr.StandardFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.StandardTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" words=" > stopwords.txt"/> > <filter class="solr.StandardFilterFactory"/> > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > </analyzer> > </fieldType> > > If you want this field to be automatically populated with the contents of > the title field when a document is added to the index, simply use a > copyField: > > <copyField source="title" dest="spell"/> > > Hope this helps, let me know if this is still not clear, I probably will add > it to the wiki page soon. > > cheers, > Tristan > > > > On 7/9/07, climbingrose <[EMAIL PROTECTED]> wrote: > > > > Thanks for the quick reply. However, I'm still not able to setup > > spellchecker. Solr does create spell directory under data but doesn't seem > > to build the spellchecker index. Here are snippets of my schema.xml: > > > > <field name="title" type="string" indexed="true" stored="true"/> > > > > <requestHandler name="spellchecker" class="solr.SpellCheckerRequestHandler > > " > > startup="lazy">
-
Re: Spell Check Handlerclimbingrose 2007-08-10, 15:53
The spellchecker handler doesn't seem to work with multi-word query. For
example, when I tried to spellcheck "Java developar", it returns nothing while if I tried "developar", spellchecker correctly returns "developer". I followed the setup on the wiki. Regards, Cuong Hoang On 7/10/07, Charles Hornberger <[EMAIL PROTECTED]> wrote: > > For what it's worth, I recently did a quick implementation of the > spellchecker feature, and I simply created another field in my schema > (Iike 'spell' in Tristan's example below). After feeding content into > my search index, I used the spell field into add one single-field > document for every distinct word in my document collection (I'm > assuming the content folks have run spell-checkers :-)). E.g.: > > <doc><field name="spell">aardvark</field></doc> > <doc><field name="spell">abacus</field></doc> > <doc><field name="spell">abbot</field></doc> > <doc><field name="spell">acacia</field></doc> > etc. > > I also added some extra documents for proper names that appear in my > documents. For instance, there are a couple fields that have > comma-separated list of names, so I for each of those -- in addition > to documents for "john", "doe", and "jane", which were generated by > the naive word-splitting done in the first pass -- I added documents > like so: > > <doc><field name="spell">john doe</field></doc> > <doc><field name="spell">jane doe</field></doc> > etc. > > You could do the same for other searchable multi-word tokens in your > input -- song/album/book/movie titles, publisher names, geographic > names (cities, neighborhoods, etc.), product names, and so on. > > -Charlie > > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > I think there is some confusion regarding how the spell checker actually > > uses the termSourceField. It is suggested that you use a simple field > type > > such a "string", however since this field type does not tokenize or > split > > words, it is only useful in situations where the whole field is > considered a > > dictionary "word": > > > > <add> > > <doc> > > <field name="title">Accountant</field> > > < > http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand > ><field > > name="title">Auditor</field> > > <field name="title">Solicitor</field> > > </doc > > </add> > > > > The follow example case will not work with spell checker since the whole > > field is considered a single word or string: > > > > <add> > > <doc> > > <field name="title">Accountant reveals that Accounting is boring</field> > > </doc > > </add> > > > > I might suggest that you create an additional field in your schema that > > takes advantage of the StandardTokenizer and StandardFilter which > doesn't > > perform a great deal of processing on the field yet should provide > decent > > results when used with the spell checker: > > > > <fieldType name="spell" class="solr.TextField" > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" words=" > > stopwords.txt"/> > > <filter class="solr.StandardFilterFactory"/> > > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.StandardTokenizerFactory"/> > > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > > ignoreCase="true" expand="true"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" words=" > > stopwords.txt"/> > > <filter class="solr.StandardFilterFactory"/> > > <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> > > </analyzer> > > </fieldType> > > > > If you want this field to be automatically populated with the contents > of > > the title field when a document is added to the index, simply use a > > copyField: > > > > <copyField source="title" dest="spell"/> > > > > Hope this helps, let me know if this is still not clear, I probably will Regards, Cuong Hoang
-
Re: Spell Check Handlerclimbingrose 2007-08-11, 03:40
After looking the SpellChecker code, I realised that it only supports
single-word. I made a very naive modification of SpellCheckerHandler to get multi-word support. Now the other problem that I have is how to have different fields in SpellChecker index. For example, since my query has two parts: "description" and "location", I don't want to build a spellchecker index which combines both "description" and "location" into one termSourceField. I want to check "description" part with the "description" field in the spellchecker index and "location" part with "location" field in the index. Otherwise I might have irrelevant suggestions for the "location" part since the number of terms in "location" is generally much smaller compared with that of "description". Any ideas? Thanks. On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote: > > The spellchecker handler doesn't seem to work with multi-word query. For > example, when I tried to spellcheck "Java developar", it returns nothing > while if I tried "developar", spellchecker correctly returns "developer". > I followed the setup on the wiki. > > Regards, > > Cuong Hoang > > On 7/10/07, Charles Hornberger <[EMAIL PROTECTED]> wrote: > > > > For what it's worth, I recently did a quick implementation of the > > spellchecker feature, and I simply created another field in my schema > > (Iike 'spell' in Tristan's example below). After feeding content into > > my search index, I used the spell field into add one single-field > > document for every distinct word in my document collection (I'm > > assuming the content folks have run spell-checkers :-)). E.g.: > > > > <doc><field name="spell">aardvark</field></doc> > > <doc><field name="spell">abacus</field></doc> > > <doc><field name="spell">abbot</field></doc> > > <doc><field name="spell">acacia</field></doc> > > etc. > > > > I also added some extra documents for proper names that appear in my > > documents. For instance, there are a couple fields that have > > comma-separated list of names, so I for each of those -- in addition > > to documents for "john", "doe", and "jane", which were generated by > > the naive word-splitting done in the first pass -- I added documents > > like so: > > > > <doc><field name="spell">john doe</field></doc> > > <doc><field name="spell">jane doe</field></doc> > > etc. > > > > You could do the same for other searchable multi-word tokens in your > > input -- song/album/book/movie titles, publisher names, geographic > > names (cities, neighborhoods, etc.), product names, and so on. > > > > -Charlie > > > > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > > I think there is some confusion regarding how the spell checker > > actually > > > uses the termSourceField. It is suggested that you use a simple field > > type > > > such a "string", however since this field type does not tokenize or > > split > > > words, it is only useful in situations where the whole field is > > considered a > > > dictionary "word": > > > > > > <add> > > > <doc> > > > <field name="title">Accountant</field> > > > <http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand > > ><field > > > name="title">Auditor</field> > > > <field name="title">Solicitor</field> > > > </doc > > > </add> > > > > > > The follow example case will not work with spell checker since the > > whole > > > field is considered a single word or string: > > > > > > <add> > > > <doc> > > > <field name="title">Accountant reveals that Accounting is > > boring</field> > > > </doc > > > </add> > > > > > > I might suggest that you create an additional field in your schema > > that > > > takes advantage of the StandardTokenizer and StandardFilter which > > doesn't > > > perform a great deal of processing on the field yet should provide > > decent > > > results when used with the spell checker: > > > > > > <fieldType name="spell" class="solr.TextField" > > positionIncrementGap="100"> > > > <analyzer type="index"> Regards, Cuong Hoang
-
Re: Spell Check Handlerclimbingrose 2007-08-11, 03:49
OK, I just need to define 2 spellcheckers in solrconfig.xml for my purpose.
On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote: > > After looking the SpellChecker code, I realised that it only supports > single-word. I made a very naive modification of SpellCheckerHandler to get > multi-word support. Now the other problem that I have is how to have > different fields in SpellChecker index. For example, since my query has two > parts: "description" and "location", I don't want to build a spellchecker > index which combines both "description" and "location" into one > termSourceField. I want to check "description" part with the "description" > field in the spellchecker index and "location" part with "location" field in > the index. Otherwise I might have irrelevant suggestions for the "location" > part since the number of terms in "location" is generally much smaller > compared with that of "description". Any ideas? > > Thanks. > > On 8/11/07, climbingrose <[EMAIL PROTECTED]> wrote: > > > > The spellchecker handler doesn't seem to work with multi-word query. For > > example, when I tried to spellcheck "Java developar", it returns nothing > > while if I tried "developar", spellchecker correctly returns > > "developer". I followed the setup on the wiki. > > > > Regards, > > > > Cuong Hoang > > > > On 7/10/07, Charles Hornberger < [EMAIL PROTECTED]> wrote: > > > > > > For what it's worth, I recently did a quick implementation of the > > > spellchecker feature, and I simply created another field in my schema > > > (Iike 'spell' in Tristan's example below). After feeding content into > > > my search index, I used the spell field into add one single-field > > > document for every distinct word in my document collection (I'm > > > assuming the content folks have run spell-checkers :-)). E.g.: > > > > > > <doc><field name="spell">aardvark</field></doc> > > > <doc><field name="spell">abacus</field></doc> > > > <doc><field name="spell">abbot</field></doc> > > > <doc><field name="spell">acacia</field></doc> > > > etc. > > > > > > I also added some extra documents for proper names that appear in my > > > documents. For instance, there are a couple fields that have > > > comma-separated list of names, so I for each of those -- in addition > > > to documents for "john", "doe", and "jane", which were generated by > > > the naive word-splitting done in the first pass -- I added documents > > > like so: > > > > > > <doc><field name="spell">john doe</field></doc> > > > <doc><field name="spell">jane doe</field></doc> > > > etc. > > > > > > You could do the same for other searchable multi-word tokens in your > > > input -- song/album/book/movie titles, publisher names, geographic > > > names (cities, neighborhoods, etc.), product names, and so on. > > > > > > -Charlie > > > > > > On 7/9/07, Tristan Vittorio <[EMAIL PROTECTED]> wrote: > > > > I think there is some confusion regarding how the spell checker > > > actually > > > > uses the termSourceField. It is suggested that you use a simple > > > field type > > > > such a "string", however since this field type does not tokenize or > > > split > > > > words, it is only useful in situations where the whole field is > > > considered a > > > > dictionary "word": > > > > > > > > <add> > > > > <doc> > > > > <field name="title">Accountant</field> > > > > <http://localhost:8984/solr/select/?q=Accountent&qt=spellchecker&cmd=rebuildand > > > ><field > > > > name="title">Auditor</field> > > > > <field name="title">Solicitor</field> > > > > </doc > > > > </add> > > > > > > > > The follow example case will not work with spell checker since the > > > whole > > > > field is considered a single word or string: > > > > > > > > <add> > > > > <doc> > > > > <field name="title">Accountant reveals that Accounting is > > > boring</field> > > > > </doc > > > > </add> > > > > > > > > I might suggest that you create an additional field in your schema > > > that > > > > takes advantage of the StandardTokenizer and StandardFilter which Regards, Cuong Hoang
-
Re: Spell Check HandlerPieter Berkel 2007-08-11, 05:19
On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote:
> > The spellchecker handler doesn't seem to work with multi-word query. For > example, when I tried to spellcheck "Java developar", it returns nothing > while if I tried "developar", spellchecker correctly returns "developer". > I > followed the setup on the wiki. While I suppose the general case for using the spelling checker would be a query containing a single misspelled word, it would be quite useful if the handler applied the analyzer specified by the termSourceField fieldType to the query input and then checked the spelling of each query token. This would seem to be the most flexible way of supporting multi-word queries (provided the termSourceField didn't use any stemmer filters I suppose). Piete
-
Re: Spell Check Handlerclimbingrose 2007-08-11, 08:36
That's exactly what I did with my custom version of the SpellCheckerHandler.
However, I didn't handle suggestionCount and only returned the one corrected phrase which contains the "best" corrected terms. There is an issue on Lucene issue tracker regarding multi-word spellchecker: https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel . On 8/11/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote: > > > > The spellchecker handler doesn't seem to work with multi-word query. For > > example, when I tried to spellcheck "Java developar", it returns nothing > > while if I tried "developar", spellchecker correctly returns > "developer". > > I > > followed the setup on the wiki. > > > While I suppose the general case for using the spelling checker would be a > query containing a single misspelled word, it would be quite useful if the > handler applied the analyzer specified by the termSourceField fieldType to > the query input and then checked the spelling of each query token. This > would seem to be the most flexible way of supporting multi-word queries > (provided the termSourceField didn't use any stemmer filters I suppose). > > Piete > -- Regards, Cuong Hoang
-
Re: Spell Check Handlerkarl wettin 2007-08-11, 22:04
11 aug 2007 kl. 10.36 skrev climbingrose: > There is an issue on > Lucene issue tracker regarding multi-word spellchecker: > https://issues.apache.org/jira/browse/LUCENE-550 I think you mean LUCENE-626 that sort of depends on LUCENE-550. -- karl
-
Re: Spell Check Handlerclimbingrose 2007-08-12, 00:35
Yeah. How stable is the patch Karl? Is it possible to use it in product
environment? On 8/12/07, karl wettin <[EMAIL PROTECTED]> wrote: > > > 11 aug 2007 kl. 10.36 skrev climbingrose: > > > There is an issue on > > Lucene issue tracker regarding multi-word spellchecker: > > https://issues.apache.org/jira/browse/LUCENE-550 > > I think you mean LUCENE-626 that sort of depends on LUCENE-550. > > > -- > karl > > > > -- Regards, Cuong Hoang
-
Re: Spell Check HandlerPieter Berkel 2007-08-12, 01:03
<http://issues.apache.org/jira/browse/LUCENE-626>On 11/08/07, climbingrose<
[EMAIL PROTECTED]> wrote: > > That's exactly what I did with my custom version of the > SpellCheckerHandler. > However, I didn't handle suggestionCount and only returned the one > corrected > phrase which contains the "best" corrected terms. There is an issue on > Lucene issue tracker regarding multi-word spellchecker: > https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > I'd be interested to take a look at your modifications to the SpellCheckerHandler, how did you handle phrase queries? maybe we can open a JIRA issue to expand the spell checking functionality to perform analysis on multi-word input values. I did find http://issues.apache.org/jira/browse/LUCENE-626 after looking at LUCENE-550, but since these patches are not yet included in the Lucene trunk yet it might be a little difficult to justify implementing them in Solr.
-
Re: Spell Check Handlerkarl wettin 2007-08-12, 01:08
12 aug 2007 kl. 02.35 skrev climbingrose: >> I think you mean LUCENE-626 > Yeah. Is it possible to use it in product environment? It's been running live for a long time at this one place, but the code is stuck at Lucene 2.0 and an old version of 550. I don't really do any more Solr than to monitor the forums and use some analysis code, so I could not say how much work it would take you to get it running. I'm aiming at giving the code an overview and bring it up to date with the Lucene trunk any day, week, month or year now, depending on workload and if I manage to fix a verion of 550 that is accepted to the trunk. You are welcome to break out the TokenPhraseSuggester and NgramTokenSuggester, the parts I think you are intrerested in. If you do, feel free to report about it and posting a patch in the issue. -- karl
-
Re: Spell Check Handlerclimbingrose 2007-08-12, 12:24
I'm happy to contribute code for the SpellCheckerRequestHandler. I'll post
the code once I strip off stuff related to our product. On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > <http://issues.apache.org/jira/browse/LUCENE-626>On 11/08/07, > climbingrose< > [EMAIL PROTECTED]> wrote: > > > > That's exactly what I did with my custom version of the > > SpellCheckerHandler. > > However, I didn't handle suggestionCount and only returned the one > > corrected > > phrase which contains the "best" corrected terms. There is an issue on > > Lucene issue tracker regarding multi-word spellchecker: > > > https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > > > > I'd be interested to take a look at your modifications to the > SpellCheckerHandler, how did you handle phrase queries? maybe we can open > a > JIRA issue to expand the spell checking functionality to perform analysis > on > multi-word input values. > > I did find http://issues.apache.org/jira/browse/LUCENE-626 after looking > at > LUCENE-550, but since these patches are not yet included in the Lucene > trunk > yet it might be a little difficult to justify implementing them in Solr. > -- Regards, Cuong Hoang
-
Re: Spell Check Handlerkarl wettin 2007-08-17, 15:18
I updated LUCENE-626 last night. It should now run smooth without
LUCENE-550, but smoother with. Perhaps it is something you can use. 12 aug 2007 kl. 14.24 skrev climbingrose: > I'm happy to contribute code for the SpellCheckerRequestHandler. > I'll post > the code once I strip off stuff related to our product. > > On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: >> >> <http://issues.apache.org/jira/browse/LUCENE-626>On 11/08/07, >> climbingrose< >> [EMAIL PROTECTED]> wrote: >>> >>> That's exactly what I did with my custom version of the >>> SpellCheckerHandler. >>> However, I didn't handle suggestionCount and only returned the one >>> corrected >>> phrase which contains the "best" corrected terms. There is an >>> issue on >>> Lucene issue tracker regarding multi-word spellchecker: >>> >> https://issues.apache.org/jira/browse/LUCENE-550? >> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >>> >> >> >> I'd be interested to take a look at your modifications to the >> SpellCheckerHandler, how did you handle phrase queries? maybe we >> can open >> a >> JIRA issue to expand the spell checking functionality to perform >> analysis >> on >> multi-word input values. >> >> I did find http://issues.apache.org/jira/browse/LUCENE-626 after >> looking >> at >> LUCENE-550, but since these patches are not yet included in the >> Lucene >> trunk >> yet it might be a little difficult to justify implementing them in >> Solr. >> > > > > -- > Regards, > > Cuong Hoang
-
Re: Spell Check Handlerclimbingrose 2007-08-17, 16:28
Thanks Karl. I'll check it out!
On 8/18/07, karl wettin <[EMAIL PROTECTED]> wrote: > > I updated LUCENE-626 last night. It should now run smooth without > LUCENE-550, but smoother with. > > Perhaps it is something you can use. > > > 12 aug 2007 kl. 14.24 skrev climbingrose: > > > I'm happy to contribute code for the SpellCheckerRequestHandler. > > I'll post > > the code once I strip off stuff related to our product. > > > > On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > >> > >> <http://issues.apache.org/jira/browse/LUCENE-626>On 11/08/07, > >> climbingrose< > >> [EMAIL PROTECTED]> wrote: > >>> > >>> That's exactly what I did with my custom version of the > >>> SpellCheckerHandler. > >>> However, I didn't handle suggestionCount and only returned the one > >>> corrected > >>> phrase which contains the "best" corrected terms. There is an > >>> issue on > >>> Lucene issue tracker regarding multi-word spellchecker: > >>> > >> https://issues.apache.org/jira/browse/LUCENE-550? > >> page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > >>> > >> > >> > >> I'd be interested to take a look at your modifications to the > >> SpellCheckerHandler, how did you handle phrase queries? maybe we > >> can open > >> a > >> JIRA issue to expand the spell checking functionality to perform > >> analysis > >> on > >> multi-word input values. > >> > >> I did find http://issues.apache.org/jira/browse/LUCENE-626 after > >> looking > >> at > >> LUCENE-550, but since these patches are not yet included in the > >> Lucene > >> trunk > >> yet it might be a little difficult to justify implementing them in > >> Solr. > >> > > > > > > > > -- > > Regards, > > > > Cuong Hoang > > -- Regards, Cuong Hoang
-
Re: Spell Check HandlerOtis Gospodnetic 2007-10-08, 04:55
Hello,
Did I miss this contribution or did it not happen? I'm referring to the change to the SpellCheckerRequestHandler to handle spelling corrections/suggestions for multi-word queries. Any chance you can provide a patch? Thanks! Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: climbingrose <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Sunday, August 12, 2007 8:24:12 AM Subject: Re: Spell Check Handler I'm happy to contribute code for the SpellCheckerRequestHandler. I'll post the code once I strip off stuff related to our product. On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > <http://issues.apache.org/jira/browse/LUCENE-626>On 11/08/07, > climbingrose< > [EMAIL PROTECTED]> wrote: > > > > That's exactly what I did with my custom version of the > > SpellCheckerHandler. > > However, I didn't handle suggestionCount and only returned the one > > corrected > > phrase which contains the "best" corrected terms. There is an issue on > > Lucene issue tracker regarding multi-word spellchecker: > > > https://issues.apache.org/jira/browse/LUCENE-550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > > > > I'd be interested to take a look at your modifications to the > SpellCheckerHandler, how did you handle phrase queries? maybe we can open > a > JIRA issue to expand the spell checking functionality to perform analysis > on > multi-word input values. > > I did find http://issues.apache.org/jira/browse/LUCENE-626 after looking > at > LUCENE-550, but since these patches are not yet included in the Lucene > trunk > yet it might be a little difficult to justify implementing them in Solr. > -- Regards, Cuong Hoang |