|
Dyer, James
2011-01-12, 22:19
Markus Jelsma
2011-01-12, 22:25
Jayendra Patil
2011-01-12, 22:41
Markus Jelsma
2011-01-12, 22:43
Dyer, James
2011-01-12, 23:23
Markus Jelsma
2011-01-12, 23:48
Jonathan Rochkind
2011-01-13, 15:53
Dyer, James
2011-01-13, 16:36
Jan Høydahl
2012-02-01, 14:28
|
-
StopFilterFactory and "qf" containing some fields that use it and some that do notDyer, James 2011-01-12, 22:19
I'm running into a problem with StopFilterFactory in conjunction with (e)dismax queries that have a mix of fields, only some of which use StopFilterFactory. It seems that if even 1 field on the "qf" parameter does not use StopFilterFactory, then stop words are not removed when searching any fields. Here's an example of what I mean:
- I have 2 fields indexed: > Title is "textStemmed", which includes StopFilterFactory (see below). > Contributor is "textSimple", which does not include StopFilterFactory (see below). - "The" is a stop word in stopwords.txt - q=life&defType=edismax&qf=Title ... returns 277,635 results - q=the life&defType=edismax&qf=Title ... returns 277,635 results - q=life&defType=edismax&qf=Title Contributor ... returns 277,635 results - q=the life&defType=edismax&qf=Title Contributor ... returns 0 results It seems as if the stop words are not being stripped from the query because "qf" contains a field that doesn't use StopFilterFactory. I did testing with combining Stemmed fields with not Stemmed fields in "qf" and it seems as if stemming gets applied regardless. But stop words do not. Does anyone have ideas on what is going on? Is this a feature or possibly a bug? Any known workarounds? Any advice is appreciated. James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 ________________________________ <fieldType name="textSimple" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> </fieldType> <fieldType name="textStemmed" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="1" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="0" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" stemEnglishPossessive="1" /> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
-
Re: StopFilterFactory and "qf" containing some fields that use it and some that do notMarkus Jelsma 2011-01-12, 22:25
I haven't used edismax but i can imagine its a feature. Ths is because
inconstent use of stopwords in the analyzers of the fields specified in qf can yield really unexpected results because of the mm parameter. In dismax, if one analyzer removed stopwords and the other doesn't the mm parameter goes crazy. > I'm running into a problem with StopFilterFactory in conjunction with > (e)dismax queries that have a mix of fields, only some of which use > StopFilterFactory. It seems that if even 1 field on the "qf" parameter > does not use StopFilterFactory, then stop words are not removed when > searching any fields. Here's an example of what I mean: > > - I have 2 fields indexed: > > Title is "textStemmed", which includes StopFilterFactory (see below). > > Contributor is "textSimple", which does not include StopFilterFactory > > (see below). > > - "The" is a stop word in stopwords.txt > - q=life&defType=edismax&qf=Title ... returns 277,635 results > - q=the life&defType=edismax&qf=Title ... returns 277,635 results > - q=life&defType=edismax&qf=Title Contributor ... returns 277,635 results > - q=the life&defType=edismax&qf=Title Contributor ... returns 0 results > > It seems as if the stop words are not being stripped from the query because > "qf" contains a field that doesn't use StopFilterFactory. I did testing > with combining Stemmed fields with not Stemmed fields in "qf" and it seems > as if stemming gets applied regardless. But stop words do not. > > Does anyone have ideas on what is going on? Is this a feature or possibly > a bug? Any known workarounds? Any advice is appreciated. > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > ________________________________ > <fieldType name="textSimple" class="solr.TextField" > positionIncrementGap="100"> <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > <fieldType name="textStemmed" class="solr.TextField" > positionIncrementGap="100"> <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> <filter > class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="0" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" > stemEnglishPossessive="1" /> <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> <filter class="solr.StopFilterFactory" > ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="0" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" > stemEnglishPossessive="1" /> <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > </fieldType>
-
Re: StopFilterFactory and "qf" containing some fields that use it and some that do notJayendra Patil 2011-01-12, 22:41
Have used edismax and Stopword filters as well. But usually use the fq
parameter e.g. fq=title:the life and never had any issues. Can you turn on the debugQuery and check whats the Query formed for all the combinations you mentioned. Regards, Jayendra On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James <[EMAIL PROTECTED]>wrote: > I'm running into a problem with StopFilterFactory in conjunction with > (e)dismax queries that have a mix of fields, only some of which use > StopFilterFactory. It seems that if even 1 field on the "qf" parameter does > not use StopFilterFactory, then stop words are not removed when searching > any fields. Here's an example of what I mean: > > - I have 2 fields indexed: > > Title is "textStemmed", which includes StopFilterFactory (see below). > > Contributor is "textSimple", which does not include StopFilterFactory > (see below). > - "The" is a stop word in stopwords.txt > - q=life&defType=edismax&qf=Title ... returns 277,635 results > - q=the life&defType=edismax&qf=Title ... returns 277,635 results > - q=life&defType=edismax&qf=Title Contributor ... returns 277,635 results > - q=the life&defType=edismax&qf=Title Contributor ... returns 0 results > > It seems as if the stop words are not being stripped from the query because > "qf" contains a field that doesn't use StopFilterFactory. I did testing > with combining Stemmed fields with not Stemmed fields in "qf" and it seems > as if stemming gets applied regardless. But stop words do not. > > Does anyone have ideas on what is going on? Is this a feature or possibly > a bug? Any known workarounds? Any advice is appreciated. > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > ________________________________ > <fieldType name="textSimple" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.LowerCaseFilterFactory"/> > </analyzer> > </fieldType> > > <fieldType name="textStemmed" class="solr.TextField" > positionIncrementGap="100"> > <analyzer type="index"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="0" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" > stemEnglishPossessive="1" /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > <analyzer type="query"> > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > ignoreCase="true" expand="true"/> > <filter class="solr.StopFilterFactory" ignoreCase="true" > words="stopwords.txt" enablePositionIncrements="true" /> > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > generateNumberParts="0" catenateWords="0" catenateNumbers="0" > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" > stemEnglishPossessive="1" /> > <filter class="solr.LowerCaseFilterFactory"/> > <filter class="solr.PorterStemFilterFactory"/> > </analyzer> > </fieldType> >
-
Re: StopFilterFactory and "qf" containing some fields that use it and some that do notMarkus Jelsma 2011-01-12, 22:43
> Have used edismax and Stopword filters as well. But usually use the fq > parameter e.g. fq=title:the life and never had any issues. That is because filter queries are not relevant for the mm parameter which is being used for the main query. > > Can you turn on the debugQuery and check whats the Query formed for all the > combinations you mentioned. > > Regards, > Jayendra > > On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James <[EMAIL PROTECTED]>wrote: > > I'm running into a problem with StopFilterFactory in conjunction with > > (e)dismax queries that have a mix of fields, only some of which use > > StopFilterFactory. It seems that if even 1 field on the "qf" parameter > > does not use StopFilterFactory, then stop words are not removed when > > searching any fields. Here's an example of what I mean: > > > > - I have 2 fields indexed: > > > Title is "textStemmed", which includes StopFilterFactory (see below). > > > Contributor is "textSimple", which does not include StopFilterFactory > > > > (see below). > > - "The" is a stop word in stopwords.txt > > - q=life&defType=edismax&qf=Title ... returns 277,635 results > > - q=the life&defType=edismax&qf=Title ... returns 277,635 results > > - q=life&defType=edismax&qf=Title Contributor ... returns 277,635 > > results - q=the life&defType=edismax&qf=Title Contributor ... returns 0 > > results > > > > It seems as if the stop words are not being stripped from the query > > because "qf" contains a field that doesn't use StopFilterFactory. I did > > testing with combining Stemmed fields with not Stemmed fields in "qf" > > and it seems as if stemming gets applied regardless. But stop words do > > not. > > > > Does anyone have ideas on what is going on? Is this a feature or > > possibly a bug? Any known workarounds? Any advice is appreciated. > > > > James Dyer > > E-Commerce Systems > > Ingram Content Group > > (615) 213-4311 > > ________________________________ > > <fieldType name="textSimple" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > > > > <fieldType name="textStemmed" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt" enablePositionIncrements="true" /> > > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > > generateNumberParts="0" catenateWords="0" catenateNumbers="0" > > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" > > stemEnglishPossessive="1" /> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" > > ignoreCase="true" expand="true"/> > > <filter class="solr.StopFilterFactory" ignoreCase="true" > > words="stopwords.txt" enablePositionIncrements="true" /> > > <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" > > generateNumberParts="0" catenateWords="0" catenateNumbers="0" > > catenateAll="0" splitOnCaseChange="0" splitOnNumerics="0" > > stemEnglishPossessive="1" /> > > <filter class="solr.LowerCaseFilterFactory"/> > > <filter class="solr.PorterStemFilterFactory"/> > > </analyzer> > > </fieldType>
-
RE: StopFilterFactory and "qf" containing some fields that use it and some that do notDyer, James 2011-01-12, 23:23
Here is what debug says each of these queries parse to:
1. q=life&defType=edismax&qf=Title ... returns 277,635 results 2. q=the life&defType=edismax&qf=Title ... returns 277,635 results 3. q=life&defType=edismax&qf=Title Contributor ... returns 277,635 4. q=the life&defType=edismax&qf=Title Contributor ... returns 0 results 1. +DisjunctionMaxQuery((Title:life)) 2. +((DisjunctionMaxQuery((Title:life)))~1) 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) 4. +((DisjunctionMaxQuery((Contributor:the)) DisjunctionMaxQuery((Contributor:life | Title:life)))~2) I see what's going on here. Because "the" is a stop word for Title, it gets removed from first part of the expression. This means that "Contributor" is required to contain "the". dismax does the same thing too. I guess I should have run debug before asking the mail list! It looks like the only workarounds I have is to either filter out the stopwords in the client when this happens, or enable stop words for all the fields that are used in "qf" with stopword-enabled fields. Unless...someone has a better idea?? James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -----Original Message----- From: Markus Jelsma [mailto:[EMAIL PROTECTED]] Sent: Wednesday, January 12, 2011 4:44 PM To: [EMAIL PROTECTED] Cc: Jayendra Patil Subject: Re: StopFilterFactory and "qf" containing some fields that use it and some that do not > Have used edismax and Stopword filters as well. But usually use the fq > parameter e.g. fq=title:the life and never had any issues. That is because filter queries are not relevant for the mm parameter which is being used for the main query. > > Can you turn on the debugQuery and check whats the Query formed for all the > combinations you mentioned. > > Regards, > Jayendra > > On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James <[EMAIL PROTECTED]>wrote: > > I'm running into a problem with StopFilterFactory in conjunction with > > (e)dismax queries that have a mix of fields, only some of which use > > StopFilterFactory. It seems that if even 1 field on the "qf" parameter > > does not use StopFilterFactory, then stop words are not removed when > > searching any fields. Here's an example of what I mean: > > > > - I have 2 fields indexed: > > > Title is "textStemmed", which includes StopFilterFactory (see below). > > > Contributor is "textSimple", which does not include StopFilterFactory > > > > (see below). > > - "The" is a stop word in stopwords.txt > > - q=life&defType=edismax&qf=Title ... returns 277,635 results > > - q=the life&defType=edismax&qf=Title ... returns 277,635 results > > - q=life&defType=edismax&qf=Title Contributor ... returns 277,635 > > results - q=the life&defType=edismax&qf=Title Contributor ... returns 0 > > results > > > > It seems as if the stop words are not being stripped from the query > > because "qf" contains a field that doesn't use StopFilterFactory. I did > > testing with combining Stemmed fields with not Stemmed fields in "qf" > > and it seems as if stemming gets applied regardless. But stop words do > > not. > > > > Does anyone have ideas on what is going on? Is this a feature or > > possibly a bug? Any known workarounds? Any advice is appreciated. > > > > James Dyer > > E-Commerce Systems > > Ingram Content Group > > (615) 213-4311 > > ________________________________ > > <fieldType name="textSimple" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > <analyzer type="query"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/> > > <filter class="solr.LowerCaseFilterFactory"/> > > </analyzer> > > </fieldType> > > > > <fieldType name="textStemmed" class="solr.TextField" > > positionIncrementGap="100"> > > <analyzer type="index"> > > <tokenizer class="solr.WhitespaceTokenizerFactory"/>
-
Re: StopFilterFactory and "qf" containing some fields that use it and some that do notMarkus Jelsma 2011-01-12, 23:48
Here's another thread on the subject:
http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug- td493483.html And slightly off topic: you'd also might want to look at using common grams, they are really useful for phrase queries that contain stopwords. http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory > Here is what debug says each of these queries parse to: > > 1. q=life&defType=edismax&qf=Title ... returns 277,635 results > 2. q=the life&defType=edismax&qf=Title ... returns 277,635 results > 3. q=life&defType=edismax&qf=Title Contributor ... returns 277,635 > 4. q=the life&defType=edismax&qf=Title Contributor ... returns 0 results > > 1. +DisjunctionMaxQuery((Title:life)) > 2. +((DisjunctionMaxQuery((Title:life)))~1) > 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) > 4. +((DisjunctionMaxQuery((Contributor:the)) > DisjunctionMaxQuery((Contributor:life | Title:life)))~2) > > I see what's going on here. Because "the" is a stop word for Title, it > gets removed from first part of the expression. This means that > "Contributor" is required to contain "the". dismax does the same thing > too. I guess I should have run debug before asking the mail list! > > It looks like the only workarounds I have is to either filter out the > stopwords in the client when this happens, or enable stop words for all > the fields that are used in "qf" with stopword-enabled fields. > Unless...someone has a better idea?? > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > > -----Original Message----- > From: Markus Jelsma [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, January 12, 2011 4:44 PM > To: [EMAIL PROTECTED] > Cc: Jayendra Patil > Subject: Re: StopFilterFactory and "qf" containing some fields that use it > and some that do not > > > Have used edismax and Stopword filters as well. But usually use the fq > > parameter e.g. fq=title:the life and never had any issues. > > That is because filter queries are not relevant for the mm parameter which > is being used for the main query. > > > Can you turn on the debugQuery and check whats the Query formed for all > > the combinations you mentioned. > > > > Regards, > > Jayendra > > > > On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James > > <[EMAIL PROTECTED]>wrote: > > > I'm running into a problem with StopFilterFactory in conjunction with > > > (e)dismax queries that have a mix of fields, only some of which use > > > StopFilterFactory. It seems that if even 1 field on the "qf" parameter > > > does not use StopFilterFactory, then stop words are not removed when > > > searching any fields. Here's an example of what I mean: > > > > > > - I have 2 fields indexed: > > > > Title is "textStemmed", which includes StopFilterFactory (see > > > > below). Contributor is "textSimple", which does not include > > > > StopFilterFactory > > > > > > (see below). > > > - "The" is a stop word in stopwords.txt > > > - q=life&defType=edismax&qf=Title ... returns 277,635 results > > > - q=the life&defType=edismax&qf=Title ... returns 277,635 results > > > - q=life&defType=edismax&qf=Title Contributor ... returns 277,635 > > > results - q=the life&defType=edismax&qf=Title Contributor ... returns 0 > > > results > > > > > > It seems as if the stop words are not being stripped from the query > > > because "qf" contains a field that doesn't use StopFilterFactory. I > > > did testing with combining Stemmed fields with not Stemmed fields in > > > "qf" and it seems as if stemming gets applied regardless. But stop > > > words do not. > > > > > > Does anyone have ideas on what is going on? Is this a feature or > > > possibly a bug? Any known workarounds? Any advice is appreciated. > > > > > > James Dyer > > > E-Commerce Systems > > > Ingram Content Group > > > (615) 213-4311 > > > ________________________________ > > > <fieldType name="textSimple" class="solr.TextField"
-
Re: StopFilterFactory and "qf" containing some fields that use it and some that do notJonathan Rochkind 2011-01-13, 15:53
It's a known 'issue' in dismax, (really an inherent part of dismax's
design with no clear way to do anything about it), that qf over fields with different stop word definitions will produce odd results for a query with a stopword. Here's my understanding of what's going on: http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ On 1/12/2011 6:48 PM, Markus Jelsma wrote: > Here's another thread on the subject: > http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug- > td493483.html > > And slightly off topic: you'd also might want to look at using common grams, > they are really useful for phrase queries that contain stopwords. > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory > > >> Here is what debug says each of these queries parse to: >> >> 1. q=life&defType=edismax&qf=Title ... returns 277,635 results >> 2. q=the life&defType=edismax&qf=Title ... returns 277,635 results >> 3. q=life&defType=edismax&qf=Title Contributor ... returns 277,635 >> 4. q=the life&defType=edismax&qf=Title Contributor ... returns 0 results >> >> 1. +DisjunctionMaxQuery((Title:life)) >> 2. +((DisjunctionMaxQuery((Title:life)))~1) >> 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) >> 4. +((DisjunctionMaxQuery((Contributor:the)) >> DisjunctionMaxQuery((Contributor:life | Title:life)))~2) >> >> I see what's going on here. Because "the" is a stop word for Title, it >> gets removed from first part of the expression. This means that >> "Contributor" is required to contain "the". dismax does the same thing >> too. I guess I should have run debug before asking the mail list! >> >> It looks like the only workarounds I have is to either filter out the >> stopwords in the client when this happens, or enable stop words for all >> the fields that are used in "qf" with stopword-enabled fields. >> Unless...someone has a better idea?? >> >> James Dyer >> E-Commerce Systems >> Ingram Content Group >> (615) 213-4311 >> >> -----Original Message----- >> From: Markus Jelsma [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, January 12, 2011 4:44 PM >> To: [EMAIL PROTECTED] >> Cc: Jayendra Patil >> Subject: Re: StopFilterFactory and "qf" containing some fields that use it >> and some that do not >> >>> Have used edismax and Stopword filters as well. But usually use the fq >>> parameter e.g. fq=title:the life and never had any issues. >> That is because filter queries are not relevant for the mm parameter which >> is being used for the main query. >> >>> Can you turn on the debugQuery and check whats the Query formed for all >>> the combinations you mentioned. >>> >>> Regards, >>> Jayendra >>> >>> On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James >> <[EMAIL PROTECTED]>wrote: >>>> I'm running into a problem with StopFilterFactory in conjunction with >>>> (e)dismax queries that have a mix of fields, only some of which use >>>> StopFilterFactory. It seems that if even 1 field on the "qf" parameter >>>> does not use StopFilterFactory, then stop words are not removed when >>>> searching any fields. Here's an example of what I mean: >>>> >>>> - I have 2 fields indexed: >>>> > Title is "textStemmed", which includes StopFilterFactory (see >>>> > below). Contributor is "textSimple", which does not include >>>> > StopFilterFactory >>>> >>>> (see below). >>>> - "The" is a stop word in stopwords.txt >>>> - q=life&defType=edismax&qf=Title ... returns 277,635 results >>>> - q=the life&defType=edismax&qf=Title ... returns 277,635 results >>>> - q=life&defType=edismax&qf=Title Contributor ... returns 277,635 >>>> results - q=the life&defType=edismax&qf=Title Contributor ... returns 0 >>>> results >>>> >>>> It seems as if the stop words are not being stripped from the query >>>> because "qf" contains a field that doesn't use StopFilterFactory. I >>>> did testing with combining Stemmed fields with not Stemmed fields in
-
RE: StopFilterFactory and "qf" containing some fields that use it and some that do notDyer, James 2011-01-13, 16:36
I appreciate the reply and blog posting. For now, I just enabled stopwords for all the fields on "Qf". We have a very short list anyhow and our legacy search engine didn't even allow field-by-field configuration (stopwords are global on that system).
I do wonder...what if (e)dismax had a flag you could set that would tell it that if any analyzers removed a term, then that term would become optional for any fields for which it remained? I'm not sure what the development effort would perhaps it would be a nice way to circumvent this problem in a future release... James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 James Dyer E-Commerce Systems Ingram Content Group (615) 213-4311 -----Original Message----- From: Jonathan Rochkind [mailto:[EMAIL PROTECTED]] Sent: Thursday, January 13, 2011 9:54 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Cc: Dyer, James Subject: Re: StopFilterFactory and "qf" containing some fields that use it and some that do not It's a known 'issue' in dismax, (really an inherent part of dismax's design with no clear way to do anything about it), that qf over fields with different stop word definitions will produce odd results for a query with a stopword. Here's my understanding of what's going on: http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ On 1/12/2011 6:48 PM, Markus Jelsma wrote: > Here's another thread on the subject: > http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug- > td493483.html > > And slightly off topic: you'd also might want to look at using common grams, > they are really useful for phrase queries that contain stopwords. > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory > > >> Here is what debug says each of these queries parse to: >> >> 1. q=life&defType=edismax&qf=Title ... returns 277,635 results >> 2. q=the life&defType=edismax&qf=Title ... returns 277,635 results >> 3. q=life&defType=edismax&qf=Title Contributor ... returns 277,635 >> 4. q=the life&defType=edismax&qf=Title Contributor ... returns 0 results >> >> 1. +DisjunctionMaxQuery((Title:life)) >> 2. +((DisjunctionMaxQuery((Title:life)))~1) >> 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) >> 4. +((DisjunctionMaxQuery((Contributor:the)) >> DisjunctionMaxQuery((Contributor:life | Title:life)))~2) >> >> I see what's going on here. Because "the" is a stop word for Title, it >> gets removed from first part of the expression. This means that >> "Contributor" is required to contain "the". dismax does the same thing >> too. I guess I should have run debug before asking the mail list! >> >> It looks like the only workarounds I have is to either filter out the >> stopwords in the client when this happens, or enable stop words for all >> the fields that are used in "qf" with stopword-enabled fields. >> Unless...someone has a better idea?? >> >> James Dyer >> E-Commerce Systems >> Ingram Content Group >> (615) 213-4311 >> >> -----Original Message----- >> From: Markus Jelsma [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, January 12, 2011 4:44 PM >> To: [EMAIL PROTECTED] >> Cc: Jayendra Patil >> Subject: Re: StopFilterFactory and "qf" containing some fields that use it >> and some that do not >> >>> Have used edismax and Stopword filters as well. But usually use the fq >>> parameter e.g. fq=title:the life and never had any issues. >> That is because filter queries are not relevant for the mm parameter which >> is being used for the main query. >> >>> Can you turn on the debugQuery and check whats the Query formed for all >>> the combinations you mentioned. >>> >>> Regards, >>> Jayendra >>> >>> On Wed, Jan 12, 2011 at 5:19 PM, Dyer, James >> <[EMAIL PROTECTED]>wrote: >>>> I'm running into a problem with StopFilterFactory in conjunction with >>>> (e)dismax queries that have a mix of fields, only some of which use >>>> StopFilterFactory. It seems that if even 1 field on the "qf" parameter
-
Re: StopFilterFactory and "qf" containing some fields that use it and some that do notJan Høydahl 2012-02-01, 14:28
Reviving this thread.
You say: > I do wonder...what if (e)dismax had a flag you could set that would tell it that if any analyzers removed a term, then that term would become optional for any fields for which it remained? I'm not sure what the development effort would perhaps it would be a nice way to circumvent this problem in a future release... I created a JIRA issue to investigate if it is possible to implement this. See https://issues.apache.org/jira/browse/SOLR-3085 -- Jan Høydahl, search solution architect Cominvent AS - www.cominvent.com Solr Training - www.solrtraining.com On 13. jan. 2011, at 17:36, Dyer, James wrote: > I appreciate the reply and blog posting. For now, I just enabled stopwords for all the fields on "Qf". We have a very short list anyhow and our legacy search engine didn't even allow field-by-field configuration (stopwords are global on that system). > > I do wonder...what if (e)dismax had a flag you could set that would tell it that if any analyzers removed a term, then that term would become optional for any fields for which it remained? I'm not sure what the development effort would perhaps it would be a nice way to circumvent this problem in a future release... > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > > James Dyer > E-Commerce Systems > Ingram Content Group > (615) 213-4311 > > > -----Original Message----- > From: Jonathan Rochkind [mailto:[EMAIL PROTECTED]] > Sent: Thursday, January 13, 2011 9:54 AM > To: [EMAIL PROTECTED]; [EMAIL PROTECTED] > Cc: Dyer, James > Subject: Re: StopFilterFactory and "qf" containing some fields that use it and some that do not > > It's a known 'issue' in dismax, (really an inherent part of dismax's > design with no clear way to do anything about it), that qf over fields > with different stop word definitions will produce odd results for a > query with a stopword. > > Here's my understanding of what's going on: > http://bibwild.wordpress.com/2010/04/14/solr-stop-wordsdismax-gotcha/ > > On 1/12/2011 6:48 PM, Markus Jelsma wrote: >> Here's another thread on the subject: >> http://lucene.472066.n3.nabble.com/Dismax-Minimum-Match-Stopwords-Bug- >> td493483.html >> >> And slightly off topic: you'd also might want to look at using common grams, >> they are really useful for phrase queries that contain stopwords. >> >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#solr.CommonGramsFilterFactory >> >> >>> Here is what debug says each of these queries parse to: >>> >>> 1. q=life&defType=edismax&qf=Title ... returns 277,635 results >>> 2. q=the life&defType=edismax&qf=Title ... returns 277,635 results >>> 3. q=life&defType=edismax&qf=Title Contributor ... returns 277,635 >>> 4. q=the life&defType=edismax&qf=Title Contributor ... returns 0 results >>> >>> 1. +DisjunctionMaxQuery((Title:life)) >>> 2. +((DisjunctionMaxQuery((Title:life)))~1) >>> 3. +DisjunctionMaxQuery((CTBR_SEARCH:life | Title:life)) >>> 4. +((DisjunctionMaxQuery((Contributor:the)) >>> DisjunctionMaxQuery((Contributor:life | Title:life)))~2) >>> >>> I see what's going on here. Because "the" is a stop word for Title, it >>> gets removed from first part of the expression. This means that >>> "Contributor" is required to contain "the". dismax does the same thing >>> too. I guess I should have run debug before asking the mail list! >>> >>> It looks like the only workarounds I have is to either filter out the >>> stopwords in the client when this happens, or enable stop words for all >>> the fields that are used in "qf" with stopword-enabled fields. >>> Unless...someone has a better idea?? >>> >>> James Dyer >>> E-Commerce Systems >>> Ingram Content Group >>> (615) 213-4311 >>> >>> -----Original Message----- >>> From: Markus Jelsma [mailto:[EMAIL PROTECTED]] >>> Sent: Wednesday, January 12, 2011 4:44 PM >>> To: [EMAIL PROTECTED] >>> Cc: Jayendra Patil >>> Subject: Re: StopFilterFactory and "qf" containing some fields that use it |