|
Bernd Fehling
2011-10-26, 06:32
Uwe Schindler
2011-10-26, 06:58
Simon Willnauer
2011-10-26, 07:21
Uwe Schindler
2011-10-26, 07:32
Chris Male
2011-10-26, 07:37
Bernd Fehling
2011-10-26, 08:06
Uwe Schindler
2011-10-26, 08:26
Bernd Fehling
2011-10-26, 12:05
Robert Muir
2011-10-26, 12:09
DM Smith
2011-10-26, 12:28
Simon Willnauer
2011-10-26, 12:33
Bernd Fehling
2011-10-26, 12:49
Chris Hostetter
2011-10-29, 20:27
Robert Muir
2011-10-29, 21:23
Chris Hostetter
2011-10-29, 21:33
Robert Muir
2011-10-29, 21:36
Simon Willnauer
2011-10-29, 21:52
DM Smith
2011-10-29, 22:36
|
-
accessing the query string from inside TokenFilterBernd Fehling 2011-10-26, 06:32
Dear list,
while writing some TokenFilter for my analyzer chain I need access to the query string from inside of my TokenFilter for some comparison, but the Filters are working with a TokenStream and get seperate Tokens. Currently I couldn't get any access to the query string. It would be great to have such a funtionality in lucene/solr. Should I write a jira issue for it or is there somewhere a wish list? Best regards Bernd ---------------------------------------------------------------------
-
RE: accessing the query string from inside TokenFilterUwe Schindler 2011-10-26, 06:58
Hi,
QueryParser and TokenStreams are clearly separated, there is no way to get the query string from inside a TokenStream (and there cannot be, because QP is a consumer of the TS, which is used not only for query parsing). The only chance you have is to use a ThreadLocal that you set before the query is parsed and then use it in the TokenFilter. Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Bernd Fehling [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, October 26, 2011 8:33 AM > To: [EMAIL PROTECTED] > Subject: accessing the query string from inside TokenFilter > > Dear list, > while writing some TokenFilter for my analyzer chain I need access to the query > string from inside of my TokenFilter for some comparison, but the Filters are > working with a TokenStream and get seperate Tokens. > Currently I couldn't get any access to the query string. > > It would be great to have such a funtionality in lucene/solr. > > Should I write a jira issue for it or is there somewhere a wish list? > > Best regards > Bernd > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] For additional > commands, e-mail: [EMAIL PROTECTED] ---------------------------------------------------------------------
-
Re: accessing the query string from inside TokenFilterSimon Willnauer 2011-10-26, 07:21
What Uwe says is correct though. What we possibly could do is adding a
queryattribute that is set in a query parser (you can do that yourself though). not sure if it is worth it and if we should do it. simon On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > Hi, > > QueryParser and TokenStreams are clearly separated, there is no way to get > the query string from inside a TokenStream (and there cannot be, because QP > is a consumer of the TS, which is used not only for query parsing). The only > chance you have is to use a ThreadLocal that you set before the query is > parsed and then use it in the TokenFilter. > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > > >> -----Original Message----- >> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, October 26, 2011 8:33 AM >> To: [EMAIL PROTECTED] >> Subject: accessing the query string from inside TokenFilter >> >> Dear list, >> while writing some TokenFilter for my analyzer chain I need access to the > query >> string from inside of my TokenFilter for some comparison, but the Filters > are >> working with a TokenStream and get seperate Tokens. >> Currently I couldn't get any access to the query string. >> >> It would be great to have such a funtionality in lucene/solr. >> >> Should I write a jira issue for it or is there somewhere a wish list? >> >> Best regards >> Bernd >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] For additional >> commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
RE: accessing the query string from inside TokenFilterUwe Schindler 2011-10-26, 07:32
Hi Simon,
The problem is the xchanged consumer/producer role. Once the TokenStream calls clearAttributes() the attributes are gone, but query parser can only set the attribute *before* calling incrementToken(), so you have no chance to get them, as Tokenizer cleared it before any filter can read it (unless we use an attribute with clear() a no-op, which would fail lots of tests, as it's a hack). Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Simon Willnauer [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, October 26, 2011 9:21 AM > To: [EMAIL PROTECTED] > Subject: Re: accessing the query string from inside TokenFilter > > What Uwe says is correct though. What we possibly could do is adding a > queryattribute that is set in a query parser (you can do that yourself though). > not sure if it is worth it and if we should do it. > > simon > > On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > > Hi, > > > > QueryParser and TokenStreams are clearly separated, there is no way to > > get the query string from inside a TokenStream (and there cannot be, > > because QP is a consumer of the TS, which is used not only for query > > parsing). The only chance you have is to use a ThreadLocal that you > > set before the query is parsed and then use it in the TokenFilter. > > > > Uwe > > > > ----- > > Uwe Schindler > > H.-H.-Meier-Allee 63, D-28213 Bremen > > http://www.thetaphi.de > > eMail: [EMAIL PROTECTED] > > > > > >> -----Original Message----- > >> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] > >> Sent: Wednesday, October 26, 2011 8:33 AM > >> To: [EMAIL PROTECTED] > >> Subject: accessing the query string from inside TokenFilter > >> > >> Dear list, > >> while writing some TokenFilter for my analyzer chain I need access to > >> the > > query > >> string from inside of my TokenFilter for some comparison, but the > >> Filters > > are > >> working with a TokenStream and get seperate Tokens. > >> Currently I couldn't get any access to the query string. > >> > >> It would be great to have such a funtionality in lucene/solr. > >> > >> Should I write a jira issue for it or is there somewhere a wish list? > >> > >> Best regards > >> Bernd > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] For > >> additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] For > > additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] For additional > commands, e-mail: [EMAIL PROTECTED] ---------------------------------------------------------------------
-
Re: accessing the query string from inside TokenFilterChris Male 2011-10-26, 07:37
We've also lost the full query string by the time the QP creates its
TokenStream, right? Because the QP tokenizes on whitespace. On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > Hi Simon, > > The problem is the xchanged consumer/producer role. Once the TokenStream > calls clearAttributes() the attributes are gone, but query parser can only > set the attribute *before* calling incrementToken(), so you have no chance > to get them, as Tokenizer cleared it before any filter can read it (unless > we use an attribute with clear() a no-op, which would fail lots of tests, > as it's a hack). > > Uwe > > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > > > > -----Original Message----- > > From: Simon Willnauer [mailto:[EMAIL PROTECTED]] > > Sent: Wednesday, October 26, 2011 9:21 AM > > To: [EMAIL PROTECTED] > > Subject: Re: accessing the query string from inside TokenFilter > > > > What Uwe says is correct though. What we possibly could do is adding a > > queryattribute that is set in a query parser (you can do that yourself > though). > > not sure if it is worth it and if we should do it. > > > > simon > > > > On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > > > QueryParser and TokenStreams are clearly separated, there is no way to > > > get the query string from inside a TokenStream (and there cannot be, > > > because QP is a consumer of the TS, which is used not only for query > > > parsing). The only chance you have is to use a ThreadLocal that you > > > set before the query is parsed and then use it in the TokenFilter. > > > > > > Uwe > > > > > > ----- > > > Uwe Schindler > > > H.-H.-Meier-Allee 63, D-28213 Bremen > > > http://www.thetaphi.de > > > eMail: [EMAIL PROTECTED] > > > > > > > > >> -----Original Message----- > > >> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] > > >> Sent: Wednesday, October 26, 2011 8:33 AM > > >> To: [EMAIL PROTECTED] > > >> Subject: accessing the query string from inside TokenFilter > > >> > > >> Dear list, > > >> while writing some TokenFilter for my analyzer chain I need access to > > >> the > > > query > > >> string from inside of my TokenFilter for some comparison, but the > > >> Filters > > > are > > >> working with a TokenStream and get seperate Tokens. > > >> Currently I couldn't get any access to the query string. > > >> > > >> It would be great to have such a funtionality in lucene/solr. > > >> > > >> Should I write a jira issue for it or is there somewhere a wish list? > > >> > > >> Best regards > > >> Bernd > > >> > > >> --------------------------------------------------------------------- > > >> To unsubscribe, e-mail: [EMAIL PROTECTED] For > > >> additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] For > > > additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] For additional > > commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Chris Male | Software Developer | DutchWorks | www.dutchworks.nl
-
Re: accessing the query string from inside TokenFilterBernd Fehling 2011-10-26, 08:06
From what I can see in the debugger the analyzer chain is implemented
as a stack with last filter at the bottom and the first filter at the top. An analyzer query chain of: charFilter: MappingCharFilterFactory tokenizer : WhitespaceTokenizerFactory filter : PatternReplaceFilterFactory filter : LowerCaseFilterFactory filter : ShingleFilterFactory filter : SynonymFilterFactory has a chain of: this.input(SynonymFilter) --> input(ShingleFilter) --> input(LowerCaseFilter) --> input(PatternReplaceFilter) --> input(WhitespaceTokenizer) --> input(MappingCharFilter) --> input(CharReader) --> input(StringReader).str So I can always "see" the input of StringReader, but can I access it? Bernd Am 26.10.2011 09:37, schrieb Chris Male: > We've also lost the full query string by the time the QP creates its > TokenStream, right? Because the QP tokenizes on whitespace. > > On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote: > >> Hi Simon, >> >> The problem is the xchanged consumer/producer role. Once the TokenStream >> calls clearAttributes() the attributes are gone, but query parser can only >> set the attribute *before* calling incrementToken(), so you have no chance >> to get them, as Tokenizer cleared it before any filter can read it (unless >> we use an attribute with clear() a no-op, which would fail lots of tests, >> as it's a hack). >> >> Uwe >> >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: [EMAIL PROTECTED] >> >> >>> -----Original Message----- >>> From: Simon Willnauer [mailto:[EMAIL PROTECTED]] >>> Sent: Wednesday, October 26, 2011 9:21 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: accessing the query string from inside TokenFilter >>> >>> What Uwe says is correct though. What we possibly could do is adding a >>> queryattribute that is set in a query parser (you can do that yourself >> though). >>> not sure if it is worth it and if we should do it. >>> >>> simon >>> >>> On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler<[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> QueryParser and TokenStreams are clearly separated, there is no way to >>>> get the query string from inside a TokenStream (and there cannot be, >>>> because QP is a consumer of the TS, which is used not only for query >>>> parsing). The only chance you have is to use a ThreadLocal that you >>>> set before the query is parsed and then use it in the TokenFilter. >>>> >>>> Uwe >>>> >>>> ----- >>>> Uwe Schindler >>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>> http://www.thetaphi.de >>>> eMail: [EMAIL PROTECTED] >>>> >>>> >>>>> -----Original Message----- >>>>> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >>>>> Sent: Wednesday, October 26, 2011 8:33 AM >>>>> To: [EMAIL PROTECTED] >>>>> Subject: accessing the query string from inside TokenFilter >>>>> >>>>> Dear list, >>>>> while writing some TokenFilter for my analyzer chain I need access to >>>>> the >>>> query >>>>> string from inside of my TokenFilter for some comparison, but the >>>>> Filters >>>> are >>>>> working with a TokenStream and get seperate Tokens. >>>>> Currently I couldn't get any access to the query string. >>>>> >>>>> It would be great to have such a funtionality in lucene/solr. >>>>> >>>>> Should I write a jira issue for it or is there somewhere a wish list? >>>>> >>>>> Best regards >>>>> Bernd >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [EMAIL PROTECTED] For >>>>> additional commands, e-mail: [EMAIL PROTECTED] >>>> >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [EMAIL PROTECTED] For >>>> additional commands, e-mail: [EMAIL PROTECTED] >>>> >>>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] For additional >>> commands, e-mail ************************************************************* Bernd Fehling Universit�tsbibliothek Bielefeld Dipl.-Inform. (FH) Universit�tsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 [EMAIL PROTECTED] 33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *************************************************************
-
RE: accessing the query string from inside TokenFilterUwe Schindler 2011-10-26, 08:26
The input from StringReader does not help you:
- in the case of QueryParser it is *not* the query string!!! - storing it in an attribute would blow up your heap for real documents Uwe ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: Bernd Fehling [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, October 26, 2011 10:06 AM > To: [EMAIL PROTECTED] > Subject: Re: accessing the query string from inside TokenFilter > > From what I can see in the debugger the analyzer chain is implemented as a > stack with last filter at the bottom and the first filter at the top. > > An analyzer query chain of: > charFilter: MappingCharFilterFactory > tokenizer : WhitespaceTokenizerFactory > filter : PatternReplaceFilterFactory > filter : LowerCaseFilterFactory > filter : ShingleFilterFactory > filter : SynonymFilterFactory > > has a chain of: > this.input(SynonymFilter) --> input(ShingleFilter) --> > input(LowerCaseFilter) --> input(PatternReplaceFilter) --> > input(WhitespaceTokenizer) --> input(MappingCharFilter) --> > input(CharReader) --> input(StringReader).str > > So I can always "see" the input of StringReader, but can I access it? > > Bernd > > Am 26.10.2011 09:37, schrieb Chris Male: > > We've also lost the full query string by the time the QP creates its > > TokenStream, right? Because the QP tokenizes on whitespace. > > > > On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote: > > > >> Hi Simon, > >> > >> The problem is the xchanged consumer/producer role. Once the > >> TokenStream calls clearAttributes() the attributes are gone, but > >> query parser can only set the attribute *before* calling > >> incrementToken(), so you have no chance to get them, as Tokenizer > >> cleared it before any filter can read it (unless we use an attribute > >> with clear() a no-op, which would fail lots of tests, as it's a hack). > >> > >> Uwe > >> > >> ----- > >> Uwe Schindler > >> H.-H.-Meier-Allee 63, D-28213 Bremen > >> http://www.thetaphi.de > >> eMail: [EMAIL PROTECTED] > >> > >> > >>> -----Original Message----- > >>> From: Simon Willnauer [mailto:[EMAIL PROTECTED]] > >>> Sent: Wednesday, October 26, 2011 9:21 AM > >>> To: [EMAIL PROTECTED] > >>> Subject: Re: accessing the query string from inside TokenFilter > >>> > >>> What Uwe says is correct though. What we possibly could do is adding > >>> a queryattribute that is set in a query parser (you can do that > >>> yourself > >> though). > >>> not sure if it is worth it and if we should do it. > >>> > >>> simon > >>> > >>> On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler<[EMAIL PROTECTED]> > wrote: > >>>> Hi, > >>>> > >>>> QueryParser and TokenStreams are clearly separated, there is no way > >>>> to get the query string from inside a TokenStream (and there cannot > >>>> be, because QP is a consumer of the TS, which is used not only for > >>>> query parsing). The only chance you have is to use a ThreadLocal > >>>> that you set before the query is parsed and then use it in the TokenFilter. > >>>> > >>>> Uwe > >>>> > >>>> ----- > >>>> Uwe Schindler > >>>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de > >>>> eMail: [EMAIL PROTECTED] > >>>> > >>>> > >>>>> -----Original Message----- > >>>>> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] > >>>>> Sent: Wednesday, October 26, 2011 8:33 AM > >>>>> To: [EMAIL PROTECTED] > >>>>> Subject: accessing the query string from inside TokenFilter > >>>>> > >>>>> Dear list, > >>>>> while writing some TokenFilter for my analyzer chain I need access to > >>>>> the > >>>> query > >>>>> string from inside of my TokenFilter for some comparison, but the > >>>>> Filters > >>>> are > >>>>> working with a TokenStream and get seperate Tokens. > >>>>> Currently I couldn't get any access to the query string. > >>>>> > >>>>> It would be great to have such a funtionality in lucene/solr. > >>>>> > >>>>> Should I write a jira issue for it or is there somewhere a wish list? additional
-
Re: accessing the query string from inside TokenFilterBernd Fehling 2011-10-26, 12:05
OK, I think "query string" is a bit to specific, so more general
what I need is access from inside of a filter to the complete string (not only token) being analyzed. A very dirty workaround would be a "collector filter" which collects all tokens after WhitespaceTokenizer and makes it somehow available for the following filters, or not? So at least at the last run of incrementToken() I have the original string. Bernd Am 26.10.2011 10:26, schrieb Uwe Schindler: > The input from StringReader does not help you: > - in the case of QueryParser it is *not* the query string!!! > - storing it in an attribute would blow up your heap for real documents > > Uwe > ----- > Uwe Schindler > H.-H.-Meier-Allee 63, D-28213 Bremen > http://www.thetaphi.de > eMail: [EMAIL PROTECTED] > > >> -----Original Message----- >> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, October 26, 2011 10:06 AM >> To: [EMAIL PROTECTED] >> Subject: Re: accessing the query string from inside TokenFilter >> >> From what I can see in the debugger the analyzer chain is implemented as > a >> stack with last filter at the bottom and the first filter at the top. >> >> An analyzer query chain of: >> charFilter: MappingCharFilterFactory >> tokenizer : WhitespaceTokenizerFactory >> filter : PatternReplaceFilterFactory >> filter : LowerCaseFilterFactory >> filter : ShingleFilterFactory >> filter : SynonymFilterFactory >> >> has a chain of: >> this.input(SynonymFilter) --> input(ShingleFilter) --> >> input(LowerCaseFilter) --> input(PatternReplaceFilter) --> >> input(WhitespaceTokenizer) --> input(MappingCharFilter) --> >> input(CharReader) --> input(StringReader).str >> >> So I can always "see" the input of StringReader, but can I access it? >> >> Bernd >> >> Am 26.10.2011 09:37, schrieb Chris Male: >>> We've also lost the full query string by the time the QP creates its >>> TokenStream, right? Because the QP tokenizes on whitespace. >>> >>> On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote: >>> >>>> Hi Simon, >>>> >>>> The problem is the xchanged consumer/producer role. Once the >>>> TokenStream calls clearAttributes() the attributes are gone, but >>>> query parser can only set the attribute *before* calling >>>> incrementToken(), so you have no chance to get them, as Tokenizer >>>> cleared it before any filter can read it (unless we use an attribute >>>> with clear() a no-op, which would fail lots of tests, as it's a hack). >>>> >>>> Uwe >>>> >>>> ----- >>>> Uwe Schindler >>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>> http://www.thetaphi.de >>>> eMail: [EMAIL PROTECTED] >>>> >>>> >>>>> -----Original Message----- >>>>> From: Simon Willnauer [mailto:[EMAIL PROTECTED]] >>>>> Sent: Wednesday, October 26, 2011 9:21 AM >>>>> To: [EMAIL PROTECTED] >>>>> Subject: Re: accessing the query string from inside TokenFilter >>>>> >>>>> What Uwe says is correct though. What we possibly could do is adding >>>>> a queryattribute that is set in a query parser (you can do that >>>>> yourself >>>> though). >>>>> not sure if it is worth it and if we should do it. >>>>> >>>>> simon >>>>> >>>>> On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler<[EMAIL PROTECTED]> >> wrote: >>>>>> Hi, >>>>>> >>>>>> QueryParser and TokenStreams are clearly separated, there is no way >>>>>> to get the query string from inside a TokenStream (and there cannot >>>>>> be, because QP is a consumer of the TS, which is used not only for >>>>>> query parsing). The only chance you have is to use a ThreadLocal >>>>>> that you set before the query is parsed and then use it in the > TokenFilter. >>>>>> >>>>>> Uwe >>>>>> >>>>>> ----- >>>>>> Uwe Schindler >>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de >>>>>> eMail: [EMAIL PROTECTED] >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >>>>>>> Sent: Wednesday, October 26, 2011 8:33 AM >>>>>>> To: [EMAIL PROTECTED] ************************************************************* Bernd Fehling Universit�tsbibliothek Bielefeld Dipl.-Inform. (FH) Universit�tsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 [EMAIL PROTECTED] 33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *************************************************************
-
Re: accessing the query string from inside TokenFilterRobert Muir 2011-10-26, 12:09
Use a queryparser that doesnt break on whitespace as a workaround?
Or, we can start thinking about how to fix QueryParser (https://issues.apache.org/jira/browse/LUCENE-2605) The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace. Allowing tokenizer access to the query string would just mean that your tokenizer hacks around this by trying to be a QueryParser, too, making matters even worse! On Wed, Oct 26, 2011 at 8:05 AM, Bernd Fehling <[EMAIL PROTECTED]> wrote: > OK, I think "query string" is a bit to specific, so more general > what I need is access from inside of a filter to the complete string > (not only token) being analyzed. > > A very dirty workaround would be a "collector filter" which collects all > tokens after WhitespaceTokenizer and makes it somehow available for > the following filters, or not? > So at least at the last run of incrementToken() I have the original string. > > Bernd > > Am 26.10.2011 10:26, schrieb Uwe Schindler: >> >> The input from StringReader does not help you: >> - in the case of QueryParser it is *not* the query string!!! >> - storing it in an attribute would blow up your heap for real documents >> >> Uwe >> ----- >> Uwe Schindler >> H.-H.-Meier-Allee 63, D-28213 Bremen >> http://www.thetaphi.de >> eMail: [EMAIL PROTECTED] >> >> >>> -----Original Message----- >>> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >>> Sent: Wednesday, October 26, 2011 10:06 AM >>> To: [EMAIL PROTECTED] >>> Subject: Re: accessing the query string from inside TokenFilter >>> >>> From what I can see in the debugger the analyzer chain is implemented as >> >> a >>> >>> stack with last filter at the bottom and the first filter at the top. >>> >>> An analyzer query chain of: >>> charFilter: MappingCharFilterFactory >>> tokenizer : WhitespaceTokenizerFactory >>> filter : PatternReplaceFilterFactory >>> filter : LowerCaseFilterFactory >>> filter : ShingleFilterFactory >>> filter : SynonymFilterFactory >>> >>> has a chain of: >>> this.input(SynonymFilter) --> input(ShingleFilter) --> >>> input(LowerCaseFilter) --> input(PatternReplaceFilter) --> >>> input(WhitespaceTokenizer) --> input(MappingCharFilter) --> >>> input(CharReader) --> input(StringReader).str >>> >>> So I can always "see" the input of StringReader, but can I access it? >>> >>> Bernd >>> >>> Am 26.10.2011 09:37, schrieb Chris Male: >>>> >>>> We've also lost the full query string by the time the QP creates its >>>> TokenStream, right? Because the QP tokenizes on whitespace. >>>> >>>> On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi Simon, >>>>> >>>>> The problem is the xchanged consumer/producer role. Once the >>>>> TokenStream calls clearAttributes() the attributes are gone, but >>>>> query parser can only set the attribute *before* calling >>>>> incrementToken(), so you have no chance to get them, as Tokenizer >>>>> cleared it before any filter can read it (unless we use an attribute >>>>> with clear() a no-op, which would fail lots of tests, as it's a hack). >>>>> >>>>> Uwe >>>>> >>>>> ----- >>>>> Uwe Schindler >>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>> http://www.thetaphi.de >>>>> eMail: [EMAIL PROTECTED] >>>>> >>>>> >>>>>> -----Original Message----- >>>>>> From: Simon Willnauer [mailto:[EMAIL PROTECTED]] >>>>>> Sent: Wednesday, October 26, 2011 9:21 AM >>>>>> To: [EMAIL PROTECTED] >>>>>> Subject: Re: accessing the query string from inside TokenFilter >>>>>> >>>>>> What Uwe says is correct though. What we possibly could do is adding >>>>>> a queryattribute that is set in a query parser (you can do that >>>>>> yourself >>>>> >>>>> though). >>>>>> >>>>>> not sure if it is worth it and if we should do it. >>>>>> >>>>>> simon >>>>>> >>>>>> On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler<[EMAIL PROTECTED]> >>> >>> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> QueryParser and TokenStreams are clearly separated, there is no way >>>>>>> to get the query string from inside a TokenStream (and there cannot lucidimagination.com
-
Re: accessing the query string from inside TokenFilterDM Smith 2011-10-26, 12:28
+1 please fix the QP bug. It should only identify query keywords and non-keywords.
On Oct 26, 2011, at 8:09 AM, Robert Muir <[EMAIL PROTECTED]> wrote: > Use a queryparser that doesnt break on whitespace as a workaround? > Or, we can start thinking about how to fix QueryParser > (https://issues.apache.org/jira/browse/LUCENE-2605) > > The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace. > Allowing tokenizer access to the query string would just mean that > your tokenizer hacks around this by trying to be a QueryParser, too, > making matters even worse! > > > On Wed, Oct 26, 2011 at 8:05 AM, Bernd Fehling > <[EMAIL PROTECTED]> wrote: >> OK, I think "query string" is a bit to specific, so more general >> what I need is access from inside of a filter to the complete string >> (not only token) being analyzed. >> >> A very dirty workaround would be a "collector filter" which collects all >> tokens after WhitespaceTokenizer and makes it somehow available for >> the following filters, or not? >> So at least at the last run of incrementToken() I have the original string. >> >> Bernd >> >> Am 26.10.2011 10:26, schrieb Uwe Schindler: >>> >>> The input from StringReader does not help you: >>> - in the case of QueryParser it is *not* the query string!!! >>> - storing it in an attribute would blow up your heap for real documents >>> >>> Uwe >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: [EMAIL PROTECTED] >>> >>> >>>> -----Original Message----- >>>> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >>>> Sent: Wednesday, October 26, 2011 10:06 AM >>>> To: [EMAIL PROTECTED] >>>> Subject: Re: accessing the query string from inside TokenFilter >>>> >>>> From what I can see in the debugger the analyzer chain is implemented as >>> >>> a >>>> >>>> stack with last filter at the bottom and the first filter at the top. >>>> >>>> An analyzer query chain of: >>>> charFilter: MappingCharFilterFactory >>>> tokenizer : WhitespaceTokenizerFactory >>>> filter : PatternReplaceFilterFactory >>>> filter : LowerCaseFilterFactory >>>> filter : ShingleFilterFactory >>>> filter : SynonymFilterFactory >>>> >>>> has a chain of: >>>> this.input(SynonymFilter) --> input(ShingleFilter) --> >>>> input(LowerCaseFilter) --> input(PatternReplaceFilter) --> >>>> input(WhitespaceTokenizer) --> input(MappingCharFilter) --> >>>> input(CharReader) --> input(StringReader).str >>>> >>>> So I can always "see" the input of StringReader, but can I access it? >>>> >>>> Bernd >>>> >>>> Am 26.10.2011 09:37, schrieb Chris Male: >>>>> >>>>> We've also lost the full query string by the time the QP creates its >>>>> TokenStream, right? Because the QP tokenizes on whitespace. >>>>> >>>>> On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hi Simon, >>>>>> >>>>>> The problem is the xchanged consumer/producer role. Once the >>>>>> TokenStream calls clearAttributes() the attributes are gone, but >>>>>> query parser can only set the attribute *before* calling >>>>>> incrementToken(), so you have no chance to get them, as Tokenizer >>>>>> cleared it before any filter can read it (unless we use an attribute >>>>>> with clear() a no-op, which would fail lots of tests, as it's a hack). >>>>>> >>>>>> Uwe >>>>>> >>>>>> ----- >>>>>> Uwe Schindler >>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>>> http://www.thetaphi.de >>>>>> eMail: [EMAIL PROTECTED] >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Simon Willnauer [mailto:[EMAIL PROTECTED]] >>>>>>> Sent: Wednesday, October 26, 2011 9:21 AM >>>>>>> To: [EMAIL PROTECTED] >>>>>>> Subject: Re: accessing the query string from inside TokenFilter >>>>>>> >>>>>>> What Uwe says is correct though. What we possibly could do is adding >>>>>>> a queryattribute that is set in a query parser (you can do that >>>>>>> yourself >>>>>> >>>>>> though). >>
-
Re: accessing the query string from inside TokenFilterSimon Willnauer 2011-10-26, 12:33
On Wed, Oct 26, 2011 at 2:09 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
> Use a queryparser that doesnt break on whitespace as a workaround? > Or, we can start thinking about how to fix QueryParser > (https://issues.apache.org/jira/browse/LUCENE-2605) +1 > > The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace. > Allowing tokenizer access to the query string would just mean that > your tokenizer hacks around this by trying to be a QueryParser, too, > making matters even worse! > > > On Wed, Oct 26, 2011 at 8:05 AM, Bernd Fehling > <[EMAIL PROTECTED]> wrote: >> OK, I think "query string" is a bit to specific, so more general >> what I need is access from inside of a filter to the complete string >> (not only token) being analyzed. >> >> A very dirty workaround would be a "collector filter" which collects all >> tokens after WhitespaceTokenizer and makes it somehow available for >> the following filters, or not? >> So at least at the last run of incrementToken() I have the original string. >> >> Bernd >> >> Am 26.10.2011 10:26, schrieb Uwe Schindler: >>> >>> The input from StringReader does not help you: >>> - in the case of QueryParser it is *not* the query string!!! >>> - storing it in an attribute would blow up your heap for real documents >>> >>> Uwe >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: [EMAIL PROTECTED] >>> >>> >>>> -----Original Message----- >>>> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >>>> Sent: Wednesday, October 26, 2011 10:06 AM >>>> To: [EMAIL PROTECTED] >>>> Subject: Re: accessing the query string from inside TokenFilter >>>> >>>> From what I can see in the debugger the analyzer chain is implemented as >>> >>> a >>>> >>>> stack with last filter at the bottom and the first filter at the top. >>>> >>>> An analyzer query chain of: >>>> charFilter: MappingCharFilterFactory >>>> tokenizer : WhitespaceTokenizerFactory >>>> filter : PatternReplaceFilterFactory >>>> filter : LowerCaseFilterFactory >>>> filter : ShingleFilterFactory >>>> filter : SynonymFilterFactory >>>> >>>> has a chain of: >>>> this.input(SynonymFilter) --> input(ShingleFilter) --> >>>> input(LowerCaseFilter) --> input(PatternReplaceFilter) --> >>>> input(WhitespaceTokenizer) --> input(MappingCharFilter) --> >>>> input(CharReader) --> input(StringReader).str >>>> >>>> So I can always "see" the input of StringReader, but can I access it? >>>> >>>> Bernd >>>> >>>> Am 26.10.2011 09:37, schrieb Chris Male: >>>>> >>>>> We've also lost the full query string by the time the QP creates its >>>>> TokenStream, right? Because the QP tokenizes on whitespace. >>>>> >>>>> On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<[EMAIL PROTECTED]> �� wrote: >>>>> >>>>>> Hi Simon, >>>>>> >>>>>> The problem is the xchanged consumer/producer role. Once the >>>>>> TokenStream calls clearAttributes() the attributes are gone, but >>>>>> query parser can only set the attribute *before* calling >>>>>> incrementToken(), so you have no chance to get them, as Tokenizer >>>>>> cleared it before any filter can read it (unless we use an attribute >>>>>> with clear() a no-op, which would fail lots of tests, as it's a hack). >>>>>> >>>>>> Uwe >>>>>> >>>>>> ----- >>>>>> Uwe Schindler >>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>>> http://www.thetaphi.de >>>>>> eMail: [EMAIL PROTECTED] >>>>>> >>>>>> >>>>>>> -----Original Message----- >>>>>>> From: Simon Willnauer [mailto:[EMAIL PROTECTED]] >>>>>>> Sent: Wednesday, October 26, 2011 9:21 AM >>>>>>> To: [EMAIL PROTECTED] >>>>>>> Subject: Re: accessing the query string from inside TokenFilter >>>>>>> >>>>>>> What Uwe says is correct though. What we possibly could do is adding >>>>>>> a queryattribute that is set in a query parser (you can do that >>>>>>> yourself >>>>>> >>>>>> though). >>>>>>> >>>>>>> not sure if it is worth it and if we should do it. >>>>>>> >>>>>>> simon >>>>>>> >>>>>>> On Wed, Oct 26, 2011 at 8:58 AM, Uwe Schindler<[EMAIL PROTECTED]>
-
Re: accessing the query string from inside TokenFilterBernd Fehling 2011-10-26, 12:49
Thanks Robert for pointing me to the issue. Thats exactly my problem
because I'm trying to implement "query time synonym expansion". Therefore it is nessessary to "cleanup" the synonym result with help of the query string. Interestingly my FAST system calls synonym twice for query parsing: ... synonym parse synonym ... Would be pleased to have this fixed so that QueryParser is not also a tokenizer, but while having looked into QueryParser (which scared me to death) is it possible to be fixed at all without getting any other bad side effects? Using phrase query works so far for getting the complete query string at once to the analyzer. Am 26.10.2011 14:09, schrieb Robert Muir: > Use a queryparser that doesnt break on whitespace as a workaround? > Or, we can start thinking about how to fix QueryParser > (https://issues.apache.org/jira/browse/LUCENE-2605) > > The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace. > Allowing tokenizer access to the query string would just mean that > your tokenizer hacks around this by trying to be a QueryParser, too, > making matters even worse! > > > On Wed, Oct 26, 2011 at 8:05 AM, Bernd Fehling > <[EMAIL PROTECTED]> wrote: >> OK, I think "query string" is a bit to specific, so more general >> what I need is access from inside of a filter to the complete string >> (not only token) being analyzed. >> >> A very dirty workaround would be a "collector filter" which collects all >> tokens after WhitespaceTokenizer and makes it somehow available for >> the following filters, or not? >> So at least at the last run of incrementToken() I have the original string. >> >> Bernd >> >> Am 26.10.2011 10:26, schrieb Uwe Schindler: >>> >>> The input from StringReader does not help you: >>> - in the case of QueryParser it is *not* the query string!!! >>> - storing it in an attribute would blow up your heap for real documents >>> >>> Uwe >>> ----- >>> Uwe Schindler >>> H.-H.-Meier-Allee 63, D-28213 Bremen >>> http://www.thetaphi.de >>> eMail: [EMAIL PROTECTED] >>> >>> >>>> -----Original Message----- >>>> From: Bernd Fehling [mailto:[EMAIL PROTECTED]] >>>> Sent: Wednesday, October 26, 2011 10:06 AM >>>> To: [EMAIL PROTECTED] >>>> Subject: Re: accessing the query string from inside TokenFilter >>>> >>>> From what I can see in the debugger the analyzer chain is implemented as >>> >>> a >>>> >>>> stack with last filter at the bottom and the first filter at the top. >>>> >>>> An analyzer query chain of: >>>> charFilter: MappingCharFilterFactory >>>> tokenizer : WhitespaceTokenizerFactory >>>> filter : PatternReplaceFilterFactory >>>> filter : LowerCaseFilterFactory >>>> filter : ShingleFilterFactory >>>> filter : SynonymFilterFactory >>>> >>>> has a chain of: >>>> this.input(SynonymFilter) --> input(ShingleFilter) --> >>>> input(LowerCaseFilter) --> input(PatternReplaceFilter) --> >>>> input(WhitespaceTokenizer) --> input(MappingCharFilter) --> >>>> input(CharReader) --> input(StringReader).str >>>> >>>> So I can always "see" the input of StringReader, but can I access it? >>>> >>>> Bernd >>>> >>>> Am 26.10.2011 09:37, schrieb Chris Male: >>>>> >>>>> We've also lost the full query string by the time the QP creates its >>>>> TokenStream, right? Because the QP tokenizes on whitespace. >>>>> >>>>> On Wed, Oct 26, 2011 at 8:32 PM, Uwe Schindler<[EMAIL PROTECTED]> wrote: >>>>> >>>>>> Hi Simon, >>>>>> >>>>>> The problem is the xchanged consumer/producer role. Once the >>>>>> TokenStream calls clearAttributes() the attributes are gone, but >>>>>> query parser can only set the attribute *before* calling >>>>>> incrementToken(), so you have no chance to get them, as Tokenizer >>>>>> cleared it before any filter can read it (unless we use an attribute >>>>>> with clear() a no-op, which would fail lots of tests, as it's a hack). >>>>>> >>>>>> Uwe >>>>>> >>>>>> ----- >>>>>> Uwe Schindler >>>>>> H.-H.-Meier-Allee 63, D-28213 Bremen >>>>>> http://www.thetaphi.de ************************************************************* Bernd Fehling Universitätsbibliothek Bielefeld Dipl.-Inform. (FH) Universitätsstr. 25 Tel. +49 521 106-4060 Fax. +49 521 106-4052 [EMAIL PROTECTED] 33615 Bielefeld BASE - Bielefeld Academic Search Engine - www.base-search.net *************************************************************
-
Re: accessing the query string from inside TokenFilterChris Hostetter 2011-10-29, 20:27
: The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace. : Allowing tokenizer access to the query string would just mean that Calling this a bug in the QUeryParser is grossly missleading -- it's like saying that QueryParser is buggy because it does parsing on whitespace characters is like saying it's buggy because it doesn't treat + and - as literal input characters. whitespace characters (and +, and -, and quotes, and parens, etc...) that are not quoted or escaped are syntactically meaningful markup characters to the QueryParser -- they instruct the query parser where one clause of a boolean query ends and another clause begins. if this isn't the parsing behavior that you want, then either escape the whitespace characters, or don't use the Lucene QueryParser -- use some other parser that doesn't have meta characters. -Hoss ---------------------------------------------------------------------
-
Re: accessing the query string from inside TokenFilterRobert Muir 2011-10-29, 21:23
On Sat, Oct 29, 2011 at 4:27 PM, Chris Hostetter
<[EMAIL PROTECTED]> wrote: > > : The bug is that QueryParser tries to be a Tokenizer and breaks on whitespace. > : Allowing tokenizer access to the query string would just mean that > > Calling this a bug in the QUeryParser is grossly missleading -- it's like > saying that QueryParser is buggy because it does parsing on whitespace > characters is like saying it's buggy because it doesn't treat + and - as > literal input characters. its not really misleading. Its a bug. > > if this isn't the parsing behavior that you want, then either escape the > whitespace characters, or don't use the Lucene QueryParser -- use some > other parser that doesn't have meta characters. > the queryparser's grammer/behavior is hardly set in stone. we can improve it, thats why the issue is open for anyone that figures out a good solution here. -- lucidimagination.com ---------------------------------------------------------------------
-
Re: accessing the query string from inside TokenFilterChris Hostetter 2011-10-29, 21:33
: its not really misleading. Its a bug.
... : the queryparser's grammer/behavior is hardly set in stone. we can : improve it, thats why the issue is open for anyone that figures out a : good solution here. it's working exactly as designed: whitespace delimites clauses. a new parser (or a new syntax) is a great idea that's been a long time coming -- but claiming the current query parser implementation is broken because it works this way is *absolutely* missleading. (among other things it suggests to novice users/developers that the QP is *not* working as designed, and that there may be some way to "fix" the QP w/o changing it's syntax. and i have yet to see/hear anyone propose anything close to a solution that would actually accomplish that (it seems damn near impossible to come up with a way for whitespace to simultenously be markup to the queryparser, and also literal text -- but i look forward to hearing an actual suggestion) : -- : lucidimagination.com : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : -Hoss ---------------------------------------------------------------------
-
Re: accessing the query string from inside TokenFilterRobert Muir 2011-10-29, 21:36
On Sat, Oct 29, 2011 at 5:33 PM, Chris Hostetter
<[EMAIL PROTECTED]> wrote: > : its not really misleading. Its a bug. > ... > : the queryparser's grammer/behavior is hardly set in stone. we can > : improve it, thats why the issue is open for anyone that figures out a > : good solution here. > > it's working exactly as designed: whitespace delimites clauses. > that doesn't mean its not a bug, designs can have bugs too. There might be a way to fix this problem with the current QP thats reasonably compatible with the existing syntax (fair change for 4.0 in my opinion) -- lucidimagination.com ---------------------------------------------------------------------
-
Re: accessing the query string from inside TokenFilterSimon Willnauer 2011-10-29, 21:52
On Sat, Oct 29, 2011 at 11:36 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
> On Sat, Oct 29, 2011 at 5:33 PM, Chris Hostetter > <[EMAIL PROTECTED]> wrote: >> : its not really misleading. Its a bug. >> ... >> : the queryparser's grammer/behavior is hardly set in stone. we can >> : improve it, thats why the issue is open for anyone that figures out a >> : good solution here. >> >> it's working exactly as designed: whitespace delimites clauses. >> > > that doesn't mean its not a bug, designs can have bugs too. There > might be a way to fix this problem with the current QP thats > reasonably compatible with the existing syntax (fair change for 4.0 in > my opinion) +1 I think we should revise this design it limits query parsing a lot just look at multi term synonyms etc. our analyzer needs to see the query terms earlier ie. before we split on whitespaces. simon > > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: accessing the query string from inside TokenFilterDM Smith 2011-10-29, 22:36
Just a thought:
One way to maintain backward compatibility would be to have a two stage parser. The first stage does the breaking of the input on keywords. The second does the rest of the work that the current qp does. Those that want to have the old behavior use both stages, but those that want the new behavior use just the first stage. For analyzers such as Standard, Simple, ..., could use the two stage for the sake of backward compatibility. -- DM On Oct 29, 2011, at 5:36 PM, Robert Muir wrote: > On Sat, Oct 29, 2011 at 5:33 PM, Chris Hostetter > <[EMAIL PROTECTED]> wrote: >> : its not really misleading. Its a bug. >> ... >> : the queryparser's grammer/behavior is hardly set in stone. we can >> : improve it, thats why the issue is open for anyone that figures out a >> : good solution here. >> >> it's working exactly as designed: whitespace delimites clauses. >> > > that doesn't mean its not a bug, designs can have bugs too. There > might be a way to fix this problem with the current QP thats > reasonably compatible with the existing syntax (fair change for 4.0 in > my opinion) > > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- |