|
|
-
RE: Looking for a code pattern to pass stop words as an attributeUwe Schindler 2012-08-22, 08:03
You could misuse the attributes API:
All filters in a chain have the same attributes. This is achieved by the chaining (new TokenFilter(other TS) shares the attributes). What you could do to be non-linear in chaining: Create the "helpers" that are not part of the chain, by linking them to the input TokenStream, but never call incrementToken() on them. Their internals will always see the same attributes and attribute contents, so you could call accept() - if it would not be protected. The stream is controlled by our TokenFilter, so we incrementToken() only on ours, we just misuse the accept method (because it operates on the attributes we already populated by our own call to incrementToken()): stopwordMarkFilter = new TokenFilter(....) { private final markerAtt = addAttribute(...); private final FilteringTokenFilter japanesePOS = new new JapanesePartOfSpeechStopFilter(true, input, stoptags); private final FilteringTokenFilter stopfilter = new StopFilter(matchVersion, input, stopwords); public boolean incrementToken() { if (!input.incrementToken()) return false; if (!japanesePOS.accept() || !stopfilter.accept()) { // mark the current token as a stopword. markerAtt.setIsStopword(true); } return true; } } The only problem, as accept is not intended to be called from the outside, it is of course protected... ----- Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: [EMAIL PROTECTED] > -----Original Message----- > From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED]] On Behalf Of > Dawid Weiss > Sent: Wednesday, August 22, 2012 8:51 AM > To: [EMAIL PROTECTED] > Subject: Re: Looking for a code pattern to pass stop words as an attribute > > Thanks for replies Steve, Uwe. > > > if you dont want to create your own "marker filter", you can use > > KeywordMarkerFilter ( <http://goo.gl/OOgf4> http://goo.gl/OOgf4) instead > > This is pretty much what I had come up with, although I used a custom filter > class (with a similar attribute). The thing I have trouble with is, however, that > stop words may not be based on images but also on other attributes. In > particular, the Japanese pipeline uses _two_ term suppression classes: > > stream = new JapanesePartOfSpeechStopFilter(true, stream, stoptags); > ... > stream = new StopFilter(matchVersion, stream, stopwords); > > Of course I can just copy/paste the source of these and build my own keyword > marker, this is clear to me. But I'd rather build a filter that delegates to these > original classes and aggregates their output so that I don't have to rebuild > things on every upgrade and this is where I'm kind of stuck. Something like: > > if (!japanesePOS.accept() || !stopfilter.accept()) { > // mark the current token as a stopword. > } > > I'm just not sure if I can create such a non-linear filters pipeline > -- if this isn't going to confuse the attribute management code? Node that the > above filters (japanesePOS, blah) would _not_ be part of the token stream, the > would be attached to one of the filters. Don't know if I'm clear. > > Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED] For additional > commands, e-mail: <mailto:[EMAIL PROTECTED]> [EMAIL PROTECTED] |