|
Jamie Johnson
2012-02-09, 23:30
Robert Muir
2012-02-09, 23:44
Jamie Johnson
2012-02-10, 01:28
Robert Muir
2012-02-10, 01:44
Jamie Johnson
2012-02-10, 01:54
Robert Muir
2012-02-10, 02:02
Jamie Johnson
2012-02-10, 03:38
Jamie Johnson
2012-02-10, 03:47
|
-
custom TokenFilterJamie Johnson 2012-02-09, 23:30
I have the need to take user input and index it in a unique fashion,
essentially the value is some string (say "abcdefghijk") and needs to be converted into a set of tokens (say 1 2 3 4). I am currently have implemented a custom TokenFilter to do this, is this appropriate? In cases where I am indexing things slowly (i.e. 1 at a time) this works fine, but when I send 10,000 things to solr (all in one thread) I am noticing exceptions where it seems that the generated instance variable is being used by several threads. Is my implementation appropriate or is there another more appropriate way to do this? Are TokenFilters reused? Would it be more appropriate to convert the stream to 1 token space separated then run that through a WhiteSpaceTokenizer? Any guidance on this would be greatly appreciated. class CustomFilter extends TokenFilter { private final CharTermAttribute termAtt addAttribute(CharTermAttribute.class); private final PositionIncrementAttribute posAtt addAttribute(PositionIncrementAttribute.class); protected CustomFilter(TokenStream input) { super(input); } Iterator<AttributeSource> replacement; @Override public boolean incrementToken() throws IOException { if(generated == null){ //setup generated if(!input.incrementToken()){ return false; } //clearAttributes(); List<String> cells = StaticClass.generateTokens(termAtt.toString()); generated = new ArrayList<AttributeSource>(cells.size()); boolean first = true; for(String cell : cells) { AttributeSource newTokenSource = this.cloneAttributes(); CharTermAttribute newTermAtt newTokenSource.addAttribute(CharTermAttribute.class); newTermAtt.setEmpty(); newTermAtt.append(cell); OffsetAttribute newOffsetAtt newTokenSource.addAttribute(OffsetAttribute.class); PositionIncrementAttribute newPosIncAtt newTokenSource.addAttribute(PositionIncrementAttribute.class); newOffsetAtt.setOffset(0,0); newPosIncAtt.setPositionIncrement(first ? 1 : 0); generated.add(newTokenSource); first = false; generated.add(newTokenSource); } } if(!generated.isEmpty()){ copy(this, generated.remove(0)); return true; } return false; } private void copy(AttributeSource target, AttributeSource source) { if (target != source) source.copyTo(target); } private LinkedList<AttributeSource> buffer; private LinkedList<AttributeSource> matched; private boolean exhausted; private AttributeSource nextTok() throws IOException { if (buffer != null && !buffer.isEmpty()) { return buffer.removeFirst(); } else { if (!exhausted && input.incrementToken()) { return this; } else { exhausted = true; return null; } } } @Override public void reset() throws IOException { super.reset(); generated = null; } }
-
Re: custom TokenFilterRobert Muir 2012-02-09, 23:44
If you are writing a custom tokenstream, I recommend using some of the
resources in Lucene's test-framework.jar to test it. These find lots of bugs! (including thread-safety bugs) For a filter: I recommend to use the assertions in BaseTokenStreamTestCase: assertTokenStreamContents, assertAnalyzesTo, and especially checkRandomData http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java When testing your filter, for even more checks, don't use Whitespace or Keyword Tokenizer, use MockTokenizer, it has more checks: http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/MockTokenizer.java For some examples, you can look at the tests in modules/analysis. And of course enable assertions (-ea) when testing! On Thu, Feb 9, 2012 at 6:30 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: > I have the need to take user input and index it in a unique fashion, > essentially the value is some string (say "abcdefghijk") and needs to > be converted into a set of tokens (say 1 2 3 4). I am currently have > implemented a custom TokenFilter to do this, is this appropriate? In > cases where I am indexing things slowly (i.e. 1 at a time) this works > fine, but when I send 10,000 things to solr (all in one thread) I am > noticing exceptions where it seems that the generated instance > variable is being used by several threads. Is my implementation > appropriate or is there another more appropriate way to do this? Are > TokenFilters reused? Would it be more appropriate to convert the > stream to 1 token space separated then run that through a > WhiteSpaceTokenizer? Any guidance on this would be greatly > appreciated. > > class CustomFilter extends TokenFilter { > private final CharTermAttribute termAtt > addAttribute(CharTermAttribute.class); > private final PositionIncrementAttribute posAtt > addAttribute(PositionIncrementAttribute.class); > protected CustomFilter(TokenStream input) { > super(input); > } > > Iterator<AttributeSource> replacement; > @Override > public boolean incrementToken() throws IOException { > > > if(generated == null){ > //setup generated > if(!input.incrementToken()){ > return false; > } > > //clearAttributes(); > List<String> cells = StaticClass.generateTokens(termAtt.toString()); > generated = new ArrayList<AttributeSource>(cells.size()); > boolean first = true; > for(String cell : cells) { > AttributeSource newTokenSource = this.cloneAttributes(); > > CharTermAttribute newTermAtt > newTokenSource.addAttribute(CharTermAttribute.class); > newTermAtt.setEmpty(); > newTermAtt.append(cell); > OffsetAttribute newOffsetAtt > newTokenSource.addAttribute(OffsetAttribute.class); > PositionIncrementAttribute newPosIncAtt > newTokenSource.addAttribute(PositionIncrementAttribute.class); > newOffsetAtt.setOffset(0,0); > newPosIncAtt.setPositionIncrement(first ? 1 : 0); > generated.add(newTokenSource); > first = false; > generated.add(newTokenSource); > } lucidimagination.com
-
Re: custom TokenFilterJamie Johnson 2012-02-10, 01:28
Thanks Robert, I'll take a look there. Does it sound like I'm on the
right the right track with what I'm implementing, in other words is a TokenFilter appropriate or is there something else that would be a better fit for what I've described? On Thu, Feb 9, 2012 at 6:44 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > If you are writing a custom tokenstream, I recommend using some of the > resources in Lucene's test-framework.jar to test it. > These find lots of bugs! (including thread-safety bugs) > > For a filter: I recommend to use the assertions in > BaseTokenStreamTestCase: assertTokenStreamContents, assertAnalyzesTo, > and especially checkRandomData > http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java > > When testing your filter, for even more checks, don't use Whitespace > or Keyword Tokenizer, use MockTokenizer, it has more checks: > http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/MockTokenizer.java > > For some examples, you can look at the tests in modules/analysis. > > And of course enable assertions (-ea) when testing! > > On Thu, Feb 9, 2012 at 6:30 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: >> I have the need to take user input and index it in a unique fashion, >> essentially the value is some string (say "abcdefghijk") and needs to >> be converted into a set of tokens (say 1 2 3 4). I am currently have >> implemented a custom TokenFilter to do this, is this appropriate? In >> cases where I am indexing things slowly (i.e. 1 at a time) this works >> fine, but when I send 10,000 things to solr (all in one thread) I am >> noticing exceptions where it seems that the generated instance >> variable is being used by several threads. Is my implementation >> appropriate or is there another more appropriate way to do this? Are >> TokenFilters reused? Would it be more appropriate to convert the >> stream to 1 token space separated then run that through a >> WhiteSpaceTokenizer? Any guidance on this would be greatly >> appreciated. >> >> class CustomFilter extends TokenFilter { >> private final CharTermAttribute termAtt >> addAttribute(CharTermAttribute.class); >> private final PositionIncrementAttribute posAtt >> addAttribute(PositionIncrementAttribute.class); >> protected CustomFilter(TokenStream input) { >> super(input); >> } >> >> Iterator<AttributeSource> replacement; >> @Override >> public boolean incrementToken() throws IOException { >> >> >> if(generated == null){ >> //setup generated >> if(!input.incrementToken()){ >> return false; >> } >> >> //clearAttributes(); >> List<String> cells = StaticClass.generateTokens(termAtt.toString()); >> generated = new ArrayList<AttributeSource>(cells.size()); >> boolean first = true; >> for(String cell : cells) { >> AttributeSource newTokenSource = this.cloneAttributes(); >> >> CharTermAttribute newTermAtt >> newTokenSource.addAttribute(CharTermAttribute.class); >> newTermAtt.setEmpty(); >> newTermAtt.append(cell); >> OffsetAttribute newOffsetAtt >> newTokenSource.addAttribute(OffsetAttribute.class); >> PositionIncrementAttribute newPosIncAtt >> newTokenSource.addAttribute(PositionIncrementAttribute.class);
-
Re: custom TokenFilterRobert Muir 2012-02-10, 01:44
On Thu, Feb 9, 2012 at 8:28 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
> Thanks Robert, I'll take a look there. Does it sound like I'm on the > right the right track with what I'm implementing, in other words is a > TokenFilter appropriate or is there something else that would be a > better fit for what I've described? I can't say for sure to be honest... because its a bit too abstract...I don't know the reasoning behind trying to convert "abcdefghijk" to 1 2 3 4, and I'm not sure I really understand what that means either. But in general: if you are taking the whole content of a field and making it into tokens, then its best implemented as a tokenizer. -- lucidimagination.com
-
Re: custom TokenFilterJamie Johnson 2012-02-10, 01:54
Again thanks. I'll take a stab at that are you aware of any
resources/examples of how to do this? I figured I'd start with WhiteSpaceTokenizer but wasn't sure if there was a simpler place to start. On Thu, Feb 9, 2012 at 8:44 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > On Thu, Feb 9, 2012 at 8:28 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: >> Thanks Robert, I'll take a look there. Does it sound like I'm on the >> right the right track with what I'm implementing, in other words is a >> TokenFilter appropriate or is there something else that would be a >> better fit for what I've described? > > I can't say for sure to be honest... because its a bit too > abstract...I don't know the reasoning behind trying to convert > "abcdefghijk" to 1 2 3 4, and I'm not sure I really understand what > that means either. > > But in general: if you are taking the whole content of a field and > making it into tokens, then its best implemented as a tokenizer. > > -- > lucidimagination.com
-
Re: custom TokenFilterRobert Muir 2012-02-10, 02:02
On Thu, Feb 9, 2012 at 8:54 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
> Again thanks. I'll take a stab at that are you aware of any > resources/examples of how to do this? I figured I'd start with > WhiteSpaceTokenizer but wasn't sure if there was a simpler place to > start. > Well, easiest is if you can build what you need out of existing resources... But if you need to write your own, and If your input is not massive documents/you have no problem processing the whole field in RAM at once, you could try looking at PatternTokenizer for an example: http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTokenizer.java -- lucidimagination.com
-
Re: custom TokenFilterJamie Johnson 2012-02-10, 03:38
Thanks Robert, worked perfect for the index side of the house. Now on
the query side I have a similar Tokenizer, but it's not operating quite the way I want it to. The query tokenizer generates the tokens properly except I'm ending up with a phrase query, i.e. field:"1 2 3 4" when I really want field:1 OR field:2 OR field:3 OR field:4. Is there something in the tokenizer that needs to be set for this to generate this type of query or is it something in the query parser? On Thu, Feb 9, 2012 at 9:02 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > On Thu, Feb 9, 2012 at 8:54 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: >> Again thanks. I'll take a stab at that are you aware of any >> resources/examples of how to do this? I figured I'd start with >> WhiteSpaceTokenizer but wasn't sure if there was a simpler place to >> start. >> > > Well, easiest is if you can build what you need out of existing resources... > > But if you need to write your own, and If your input is not massive > documents/you have no problem processing the whole field in RAM at > once, you could try looking at PatternTokenizer for an example: > > http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTokenizer.java > > -- > lucidimagination.com
-
Re: custom TokenFilterJamie Johnson 2012-02-10, 03:47
Think I figured it out, the tokens just needed the same position attribute.
On Thu, Feb 9, 2012 at 10:38 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: > Thanks Robert, worked perfect for the index side of the house. Now on > the query side I have a similar Tokenizer, but it's not operating > quite the way I want it to. The query tokenizer generates the tokens > properly except I'm ending up with a phrase query, i.e. field:"1 2 3 > 4" when I really want field:1 OR field:2 OR field:3 OR field:4. Is > there something in the tokenizer that needs to be set for this to > generate this type of query or is it something in the query parser? > > On Thu, Feb 9, 2012 at 9:02 PM, Robert Muir <[EMAIL PROTECTED]> wrote: >> On Thu, Feb 9, 2012 at 8:54 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: >>> Again thanks. I'll take a stab at that are you aware of any >>> resources/examples of how to do this? I figured I'd start with >>> WhiteSpaceTokenizer but wasn't sure if there was a simpler place to >>> start. >>> >> >> Well, easiest is if you can build what you need out of existing resources... >> >> But if you need to write your own, and If your input is not massive >> documents/you have no problem processing the whole field in RAM at >> once, you could try looking at PatternTokenizer for an example: >> >> http://svn.apache.org/repos/asf/lucene/dev/trunk/modules/analysis/common/src/java/org/apache/lucene/analysis/pattern/PatternTokenizer.java >> >> -- >> lucidimagination.com |