Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr, mail # user - custom TokenFilter


Copy link to this message
-
Re: custom TokenFilter
Jamie Johnson 2012-02-10, 01:28
Thanks Robert, I'll take a look there.  Does it sound like I'm on the
right the right track with what I'm implementing, in other words is a
TokenFilter appropriate or is there something else that would be a
better fit for what I've described?

On Thu, Feb 9, 2012 at 6:44 PM, Robert Muir <[EMAIL PROTECTED]> wrote:
> If you are writing a custom tokenstream, I recommend using some of the
> resources in Lucene's test-framework.jar to test it.
> These find lots of bugs! (including thread-safety bugs)
>
> For a filter: I recommend to use the assertions in
> BaseTokenStreamTestCase: assertTokenStreamContents, assertAnalyzesTo,
> and especially checkRandomData
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/BaseTokenStreamTestCase.java
>
> When testing your filter, for even more checks, don't use Whitespace
> or Keyword Tokenizer, use MockTokenizer, it has more checks:
> http://svn.apache.org/repos/asf/lucene/dev/trunk/lucene/test-framework/src/java/org/apache/lucene/analysis/MockTokenizer.java
>
> For some examples, you can look at the tests in modules/analysis.
>
> And of course enable assertions (-ea) when testing!
>
> On Thu, Feb 9, 2012 at 6:30 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote:
>> I have the need to take user input and index it in a unique fashion,
>> essentially the value is some string (say "abcdefghijk") and needs to
>> be converted into a set of tokens (say 1 2 3 4).  I am currently have
>> implemented a custom TokenFilter to do this, is this appropriate?  In
>> cases where I am indexing things slowly (i.e. 1 at a time) this works
>> fine, but when I send 10,000 things to solr (all in one thread) I am
>> noticing exceptions where it seems that the generated instance
>> variable is being used by several threads.  Is my implementation
>> appropriate or is there another more appropriate way to do this?  Are
>> TokenFilters reused?  Would it be more appropriate to convert the
>> stream to 1 token space separated then run that through a
>> WhiteSpaceTokenizer?  Any guidance on this would be greatly
>> appreciated.
>>
>>        class CustomFilter extends TokenFilter {
>>                private final CharTermAttribute termAtt >> addAttribute(CharTermAttribute.class);
>>                private final PositionIncrementAttribute posAtt >> addAttribute(PositionIncrementAttribute.class);
>>                protected CustomFilter(TokenStream input) {
>>                        super(input);
>>                }
>>
>>                Iterator<AttributeSource> replacement;
>>                @Override
>>                public boolean incrementToken() throws IOException {
>>
>>
>>                        if(generated == null){
>>                                //setup generated
>>                                if(!input.incrementToken()){
>>                                        return false;
>>                                }
>>
>>                                //clearAttributes();
>>                                List<String> cells = StaticClass.generateTokens(termAtt.toString());
>>                                generated = new ArrayList<AttributeSource>(cells.size());
>>                                boolean first = true;
>>                                for(String cell : cells) {
>>                                        AttributeSource newTokenSource = this.cloneAttributes();
>>
>>                                        CharTermAttribute newTermAtt >> newTokenSource.addAttribute(CharTermAttribute.class);
>>                                        newTermAtt.setEmpty();
>>                                        newTermAtt.append(cell);
>>                                        OffsetAttribute newOffsetAtt >> newTokenSource.addAttribute(OffsetAttribute.class);
>>                                        PositionIncrementAttribute newPosIncAtt >> newTokenSource.addAttribute(PositionIncrementAttribute.class);