|
|
-
PatternTokenizer failure
Jay Luker 2011-11-28, 17:01
Hi all,
I'm trying to use PatternTokenizer and not getting expected results. Not sure where the failure lies. What I'm trying to do is split my input on whitespace except in cases where the whitespace is preceded by a hyphen character. So to do this I'm using a negative look behind assertion in the pattern, e.g. "(?<!-)\s+".
Expected behavior: "foo bar" -> ["foo","bar"] - OK "foo \n bar" -> ["foo","bar"] - OK "foo- bar" -> ["foo- bar"] - OK "foo-\nbar" -> ["foo-\nbar"] - OK "foo- \n bar" -> ["foo- \n bar"] - FAILS
Here's a test case that demonstrates the failure:
public void testPattern() throws Exception { Map<String,String> args = new HashMap<String, String>(); args.put( PatternTokenizerFactory.GROUP, "-1" ); args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" ); Reader reader = new StringReader("blah \n foo bar- baz\nfoo-\nbar- baz foo- \n bar"); PatternTokenizerFactory tokFactory = new PatternTokenizerFactory(); tokFactory.init( args ); TokenStream stream = tokFactory.create( reader ); assertTokenStreamContents(stream, new String[] { "blah", "foo", "bar- baz", "foo-\nbar- baz", "foo- \n bar" }); }
This fails with the following output: "org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- []>"
Am I doing something wrong? Incorrect expectations? Or could this be a bug?
Thanks, --jay
-
Re: PatternTokenizer failure
Erick Erickson 2011-11-29, 14:20
Hmmm, I tried this in straight Java, no Solr/Lucene involved and the behavior I'm seeing is that no example works if it has more than one whitespace character after the hyphen, including your failure example.
I haven't lived inside regexes for long enough that I don't know what the right regex should be, but it doesn't appear to be a Solr problem
Sorry I can't be more helpful. Erick
On Mon, Nov 28, 2011 at 12:01 PM, Jay Luker <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm trying to use PatternTokenizer and not getting expected results. > Not sure where the failure lies. What I'm trying to do is split my > input on whitespace except in cases where the whitespace is preceded > by a hyphen character. So to do this I'm using a negative look behind > assertion in the pattern, e.g. "(?<!-)\s+". > > Expected behavior: > "foo bar" -> ["foo","bar"] - OK > "foo \n bar" -> ["foo","bar"] - OK > "foo- bar" -> ["foo- bar"] - OK > "foo-\nbar" -> ["foo-\nbar"] - OK > "foo- \n bar" -> ["foo- \n bar"] - FAILS > > Here's a test case that demonstrates the failure: > > public void testPattern() throws Exception { > Map<String,String> args = new HashMap<String, String>(); > args.put( PatternTokenizerFactory.GROUP, "-1" ); > args.put( PatternTokenizerFactory.PATTERN, "(?<!-)\\s+" ); > Reader reader = new StringReader("blah \n foo bar- baz\nfoo-\nbar- > baz foo- \n bar"); > PatternTokenizerFactory tokFactory = new PatternTokenizerFactory(); > tokFactory.init( args ); > TokenStream stream = tokFactory.create( reader ); > assertTokenStreamContents(stream, new String[] { "blah", "foo", > "bar- baz", "foo-\nbar- baz", "foo- \n bar" }); > } > > This fails with the following output: > "org.junit.ComparisonFailure: term 4 expected:<foo- [\n bar]> but was:<foo- []>" > > Am I doing something wrong? Incorrect expectations? Or could this be a bug? > > Thanks, > --jay
-
Re: PatternTokenizer failure
Michael Kuhlmann 2011-11-29, 14:37
Am 29.11.2011 15:20, schrieb Erick Erickson: > Hmmm, I tried this in straight Java, no Solr/Lucene involved and the > behavior I'm seeing is that no example works if it has more than > one whitespace character after the hyphen, including your failure > example. > > I haven't lived inside regexes for long enough that I don't know what > the right regex should be, but it doesn't appear to be a Solr problem
Jay, I think the problem is this:
You're checking whether the character preceding the array of at least one whitespace is not a hyphen.
However, when you've more than one whitespace, like this: "foo- \n bar" then there's another array of whitespaces - "\n " - which is precedes by the first whitespace - " ".
Therefore, you'll need to not only check for preceding hyphens, but also for preceding whitespaces.
I'll leave this as an exercise for you. ;)
-Kuli
-
Re: PatternTokenizer failure
Jay Luker 2011-11-30, 14:14
On Tue, Nov 29, 2011 at 9:37 AM, Michael Kuhlmann <[EMAIL PROTECTED]> wrote: > Jay, > I think the problem is this: > > You're checking whether the character preceding the array of at least one > whitespace is not a hyphen. > > However, when you've more than one whitespace, like this: > "foo- \n bar" > then there's another array of whitespaces - "\n " - which is precedes by the > first whitespace - " ". > > Therefore, you'll need to not only check for preceding hyphens, but also for > preceding whitespaces. > > I'll leave this as an exercise for you. ;) > > -Kuli
Just for the sake of closure, you were correct. I needed to update the regex to include a whitespace character in the negative look-behind, i.e., "(?<![-\s])\s+".
Thanks, --jay
|
|