Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene, mail # dev - SynonymFilter, FST, and Aho-Corasick algorithm


Copy link to this message
-
Re: SynonymFilter, FST, and Aho-Corasick algorithm
Dawid Weiss 2012-07-12, 19:50
bq. I rather like Wikipedia's definition:
http://en.wikipedia.org/wiki/Aho–Corasick_string_matching_algorithm

I did a similar thing but:

1) based on entire words as individual "tokens" (instead of letter-by-letter),
2) all words present in input patterns can be encoded as a separate
data structure which maps to an unique integer
3) the matcher is essentially tracking the following:

Match {
  automaton_arc toNextNode;
  final int matchStart;
  int matchLength;
}

you then process the input word-by-word and advance each Match if
there is an arc leaving "toNextNode" and matching the current word. If
the toNextNode arc is final then you've hit a match and need to record
it (it may not be the longest match so if you only care about the
longest matches then additional processing is required).

You create new Match objects and discard mismatched existing Matches
as you process the input. Essentially, it's as if you tried to walk
down in the automaton starting on every single position in the input.
This may seem costly but in reality the matches are infrequent
compared to the input text and they are rarely very, very long (to
create lots of states). I used the above approach for entity matching
and it worked super-fast.

All this said, an Aho-Corasick transition graph would of course be
more efficient. The question is how much more efficient and how much
code/ work you'll need to put into it to make it work :)

Dawid

---------------------------------------------------------------------