Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr, mail # user - CJKBigram filter questons: single character queries, bigrams created across sript/character types


Copy link to this message
-
CJKBigram filter questons: single character queries, bigrams created across sript/character types
Burton-West, Tom 2012-04-27, 17:43
I have a few questions about the CJKBigram filter.

About 10% of our queries that contain Han characters are single character queries.   It looks like the CJKBigram filter only outputs single characters when there are no adjacent bigrammable characters in the input.   This means we would have to create a separate field to index Han unigrams in order to address single character queries.  Is this correct?

For Japanese, the default settings form bigrams across character types.  So for a string containing Hiragana and Han characters bigrams containing a mixture of Hiragana and Han characters are formed:
いろは革命歌   =>    “いろ” ”ろは“  “は革”   ”革命” “命歌”

Is there a way to specify that you don’t want bigrams across character types?

Tom

Tom Burton-West
Digital Library Production Service
University of Michigan Library

http://www.hathitrust.org/blogs/large-scale-search