-Re: Import Handler for tokenizing facet string into multi-valued solr.StrField..
Erick Erickson 2011-01-27, 15:32
Tokenization is fine with facets, that caution is about, say, faceting
on the tokenized body of a document where you have potentially
a huge number of unique tokens.
But if there is a controlled number of distinct values, you shouldn't have
to do anything except index to a tokenized field. I'd remove stemming,
WordDelimiterFactory, etc though, in fact I'd probably just go with
WhiteSpaceTokenizer and, maybe, LowerCaseFilter.
But if you have a huge number of unique values, it doesn't matter whether
they are tokenized or strings, it'll still be a problem.
One note: when faceting for the first time on a newly-started Solr instance,
the caches are filled and the *first* query will be slower, so measure
On Thu, Jan 27, 2011 at 9:09 AM, Dennis Schafroth <[EMAIL PROTECTED]>wrote:
> Pretty novice into SOLR coding, but looking for hints about how (if not
> already done) to implement a PatternTokenizer, that would index this into
> multivalie fields of solr.StrField for facetting. Ex.
> Water -- Irrigation ; Water -- Sewage
> should be tokenized into
> in multi-valued non-tokenized fields due to performance. I could do it from
> the outside, but I would this as a opportunity to learn about SOLR.
> It "works" as I want with the PatternTokenizerFactory when I am using
> solr.TextField, but not when I am using the non-tokenized solr.StrField. But
> according to reading, facets performance is better on non-tokenized fields.
> We need better performance on our faceted searches on these multi-value
> fields. (25 million documents, three multi-valued facets)
> I would also need to have a filter that filter out identical values as the
> feeds have redundant data as shown above.
> Can anyone point point me in the right direction..