Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr, mail # dev - Re: [Solr Wiki] Update of "SolrFacetingOverview" by JJLarrea


Copy link to this message
-
Re: [Solr Wiki] Update of "SolrFacetingOverview" by JJLarrea
Erik Hatcher 2006-12-28, 03:37
JJ:  Fantastic - this is excellent info, and sharing it helps a LOT!

Erik
On Dec 27, 2006, at 7:25 PM, Apache Wiki wrote:

> Dear Wiki user,
>
> You have subscribed to a wiki page or wiki category on "Solr Wiki"  
> for change notification.
>
> The following page has been changed by JJLarrea:
> http://wiki.apache.org/solr/SolrFacetingOverview
>
> The comment on the change is:
> Added page per 12/8/06 suggestion by Yonik
>
> New page:
> = Faceting Overview >
> Solr provides a [http://incubator.apache.org/solr/docs/api/org/
> apache/solr/request/SimpleFacets.html Simple Faceting toolkit]  
> which can be reused by various Request Handlers to include "Facet  
> counts" of based on some simple criteria. Both the  
> StandardRequestHandler and the DisMaxRequestHandler currently use  
> these utilities.  Detailed descriptions of the parameters used to  
> control faceting can be found (along with several examples) at  
> [SimpleFacetParameters].
>
> This page briefly provides some general background information:
>
> = Facet Indexing >
> Faceting is done on __indexed__ rather than __stored__ values.  
> This is because the primary use for faceting is drilldown into a  
> subset of hits resulting from a query, and so the chosen facet  
> value is used to construct a filter query which literally matches  
> that value in the index.  For the stock Solr request handlers this  
> is done by adding an `fq=<facet-field>:<quoted facet-value>`  
> parameter and resubmitting the query.
>
> Because faceting fields are often specified to serve two purposes,  
> human-readable text and drill-down query value, they are frequently  
> indexed differently from fields used for searching and sorting:
>   * They are not tokenized into separate words
>   * They are not mapped into lower case
>   * Human-readable punctuation is not removed (other than double-
> quotes)
>   * There is often no need to store them, since stored values would  
> look much like indexed values and the faceting mechanism is used  
> for value retrieval.
>   * Depending on how the field is defined the SimpleFacets  
> mechanism may only allow for a single value per field per document  
> (see below)
>
> As an example, if I had a field with a list of authors, such as:
>
>   Schildt, Herbert; Wolpert, Lewis; Davies, P.
>
> I might want to index the same data differently in three different  
> fields (perhaps using the Solr [:SchemaXml#Copy Fields:copyField]  
> directive):
>   * For searching: Tokenized, case-folded, punctuation-stripped:
>       schildt / herbert / wolpert / lewis / davies / p
>   * For sorting: Untokenized, case-folded, punctuation-stripped:
>       schildt herbert wolpert lewis davies p
>   * For faceting: Primary author only, using a `solr.StringField`:
>       Schildt, Herbert
>
> Then when the user drills down on the "Schildt, Herbert" string I  
> would reissue the query with an added fq="Schild, Herbert" parameter.
>
> = Facet Operation >
> Currently SimpleFacets has 3 modes of operation:
>
> == FacetQueries =>
> Any number of [:SimpleFacetParameters#facet.query:facet.query]  
> parameters can be passed to the request handler.  Each distinct  
> facet.query will first be executed against the entire index, with  
> the results cached as a hashed set (if fewer than hashDocSet) or a  
> bit set (if greater) of document IDs (see [:SolrCaching#The  
> hashDocSet Max Size:hashDocSet]).  Then every time that facet.query  
> is used for faceting a query, the cached set will be intersected  
> against the set of document ids returned by the query to count the  
> number of documents for which the facet.query condition is true.
>
> == FacetFields =>
> Any number of [:SimpleFacetParameters#facet.field:facet.field]  
> parameters can be passed to the request handler.  For each  
> facet.field, one of two approaches will be used:
>
>     * Field Queries:  If the facet field is defined in the schema  
> as multi-valued, boolean, or tokenized, then every indexed value