Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr >> mail # user >> HTMLStripCharFilterFactory not working in Solr4?


Copy link to this message
-
Re: HTMLStripCharFilterFactory not working in Solr4?
You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.

-Yonik
http://www.lucidimagination.com

On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
>
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
>
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>            <analyzer type="query">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>        </fieldType>
>
>
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
>
>
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>    SolrInputDocument inputDocument = new SolrInputDocument()
>    inputDocument.addField('label', 'Bose® ™')
>
>    solrServer.add(inputDocument)
>    solrServer.commit()
>
>    QueryResponse response = solrServer.query(new SolrQuery('bose'))
>    assert 1 == response.results.numFound
>
>    SolrQuery facetQuery = new SolrQuery('bose')
>    facetQuery.facet = true
>    facetQuery.set(FacetParams.FACET_FIELD, 'label')
>    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
>
>    response = solrServer.query(facetQuery)
>    FacetField ff = response.facetFields.find {it.name == 'label'}
>
>    List suggestResponse = []
>
>    for (FacetField.Count facetField in ff?.values) {
>        suggestResponse << facetField.name
>    }
>
>    assert suggestResponse == ['bose']
> }
>
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
>
> Assertion failed:
>
> assert suggestResponse == ['bose']
>       |               |
>       |               false
>       [174, 8482, bose]
>
>
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
>
> Thanks in advance for any tips!
>
> Mike