Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Lucene and all its subprojects:

Switch to Threaded View
Solr >> mail # user >> HTMLStripCharFilterFactory not working in Solr4?


Copy link to this message
-
Re: HTMLStripCharFilterFactory not working in Solr4?
You can use LegacyHTMLStripCharFilterFactory to get the previous behavior.
See https://issues.apache.org/jira/browse/LUCENE-3690 for more details.

-Yonik
http://www.lucidimagination.com

On Tue, Jan 24, 2012 at 1:34 PM, Mike Hugo <[EMAIL PROTECTED]> wrote:
> We recently updated to the latest build of Solr4 and everything is working
> really well so far!  There is one case that is not working the same way it
> was in Solr 3.4 - we strip out certain HTML constructs (like trademark and
> registered, for example) in a field as defined below - it was working in
> Solr3.4 with the configuration shown here, but is not working the same way
> in Solr4.
>
> The label field is defined as type="text_general"
> <field name="label" type="text_general" indexed="true" stored="false"
> required="false" multiValued="true"/>
>
> Here's the type definition for text_general field:
> <fieldType name="text_general" class="solr.TextField"
> positionIncrementGap="100">
>            <analyzer type="index">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>            <analyzer type="query">
>                <tokenizer class="solr.StandardTokenizerFactory"/>
>                <charFilter class="solr.HTMLStripCharFilterFactory"/>
>                <filter class="solr.StopFilterFactory" ignoreCase="true"
> words="stopwords.txt"
>                        enablePositionIncrements="true"/>
>                <filter class="solr.LowerCaseFilterFactory"/>
>            </analyzer>
>        </fieldType>
>
>
> In Solr 3.4, that configuration was completely stripping html constructs
> out of the indexed field which is exactly what we wanted.  If for example,
> we then do a facet on the label field, like in the test below, we're
> getting some terms in the response that we would not like to be there.
>
>
> // test case (groovy)
> void specialHtmlConstructsGetStripped() {
>    SolrInputDocument inputDocument = new SolrInputDocument()
>    inputDocument.addField('label', 'Bose® ™')
>
>    solrServer.add(inputDocument)
>    solrServer.commit()
>
>    QueryResponse response = solrServer.query(new SolrQuery('bose'))
>    assert 1 == response.results.numFound
>
>    SolrQuery facetQuery = new SolrQuery('bose')
>    facetQuery.facet = true
>    facetQuery.set(FacetParams.FACET_FIELD, 'label')
>    facetQuery.set(FacetParams.FACET_MINCOUNT, '1')
>
>    response = solrServer.query(facetQuery)
>    FacetField ff = response.facetFields.find {it.name == 'label'}
>
>    List suggestResponse = []
>
>    for (FacetField.Count facetField in ff?.values) {
>        suggestResponse << facetField.name
>    }
>
>    assert suggestResponse == ['bose']
> }
>
> With the upgrade to Solr4, the assertion fails, the suggested response
> contains 174 and 8482 as terms.  Test output is:
>
> Assertion failed:
>
> assert suggestResponse == ['bose']
>       |               |
>       |               false
>       [174, 8482, bose]
>
>
> I just tried again using the latest build from today, namely:
> https://builds.apache.org/job/Lucene-Solr-Maven-trunk/369/ and we're still
> getting the failing assertion. Is there a different way to configure the
> HTMLStripCharFilterFactory in Solr4?
>
> Thanks in advance for any tips!
>
> Mike
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB