|
David Smiley
2011-02-08, 03:34
Steven A Rowe
2011-02-08, 03:51
Chris Hostetter
2011-02-08, 04:06
Robert Muir
2011-02-08, 11:55
David Smiley
2011-02-08, 14:12
Robert Muir
2011-02-08, 14:50
David Smiley
2011-02-08, 15:05
Robert Muir
2011-02-08, 15:15
Robert Zotter
2011-02-08, 16:00
|
-
Should ASCIIFoldingFilter be deprecated?David Smiley 2011-02-08, 03:34
ISOLatin1AccentFilter is deprecated, presumably because you can (and should) use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By that same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using mapping-FoldToASCII.txt ? ~ David Smiley ----- Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html Sent from the Solr - Dev mailing list archive at Nabble.com. ---------------------------------------------------------------------
-
RE: Should ASCIIFoldingFilter be deprecated?Steven A Rowe 2011-02-08, 03:51
AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter provides a superset of it mappings.
I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. Steve > -----Original Message----- > From: David Smiley (@MITRE.org) [mailto:[EMAIL PROTECTED]] > Sent: Monday, February 07, 2011 10:34 PM > To: [EMAIL PROTECTED] > Subject: Should ASCIIFoldingFilter be deprecated? > > > ISOLatin1AccentFilter is deprecated, presumably because you can (and > should) > use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By > that > same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of > using > mapping-FoldToASCII.txt ? > > ~ David Smiley > > ----- > Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book > -- > View this message in context: http://lucene.472066.n3.nabble.com/Should- > ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html > Sent from the Solr - Dev mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED]
-
Re: Should ASCIIFoldingFilter be deprecated?Chris Hostetter 2011-02-08, 04:06
:
: ISOLatin1AccentFilter is deprecated, presumably because you can (and should) : use MappingCharFilter configured with mapping-ISOLatin1Accent.txt. By that : same reasoning, shouldn't ASCIIFoldingFilter be deprecated in favor of using : mapping-FoldToASCII.txt ? CharFilters and TokenFilters have different purposes though... http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter (ie: If you use MappingCharFilter, you can't then tokenize on some of the characters you filtered away) : : ~ David Smiley : : ----- : Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book : -- : View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2448919.html : Sent from the Solr - Dev mailing list archive at Nabble.com. : : --------------------------------------------------------------------- : To unsubscribe, e-mail: [EMAIL PROTECTED] : For additional commands, e-mail: [EMAIL PROTECTED] : : -Hoss ---------------------------------------------------------------------
-
Re: Should ASCIIFoldingFilter be deprecated?Robert Muir 2011-02-08, 11:55
On Mon, Feb 7, 2011 at 10:51 PM, Steven A Rowe <[EMAIL PROTECTED]> wrote:
> I haven't done any benchmarking, but I'm pretty sure that ASCIIFoldingFilter can achieve a significantly higher throughput rate than MappingCharFilter, and given that, it probably makes sense to keep both, to allow people to make the choice about the tradeoff between the flexibility provided by the human-readable (and editable) mapping file and the speed provided by ASCIIFoldingFilter. I agree... have you seen http://bugs.icu-project.org/trac/ticket/7743 ? Hopefully something along those lines would allow us to support the flexibility in a factory or whatever (even better as described, when you just want a small tweak) but still with good performance. ---------------------------------------------------------------------
-
RE: Should ASCIIFoldingFilter be deprecated?David Smiley 2011-02-08, 14:12
Chris Hostetter-3 wrote: > > CharFilters and TokenFilters have different purposes though... > > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters#When_To_use_a_CharFilter_vs_a_TokenFilter > > (ie: If you use MappingCharFilter, you can't then tokenize on some of the > characters you filtered away) > Right, but it’s hard to imagine wanting to tokenize on an accent character or some other modification specified in these particular mapping files. Steven A Rowe wrote: > > AFAIK, ISOLatin1AccentFilter was deprecated because ASCIIFoldingFilter > provides a superset of it mappings. > *If* that is the case then this file should also be removed: solr/example/solr/conf/mapping-ISOLatin1Accent.txt Steven A Rowe wrote: > > I haven't done any benchmarking, but I'm pretty sure that > ASCIIFoldingFilter can achieve a significantly higher throughput rate than > MappingCharFilter, and given that, it probably makes sense to keep both, > to allow people to make the choice about the tradeoff between the > flexibility provided by the human-readable (and editable) mapping file and > the speed provided by ASCIIFoldingFilter. > I'm skeptical that whatever the difference is is relevant in the scheme of things. The cost to keeping it is introducing confusion on users, and more code to maintain. ~ David Smiley ----- Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451504.html Sent from the Solr - Dev mailing list archive at Nabble.com. ---------------------------------------------------------------------
-
Re: Should ASCIIFoldingFilter be deprecated?Robert Muir 2011-02-08, 14:50
On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org)
<[EMAIL PROTECTED]> wrote: > I'm skeptical that whatever the difference is is relevant in the scheme of > things. The cost to keeping it is introducing confusion on users, and more > code to maintain. > its pretty significant. charfilters are not reusable, and box every character and lookup out of a hashmap (i made a patch to fix the reusability, but no one has commented) : https://issues.apache.org/jira/browse/LUCENE-2788 asciifoldingfilter does a huge switch (which still isnt optimal), but its way way faster than mappingcharfilter, especially since its a no-op for chars < 0x7F. icufoldingfilter precompiles a recursively decomposed trie, so its lookup is a unicode folded trie (icu-project.org/docs/papers/foldedtrie_iuc21.ppt). I think its a tad slower than asciifoldingfilter but it also incorporates case folding and unicode normalization: neither asciifoldingfilter nor mappingcharfilter will not properly fold http://www.geonames.org/search.html?q=Ab%C5%AB+Z%CC%A7aby&country=, because there is no composed form for Z + combining cedilla, but icufoldingfilter will. ---------------------------------------------------------------------
-
Re: Should ASCIIFoldingFilter be deprecated?David Smiley 2011-02-08, 15:05
Robert Muir wrote: > > On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) > <[EMAIL PROTECTED]> wrote: > >> I'm skeptical that whatever the difference is is relevant in the scheme >> of >> things. The cost to keeping it is introducing confusion on users, and >> more >> code to maintain. >> > > its pretty significant. charfilters are not reusable, and box every > character and lookup out of a hashmap (i made a patch to fix the > reusability, but no one has commented) : > https://issues.apache.org/jira/browse/LUCENE-2788 > > asciifoldingfilter does a huge switch (which still isnt optimal), but > its way way faster than mappingcharfilter, especially since its a > no-op for chars < 0x7F. > Well then I see a path forward to speed up MappingCharFilter substantially. There's your LUCENE-2788, and then you could easily add the same no-op optimization for the smallest char value in the HashMap. ----- Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/Should-ASCIIFoldingFilter-be-deprecated-tp2448919p2451800.html Sent from the Solr - Dev mailing list archive at Nabble.com. ---------------------------------------------------------------------
-
Re: Should ASCIIFoldingFilter be deprecated?Robert Muir 2011-02-08, 15:15
On Tue, Feb 8, 2011 at 10:05 AM, David Smiley (@MITRE.org)
<[EMAIL PROTECTED]> wrote: > > Well then I see a path forward to speed up MappingCharFilter substantially. > There's your LUCENE-2788, and then you could easily add the same no-op > optimization for the smallest char value in the HashMap. only for the smallest starter, and still mappingcharfilter has to maintain an array of any offset changes (this is now binary searched) for correctOffset. ---------------------------------------------------------------------
-
Re: Should ASCIIFoldingFilter be deprecated?Robert Zotter 2011-02-08, 16:00
unsubscribe
On 2/8/11 7:05 AM, David Smiley (@MITRE.org) wrote: > > Robert Muir wrote: >> On Tue, Feb 8, 2011 at 9:12 AM, David Smiley (@MITRE.org) >> <[EMAIL PROTECTED]> wrote: >> >>> I'm skeptical that whatever the difference is is relevant in the scheme >>> of >>> things. The cost to keeping it is introducing confusion on users, and >>> more >>> code to maintain. >>> >> its pretty significant. charfilters are not reusable, and box every >> character and lookup out of a hashmap (i made a patch to fix the >> reusability, but no one has commented) : >> https://issues.apache.org/jira/browse/LUCENE-2788 >> >> asciifoldingfilter does a huge switch (which still isnt optimal), but >> its way way faster than mappingcharfilter, especially since its a >> no-op for chars< 0x7F. >> > Well then I see a path forward to speed up MappingCharFilter substantially. > There's your LUCENE-2788, and then you could easily add the same no-op > optimization for the smallest char value in the HashMap. > > ----- > Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book --------------------------------------------------------------------- |