|
|
Walter Underwood 2006-11-21, 05:27
I've written a simple phonetic token filter (and factory) based on the Double Metaphone implementation in Jakarta Codecs to contribute. Three questions:
1. Does this sound like a generally useful addition?
2. Should we have a Jira issue first?
3. This adds a depencency on the codecs jar. How do we add that to the distro?
The code is very simple, but I need to learn the contribution process and build some tests, so this won't happen in one day.
wunder -- Walter Underwood Search Guru, Netflix
+
Walter Underwood 2006-11-21, 05:27
-
Re: Phonetic Token Filter
Bertrand Delacretaz 2006-11-21, 09:01
On 11/21/06, Walter Underwood <[EMAIL PROTECTED]> wrote: > ...I've written a simple phonetic token filter (and factory) based > on the Double Metaphone implementation in Jakarta Codecs to > contribute. Three questions: > > 1. Does this sound like a generally useful addition?...
Sure!
Do you know if it is supposed to work for non-english languages? I'm interested in testing it on French (and maybe German) texts, once your patch is ready.
-Bertrand
+
Bertrand Delacretaz 2006-11-21, 09:01
-
Re: Phonetic Token Filter
Walter Underwood 2006-11-21, 15:52
On 11/21/06 1:01 AM, "Bertrand Delacretaz" <[EMAIL PROTECTED]> wrote:
> Do you know if it is supposed to work for non-english languages? I'm > interested in testing it on French (and maybe German) texts, once your > patch is ready.
Double Metaphone has several rules for non-English words, but it assumes English pronunciation. I think the biggest problem would consonants that are silent or vowel sounds. For example, it codes "Paris" as "PRS" instead of with a silent "s" as in French, and "Jonas" as "JNS" where in German it would be pronounced "yonas". And "Wim Winders" is coded as "AM ANTR", treating the "W" as a vowel instead of a "V" sound.
It is worth a try. Most implementations of Double Metaphone are well-commented, so you could change it for other languages.
wunder -- Walter Underwood Search Guru, Netflix
+
Walter Underwood 2006-11-21, 15:52
-
Re: Phonetic Token Filter
Bertrand Delacretaz 2006-11-21, 15:53
On 11/21/06, Walter Underwood <[EMAIL PROTECTED]> wrote: > ...It is worth a try. Most implementations of Double Metaphone are > well-commented, so you could change it for other languages...
Ok, I'll see if I find some time to test that, thanks for the clarifications! -Bertrand
+
Bertrand Delacretaz 2006-11-21, 15:53
-
Re: Phonetic Token Filter
Yonik Seeley 2006-11-21, 05:36
On 11/21/06, Walter Underwood <[EMAIL PROTECTED]> wrote: > I've written a simple phonetic token filter (and factory) based > on the Double Metaphone implementation in Jakarta Codecs to > contribute. Three questions: > > 1. Does this sound like a generally useful addition?
Definitely useful. If it's generally applicable enough and light weight enough then it should go in the core. Otherwise it could go in contrib (which we don't really have yet, but we will when the need arises).
This sounds like it should probably go in the core.
> 2. Should we have a Jira issue first?
Yes, please.
> 3. This adds a depencency on the codecs jar. How do we add that > to the distro?
It would go in the lib directory if it ends up in Solr proper.
-Yonik
+
Yonik Seeley 2006-11-21, 05:36
+
Chris Hostetter 2006-11-21, 07:39
|
|