Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene, mail # user - PhoneticFilterFactory 's inject parameter


Copy link to this message
-
Re: PhoneticFilterFactory 's inject parameter
Ian Lea 2012-04-26, 12:51
There are useful tips in the FAQ,
http://wiki.apache.org/lucene-java/LuceneFAQ#Why_am_I_getting_no_hits_.2BAC8_incorrect_hits.3F.

I still think you should come up with small self-contained example code.
--
Ian.
On Wed, Apr 25, 2012 at 4:02 PM, Elmer van Chastelet
<[EMAIL PROTECTED]> wrote:
> Thanks for your suggestion Ian, but I just found out that if I replace the
> KeywordTokenizer with a WhitespaceTokenizer, all seems to work fine.
>
> Just to test what happens, I created another field 'orig', using this
> analyzer:
> analyzer KeywordLowered{
>    tokenizer = KeywordTokenizer
>    tokenfilter = LowerCaseFilter
> }
>
> Guess what.. exactly the same problem, also in Luke.
> It finds no documents with for query:
> orig:strange
> While the term 'strange' is in the index for the field 'orig'.
>
> Does anybody have a clue why documents are not matched when using the
> KeywordTokenizer? Remember that all queries and terms don't contain white
> spaces.
>
>
> Thanks again.
> -Elmer
>
>
> On 04/25/2012 02:53 PM, Ian Lea wrote:
>>
>> You seem to be quietly going round in circles, by yourself!  I suggest
>> a small self-contained program/test case with a RAM index created from
>> scratch.  You can then experiment with inject on or off and if you
>> still can't figure it out, post the code and hopefully someone will be
>> able to help you make sense of it.
>>
>> Make sure you tell us what version of Lucene you are using.  If not
>> the latest, wouldn't hurt to try with the latest.
>>
>>
>> --
>> Ian.
>>
>>
>> On Wed, Apr 25, 2012 at 1:22 PM, Elmer van Chastelet
>> <[EMAIL PROTECTED]>  wrote:
>>>
>>> I keep replying to myself, it all gets a bit confusing.
>>> The problem still exists and I don't understand why, and why it worked
>>> once.
>>>
>>> I have the same behavior again as posted in my first mail:
>>> - Inject parameter is set to true.
>>> - The index has _no deleted documents_ and is optimized.
>>> - The term 'compete' is in there.
>>> - If I ask Luke to show all docs for term 'compete' it shows me the one
>>> and
>>> only document that represents this word. But...
>>> - If I perform the query 'value:compete' in luke again, it says there are
>>> no
>>> results.
>>>
>>> Here is the index I'm currently using. It contains various fields for the
>>> available phonetic filter encoders:
>>> https://www.box.com/s/34212e82227e102f6734
>>>
>>> Can somebody explain this behavior? What's the real use of the inject
>>> parameter of the PhoneticFilterFactory?
>>>
>>> Thanks in advance.
>>>
>>> -Elmer
>>>
>>>
>>> On 04/25/2012 12:25 PM, Elmer van Chastelet wrote:
>>>>
>>>> Problem solved. Long story short: for some reason I had deleted
>>>> documents
>>>> in the index and the non-deleted documents used the phonetic filter with
>>>> inject set to false.
>>>>
>>>> Works fine now :)
>>>>
>>>> On 04/23/2012 09:27 PM, Elmer van Chastelet wrote:
>>>>>
>>>>> Hi all,
>>>>>
>>>>> (scroll to bottom for question)
>>>>>
>>>>> I was setting up a simple web app to play around with phonetic filters.
>>>>> The idea is simple, I just create a document for each word in the
>>>>> English
>>>>> dictionary, each document containing a single search field holding the
>>>>> value
>>>>> after it is preprocessed using the following analyzer def (in our own
>>>>> dsl
>>>>> syntax, which gets transformed to java):
>>>>>
>>>>> analyzer soundslike{
>>>>>    tokenizer = KeywordTokenizer
>>>>>    tokenfilter = LowerCaseFilter
>>>>>    tokenfilter = PhoneticFilter(encoder="DoubleMetaphone",
>>>>> inject="true")
>>>>> }
>>>>>
>>>>> I can run the web app and I get results that indeed (in some way) sound
>>>>> like the original query term.
>>>>>
>>>>> But what confuses me is the ranking of the results, knowing that I set
>>>>> the inject param to true. If I search for the query term 'compete', the
>>>>> parsed query becomes '(value:KMPT value:compete)', and therefore I
>>>>> expect
>>>>> the word 'compete' to be ranked highest in the list than any other
>>>>> word....