|
|
Finotti Simone 2012-07-25, 16:03
Hi
is there a tokenizer and/or a combination of filter to remove the first term from a field?
For example: The quick brown fox
should be tokenized as: quick brown fox
thank you in advance S
Ahmet Arslan 2012-07-25, 16:10
> is there a tokenizer and/or a combination of filter to > remove the first term from a field? > > For example: > The quick brown fox > > should be tokenized as: > quick > brown > fox
There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this?
Finotti Simone 2012-07-26, 07:05
Hi Ahmet, business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms.
We are developing a search suggestion mechanism, the idea is that if the user types "D", the engine should suggest "Dolce & Gabbana", but if we type "G", it should suggest other brands. Only if users type "Gab" it should suggest "Dolce & Gabbana".
Thanks S ________________________________________ Inizio: Ahmet Arslan [[EMAIL PROTECTED]] Inviato: mercoledì 25 luglio 2012 18.10 Fine: [EMAIL PROTECTED] Oggetto: Re: Skip first word
> is there a tokenizer and/or a combination of filter to > remove the first term from a field? > > For example: > The quick brown fox > > should be tokenized as: > quick > brown > fox
There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this?
Chantal Ackermann 2012-07-26, 16:32
Hi,
use two fields: 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for inputs of length < 3, 2. the other one tokenized as appropriate with minsize=3 and longer for all longer inputs Cheers, Chantal Am 26.07.2012 um 09:05 schrieb Finotti Simone:
> Hi Ahmet, > business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms. > > We are developing a search suggestion mechanism, the idea is that if the user types "D", the engine should suggest "Dolce & Gabbana", but if we type "G", it should suggest other brands. Only if users type "Gab" it should suggest "Dolce & Gabbana". > > Thanks > S > ________________________________________ > Inizio: Ahmet Arslan [[EMAIL PROTECTED]] > Inviato: mercoledì 25 luglio 2012 18.10 > Fine: [EMAIL PROTECTED] > Oggetto: Re: Skip first word > >> is there a tokenizer and/or a combination of filter to >> remove the first term from a field? >> >> For example: >> The quick brown fox >> >> should be tokenized as: >> quick >> brown >> fox > > There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this? > > > >
in.abdul 2012-07-26, 18:36
That's is best option I had also used shingle filter factory . . On Jul 26, 2012 10:03 PM, "Chantal Ackermann-2 [via Lucene]" < ml-node+[EMAIL PROTECTED]> wrote: > Hi, > > use two fields: > 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 > for inputs of length < 3, > 2. the other one tokenized as appropriate with minsize=3 and longer for > all longer inputs > > > Cheers, > Chantal > > > Am 26.07.2012 um 09:05 schrieb Finotti Simone: > > > Hi Ahmet, > > business asked me to apply EdgeNGram with minGramSize=1 on the first > term and with minGramSize=3 on the latter terms. > > > > We are developing a search suggestion mechanism, the idea is that if the > user types "D", the engine should suggest "Dolce & Gabbana", but if we type > "G", it should suggest other brands. Only if users type "Gab" it should > suggest "Dolce & Gabbana". > > > > Thanks > > S > > ________________________________________ > > Inizio: Ahmet Arslan [[hidden email]< http://user/SendEmail.jtp?type=node&node=3997480&i=0>]> > > Inviato: mercoledì 25 luglio 2012 18.10 > > Fine: [hidden email]< http://user/SendEmail.jtp?type=node&node=3997480&i=1>> > Oggetto: Re: Skip first word > > > >> is there a tokenizer and/or a combination of filter to > >> remove the first term from a field? > >> > >> For example: > >> The quick brown fox > >> > >> should be tokenized as: > >> quick > >> brown > >> fox > > > > There is no such filter that i know of. Though, you can implement one > with modifying source code of LengthFilterFactory or StopFilterFactory. > They both remove tokens. Out of curiosity, what is the use case for this? > > > > > > > > > > > > ------------------------------ > If you reply to this email, your message will be added to the discussion > below: > http://lucene.472066.n3.nabble.com/Skip-first-word-tp3997277p3997480.html> To unsubscribe from Lucene, click here< http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_code&node=472066&code=aW4uYWJkdWxAZ21haWwuY29tfDQ3MjA2NnwxMDczOTUyNDEw>> . > NAML< http://lucene.472066.n3.nabble.com/template/NamlServlet.jtp?macro=macro_viewer&id=instant_html%21nabble%3Aemail.naml&base=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespace&breadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml>> ----- THANKS AND REGARDS, SYED ABDUL KATHER -- View this message in context: http://lucene.472066.n3.nabble.com/Skip-first-word-tp3997277p3997509.htmlSent from the Solr - User mailing list archive at Nabble.com.
Finotti Simone 2012-07-27, 09:05
Hi Chantal,
if I understand correctly, this implies that I have to populate different fields according to their lenght. Since I'm not aware of any logical condition you can apply to copyField directive, it means that this logic has to be implementend by the process that populates the Solr core. Is this assumption correct?
That's kind of bad, because I'd like to have this kind of "rules" in the Solr configuration. Of course, if that's the only way... :)
Thank you
________________________________________ Inizio: Chantal Ackermann [[EMAIL PROTECTED]] Inviato: giovedì 26 luglio 2012 18.32 Fine: [EMAIL PROTECTED] Oggetto: Re: Skip first word
Hi,
use two fields: 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for inputs of length < 3, 2. the other one tokenized as appropriate with minsize=3 and longer for all longer inputs Cheers, Chantal Am 26.07.2012 um 09:05 schrieb Finotti Simone:
> Hi Ahmet, > business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms. > > We are developing a search suggestion mechanism, the idea is that if the user types "D", the engine should suggest "Dolce & Gabbana", but if we type "G", it should suggest other brands. Only if users type "Gab" it should suggest "Dolce & Gabbana". > > Thanks > S > ________________________________________ > Inizio: Ahmet Arslan [[EMAIL PROTECTED]] > Inviato: mercoledì 25 luglio 2012 18.10 > Fine: [EMAIL PROTECTED] > Oggetto: Re: Skip first word > >> is there a tokenizer and/or a combination of filter to >> remove the first term from a field? >> >> For example: >> The quick brown fox >> >> should be tokenized as: >> quick >> brown >> fox > > There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this? > > > >
Chantal Ackermann 2012-07-27, 09:20
Hi Simone,
no I meant that you populate the two fields with the same input - best done via copyField directive.
The first field will contain ngrams of size 1 and 2. The other field will contain ngrams of size 3 and longer (you might want to set a decent maxsize there).
The query for the autocomplete list uses the first field when the input (typed in by the user) is one or two characters long. Your example was: "D", "G", or than "Do" or "Ga". The result would search only on the single token field that contains for the input "Dolce & Gabbana" only the ngrams "D" and "Do". So, only the input "D" or "Do" would result in a hit on "Dolce & Gabbana". Once the user has typed in the third letter: "Dol" or "Gab", you query the second, more tokenized field which would contain for "Dolce & Gabbana" the ngrams "Dol" "Dolc" "Dolce" "Gab" "Gabb" "Gabba" etc. Both inputs "Gab" and "Dol" would then return "Dolce & Gabbana".
1. First field type:
<tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="2" side="front"/>
2. Secong field type:
<tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- maybe add WordDelimiter etc. --> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="10" side="front"/>
3. field declarations:
<field name="short_prefix" type="short_ngram" … /> <field name="long_prefix" type="long_ngram" … />
<copyField source="short_prefix" dest="long_prefix" /> Chantal
Am 27.07.2012 um 11:05 schrieb Finotti Simone:
> Hi Chantal, > > if I understand correctly, this implies that I have to populate different fields according to their lenght. Since I'm not aware of any logical condition you can apply to copyField directive, it means that this logic has to be implementend by the process that populates the Solr core. Is this assumption correct? > > That's kind of bad, because I'd like to have this kind of "rules" in the Solr configuration. Of course, if that's the only way... :) > > Thank you > > ________________________________________ > Inizio: Chantal Ackermann [[EMAIL PROTECTED]] > Inviato: giovedì 26 luglio 2012 18.32 > Fine: [EMAIL PROTECTED] > Oggetto: Re: Skip first word > > Hi, > > use two fields: > 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for inputs of length < 3, > 2. the other one tokenized as appropriate with minsize=3 and longer for all longer inputs > > > Cheers, > Chantal > > > Am 26.07.2012 um 09:05 schrieb Finotti Simone: > >> Hi Ahmet, >> business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms. >> >> We are developing a search suggestion mechanism, the idea is that if the user types "D", the engine should suggest "Dolce & Gabbana", but if we type "G", it should suggest other brands. Only if users type "Gab" it should suggest "Dolce & Gabbana". >> >> Thanks >> S >> ________________________________________ >> Inizio: Ahmet Arslan [[EMAIL PROTECTED]] >> Inviato: mercoledì 25 luglio 2012 18.10 >> Fine: [EMAIL PROTECTED] >> Oggetto: Re: Skip first word >> >>> is there a tokenizer and/or a combination of filter to >>> remove the first term from a field? >>> >>> For example: >>> The quick brown fox >>> >>> should be tokenized as: >>> quick >>> brown >>> fox >> >> There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this? >> >> >> >> > > > > >
Finotti Simone 2012-07-27, 09:46
Brilliant! Thank you very much :)
________________________________________ Inizio: Chantal Ackermann [[EMAIL PROTECTED]] Inviato: venerdì 27 luglio 2012 11.20 Fine: [EMAIL PROTECTED] Oggetto: Re: Skip first word
Hi Simone,
no I meant that you populate the two fields with the same input - best done via copyField directive.
The first field will contain ngrams of size 1 and 2. The other field will contain ngrams of size 3 and longer (you might want to set a decent maxsize there).
The query for the autocomplete list uses the first field when the input (typed in by the user) is one or two characters long. Your example was: "D", "G", or than "Do" or "Ga". The result would search only on the single token field that contains for the input "Dolce & Gabbana" only the ngrams "D" and "Do". So, only the input "D" or "Do" would result in a hit on "Dolce & Gabbana". Once the user has typed in the third letter: "Dol" or "Gab", you query the second, more tokenized field which would contain for "Dolce & Gabbana" the ngrams "Dol" "Dolc" "Dolce" "Gab" "Gabb" "Gabba" etc. Both inputs "Gab" and "Dol" would then return "Dolce & Gabbana".
1. First field type:
<tokenizer class="solr.KeywordTokenizerFactory"/> <filter class="solr.EdgeNGramFilterFactory" minGramSize="1" maxGramSize="2" side="front"/>
2. Secong field type:
<tokenizer class="solr.WhitespaceTokenizerFactory"/> <!-- maybe add WordDelimiter etc. --> <filter class="solr.EdgeNGramFilterFactory" minGramSize="3" maxGramSize="10" side="front"/>
3. field declarations:
<field name="short_prefix" type="short_ngram" … /> <field name="long_prefix" type="long_ngram" … />
<copyField source="short_prefix" dest="long_prefix" /> Chantal
Am 27.07.2012 um 11:05 schrieb Finotti Simone:
> Hi Chantal, > > if I understand correctly, this implies that I have to populate different fields according to their lenght. Since I'm not aware of any logical condition you can apply to copyField directive, it means that this logic has to be implementend by the process that populates the Solr core. Is this assumption correct? > > That's kind of bad, because I'd like to have this kind of "rules" in the Solr configuration. Of course, if that's the only way... :) > > Thank you > > ________________________________________ > Inizio: Chantal Ackermann [[EMAIL PROTECTED]] > Inviato: giovedì 26 luglio 2012 18.32 > Fine: [EMAIL PROTECTED] > Oggetto: Re: Skip first word > > Hi, > > use two fields: > 1. KeywordTokenizer (= single token) with ngram minsize=1 and maxsize=2 for inputs of length < 3, > 2. the other one tokenized as appropriate with minsize=3 and longer for all longer inputs > > > Cheers, > Chantal > > > Am 26.07.2012 um 09:05 schrieb Finotti Simone: > >> Hi Ahmet, >> business asked me to apply EdgeNGram with minGramSize=1 on the first term and with minGramSize=3 on the latter terms. >> >> We are developing a search suggestion mechanism, the idea is that if the user types "D", the engine should suggest "Dolce & Gabbana", but if we type "G", it should suggest other brands. Only if users type "Gab" it should suggest "Dolce & Gabbana". >> >> Thanks >> S >> ________________________________________ >> Inizio: Ahmet Arslan [[EMAIL PROTECTED]] >> Inviato: mercoledì 25 luglio 2012 18.10 >> Fine: [EMAIL PROTECTED] >> Oggetto: Re: Skip first word >> >>> is there a tokenizer and/or a combination of filter to >>> remove the first term from a field? >>> >>> For example: >>> The quick brown fox >>> >>> should be tokenized as: >>> quick >>> brown >>> fox >> >> There is no such filter that i know of. Though, you can implement one with modifying source code of LengthFilterFactory or StopFilterFactory. They both remove tokens. Out of curiosity, what is the use case for this? >> >> >> >> > > > > >
Chantal Ackermann 2012-07-27, 10:48
Your're welcome :-) C
|
|