|
Jamie Johnson
2012-03-08, 13:18
Ahmet Arslan
2012-03-08, 13:36
Jamie Johnson
2012-03-08, 15:40
Ahmet Arslan
2012-03-08, 16:16
Jamie Johnson
2012-03-09, 03:58
Ahmet Arslan
2012-03-09, 14:53
Jamie Johnson
2012-03-09, 19:27
Jamie Johnson
2012-03-09, 19:53
Jamie Johnson
2012-03-09, 21:04
Jamie Johnson
2012-03-11, 02:36
|
-
Stemmer QuestionJamie Johnson 2012-03-08, 13:18
I was previously using the PorterStemmer to do stemming and ran into
an issue where it was overly aggressive with some words or abbreviations which I needed to stop. I have recently switched to KStem and I believe the issue is less, but I was wondering still if there was a way to set a number of stop words for which you didn't want stemming to occur or if there was a way to tell the Stemmer to store the unstemmed version as well. So for instance if a query came in for "Ahmed", the PorterStemmer would turn that into Ahm, while in this case Ahmed is a name and I want to search that unstemmed. If there was a stop word list I could attempt to compile a list of words I didn't want stem or if there was a way to say also say create a token for the unstemmed word so what went into the index for Ahmed would be "ahmed" "ahm" so we'd cover both cases. What are the draw backs of providing both.
-
Re: Stemmer QuestionAhmet Arslan 2012-03-08, 13:36
> I was previously using the
> PorterStemmer to do stemming and ran into > an issue where it was overly aggressive with some words or > abbreviations which I needed to stop. I have recently > switched to > KStem and I believe the issue is less, but I was wondering > still if > there was a way to set a number of stop words for which you > didn't > want stemming to occur or if there was a way to tell the > Stemmer to > store the unstemmed version as well. So for instance > if a query came > in for "Ahmed", the PorterStemmer would turn that into Ahm, > while in > this case Ahmed is a name and I want to search that > unstemmed. If > there was a stop word list I could attempt to compile a list > of words > I didn't want stem or if there was a way to say also say > create a > token for the unstemmed word so what went into the index for > Ahmed > would be "ahmed" "ahm" so we'd cover both cases. What > are the draw > backs of providing both. StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for these kind of purposes. http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming
-
Re: Stemmer QuestionJamie Johnson 2012-03-08, 15:40
Thanks the KeywordMarkerFilterFactory seems to be what I was looking
for. I'm still wondering about keeping the unstemmed word as a token though. While I know that this would increase the index size slightly I wonder what the negative of doing such a thing would be? Just seems less destructive since I always store the unstemmed version and the stemmed version. By not storing the unstemmed version there is no way to go back without reindexing. If I wanted to implement this I'm assuming a custom tokenizer would be most appropriate? Does something like this already exist? On Thu, Mar 8, 2012 at 8:36 AM, Ahmet Arslan <[EMAIL PROTECTED]> wrote: >> I was previously using the >> PorterStemmer to do stemming and ran into >> an issue where it was overly aggressive with some words or >> abbreviations which I needed to stop. I have recently >> switched to >> KStem and I believe the issue is less, but I was wondering >> still if >> there was a way to set a number of stop words for which you >> didn't >> want stemming to occur or if there was a way to tell the >> Stemmer to >> store the unstemmed version as well. So for instance >> if a query came >> in for "Ahmed", the PorterStemmer would turn that into Ahm, >> while in >> this case Ahmed is a name and I want to search that >> unstemmed. If >> there was a stop word list I could attempt to compile a list >> of words >> I didn't want stem or if there was a way to say also say >> create a >> token for the unstemmed word so what went into the index for >> Ahmed >> would be "ahmed" "ahm" so we'd cover both cases. What >> are the draw >> backs of providing both. > > StemmerOverrideFilterFactory and KeywordMarkerFilterFactory are used for these kind of purposes. > http://wiki.apache.org/solr/LanguageAnalysis#Customizing_Stemming > > > >
-
Re: Stemmer QuestionAhmet Arslan 2012-03-08, 16:16
> Thanks the KeywordMarkerFilterFactory
> seems to be what I was looking > for. I'm still wondering about keeping the unstemmed > word as a token > though. While I know that this would increase the > index size slightly > I wonder what the negative of doing such a thing would > be? Just seems > less destructive since I always store the unstemmed version > and the > stemmed version. By not storing the unstemmed version > there is no way > to go back without reindexing. If I wanted to implement this > I'm > assuming a custom tokenizer would be most appropriate? > Does something > like this already exist? Not out-of-the-box. Actually I was using your idea, implemented such custom token filter by mixing synonym filter and stem filter. This is useful for wildcard queries. And for normal queries, this could rank exact matches higher.
-
Re: Stemmer QuestionJamie Johnson 2012-03-09, 03:58
I'd be very interested to see how you did this if it is available. Does
this seem like something useful to the community at large? On Thursday, March 8, 2012, Ahmet Arslan <[EMAIL PROTECTED]> wrote: >> Thanks the KeywordMarkerFilterFactory >> seems to be what I was looking >> for. I'm still wondering about keeping the unstemmed >> word as a token >> though. While I know that this would increase the >> index size slightly >> I wonder what the negative of doing such a thing would >> be? Just seems >> less destructive since I always store the unstemmed version >> and the >> stemmed version. By not storing the unstemmed version >> there is no way >> to go back without reindexing. If I wanted to implement this >> I'm >> assuming a custom tokenizer would be most appropriate? >> Does something >> like this already exist? > > Not out-of-the-box. Actually I was using your idea, implemented such custom token filter by mixing synonym filter and stem filter. This is useful for wildcard queries. And for normal queries, this could rank exact matches higher. >
-
Re: Stemmer QuestionAhmet Arslan 2012-03-09, 14:53
> I'd be very interested to see how you
> did this if it is available. Does > this seem like something useful to the community at large? I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested, I can provide it publicly too.
-
Re: Stemmer QuestionJamie Johnson 2012-03-09, 19:27
Ok, so I'm digging through the code and I noticed in
org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of a keepOrig attribute. Doing some googling led me to http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which speaks of an attribute preserveOriginal="1" on solr.WordDelimiterFilterFactory. So it seems like I can get the functionality I am looking for by setting preserveOriginal, is that correct? On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <[EMAIL PROTECTED]> wrote: >> I'd be very interested to see how you >> did this if it is available. Does >> this seem like something useful to the community at large? > > I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested, I can provide it publicly too.
-
Re: Stemmer QuestionJamie Johnson 2012-03-09, 19:53
Further digging leads me to believe this is not the case. The Synonym
Filter supports this, but the Stemming Filter does not. Ahmet, Would you be willing to provide your filter as well? I wonder if we can make it aware of the preserveOriginal attribute on WordDelimterFilterFactory? On Fri, Mar 9, 2012 at 2:27 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: > Ok, so I'm digging through the code and I noticed in > org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of > a keepOrig attribute. Doing some googling led me to > http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which > speaks of an attribute preserveOriginal="1" on > solr.WordDelimiterFilterFactory. So it seems like I can get the > functionality I am looking for by setting preserveOriginal, is that > correct? > > > On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <[EMAIL PROTECTED]> wrote: >>> I'd be very interested to see how you >>> did this if it is available. Does >>> this seem like something useful to the community at large? >> >> I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested, I can provide it publicly too.
-
Re: Stemmer QuestionJamie Johnson 2012-03-09, 21:04
So I've thrown something together fairly quickly which is based on
what Ahmet had sent that I believe will preserve the original token as well as the stemmed version. I didn't go as far as weighting them differently using the payloads however. I am not sure how to use the preserveOriginal attribute from WordDelimeterFilterFactory, can anyone provide guidance on that? On Fri, Mar 9, 2012 at 2:53 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: > Further digging leads me to believe this is not the case. The Synonym > Filter supports this, but the Stemming Filter does not. > > Ahmet, > > Would you be willing to provide your filter as well? I wonder if we > can make it aware of the preserveOriginal attribute on > WordDelimterFilterFactory? > > > On Fri, Mar 9, 2012 at 2:27 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: >> Ok, so I'm digging through the code and I noticed in >> org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of >> a keepOrig attribute. Doing some googling led me to >> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which >> speaks of an attribute preserveOriginal="1" on >> solr.WordDelimiterFilterFactory. So it seems like I can get the >> functionality I am looking for by setting preserveOriginal, is that >> correct? >> >> >> On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <[EMAIL PROTECTED]> wrote: >>>> I'd be very interested to see how you >>>> did this if it is available. Does >>>> this seem like something useful to the community at large? >>> >>> I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested, I can provide it publicly too.
-
Re: Stemmer QuestionJamie Johnson 2012-03-11, 02:36
Barring the horrible name I am wondering if folks would be interested
in having something like this as an alternative to the standard kstemmer. This is largely based on the SynonymFilter except it builds tokens using the kstemmer and the original input. I've created a JIRA for this to start discussion. I'd be really interested in comments/thoughts on this. https://issues.apache.org/jira/browse/SOLR-3231 On Fri, Mar 9, 2012 at 4:04 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: > So I've thrown something together fairly quickly which is based on > what Ahmet had sent that I believe will preserve the original token as > well as the stemmed version. I didn't go as far as weighting them > differently using the payloads however. I am not sure how to use the > preserveOriginal attribute from WordDelimeterFilterFactory, can anyone > provide guidance on that? > > On Fri, Mar 9, 2012 at 2:53 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: >> Further digging leads me to believe this is not the case. The Synonym >> Filter supports this, but the Stemming Filter does not. >> >> Ahmet, >> >> Would you be willing to provide your filter as well? I wonder if we >> can make it aware of the preserveOriginal attribute on >> WordDelimterFilterFactory? >> >> >> On Fri, Mar 9, 2012 at 2:27 PM, Jamie Johnson <[EMAIL PROTECTED]> wrote: >>> Ok, so I'm digging through the code and I noticed in >>> org.apache.lucene.analysis.synonym.SynonymFilter there are mentions of >>> a keepOrig attribute. Doing some googling led me to >>> http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters which >>> speaks of an attribute preserveOriginal="1" on >>> solr.WordDelimiterFilterFactory. So it seems like I can get the >>> functionality I am looking for by setting preserveOriginal, is that >>> correct? >>> >>> >>> On Fri, Mar 9, 2012 at 9:53 AM, Ahmet Arslan <[EMAIL PROTECTED]> wrote: >>>>> I'd be very interested to see how you >>>>> did this if it is available. Does >>>>> this seem like something useful to the community at large? >>>> >>>> I PMed it to you. Filter is not a big deal. Just modified from {@link org.apache.lucene.wordnet.SynonymTokenFilter}. If requested, I can provide it publicly too. |