|
Terry Steichen
2003-02-28, 18:43
Terry Steichen
2003-02-10, 19:28
Terry Steichen
2003-02-08, 23:36
Doug Cutting
2003-02-10, 18:57
Terry Steichen
2003-01-27, 02:12
Leo Galambos
2003-01-27, 19:15
Terry Steichen
2003-01-26, 16:27
Doug Cutting
2003-02-07, 19:37
Leo Galambos
2003-01-26, 16:56
Terry Steichen
2003-01-25, 01:49
Otis Gospodnetic
2003-01-25, 07:09
|
-
Re: Computing Relevancy DifferentlyTerry Steichen 2003-02-28, 18:43
Doug,
I've implemented a subclass of DefaultSimilarity (called WESimilarity.java, copy attached) which defines a new lengthNorm() method more or less as you suggested. I then added a line prior to using my IndexWriter: writer.setSimilarity(new WESimilarity()), and a similar line prior to using my IndexSeacher: searcher.setSimilarity(new WESimilarity()). The result: 1) There's no change whatsoever in the computed scores, and 2) The debugging messages never get printed out. I know the WESimilarity is being used (because if I rename it I get an exception), but it does not appear that the new lengthNorm() method is being called. It's probably some silly goof, but I can't figure out where it is. If you (or anyone else, of course) have any ideas/suggestions, I'd appreciate them. Regards, Terry ----- Original Message ----- From: "Terry Steichen" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, February 10, 2003 2:28 PM Subject: Re: Computing Relevancy Differently > Doug, > > That's excellent. Just what I've been looking for. I'll start > experimenting shortly. > > Regards, > > Terry > > ----- Original Message ----- > From: "Doug Cutting" <[EMAIL PROTECTED]> > To: "Lucene Users List" <[EMAIL PROTECTED]> > Sent: Monday, February 10, 2003 1:57 PM > Subject: Re: Computing Relevancy Differently > > > > Terry Steichen wrote: > > > Can you give me an idea of what to replace the lengthNorm() method with > to, > > > for example, remove any special weight given to shorter matching > documents? > > > > The goal of the default implementation is not to give any special weight > > to shorter documents, but rather to remove the advantage longer > > documents have. Longer documents are likely to have more matches simply > > because they contain more terms. Also, for the query "foo", a document > > containing just "foo" is a better match than a longer one containing > > "foo bar baz", since the match is more exact. > > > > However, one problem with this approach can be that very short documents > > are in fact not very informative. Thus a bias against very short > > documents is sometimes useful. > > > > > I can certainly go through a bunch of trial-and-error efforts, but it > would > > > help if I had some grasp of the logic initially. > > > > > > For example, from DefaultSimilarity, here's the lengthNorm() method: > > > > > > public float lengthNorm(String fieldName, int numTerms) { > > > return (float)(1.0 / Math.sqrt(numTerms)); > > > } > > > > > > Should I (for the purpose of eliminating any size bias) override it to > > > always return a 1? > > > > That's something to try, although, as mentioned above, I suspect your > > top hits will be dominated by long documents. Try it. It's really not > > a difficult experiment! > > > > One trick I've used to keep very short documents from dominating > > results, that, while good matches, are not informative documents, is to > > override this with something like: > > > > public float lengthNorm(String fieldName, int numTerms) { > > super.lengthNorm(fieldName, Math.max(numTerms, 100)); > > } > > > > This way all fields shorter than 100 terms are scored like fields > > containing 100 terms. Long documents are still normalized, but search > > is biased a bit against very short documents. > > > > > How would I boost the headline field here? Is that how you are supposed > to > > > use the (presently unused) fieldName parameter? If that's the case, I > > > assume I would logically (to do what I'm trying to do) make this factor > > > greater than 1 for the 'headline' field, and 1 for all other fields? > > > > You could do that here too. So, for example, you could do something like: > > > > public float lengthNorm(String fieldName, int numTerms) { > > float n = super.lengthNorm(fieldName, Math.max(numTerms, 100)); > > if (fieldName.equals("headline")) > > n *= 4.0f; > > return n; > > } +
Terry Steichen 2003-02-28, 18:43
-
Re: Computing Relevancy DifferentlyTerry Steichen 2003-02-10, 19:28
Doug,
That's excellent. Just what I've been looking for. I'll start experimenting shortly. Regards, Terry ----- Original Message ----- From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Monday, February 10, 2003 1:57 PM Subject: Re: Computing Relevancy Differently > Terry Steichen wrote: > > Can you give me an idea of what to replace the lengthNorm() method with to, > > for example, remove any special weight given to shorter matching documents? > > The goal of the default implementation is not to give any special weight > to shorter documents, but rather to remove the advantage longer > documents have. Longer documents are likely to have more matches simply > because they contain more terms. Also, for the query "foo", a document > containing just "foo" is a better match than a longer one containing > "foo bar baz", since the match is more exact. > > However, one problem with this approach can be that very short documents > are in fact not very informative. Thus a bias against very short > documents is sometimes useful. > > > I can certainly go through a bunch of trial-and-error efforts, but it would > > help if I had some grasp of the logic initially. > > > > For example, from DefaultSimilarity, here's the lengthNorm() method: > > > > public float lengthNorm(String fieldName, int numTerms) { > > return (float)(1.0 / Math.sqrt(numTerms)); > > } > > > > Should I (for the purpose of eliminating any size bias) override it to > > always return a 1? > > That's something to try, although, as mentioned above, I suspect your > top hits will be dominated by long documents. Try it. It's really not > a difficult experiment! > > One trick I've used to keep very short documents from dominating > results, that, while good matches, are not informative documents, is to > override this with something like: > > public float lengthNorm(String fieldName, int numTerms) { > super.lengthNorm(fieldName, Math.max(numTerms, 100)); > } > > This way all fields shorter than 100 terms are scored like fields > containing 100 terms. Long documents are still normalized, but search > is biased a bit against very short documents. > > > How would I boost the headline field here? Is that how you are supposed to > > use the (presently unused) fieldName parameter? If that's the case, I > > assume I would logically (to do what I'm trying to do) make this factor > > greater than 1 for the 'headline' field, and 1 for all other fields? > > You could do that here too. So, for example, you could do something like: > > public float lengthNorm(String fieldName, int numTerms) { > float n = super.lengthNorm(fieldName, Math.max(numTerms, 100)); > if (fieldName.equals("headline")) > n *= 4.0f; > return n; > } > > Equivalently, you could create your documents with something like: > > Document d = new Document(); > Field f = new Field.Text("headline", headline); > f.setBoost(4.0f); > ... > > But headlines tend to be short, and naturally benefit from the default > lengthNorm implementation. So what you really might want is something like: > > public float lengthNorm(String fieldName, int numTerms) { > if (fieldName.equals("headline")) > return 4.0f * super.lengthNorm(fieldName, numTerms); > else > return super.lengthNorm(fieldName, Math.max(numTerms, 100)); > } > > This is probably what I'd try first. > > Doug > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Terry Steichen 2003-02-10, 19:28
-
Re: Computing Relevancy DifferentlyTerry Steichen 2003-02-08, 23:36
Doug,
Can you give me an idea of what to replace the lengthNorm() method with to, for example, remove any special weight given to shorter matching documents? I can certainly go through a bunch of trial-and-error efforts, but it would help if I had some grasp of the logic initially. For example, from DefaultSimilarity, here's the lengthNorm() method: public float lengthNorm(String fieldName, int numTerms) { return (float)(1.0 / Math.sqrt(numTerms)); } Should I (for the purpose of eliminating any size bias) override it to always return a 1? How would I boost the headline field here? Is that how you are supposed to use the (presently unused) fieldName parameter? If that's the case, I assume I would logically (to do what I'm trying to do) make this factor greater than 1 for the 'headline' field, and 1 for all other fields? Regards, Terry ----- Original Message ----- From: "Doug Cutting" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Friday, February 07, 2003 2:37 PM Subject: Re: Computing Relevancy Differently > Terry Steichen wrote: > > I read all the relevant references I could find in the Users (not > > Developers) list, and I still don't exactly know what to do. > > > > What I'd like to do is get a relevancy-based order in which (a) longer > > documents tend to get more weight than shorter ones, (b) a document body > > with 'X' instances of a query term gets a higher ranking than one with fewer > > than 'X' instances. and (c) a term found in the headline (usually in > > addition to finding the same term in the body) is more highly ranked than > > one with the term only in the body. > > In the latest sources this can all be done by defining your own > Similarity implementation. You can make longer documents score higher > by overriding the lengthNorm() method. You can boost headlines there, > or with Field.setBoost(), or at query time with Query.setBoost(). > > Doug > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Terry Steichen 2003-02-08, 23:36
-
Re: Computing Relevancy DifferentlyDoug Cutting 2003-02-10, 18:57
Terry Steichen wrote:
> Can you give me an idea of what to replace the lengthNorm() method with to, > for example, remove any special weight given to shorter matching documents? The goal of the default implementation is not to give any special weight to shorter documents, but rather to remove the advantage longer documents have. Longer documents are likely to have more matches simply because they contain more terms. Also, for the query "foo", a document containing just "foo" is a better match than a longer one containing "foo bar baz", since the match is more exact. However, one problem with this approach can be that very short documents are in fact not very informative. Thus a bias against very short documents is sometimes useful. > I can certainly go through a bunch of trial-and-error efforts, but it would > help if I had some grasp of the logic initially. > > For example, from DefaultSimilarity, here's the lengthNorm() method: > > public float lengthNorm(String fieldName, int numTerms) { > return (float)(1.0 / Math.sqrt(numTerms)); > } > > Should I (for the purpose of eliminating any size bias) override it to > always return a 1? That's something to try, although, as mentioned above, I suspect your top hits will be dominated by long documents. Try it. It's really not a difficult experiment! One trick I've used to keep very short documents from dominating results, that, while good matches, are not informative documents, is to override this with something like: public float lengthNorm(String fieldName, int numTerms) { super.lengthNorm(fieldName, Math.max(numTerms, 100)); } This way all fields shorter than 100 terms are scored like fields containing 100 terms. Long documents are still normalized, but search is biased a bit against very short documents. > How would I boost the headline field here? Is that how you are supposed to > use the (presently unused) fieldName parameter? If that's the case, I > assume I would logically (to do what I'm trying to do) make this factor > greater than 1 for the 'headline' field, and 1 for all other fields? You could do that here too. So, for example, you could do something like: public float lengthNorm(String fieldName, int numTerms) { float n = super.lengthNorm(fieldName, Math.max(numTerms, 100)); if (fieldName.equals("headline")) n *= 4.0f; return n; } Equivalently, you could create your documents with something like: Document d = new Document(); Field f = new Field.Text("headline", headline); f.setBoost(4.0f); ... But headlines tend to be short, and naturally benefit from the default lengthNorm implementation. So what you really might want is something like: public float lengthNorm(String fieldName, int numTerms) { if (fieldName.equals("headline")) return 4.0f * super.lengthNorm(fieldName, numTerms); else return super.lengthNorm(fieldName, Math.max(numTerms, 100)); } This is probably what I'd try first. Doug --------------------------------------------------------------------- +
Doug Cutting 2003-02-10, 18:57
-
Re: Computing Relevancy DifferentlyTerry Steichen 2003-01-27, 02:12
I admit to a bit of frustration.
With the past several messages, I simply asked (or, more accurately, tried to ask) how to alter the way that Lucene ranks relevancy, and I asked whether the selective boost mechanism might do the trick. I admitted that I don't know (nor care to know) the theory behind how relevancy is computed. So far I've been told to review the archives (which I've done), and then this (which I don't understand - see my embedded [==>]comments below). What's next? Seems that I'm getting a message: "Figure it out on your own, you dummy." Maybe I've gotten on the wrong list by mistake? Terry ----- Original Message ----- From: "Leo Galambos" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Sunday, January 26, 2003 11:56 AM Subject: Re: Computing Relevancy Differently > 1) Lucene uses the Vector model, if you want to use different model ==>I have no idea of what that means, nor what the alternative to the "Vector model" might be. >you must understand what you are doing ==>which I don't, as I've already stated several times. >and you must change similarity calculations. ==>which means what? Is that part of Lucene? >AFAIK you would set the normalization factor to a constant value (1.0 or so). ==>Does this mean not to use boost? > 2) you are trying to search for DATA, not INFORMATION. It is a big > difference. For your task, you could rather use simpler engine that is > based on RDBMS and B+. ==>I didn't know I was excluding one for the other. Do I interpret all this to mean Lucene can't be adjusted to do what I was asking? That it's too complicated? -- +
Terry Steichen 2003-01-27, 02:12
-
Re: Computing Relevancy DifferentlyLeo Galambos 2003-01-27, 19:15
> What's next? Seems that I'm getting a message: "Figure it out on your own,
> you dummy." And from your letter I understood that you want someone to do your homework (for nothing). :) Right? One would say: ask not what others can do for you, ask what you can do for them. > >you must understand what you are doing > > ==>which I don't, as I've already stated several times. It is hard for me to tell anything. AFAIK (friendly speaking) Lucene does not offer click-click interface... > >and you must change similarity calculations. > > ==>which means what? Is that part of Lucene? Doug told it few days ago, I hope it is still in Similarity.java file. > >AFAIK you would set the normalization factor to a constant value (1.0 or > so). > > ==>Does this mean not to use boost? I am not God. The final decision is yours. > ==>I didn't know I was excluding one for the other. Do I interpret all this > to mean Lucene can't be adjusted to do what I was asking? That it's too > complicated? It means, Lucene offers much more than you want => you can use a simpler package that can be configured faster. I.e. UdmSearch uses a simple SQL query... -g- -- +
Leo Galambos 2003-01-27, 19:15
-
Re: Computing Relevancy DifferentlyTerry Steichen 2003-01-26, 16:27
I read all the relevant references I could find in the Users (not
Developers) list, and I still don't exactly know what to do. Let me explain a bit more. The documents I index are all news stories. The typical document body ranges in size from 200 to 2000 words. The document is structured into a couple of dozen indexed fields, but nearly all searching is done in two: the headline and the body. What I'd like to do is get a relevancy-based order in which (a) longer documents tend to get more weight than shorter ones, (b) a document body with 'X' instances of a query term gets a higher ranking than one with fewer than 'X' instances. and (c) a term found in the headline (usually in addition to finding the same term in the body) is more highly ranked than one with the term only in the body. But that's not what happens with the default scoring, and I'd like to change that. I'm guessing, but maybe if I check the document length at indexing time and boost longer documents, that will help. Maybe I could also (at index time) give an extra boost to the headline field. Would that be the most I could do without changing the Lucene core source? Regards, Terry PS: I'm also wondering if the fact that I have so many other fields, this may affect the ranking in a way that diminishes the relevance of the headline and/or body fields? PSS: I'd just like to clarify another point. Much of the background information on the scoring algorithms is beyond me and I have no interest whatsoever in pushing the boundaries of this part of the technology. All I want to do is use it so it comes out in a way that seems reasonable (without having to become an expert in the complex theory behind this). ----- Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Saturday, January 25, 2003 2:09 AM Subject: Re: Computing Relevancy Differently > Check the lucene-user archives, search for subject "custom scoring api > questions" > I think that may give you the answer.... > > Otis > > > --- Terry Steichen <[EMAIL PROTECTED]> wrote: > > How would one go about altering the formula for relevancy? (That is, > > which modules and which code?) I'm certain that the current > > algorithm is well founded in logic and probably works well in many > > environments. > > > > However, I find that, as I index news stories, the current algorithm > > frequently doesn't produce meaningful rankings. In previous > > discussions in this list about relevancy, the algorithm seemed to be > > very complex, possibly too complex for my poor brain to fully grasp. > > But I'd like to try some other options and see if they result in > > rankings more in line with what my average viewer would expect. > > > > Regards, > > > > Terry > > > > > > > __________________________________________________ > Do you Yahoo!? > Yahoo! Mail Plus - Powerful. Affordable. Sign up now. > http://mailplus.yahoo.com > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- +
Terry Steichen 2003-01-26, 16:27
-
Re: Computing Relevancy DifferentlyDoug Cutting 2003-02-07, 19:37
Terry Steichen wrote:
> I read all the relevant references I could find in the Users (not > Developers) list, and I still don't exactly know what to do. > > What I'd like to do is get a relevancy-based order in which (a) longer > documents tend to get more weight than shorter ones, (b) a document body > with 'X' instances of a query term gets a higher ranking than one with fewer > than 'X' instances. and (c) a term found in the headline (usually in > addition to finding the same term in the body) is more highly ranked than > one with the term only in the body. In the latest sources this can all be done by defining your own Similarity implementation. You can make longer documents score higher by overriding the lengthNorm() method. You can boost headlines there, or with Field.setBoost(), or at query time with Query.setBoost(). Doug --------------------------------------------------------------------- +
Doug Cutting 2003-02-07, 19:37
-
Re: Computing Relevancy DifferentlyLeo Galambos 2003-01-26, 16:56
> What I'd like to do is get a relevancy-based order in which (a) longer
> documents tend to get more weight than shorter ones, (b) a document body > with 'X' instances of a query term gets a higher ranking than one with fewer > than 'X' instances. and (c) a term found in the headline (usually in > addition to finding the same term in the body) is more highly ranked than > one with the term only in the body. > > But that's not what happens with the default scoring, and I'd like to change > that. I am not Lucene developer, but: 1) Lucene uses the Vector model, if you want to use different model you must understand what you are doing and you must change similarity calculations. AFAIK you would set the normalization factor to a constant value (1.0 or so). 2) you are trying to search for DATA, not INFORMATION. It is a big difference. For your task, you could rather use simpler engine that is based on RDBMS and B+. -g- -- +
Leo Galambos 2003-01-26, 16:56
-
Computing Relevancy DifferentlyTerry Steichen 2003-01-25, 01:49
How would one go about altering the formula for relevancy? (That is, which modules and which code?) I'm certain that the current algorithm is well founded in logic and probably works well in many environments.
However, I find that, as I index news stories, the current algorithm frequently doesn't produce meaningful rankings. In previous discussions in this list about relevancy, the algorithm seemed to be very complex, possibly too complex for my poor brain to fully grasp. But I'd like to try some other options and see if they result in rankings more in line with what my average viewer would expect. Regards, Terry +
Terry Steichen 2003-01-25, 01:49
-
Re: Computing Relevancy DifferentlyOtis Gospodnetic 2003-01-25, 07:09
Check the lucene-user archives, search for subject "custom scoring api
questions" I think that may give you the answer.... Otis --- Terry Steichen <[EMAIL PROTECTED]> wrote: > How would one go about altering the formula for relevancy? (That is, > which modules and which code?) I'm certain that the current > algorithm is well founded in logic and probably works well in many > environments. > > However, I find that, as I index news stories, the current algorithm > frequently doesn't produce meaningful rankings. In previous > discussions in this list about relevancy, the algorithm seemed to be > very complex, possibly too complex for my poor brain to fully grasp. > But I'd like to try some other options and see if they result in > rankings more in line with what my average viewer would expect. > > Regards, > > Terry > > __________________________________________________ Do you Yahoo!? Yahoo! Mail Plus - Powerful. Affordable. Sign up now. http://mailplus.yahoo.com -- +
Otis Gospodnetic 2003-01-25, 07:09
|