|
|
-
FunctionQuery and boosting documents using date arithmetic
Pieter Berkel 2007-08-06, 07:21
I've been using a simple variation of the boost function given in the examples used to boost more recent documents:
recip(rord(creationDate),1,1000,1000)^1.3
While it seems to work pretty well, I've realised that this may not be quite as effective as i had hoped given that the calculation is based on the ordinal of the field value rather than the value of the field itself. In cases where the field type is 'date' and the actual field values are not distributed evenly across all documents in the index, the value returned by rord() is not going to give a true reflection of document age. For example, using Hoss' new date faceting feature, I can see that the rate at which documents have been added to the index I'm maintaining has been slowly but steadily increasing over the past few months, and I fear this fact will skew the boost value calculated by the function listed above.
There doesn't seem to be currently any way of performing date arithmetic or convert a date field into an integer (seconds since epoch?), ideally I'd like to be able to do something like:
recip(intval(parseDate('NOW')-parseDate(creationDate)),1,1000,1000)^1.3
so that the function calculates the boost based on the actual document age, rather than the relative age. Does anybody have any thoughts or comments on this approach?
cheers, Piete
+
Pieter Berkel 2007-08-06, 07:21
-
Re: FunctionQuery and boosting documents using date arithmetic
Pieter Berkel 2007-08-06, 12:17
Actually, just thinking about this a bit more, perhaps adding a function call such as parseDate() might add too much overhead to the actual query, perhaps it would be better to first convert the date to a timestamp at index time and store it in a field type slong? This might be more efficient but that still leaves the problem of obtaining the current timestamp to use in the boost function.
On 06/08/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > I've been using a simple variation of the boost function given in the > examples used to boost more recent documents: > > recip(rord(creationDate),1,1000,1000)^1.3 > > While it seems to work pretty well, I've realised that this may not be > quite as effective as i had hoped given that the calculation is based on the > ordinal of the field value rather than the value of the field itself. In > cases where the field type is 'date' and the actual field values are not > distributed evenly across all documents in the index, the value returned by > rord() is not going to give a true reflection of document age. For example, > using Hoss' new date faceting feature, I can see that the rate at which > documents have been added to the index I'm maintaining has been slowly but > steadily increasing over the past few months, and I fear this fact will skew > the boost value calculated by the function listed above. > > There doesn't seem to be currently any way of performing date arithmetic > or convert a date field into an integer (seconds since epoch?), ideally I'd > like to be able to do something like: > > recip(intval(parseDate('NOW')-parseDate(creationDate)),1,1000,1000)^ 1.3 > > so that the function calculates the boost based on the actual document > age, rather than the relative age. Does anybody have any thoughts or > comments on this approach? > > cheers, > Piete > > >
+
Pieter Berkel 2007-08-06, 12:17
-
Re: FunctionQuery and boosting documents using date arithmetic
Chris Hostetter 2007-08-11, 02:30
: Actually, just thinking about this a bit more, perhaps adding a function : call such as parseDate() might add too much overhead to the actual query, : perhaps it would be better to first convert the date to a timestamp at index : time and store it in a field type slong? This might be more efficient but
i would agree with you there, this is where a more robust (ie: less efficient) DateField-ish class that supports configuration options to specify: 1) the output format 2) the input format(s) 3) the indexed format ...as SimpleDateFormatter pattern strings would be handy. The ValueSource it uses could return seconds (or some other unit based on another config option) since epoch as the intValue.
it's been discussed before, but there are a lot of tricky issues involved which is probably why no one has really tackled it.
: that still leaves the problem of obtaining the current timestamp to use in : the boost function.
it would be pretty easy to write a ValueSource that just knew about "now" as seconds since epoch.
: > While it seems to work pretty well, I've realised that this may not be : > quite as effective as i had hoped given that the calculation is based on the : > ordinal of the field value rather than the value of the field itself. In : > cases where the field type is 'date' and the actual field values are not : > distributed evenly across all documents in the index, the value returned by : > rord() is not going to give a true reflection of document age. For example,
be careful what you wish for. you are 100% correct that functions using hte (r)ord value of a DateField aren't a function of true age, but dependong on how you look at it that may be better then using the real age (i think so anyway). Why it sounds appealing to say that docA should score half as high as docB if it is twice as old, that typically isn't all that important when dealing with recent dates; and when dealing with older dates the ordinal value tends to approximate it decently well ... where a true measure of age might screw you up is when you have situations where few/no new articles get published on weekends (or late at night). it's also very confusing to people when the ordering of documents changes even though no new documents have been published -- that can easily happen if you are heavily boosting on a true age calculation but will never happen when dealing with an ordinal ranking of documents by age.
(allthough, this could be compensated by doing all of your true age calculations relative the "min age" of all articles in your index -- but you would still get really weird 'big' shifts in scores as soon as that first article gets published on monday morning. -Hoss
+
Chris Hostetter 2007-08-11, 02:30
-
Re: FunctionQuery and boosting documents using date arithmetic
Pieter Berkel 2007-08-12, 12:02
On 11/08/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > i would agree with you there, this is where a more robust (ie: > less efficient) DateField-ish class that supports configuration options > to specify: > 1) the output format > 2) the input format(s) > 3) the indexed format > ...as SimpleDateFormatter pattern strings would be handy. The > ValueSource it uses could return seconds (or some other unit based on > another config option) since epoch as the intValue. That definitely sounds like a sensible and flexible approach, I'll have to take a closer look at the ValueSource and FunctionQuery classes and see what I can come up with.
it's been discussed before, but there are a lot of tricky issues involved > which is probably why no one has really tackled it. It does seem somehow related to the issue of making the value of NOW constant during the entire execution of a query, hopefully not in the to-hard basket.
be careful what you wish for. you are 100% correct that functions using > hte (r)ord value of a DateField aren't a function of true age, but > dependong on how you look at it that may be better then using the real age > (i think so anyway). I understand the problems you describe with using true age values, although I wonder how much recip() (or perhaps some other logarithmic function) would be able to dampen any unpleasant side-effects created by unusual publishing patterns, not publishing on weekends, etc. Using "min age" sounds like a much better idea than using NOW to avoid any of the described weirdness too, but that might increase the complexity of the function.
I'm still keen to get something working, at least to compare the results it generates with the current ordinal method.
Piete
+
Pieter Berkel 2007-08-12, 12:02
-
Re: FunctionQuery and boosting documents using date arithmetic
climbingrose 2007-08-11, 08:52
I'm having the date boosting function as well. I'm using this function: F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around 10,000 of documents added in one day, rord(createDate) returns very different values for the same createDate. For example, the last document added with have rord(createdDate) =1 while the last document added will have rord(createdDate) = 10,000. When createDate > 10,000, value of F is approaching 0. Therefore, the boost query doesn't make any difference between the the last document added today and the document added 10 days ago. Now if I replace 1000 in F with a large number, say 100000, the boost function suddenly gives the last few documents enormous boost and make the other query scores irrelevant.
So in my case (and many others' I believe), the "true" date value would be more appropriate. I'm thinking along the same line of adding timestamp. It wouldn't add much overhead this way, would it?
Regards,
On 8/11/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > > > : Actually, just thinking about this a bit more, perhaps adding a function > : call such as parseDate() might add too much overhead to the actual > query, > : perhaps it would be better to first convert the date to a timestamp at > index > : time and store it in a field type slong? This might be more efficient > but > > i would agree with you there, this is where a more robust (ie: > less efficient) DateField-ish class that supports configuration options > to specify: > 1) the output format > 2) the input format(s) > 3) the indexed format > ...as SimpleDateFormatter pattern strings would be handy. The > ValueSource it uses could return seconds (or some other unit based on > another config option) since epoch as the intValue. > > it's been discussed before, but there are a lot of tricky issues involved > which is probably why no one has really tackled it. > > : that still leaves the problem of obtaining the current timestamp to use > in > : the boost function. > > it would be pretty easy to write a ValueSource that just knew about "now" > as seconds since epoch. > > : > While it seems to work pretty well, I've realised that this may not be > : > quite as effective as i had hoped given that the calculation is based > on the > : > ordinal of the field value rather than the value of the field > itself. In > : > cases where the field type is 'date' and the actual field values are > not > : > distributed evenly across all documents in the index, the value > returned by > : > rord() is not going to give a true reflection of document age. For > example, > > be careful what you wish for. you are 100% correct that functions using > hte (r)ord value of a DateField aren't a function of true age, but > dependong on how you look at it that may be better then using the real age > (i think so anyway). Why it sounds appealing to say that docA should > score half as high as docB if it is twice as old, that typically isn't all > that important when dealing with recent dates; and when dealing with older > dates the ordinal value tends to approximate it decently well ... where a > true measure of age might screw you up is when you have situations where > few/no new articles get published on weekends (or late at night). it's > also very confusing to people when the ordering of documents changes even > though no new documents have been published -- that can easily happen if > you are heavily boosting on a true age calculation but will never happen > when dealing with an ordinal ranking of documents by age. > > (allthough, this could be compensated by doing all of your true age > calculations relative the "min age" of all articles in your index -- but > you would still get really weird 'big' shifts in scores as soon as that > first article gets published on monday morning. > > > -Hoss > > -- Regards,
Cuong Hoang
+
climbingrose 2007-08-11, 08:52
-
Re: FunctionQuery and boosting documents using date arithmetic
Pieter Berkel 2007-08-12, 12:10
Do you consistently add 10,000 documents to your index every day or does the number of new documents added per day vary? On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote: > > I'm having the date boosting function as well. I'm using this function: > F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around > 10,000 of documents added in one day, rord(createDate) returns very > different values for the same createDate. For example, the last document > added with have rord(createdDate) =1 while the last document added will > have > rord(createdDate) = 10,000. When createDate > 10,000, value of F is > approaching 0. Therefore, the boost query doesn't make any difference > between the the last document added today and the document added 10 days > ago. Now if I replace 1000 in F with a large number, say 100000, the > boost > function suddenly gives the last few documents enormous boost and make > the > other query scores irrelevant. > > So in my case (and many others' I believe), the "true" date value would be > more appropriate. I'm thinking along the same line of adding timestamp. It > wouldn't add much overhead this way, would it? >
+
Pieter Berkel 2007-08-12, 12:10
-
Re: FunctionQuery and boosting documents using date arithmetic
climbingrose 2007-08-12, 12:15
We add around 10,000 docs during week days and 5,000 during weekends.
On 8/12/07, Pieter Berkel <[EMAIL PROTECTED]> wrote: > > Do you consistently add 10,000 documents to your index every day or does > the > number of new documents added per day vary? > > > On 11/08/07, climbingrose <[EMAIL PROTECTED]> wrote: > > > > I'm having the date boosting function as well. I'm using this function: > > F = recip(rord(creationDate),1,1000,1000)^10. However, since I have > around > > 10,000 of documents added in one day, rord(createDate) returns very > > different values for the same createDate. For example, the last document > > added with have rord(createdDate) =1 while the last document added will > > have > > rord(createdDate) = 10,000. When createDate > 10,000, value of F is > > approaching 0. Therefore, the boost query doesn't make any difference > > between the the last document added today and the document added 10 days > > ago. Now if I replace 1000 in F with a large number, say 100000, the > > boost > > function suddenly gives the last few documents enormous boost and make > > the > > other query scores irrelevant. > > > > So in my case (and many others' I believe), the "true" date value would > be > > more appropriate. I'm thinking along the same line of adding timestamp. > It > > wouldn't add much overhead this way, would it? > > >
-- Regards,
Cuong Hoang
+
climbingrose 2007-08-12, 12:15
-
Re: FunctionQuery and boosting documents using date arithmetic
Chris Hostetter 2007-08-12, 23:09
: I'm having the date boosting function as well. I'm using this function: : F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around : 10,000 of documents added in one day, rord(createDate) returns very : different values for the same createDate. For example, the last document
you may want to consider rounding dates down to the nearest day when indexing, that way everything published on the same day would have the same value and thus the same ordinal value.
the hypothetical DateField-ish class i described before (let's call it "CustomizableFormatsDateField" for the sake of vocabulary) could make this trivial.
: approaching 0. Therefore, the boost query doesn't make any difference : between the the last document added today and the document added 10 days : ago. Now if I replace 1000 in F with a large number, say 100000, the boost : function suddenly gives the last few documents enormous boost and make the : other query scores irrelevant.
you might be able to mitigate your problem by maxing out the contribution from the date boost -- there is no "min" ValueSource out of the box, but there is a MaxFloatFunction and with some creative uses of LinearFloatFunction i think you get your date function queries to cap out at a fixed number of your choice. -Hoss
+
Chris Hostetter 2007-08-12, 23:09
-
Re: FunctionQuery and boosting documents using date arithmetic
Yonik Seeley 2007-08-14, 01:46
On 8/12/07, Chris Hostetter <[EMAIL PROTECTED]> wrote: > : I'm having the date boosting function as well. I'm using this function: > : F = recip(rord(creationDate),1,1000,1000)^10. However, since I have around > : 10,000 of documents added in one day, rord(createDate) returns very > : different values for the same createDate. For example, the last document > > you may want to consider rounding dates down to the nearest day when > indexing, that way everything published on the same day would have the > same value and thus the same ordinal value.
Yeah, and that will save index space and a lot of memory (smaller FieldCache entry) too.
-Yonik
+
Yonik Seeley 2007-08-14, 01:46
|
|