|
britske
2011-09-26, 09:51
David Smiley
2011-09-27, 06:27
Chris Hostetter
2011-10-01, 02:25
Mikhail Khludnev
2011-10-01, 17:57
Geert-Jan Brits
2011-10-03, 11:09
Geert-Jan Brits
2011-10-03, 11:42
Mikhail Khludnev
2011-10-03, 12:52
Chris Hostetter
2011-10-11, 01:21
Geert-Jan Brits
2011-10-11, 09:43
Chris Hostetter
2011-11-01, 23:12
|
-
multiple dateranges/timeslots per doc: modeling openinghours.britske 2011-09-26, 09:51
Sorry for the somewhat length post, I would like to make clear that I covered
my basis here, and looking for an alternative solution, because the more trivial solutions don't seem to work for my use-case. Consider Bars, musea, etc. These places have multiple openinghours that can depend on: REQ 1. day of week REQ 2. special days on which they are closed, or have in another way different openinghours than there related 'day of week' Now, I want to model these 'places' in a way so I'm able to do temporal queries like: - which bars are open NOW (and stay open for at least another 3 hours) - which musea are (already) open at 25-12-2011 - 10AM - and stay open until (at least) 3PM. I believe having opening/closing hours available for each day at least gives me the data needed to query the above. (Note that having dayOfWeek*openinghours is not enough, bc. of the special cases in 2.) Okay knowing I need openinghours*dates for each place, how would I format this in documents? OPTION A) ----------- Considering granularity: I want documents to represent Places and not Places*dates. Although the latter would trivially allow me to do the quering mentioned above, it has the disadvantages: - same place returned multiple times (each with a different date) when queries are not constrained to date. - Lot's of data needs to be duplicated, all for the conceptually 'simple' functionality of needing multiple date-ranges. It feels bad and a simpler solution should exist? - Exploding the resultset (documents = say, 100 dates * 1.000.000 100.000.000. ) suddenly the size of the resultset goes from 'easily doable' to 'hmmm I have to think about this'. Given that places also have some other fields to sort on, Lucene fieldcache mem-usage would explode with a factor 100. OPTION B) ---------- Another, faulty, option would be to model opening/closing hours in 2 multivalued date-fields, i.e: open, close. and insert open/close for each day, e.g: open: 2011-11-08:1800 - close: 2011-11-09:0300 open: 2011-11-09:1700 - close: 2011-11-10:0500 open: 2011-11-10:1700 - close: 2011-11-11:0300 And queries would be of the form: 'open < now && close > now+3h' But since there is no way to indicate that 'open' and 'close' are pairwise related I will get a lot of false positives, e.g the above document would be returned for: open < 2011-11-09:0100 && close > 2011-11-09:0600 because SOME opendate is before 2011-11-09:0100 (i.e: 2011-11-08:1800) and SOME closedate is after 2011-11-09:0600 (for example: 2011-11-11:0300) but these open and close-dates are not pairwise related. OPTION C) The best of what I have now: --------------------------------------- I have been thinking about a totally different approach using Solr dynamic fields, in which each and every opening and closing-date gets it's own dynamic field, e.g: _date_2011-11-09_open: 1800 _date_2011-11-09_close: 0300 _date_2011-11-09_open: 1700 _date_2011-11-10_close: 0500 _date_2011-11-10_open: 1700 _date_2011-11-11_close: 0300 Then, the client should know the date to query, and thus the correct fields to query. This would solve the problem, since startdate/ enddate are nor pairwise -related, but I fear this can be a big issue from a performance standpoint (especially memory consumption of the Lucene fieldcache) IDEAL OPTION D) ---------------- I'm pretty sure this does not exist out-of-the-box, but might be extended. Okay, Solr has a fieldtype: date, but what if it also had a fieldtype: Daterange? A Daterange would be modeled as <DateTimeA,DateTimeB> or <DateTimeA,Delta DateTimeA> Then this problem would be really easily modelled as a multivalued field 'openinghours' of type 'Daterange'. However, I have the feeling that the standard range-query implementation can't be used on this fieldtype, or perhaps should be run for each of the N datereange-values in 'openinghours'. To make matters worse ( I didn't want to introduce this above) REQ 3: It may be possible that certain places have multiple opening-hours / timeslots each day. Consider museum in Spain which get's closed around noon because of siesta-time. OPTION D) would be able to handle this natively, all other options can't. I would very much appreciate any pointers to: - how to start with option D. and if this approach is at all feasible. - if option C. would suffice. (excluding REQ 3. ), and if I'm likely to run into performance / memory troubles. - any other possible solutions I haven' thought of to tackle this. Thanks a lot. Cheers, Geert-Jan View this message in context: http://lucene.472066.n3.nabble.com/multiple-dateranges-timeslots-per-doc-modeling-openinghours-tp3368790p3368790.html Sent from the Solr - User mailing list archive at Nabble.com.
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.David Smiley 2011-09-27, 06:27
In case anyone is curious, I responded to him with a solution using either
SOLR-2155 (Geohash prefix query filter) or LSP: https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244 ~ David Smiley ----- Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book -- View this message in context: http://lucene.472066.n3.nabble.com/multiple-dateranges-timeslots-per-doc-modeling-openinghours-tp3368790p3371747.html Sent from the Solr - User mailing list archive at Nabble.com.
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Chris Hostetter 2011-10-01, 02:25
: Another, faulty, option would be to model opening/closing hours in 2 : multivalued date-fields, i.e: open, close. and insert open/close for each : day, e.g: : : open: 2011-11-08:1800 - close: 2011-11-09:0300 : open: 2011-11-09:1700 - close: 2011-11-10:0500 : open: 2011-11-10:1700 - close: 2011-11-11:0300 : : And queries would be of the form: : : 'open < now && close > now+3h' : : But since there is no way to indicate that 'open' and 'close' are pairwise : related I will get a lot of false positives, e.g the above document would be : returned for: This isn't possible out of the box, but the general idea of "position linked" queries is possible using the same approach as the FieldMaskingSpanQuery... https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html https://issues.apache.org/jira/browse/LUCENE-1494 ..implementing something like this that would work with (Numeric)RangeQueries however would require some additional work, but it should certianly be doable -- i've suggested this before but no one has taken me up on it... http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery If we take it as a given that you can do multiple ranges "at the same position", then you can imagine supporting all of your "regular" hours using just two fields ("open" and "close") by encoding the day+time of each range of open hours into them -- even if a store is open for multiple sets of ranges per day (ie: closed for siesta)... open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ... close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ... then asking for "stores open now and for the next 3 hours" on "wed" at "2:13PM" becomes a query for... sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *]) For the special case part of your problem when there are certain dates that a store will be open atypical hours, i *think* that could be solved using some special docs and the new "join" QParser in a filter query... https://wiki.apache.org/solr/Join imagine you have your "regular" docs with all the normal data about a store, and the open/close fields i describe above. but in addition to those, for any store that you know is "closed on dec 25" or "only open 12:00-15:00 on Jan 01" you add an additional small doc encapsulating the information about the stores closures on that special date - so that each special case would be it's own doc, even if one store had 5 days where there was a special case... specialdoc1: store_id: 42 special_date: Dec-25 status: closed specialdoc2: store_id: 42 special_date: Jan-01 status: irregular open: 09_30 close: 13_00 then when you are executing your query, you use an "fq" to constrain to stores that are (normally) open right now (like i mentioned above) and you use another fq to find all docs *except* those resulting from a join against these special case docs based on the current date. so if you r query is "open now and for the next 3 hours" and "now" == "sunday, 2011-12-25 @ 10:17AM your query would be something like... q=...user input... time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *]) fq={!v=time} fq={!join from=store_id to=unique_key v=$vv} vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}")) That join based approach for dealing with the special dates should work regardless of wether someone implements a way to do pair wise "sameposition()" rangequeries ... so if you can live w/o the multiple open/close pairs per day, you can just use the "one field per day of hte week" type approach you mentioned combined with the "join" for special case days of hte year and everything you need should already work w/o any code (on trunk). (disclaimer: obviously i haven't tested that query, the exact syntax may be off but the princible for modeling the "special docs" and using them in a join should work) -Hoss
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Mikhail Khludnev 2011-10-01, 17:57
I agree about SpanQueries. It's a viable measure against "false-positive
matches on multivalue fields". we've implemented this approach some time ago. Pls find details at http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html and http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html we are going to publish the third post about an implementation approaches. -- Mikhail Khludnev On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter <[EMAIL PROTECTED]>wrote: > > : Another, faulty, option would be to model opening/closing hours in 2 > : multivalued date-fields, i.e: open, close. and insert open/close for each > : day, e.g: > : > : open: 2011-11-08:1800 - close: 2011-11-09:0300 > : open: 2011-11-09:1700 - close: 2011-11-10:0500 > : open: 2011-11-10:1700 - close: 2011-11-11:0300 > : > : And queries would be of the form: > : > : 'open < now && close > now+3h' > : > : But since there is no way to indicate that 'open' and 'close' are > pairwise > : related I will get a lot of false positives, e.g the above document would > be > : returned for: > > This isn't possible out of the box, but the general idea of "position > linked" queries is possible using the same approach as the > FieldMaskingSpanQuery... > > > https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html > https://issues.apache.org/jira/browse/LUCENE-1494 > > ..implementing something like this that would work with > (Numeric)RangeQueries however would require some additional work, but it > should certianly be doable -- i've suggested this before but no one has > taken me up on it... > http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery > > If we take it as a given that you can do multiple ranges "at the same > position", then you can imagine supporting all of your "regular" hours > using just two fields ("open" and "close") by encoding the day+time of > each range of open hours into them -- even if a store is open for multiple > sets of ranges per day (ie: closed for siesta)... > > open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ... > close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ... > > then asking for "stores open now and for the next 3 hours" on "wed" at > "2:13PM" becomes a query for... > > sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *]) > > For the special case part of your problem when there are certain dates > that a store will be open atypical hours, i *think* that could be solved > using some special docs and the new "join" QParser in a filter query... > > https://wiki.apache.org/solr/Join > > imagine you have your "regular" docs with all the normal data about a > store, and the open/close fields i describe above. but in addition to > those, for any store that you know is "closed on dec 25" or "only open > 12:00-15:00 on Jan 01" you add an additional small doc encapsulating > the information about the stores closures on that special date - so that > each special case would be it's own doc, even if one store had 5 days > where there was a special case... > > specialdoc1: > store_id: 42 > special_date: Dec-25 > status: closed > specialdoc2: > store_id: 42 > special_date: Jan-01 > status: irregular > open: 09_30 > close: 13_00 > > then when you are executing your query, you use an "fq" to constrain to > stores that are (normally) open right now (like i mentioned above) and you > use another fq to find all docs *except* those resulting from a join > against these special case docs based on the current date. > > so if you r query is "open now and for the next 3 hours" and "now" => "sunday, 2011-12-25 @ 10:17AM your query would be something like... > > q=...user input... > time=sameposition(open:[* TO sun_10_17], close:[sun_13_17 TO *]) > fq={!v=time} > fq={!join from=store_id to=unique_key v=$vv} > vv=-(+special_date:Dec-25 +(status:closed OR _query_:"{v=$time}")) > > That join based approach for dealing with the special dates should work Sincerely yours Mikhail (Mike) Khludnev Developer Grid Dynamics tel. 1-415-738-8644 Skype: mkhludnev <http://www.griddynamics.com> <[EMAIL PROTECTED]>
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Geert-Jan Brits 2011-10-03, 11:09
Interesting! Reading your previous blogposts, I gather that the to be posted
'implementation approaches' includes a way of making the SpanQueries available within SOLR? Also, would with your approach would (numeric) RangeQueries be possible as Hoss suggests? Looking forward to that 'implementation post' Cheers, Geert-Jan Op 1 oktober 2011 19:57 schreef Mikhail Khludnev <[EMAIL PROTECTED] > het volgende: > I agree about SpanQueries. It's a viable measure against "false-positive > matches on multivalue fields". > we've implemented this approach some time ago. Pls find details at > > http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html > > and > > http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html > we are going to publish the third post about an implementation approaches. > > -- > Mikhail Khludnev > > > On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter <[EMAIL PROTECTED] > >wrote: > > > > > : Another, faulty, option would be to model opening/closing hours in 2 > > : multivalued date-fields, i.e: open, close. and insert open/close for > each > > : day, e.g: > > : > > : open: 2011-11-08:1800 - close: 2011-11-09:0300 > > : open: 2011-11-09:1700 - close: 2011-11-10:0500 > > : open: 2011-11-10:1700 - close: 2011-11-11:0300 > > : > > : And queries would be of the form: > > : > > : 'open < now && close > now+3h' > > : > > : But since there is no way to indicate that 'open' and 'close' are > > pairwise > > : related I will get a lot of false positives, e.g the above document > would > > be > > : returned for: > > > > This isn't possible out of the box, but the general idea of "position > > linked" queries is possible using the same approach as the > > FieldMaskingSpanQuery... > > > > > > > https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html > > https://issues.apache.org/jira/browse/LUCENE-1494 > > > > ..implementing something like this that would work with > > (Numeric)RangeQueries however would require some additional work, but it > > should certianly be doable -- i've suggested this before but no one has > > taken me up on it... > > http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery > > > > If we take it as a given that you can do multiple ranges "at the same > > position", then you can imagine supporting all of your "regular" hours > > using just two fields ("open" and "close") by encoding the day+time of > > each range of open hours into them -- even if a store is open for > multiple > > sets of ranges per day (ie: closed for siesta)... > > > > open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ... > > close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ... > > > > then asking for "stores open now and for the next 3 hours" on "wed" at > > "2:13PM" becomes a query for... > > > > sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *]) > > > > For the special case part of your problem when there are certain dates > > that a store will be open atypical hours, i *think* that could be solved > > using some special docs and the new "join" QParser in a filter query... > > > > https://wiki.apache.org/solr/Join > > > > imagine you have your "regular" docs with all the normal data about a > > store, and the open/close fields i describe above. but in addition to > > those, for any store that you know is "closed on dec 25" or "only open > > 12:00-15:00 on Jan 01" you add an additional small doc encapsulating > > the information about the stores closures on that special date - so that > > each special case would be it's own doc, even if one store had 5 days > > where there was a special case... > > > > specialdoc1: > > store_id: 42 > > special_date: Dec-25 > > status: closed > > specialdoc2: > > store_id: 42 > > special_date: Jan-01 > > status: irregular > > open: 09_30 > > close: 13_00 > > > > then when you are executing your query, you use an "fq" to constrain to > > stores that are (normally) open right now (like i mentioned above) and
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Geert-Jan Brits 2011-10-03, 11:42
Thanks Hoss for that in-depth walkthrough.
I like your solution of using (something akin to) FieldMaskingSpanQuery<https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html>. Conceptually the Join-approach looks like it would work from paper, although I'm not a big fan of introducing a lot of complexity to the frontend / querying part of the solution. As an alternative, what about using your fieldMaskingSpanQuery-approach solely (without the JOIN-approach) and encode open/close on a per day basis? I didn't mention it, but I 'only' need 100 days of data, which would lead to 100 open and 100 close values, not counting the pois with multiple openinghours per day which are pretty rare. The index is rebuild each night, refreshing the date-data. I'm not sure what the performance implications would be like, but somehow that feels doable. Perhaps it even offsets the extra time needed for doing the Joins, only 1 way to find out I guess. Disadvantage would be fewer cache-hits when using FQ. Data then becomes: open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ... close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ... Notice the: 20111021_26_30, which indicates close at 2AM the next day, which would work (in contrast to encoding it like 20111022_02_30) Alternatively, how would you compare your suggested approach with the approach by David Smiley using either SOLR-2155 (Geohash prefix query filter) or LSP: https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244. That would work right now, and the LSP-approach seems pretty elegant to me. FQ-style caching is probably not possible though. Geert-Jan Op 1 oktober 2011 04:25 schreef Chris Hostetter <[EMAIL PROTECTED]>het volgende: > > : Another, faulty, option would be to model opening/closing hours in 2 > : multivalued date-fields, i.e: open, close. and insert open/close for each > : day, e.g: > : > : open: 2011-11-08:1800 - close: 2011-11-09:0300 > : open: 2011-11-09:1700 - close: 2011-11-10:0500 > : open: 2011-11-10:1700 - close: 2011-11-11:0300 > : > : And queries would be of the form: > : > : 'open < now && close > now+3h' > : > : But since there is no way to indicate that 'open' and 'close' are > pairwise > : related I will get a lot of false positives, e.g the above document would > be > : returned for: > > This isn't possible out of the box, but the general idea of "position > linked" queries is possible using the same approach as the > FieldMaskingSpanQuery... > > > https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html > https://issues.apache.org/jira/browse/LUCENE-1494 > > ..implementing something like this that would work with > (Numeric)RangeQueries however would require some additional work, but it > should certianly be doable -- i've suggested this before but no one has > taken me up on it... > http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery > > If we take it as a given that you can do multiple ranges "at the same > position", then you can imagine supporting all of your "regular" hours > using just two fields ("open" and "close") by encoding the day+time of > each range of open hours into them -- even if a store is open for multiple > sets of ranges per day (ie: closed for siesta)... > > open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ... > close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ... > > then asking for "stores open now and for the next 3 hours" on "wed" at > "2:13PM" becomes a query for... > > sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *]) > > For the special case part of your problem when there are certain dates > that a store will be open atypical hours, i *think* that could be solved > using some special docs and the new "join" QParser in a filter query... > > https://wiki.apache.org/solr/Join > > imagine you have your "regular" docs with all the normal data about a > store, and the open/close fields i describe above. but in addition to
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Mikhail Khludnev 2011-10-03, 12:52
On Mon, Oct 3, 2011 at 3:09 PM, Geert-Jan Brits <[EMAIL PROTECTED]> wrote:
> Interesting! Reading your previous blogposts, I gather that the to be > posted > 'implementation approaches' includes a way of making the SpanQueries > available within SOLR? > It's going to be posted in two days. But please don't expect much from them, it's just a proof of concept. It's not a code for production nor for contribution. e.g. we've chosen 'quick hack' way of boolean query converting instead of XmlQuery, SurroundParser or contrib's query parser, etc. i.e. we can share only core ideas, some of these are possibly wrong. > Also, would with your approach would (numeric) RangeQueries be possible as > Hoss suggests? > Basically range queries are just conjunctions (sometimes it's not great at all) for numbers. If you encode your terms in sortable manner eg A0715 for Monday 7-15 am, you'll be able to build the span merging 'conjunction' - new SpanOrQuery(new SpanTermQuery(..),.... ). Regards Mikhail > Looking forward to that 'implementation post' > Cheers, > Geert-Jan > > Op 1 oktober 2011 19:57 schreef Mikhail Khludnev < > [EMAIL PROTECTED] > > het volgende: > > > I agree about SpanQueries. It's a viable measure against "false-positive > > matches on multivalue fields". > > we've implemented this approach some time ago. Pls find details at > > > > > http://blog.griddynamics.com/2011/06/solr-experience-search-parent-child.html > > > > and > > > > > http://blog.griddynamics.com/2011/07/solr-experience-search-parent-child.html > > we are going to publish the third post about an implementation > approaches. > > > > -- > > Mikhail Khludnev > > > > > > On Sat, Oct 1, 2011 at 6:25 AM, Chris Hostetter < > [EMAIL PROTECTED] > > >wrote: > > > > > > > > : Another, faulty, option would be to model opening/closing hours in 2 > > > : multivalued date-fields, i.e: open, close. and insert open/close for > > each > > > : day, e.g: > > > : > > > : open: 2011-11-08:1800 - close: 2011-11-09:0300 > > > : open: 2011-11-09:1700 - close: 2011-11-10:0500 > > > : open: 2011-11-10:1700 - close: 2011-11-11:0300 > > > : > > > : And queries would be of the form: > > > : > > > : 'open < now && close > now+3h' > > > : > > > : But since there is no way to indicate that 'open' and 'close' are > > > pairwise > > > : related I will get a lot of false positives, e.g the above document > > would > > > be > > > : returned for: > > > > > > This isn't possible out of the box, but the general idea of "position > > > linked" queries is possible using the same approach as the > > > FieldMaskingSpanQuery... > > > > > > > > > > > > https://lucene.apache.org/java/3_4_0/api/core/org/apache/lucene/search/spans/FieldMaskingSpanQuery.html > > > https://issues.apache.org/jira/browse/LUCENE-1494 > > > > > > ..implementing something like this that would work with > > > (Numeric)RangeQueries however would require some additional work, but > it > > > should certianly be doable -- i've suggested this before but no one has > > > taken me up on it... > > > http://markmail.org/search/?q=hoss+FieldMaskingSpanQuery > > > > > > If we take it as a given that you can do multiple ranges "at the same > > > position", then you can imagine supporting all of your "regular" hours > > > using just two fields ("open" and "close") by encoding the day+time of > > > each range of open hours into them -- even if a store is open for > > multiple > > > sets of ranges per day (ie: closed for siesta)... > > > > > > open: mon_12_30, tue_12_30, wed_07_30, wed_3_30, ... > > > close: mon_20_00, tue_20_30, wed_12_30, wed_22_30, ... > > > > > > then asking for "stores open now and for the next 3 hours" on "wed" at > > > "2:13PM" becomes a query for... > > > > > > sameposition(open:[* TO wed_14_13], close:[wed_17_13 TO *]) > > > > > > For the special case part of your problem when there are certain dates > > > that a store will be open atypical hours, i *think* that could be > solved > > > using some special docs and the new "join" QParser in a filter query... Sincerely yours Mikhail (Mike) Khludnev Developer Grid Dynamics tel. 1-415-738-8644 Skype: mkhludnev <http://www.griddynamics.com> <[EMAIL PROTECTED]>
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Chris Hostetter 2011-10-11, 01:21
: Conceptually : the Join-approach looks like it would work from paper, although I'm not a : big fan of introducing a lot of complexity to the frontend / querying part : of the solution. you lost me there -- i don't see how using join would impact the front end / query side at all. your query clients would never even know that a join had happened (your indexing code would certianly have to know about creating those special case docs to join against obviuosly) : As an alternative, what about using your fieldMaskingSpanQuery-approach : solely (without the JOIN-approach) and encode open/close on a per day : basis? : I didn't mention it, but I 'only' need 100 days of data, which would lead to : 100 open and 100 close values, not counting the pois with multiple ... : Data then becomes: : : open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ... : close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ... aw hell ... i assumed you needed to suport an arbitrarily large number of special case open+close pairs per doc. if you only have to support a fix value (N=100) open+close values you could just have N*2 date fields and a BooleanQuery containing N 2-clause BooleanQueries contain ranging queries against each pair of your date fields. ie... ((+open00:[* TO NOW] +close00:[NOW+3HOURS TO *]) (+open01:[* TO NOW] +close01:[NOW+3HOURS TO *]) (+open02:[* TO NOW] +close02:[NOW+3HOURS TO *]) ...etc... (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *])) ...for a lot of indexes, 100 clauses is small potatoes as far as number of boolean clauses go, especially if many of them are going to short circut out because there won't be any matches at all. : Alternatively, how would you compare your suggested approach with the : approach by David Smiley using either SOLR-2155 (Geohash prefix query : filter) or LSP: : https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244. : That would work right now, and the LSP-approach seems pretty elegant to me. I'm afraid i'm totally ignorant of how the LSP stuff works so i can't really comment there. If i understand what you mean about mapping the open/close concepts to lat/lon concepts, then i can see how it would be useful for multiple pair wise (absolute) date ranges, but i'm not really sure how you would deal with the diff open+close pairs per day (or on diff days of hte week, or special days of the year) using the lat+lon conceptual model ... I guess if the LSP stuff supports arbitrary N-dimensional spaces then you could model day or week as a dimension .. but it still seems like you'd need multiple fields for the special case days, right? How it would compare performance wise: no idea. -Hoss
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Geert-Jan Brits 2011-10-11, 09:43
Op 11 oktober 2011 03:21 schreef Chris Hostetter
<[EMAIL PROTECTED]>het volgende: > > : Conceptually > : the Join-approach looks like it would work from paper, although I'm not a > : big fan of introducing a lot of complexity to the frontend / querying > part > : of the solution. > > you lost me there -- i don't see how using join would impact the front end > / query side at all. your query clients would never even know that a join > had happened (your indexing code would certianly have to know about > creating those special case docs to join against obviuosly) > > : As an alternative, what about using your fieldMaskingSpanQuery-approach > : solely (without the JOIN-approach) and encode open/close on a per day > : basis? > : I didn't mention it, but I 'only' need 100 days of data, which would lead > to > : 100 open and 100 close values, not counting the pois with multiple > ... > : Data then becomes: > : > : open: 20111020_12_30, 20111021_12_30, 20111022_07_30, ... > : close: 20111020_20_00, 20111021_26_30, 20111022_12_30, ... > > aw hell ... i assumed you needed to suport an arbitrarily large number > of special case open+close pairs per doc. > I didn't express myself well. A POI can have multiple open+close pairs per day, but each night I only index the coming 100 days. So MOST POIs will have 100 open+close pairs (1 openinghours per day) but some have more. > > if you only have to support a fix value (N=100) open+close values you > could just have N*2 date fields and a BooleanQuery containing N 2-clause > BooleanQueries contain ranging queries against each pair of your date > fields. ie... > > ((+open00:[* TO NOW] +close00:[NOW+3HOURS TO *]) > (+open01:[* TO NOW] +close01:[NOW+3HOURS TO *]) > (+open02:[* TO NOW] +close02:[NOW+3HOURS TO *]) > ...etc... > (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *])) > > ...for a lot of indexes, 100 clauses is small potatoes as far as number of > boolean clauses go, especially if many of them are going to short circut > out because there won't be any matches at all. > Given that I need multiple open+close pairs per day this can't be used directly. However when setting a logical upperbound on the maximum nr of openinghours per day (say 3), which would be possible, this could be extended to: open00 = day0 --> open00-0 = day0 timeslot 0, open00-1 = day0 timeslot 1, etc. So, ((+open00-0:[* TO NOW] +close00-0:[NOW+3HOURS TO *]) (+open00-1:[* TO NOW] +close00-1:[NOW+3HOURS TO *]) (+open00-2:[* TO NOW] +close00-2:[NOW+3HOURS TO *]) (+open01-0:[* TO NOW] +close01-0:[NOW+3HOURS TO *]) (+open01-1:[* TO NOW] +close01-1:[NOW+3HOURS TO *]) (+open01-2:[* TO NOW] +close01-2:[NOW+3HOURS TO *]) ...etc... (+open99:[* TO NOW] +close99:[NOW+3HOURS TO *])) This would need 2*3*100 = 600 dynamicfields to cover the openinghours. You mention this is peanuts for constructing a booleanquery, but how about memory consumption? I'm particularly concerned about the Lucene FieldCache getting populated for each of the 600 fields. (Since I had some nasty OOM experiences with that in the past. 2-3 years ago memory consumption of Lucene FieldCache couldn't be controlled, I'm not sure how that is now to be honest) I will not be sorting on any of the 600 dynamicfields btw. Instead I will only use them as part of the above booleanquery, which I will likely define as a Filter Query. Just to be sure, in this situation, Lucene FieldCache won't be touched, correct? If so, this will probably be a good workable solution! > : Alternatively, how would you compare your suggested approach with the > : approach by David Smiley using either SOLR-2155 (Geohash prefix query > : filter) or LSP: > : > https://issues.apache.org/jira/browse/SOLR-2155?focusedCommentId=13115244#comment-13115244 > . > : That would work right now, and the LSP-approach seems pretty elegant to > me. > > I'm afraid i'm totally ignorant of how the LSP stuff works so i can't > really comment there. > > If i understand what you mean about mapping the open/close concepts to I planned to do the folllowing using LSP, (through help from David) Each <open,close>-tuple would be modeled as a point(x,y) . (x = open, y close) So a POI can have many (100 or more) points, each representing a <open,close>-tuple. Given: 100 days lookahead, granularity: 5 min, we can map dimensions x and y to to [0,30000] E.g: - indexing starts at / baseline is at: 2011-11-01:0000 - poi open: 2011-11-08:1800 - poi close: 2011-11-09:0300 - (query): user visit: 2011-11-08:2300 - user depart: 2011-11-09:0200 Would map to: - poi open: 2520 - poi close: 2628 = point(x,y) = (2520,2628) - (query):user visit: 2580 - user depart: 2616 = bbox filter with the ranges x:[0 TO 2580], y:[2616 TO 30000] All pois are returned which have one or more points within the bbox. Both approaches seem pretty good to me. I'll be testing both soon. Thanks! Geert-Jan
-
Re: multiple dateranges/timeslots per doc: modeling openinghours.Chris Hostetter 2011-11-01, 23:12
: This would need 2*3*100 = 600 dynamicfields to cover the openinghours. You : mention this is peanuts for constructing a booleanquery, but how about : memory consumption? : I'm particularly concerned about the Lucene FieldCache getting populated for : each of the 600 fields. (Since I had some nasty OOM experiences with that in : the past. 2-3 years ago memory consumption of Lucene FieldCache couldn't be : controlled, I'm not sure how that is now to be honest) : : I will not be sorting on any of the 600 dynamicfields btw. Instead I will : only use them as part of the above booleanquery, which I will likely define : as a Filter Query. : Just to be sure, in this situation, Lucene FieldCache won't be touched, : correct? If so, this will probably be a good workable solution! correct. searching on fields doesn't use the FieldCache (unless you are doing a function query - you aren't in this case) so the memory usage of FieldCache wouldn't be a factor here at all. -Hoss |