|
Grant Ingersoll
2010-04-14, 15:00
Chris Male
2010-04-14, 15:06
Yonik Seeley
2010-04-14, 15:23
Helleringer, Nicolas
2010-04-14, 15:28
Grant Ingersoll
2010-04-14, 16:05
Helleringer, Nicolas
2010-04-14, 15:23
Grant Ingersoll
2010-04-14, 16:07
Chris Male
2010-04-14, 16:12
Yonik Seeley
2010-04-14, 16:24
Chris Male
2010-04-14, 16:32
Helleringer, Nicolas
2010-04-14, 16:36
Helleringer, Nicolas
2010-04-14, 16:42
Grant Ingersoll
2010-04-14, 17:17
Grant Ingersoll
2010-04-14, 17:12
|
-
[SPATIAL] Best Fit CalculationGrant Ingersoll 2010-04-14, 15:00
LUCENE-2359 changed the best fit calculation. I admit, I'm not entirely certain which one is right, so I thought we should step back and talk about what we are trying to achieve.
Please correct me if/where I am wrong. Looking at the problem of tiers/tiles/grids in general, we are taking a sphere, projecting it into a 2D plane. Next, we are dividing up the plane into nested grids/tiers. Each tier contains 2^tier id boxes. Thus, tier level 2 divides the earth up into 4 boxes. 2^15 = 32,768 boxes. We then, for each box, give it a unique label which then becomes the token that we index. During indexing, we typically will index many tiers, i.e. tiers 4 through 15. During search, we take in a lat/lon and a radius. The goal is to do a search using the fewest terms possible. Thus, we need to pick the tier that contains/covers the radius with the fewest number of boxes so that we can enumerate a very small number of documents. Thus, we need to calculate the best fit, which is a method inside of the CartesianTierPlotter. In the old way, we did: bf = min( 15, ceil(log2( earth_circumference / ( ( miles/2) - sqrt( (miles/2)^2 / 2 ) ) ) + 1 ) // we won't go higher than 15 for accuracy reasons The new way is: bf' = ceil ( log2( earth_circumference / ( 2 * miles ) ) ) These are obviously two different calculations, never mind the min(15) issue, we can easily resolve that one. AFAICT, the new way is much less accurate, but will likely be faster. So, which is right? Unfortunately, I find almost zero documentation on this, probably b/c the nomenclature is off, but... -Grant --------------------------------------------------------------------- +
Grant Ingersoll 2010-04-14, 15:00
-
Re: [SPATIAL] Best Fit CalculationChris Male 2010-04-14, 15:06
Hi,
My understanding of the benefits of the new algorithm is that it means a lower tier level resulting in fewer boxes, but more documents inside those boxes that are outside of the search radius. While having fewer boxes means fewer term queries to make against the index, more documents means more costly calculations to filter out those extraneous documents. For those doing just Cartesian Tier filtering it seems like the new approach is a win, but for those doing distance calculations on those documents passing the filter, it seems to come at a cost. Cheers Chris On Wed, Apr 14, 2010 at 5:00 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > LUCENE-2359 changed the best fit calculation. I admit, I'm not entirely > certain which one is right, so I thought we should step back and talk about > what we are trying to achieve. > > Please correct me if/where I am wrong. > > Looking at the problem of tiers/tiles/grids in general, we are taking a > sphere, projecting it into a 2D plane. Next, we are dividing up the plane > into nested grids/tiers. Each tier contains 2^tier id boxes. Thus, tier > level 2 divides the earth up into 4 boxes. 2^15 = 32,768 boxes. We then, > for each box, give it a unique label which then becomes the token that we > index. During indexing, we typically will index many tiers, i.e. tiers 4 > through 15. > > During search, we take in a lat/lon and a radius. The goal is to do a > search using the fewest terms possible. Thus, we need to pick the tier that > contains/covers the radius with the fewest number of boxes so that we can > enumerate a very small number of documents. Thus, we need to calculate the > best fit, which is a method inside of the CartesianTierPlotter. > > In the old way, we did: > > bf = min( 15, ceil(log2( earth_circumference / ( ( miles/2) - sqrt( > (miles/2)^2 / 2 ) ) ) + 1 ) // we won't go higher than 15 for accuracy > reasons > > The new way is: > > bf' = ceil ( log2( earth_circumference / ( 2 * miles ) ) ) > > These are obviously two different calculations, never mind the min(15) > issue, we can easily resolve that one. > > AFAICT, the new way is much less accurate, but will likely be faster. > > So, which is right? > > Unfortunately, I find almost zero documentation on this, probably b/c the > nomenclature is off, but... > > -Grant > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl +
Chris Male 2010-04-14, 15:06
-
Re: [SPATIAL] Best Fit CalculationYonik Seeley 2010-04-14, 15:23
On Wed, Apr 14, 2010 at 11:06 AM, Chris Male <[EMAIL PROTECTED]> wrote:
> While having fewer boxes means fewer term queries to make against the index, > more documents means more costly calculations to filter out those extraneous > documents. Filtering out documents (greater selectivity) seems like it should be the primary goal. But perhaps the problem could be parameterized? What if you gave a minimum tile size that you wanted to use? Then there would be only one correct answer I believe? So the problem would essentially be boiled down to this: - find the minimum area that still encompasses the entire circle - minimize the number of tiles - don't use tiles smaller than minTile That minTile param allows you to trade off between filtering accuracy and faster tile filtering. Without the param (or until it can be implemented) the correct approach seems like the above, without a minTile. This sounds to me like the old approach is correct. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague --------------------------------------------------------------------- +
Yonik Seeley 2010-04-14, 15:23
-
Re: [SPATIAL] Best Fit CalculationHelleringer, Nicolas 2010-04-14, 15:28
>
> That minTile param allows you to trade off between filtering accuracy > and faster tile filtering. Without the param (or until it can be > implemented) the correct approach seems like the above, without a > minTile. This sounds to me like the old approach is correct. > minTier and maxTier at CartesianTierPlotter level have been commited into the trunk today : see https://issues.apache.org/jira/browse/LUCENE-2184 <https://issues.apache.org/jira/browse/LUCENE-2184>Regards, Nicolas > > -Yonik > Apache Lucene Eurocon 2010 > 18-21 May 2010 | Prague > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > +
Helleringer, Nicolas 2010-04-14, 15:28
-
Re: [SPATIAL] Best Fit CalculationGrant Ingersoll 2010-04-14, 16:05
On Apr 14, 2010, at 11:28 AM, Helleringer, Nicolas wrote: > That minTile param allows you to trade off between filtering accuracy > and faster tile filtering. Without the param (or until it can be > implemented) the correct approach seems like the above, without a > minTile. This sounds to me like the old approach is correct. > minTier and maxTier at CartesianTierPlotter level have been commited into the trunk today : see https://issues.apache.org/jira/browse/LUCENE-2184 Actually, I changed those, as the CartTierPlotter doesn't need them. > > Regards, > > Nicolas > > > -Yonik > Apache Lucene Eurocon 2010 > 18-21 May 2010 | Prague > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search +
Grant Ingersoll 2010-04-14, 16:05
-
Re: [SPATIAL] Best Fit CalculationHelleringer, Nicolas 2010-04-14, 15:23
I ll try to find a little bit of time tonight to make a sample data set go
through the two calculations to see the differences. I ll make a summary table. I ll comment the issue with some comments on 'my' version of the alogrithm right now. Nicolas 2010/4/14 Chris Male <[EMAIL PROTECTED]> > Hi, > > My understanding of the benefits of the new algorithm is that it means a > lower tier level resulting in fewer boxes, but more documents inside those > boxes that are outside of the search radius. > > While having fewer boxes means fewer term queries to make against the > index, more documents means more costly calculations to filter out those > extraneous documents. > > For those doing just Cartesian Tier filtering it seems like the new > approach is a win, but for those doing distance calculations on those > documents passing the filter, it seems to come at a cost. > > Cheers > Chris > > > On Wed, Apr 14, 2010 at 5:00 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> LUCENE-2359 changed the best fit calculation. I admit, I'm not entirely >> certain which one is right, so I thought we should step back and talk about >> what we are trying to achieve. >> >> Please correct me if/where I am wrong. >> >> Looking at the problem of tiers/tiles/grids in general, we are taking a >> sphere, projecting it into a 2D plane. Next, we are dividing up the plane >> into nested grids/tiers. Each tier contains 2^tier id boxes. Thus, tier >> level 2 divides the earth up into 4 boxes. 2^15 = 32,768 boxes. We then, >> for each box, give it a unique label which then becomes the token that we >> index. During indexing, we typically will index many tiers, i.e. tiers 4 >> through 15. >> >> During search, we take in a lat/lon and a radius. The goal is to do a >> search using the fewest terms possible. Thus, we need to pick the tier that >> contains/covers the radius with the fewest number of boxes so that we can >> enumerate a very small number of documents. Thus, we need to calculate the >> best fit, which is a method inside of the CartesianTierPlotter. >> >> In the old way, we did: >> >> bf = min( 15, ceil(log2( earth_circumference / ( ( miles/2) - sqrt( >> (miles/2)^2 / 2 ) ) ) + 1 ) // we won't go higher than 15 for accuracy >> reasons >> >> The new way is: >> >> bf' = ceil ( log2( earth_circumference / ( 2 * miles ) ) ) >> >> These are obviously two different calculations, never mind the min(15) >> issue, we can easily resolve that one. >> >> AFAICT, the new way is much less accurate, but will likely be faster. >> >> So, which is right? >> >> Unfortunately, I find almost zero documentation on this, probably b/c the >> nomenclature is off, but... >> >> -Grant >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Chris Male | Software Developer | JTeam BV.| www.jteam.nl > +
Helleringer, Nicolas 2010-04-14, 15:23
-
Re: [SPATIAL] Best Fit CalculationGrant Ingersoll 2010-04-14, 16:07
On Apr 14, 2010, at 11:06 AM, Chris Male wrote: > Hi, > > My understanding of the benefits of the new algorithm is that it means a lower tier level resulting in fewer boxes, but more documents inside those boxes that are outside of the search radius. > > While having fewer boxes means fewer term queries to make against the index, more documents means more costly calculations to filter out those extraneous documents. > > For those doing just Cartesian Tier filtering it seems like the new approach is a win, but for those doing distance calculations on those documents passing the filter, it seems to come at a cost. Currently, this is only used for filtering. AIUI, Tiers aren't really that useful for distance calculations, are they? After all, all you have is a box id and you'd have to reverse out the calc of that to be able to calc a distance, no? Perhaps I'm missing something. I'm not sure, however, that it is a win for filtering. It seems like you end up including docs in the result set that should be in there. I'll wait for Nicolas' summary table, but I'm inclined to revert and then someone can refactor if they want to offer alternate implementations. -Grant --------------------------------------------------------------------- +
Grant Ingersoll 2010-04-14, 16:07
-
Re: [SPATIAL] Best Fit CalculationChris Male 2010-04-14, 16:12
Hi,
On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > > On Apr 14, 2010, at 11:06 AM, Chris Male wrote: > > > Hi, > > > > My understanding of the benefits of the new algorithm is that it means a > lower tier level resulting in fewer boxes, but more documents inside those > boxes that are outside of the search radius. > > > > While having fewer boxes means fewer term queries to make against the > index, more documents means more costly calculations to filter out those > extraneous documents. > > > > For those doing just Cartesian Tier filtering it seems like the new > approach is a win, but for those doing distance calculations on those > documents passing the filter, it seems to come at a cost. > > Currently, this is only used for filtering. AIUI, Tiers aren't really that > useful for distance calculations, are they? After all, all you have is a > box id and you'd have to reverse out the calc of that to be able to calc a > distance, no? Perhaps I'm missing something. > > How Spatial Lucene currently works (or at least one of the ways it was designed to work), is using a 2 step filtering process. Step 1 is the Cartesian Tier filtering. The resulting set of Documents is then passed on through to Step 2 which then calculates the distance from each Document to the search centre. If the distance is greater than the radius, the Document is filtered out. This means that after both filtering steps you have only those Documents that are in the search radius. How this impacts this algorithm choice is that the more Documents the pass through Step 1, the more calculations that have to be done in Step 2. > I'm not sure, however, that it is a win for filtering. It seems like you > end up including docs in the result set that should be in there. > > I'll wait for Nicolas' summary table, but I'm inclined to revert and then > someone can refactor if they want to offer alternate implementations. > > -Grant > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl +
Chris Male 2010-04-14, 16:12
-
Re: [SPATIAL] Best Fit CalculationYonik Seeley 2010-04-14, 16:24
On Wed, Apr 14, 2010 at 12:12 PM, Chris Male <[EMAIL PROTECTED]> wrote:
> On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll <[EMAIL PROTECTED]> >> On Apr 14, 2010, at 11:06 AM, Chris Male wrote: >> > For those doing just Cartesian Tier filtering it seems like the new >> > approach is a win, but for those doing distance calculations on those >> > documents passing the filter, it seems to come at a cost. >> >> Currently, this is only used for filtering. AIUI, Tiers aren't really >> that useful for distance calculations, are they? After all, all you have is >> a box id and you'd have to reverse out the calc of that to be able to calc a >> distance, no? Perhaps I'm missing something. >> > > How Spatial Lucene currently works (or at least one of the ways it was > designed to work), is using a 2 step filtering process. Step 1 is the > Cartesian Tier filtering. The resulting set of Documents is then passed on > through to Step 2 which then calculates the distance from each Document to > the search centre. IMO, being able to just do a tier or bounding box filter is also useful (step 1). One example is if someone is going to sort by distance anyway... they may want to do only a bounding-box type filter for greater performance. We should keep both concepts (bounding box filter and distance filter) regardless of how the distance filter is implemented. -Yonik Apache Lucene Eurocon 2010 18-21 May 2010 | Prague --------------------------------------------------------------------- +
Yonik Seeley 2010-04-14, 16:24
-
Re: [SPATIAL] Best Fit CalculationChris Male 2010-04-14, 16:32
On Wed, Apr 14, 2010 at 6:24 PM, Yonik Seeley <[EMAIL PROTECTED]>wrote:
> On Wed, Apr 14, 2010 at 12:12 PM, Chris Male <[EMAIL PROTECTED]> wrote: > > On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll <[EMAIL PROTECTED]> > >> On Apr 14, 2010, at 11:06 AM, Chris Male wrote: > >> > For those doing just Cartesian Tier filtering it seems like the new > >> > approach is a win, but for those doing distance calculations on those > >> > documents passing the filter, it seems to come at a cost. > >> > >> Currently, this is only used for filtering. AIUI, Tiers aren't really > >> that useful for distance calculations, are they? After all, all you > have is > >> a box id and you'd have to reverse out the calc of that to be able to > calc a > >> distance, no? Perhaps I'm missing something. > >> > > > > How Spatial Lucene currently works (or at least one of the ways it was > > designed to work), is using a 2 step filtering process. Step 1 is the > > Cartesian Tier filtering. The resulting set of Documents is then passed > on > > through to Step 2 which then calculates the distance from each Document > to > > the search centre. > > IMO, being able to just do a tier or bounding box filter is also > useful (step 1). > One example is if someone is going to sort by distance anyway... they > may want to do only a bounding-box type filter for greater > performance. > > We should keep both concepts (bounding box filter and distance filter) > regardless of how the distance filter is implemented. > Definitely. > > -Yonik > Apache Lucene Eurocon 2010 > 18-21 May 2010 | Prague > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > -- Chris Male | Software Developer | JTeam BV.| www.jteam.nl +
Chris Male 2010-04-14, 16:32
-
Re: [SPATIAL] Best Fit CalculationHelleringer, Nicolas 2010-04-14, 16:36
Here are the summary tables :
First a table to remind metrics on the Tiers : Tile Level TierLegnth TierBoxes TileLength (miles) 0 1 1 24902 1 2 4 12451 2 4 16 6225,5 3 8 64 3112,75 4 16 256 1556,375 5 32 1024 778,1875 6 64 4096 389,09375 7 128 16384 194,546875 8 256 65536 97,2734375 9 512 262144 48,63671875 10 1024 1048576 24,31835938 11 2048 4194304 12,15917969 12 4096 16777216 6,079589844 13 8192 67108864 3,039794922 14 16384 268435456 1,519897461 15 32768 1073741824 0,75994873 Then the comparaison table between legacy and new bestFit : Radius (miles) legacy bestFit legacy bestFit TileLength legacy bestFit max number of Box to fetch new bestFit new bestFit TileLength new bestFit number of Box to fetch 1 18 0,75994873 9 14 1,519897461 4 5 16 0,75994873 64 12 6,079589844 4 10 15 0,75994873 225 11 12,15917969 4 25 13 3,039794922 100 9 24,31835938 9 50 12 6,079589844 100 8 97,2734375 4 100 11 12,15917969 100 7 194,546875 4 250 10 24,31835938 144 6 389,09375 4 500 9 48,63671875 144 5 778,1875 4 1000 8 97,2734375 144 4 1556,375 4 2500 7 194,546875 196 3 3112,75 4 5000 6 389,09375 196 2 6225,5 4 10000 5 778,1875 196 1 12451 4 I hope mailers will keep the formating ... If not I shall post on JIRA. Formulas : TileLength is 24902 (earth circumference) / TierLength bestFit formulas as summarized by Grant in his email. number of box to fetch : pow(ceil(TileLength/Radius)+1,2) => TileLength/Radius is for how many tiles are needed to cover the radius, +1 is because you are not always well aligned, the pow(X,2) because there is two directions/axis Best regards, Nicolas 2010/4/14 Chris Male <[EMAIL PROTECTED]> > Hi, > > On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> >> On Apr 14, 2010, at 11:06 AM, Chris Male wrote: >> >> > Hi, >> > >> > My understanding of the benefits of the new algorithm is that it means a >> lower tier level resulting in fewer boxes, but more documents inside those >> boxes that are outside of the search radius. >> > >> > While having fewer boxes means fewer term queries to make against the >> index, more documents means more costly calculations to filter out those >> extraneous documents. >> > >> > For those doing just Cartesian Tier filtering it seems like the new >> approach is a win, but for those doing distance calculations on those >> documents passing the filter, it seems to come at a cost. >> >> Currently, this is only used for filtering. AIUI, Tiers aren't really >> that useful for distance calculations, are they? After all, all you have is >> a box id and you'd have to reverse out the calc of that to be able to calc a >> distance, no? Perhaps I'm missing something. >> >> > How Spatial Lucene currently works (or at least one of the ways it was > designed to work), is using a 2 step filtering process. Step 1 is the > Cartesian Tier filtering. The resulting set of Documents is then passed on > through to Step 2 which then calculates the distance from each Document to > the search centre. If the distance is greater than the radius, the Document > is filtered out. This means that after both filtering steps you have only > those Documents that are in the search radius. > > How this impacts this algorithm choice is that the more Documents the pass > through Step 1, the more calculations that have to be done in Step 2. > > >> I'm not sure, however, that it is a win for filtering. It seems like you >> end up including docs in the result set that should be in there. >> >> I'll wait for Nicolas' summary table, but I'm inclined to revert and then >> someone can refactor if they want to offer alternate implementations. >> >> -Grant >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > > -- > Chris Male | Software Developer | JTeam BV.| www.jteam.nl > +
Helleringer, Nicolas 2010-04-14, 16:36
-
Re: [SPATIAL] Best Fit CalculationHelleringer, Nicolas 2010-04-14, 16:42
Tables are well on JIRA : https://issues.apache.org/jira/browse/LUCENE-2359
<https://issues.apache.org/jira/browse/LUCENE-2359>Nicolas 2010/4/14 Helleringer, Nicolas <[EMAIL PROTECTED]> > Here are the summary tables : > > First a table to remind metrics on the Tiers : > Tile Level TierLegnth TierBoxes TileLength (miles) 0 1 1 24902 1 2 4 12451 > 2 4 16 6225,5 3 8 64 3112,75 4 16 256 1556,375 5 32 1024 778,1875 6 64 4096 > 389,09375 7 128 16384 194,546875 8 256 65536 97,2734375 9 512 262144 > 48,63671875 10 1024 1048576 24,31835938 11 2048 4194304 12,15917969 12 4096 > 16777216 6,079589844 13 8192 67108864 3,039794922 14 16384 268435456 > 1,519897461 15 32768 1073741824 0,75994873 > > > Then the comparaison table between legacy and new bestFit : > Radius (miles) legacy bestFit legacy bestFit TileLength legacy bestFit max > number of Box to fetch new bestFit new bestFit TileLength new bestFit number > of Box to fetch 1 18 0,75994873 9 14 1,519897461 4 5 16 0,75994873 64 12 > 6,079589844 4 10 15 0,75994873 225 11 12,15917969 4 25 13 3,039794922 100 9 > 24,31835938 9 50 12 6,079589844 100 8 97,2734375 4 100 11 12,15917969 100 7 > 194,546875 4 250 10 24,31835938 144 6 389,09375 4 500 9 48,63671875 144 5 > 778,1875 4 1000 8 97,2734375 144 4 1556,375 4 2500 7 194,546875 196 3 > 3112,75 4 5000 6 389,09375 196 2 6225,5 4 10000 5 778,1875 196 1 12451 4 > > I hope mailers will keep the formating ... > > > > > > If not I shall post on JIRA. > > Formulas : > TileLength is 24902 (earth circumference) / TierLength > bestFit formulas as summarized by Grant in his email. > number of box to fetch : pow(ceil(TileLength/Radius)+1,2) => > TileLength/Radius is for how many tiles are needed to cover the radius, +1 > is because you are not always well aligned, the pow(X,2) because there is > two directions/axis > > Best regards, > > Nicolas > > 2010/4/14 Chris Male <[EMAIL PROTECTED]> > > Hi, >> >> On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: >> >>> >>> On Apr 14, 2010, at 11:06 AM, Chris Male wrote: >>> >>> > Hi, >>> > >>> > My understanding of the benefits of the new algorithm is that it means >>> a lower tier level resulting in fewer boxes, but more documents inside those >>> boxes that are outside of the search radius. >>> > >>> > While having fewer boxes means fewer term queries to make against the >>> index, more documents means more costly calculations to filter out those >>> extraneous documents. >>> > >>> > For those doing just Cartesian Tier filtering it seems like the new >>> approach is a win, but for those doing distance calculations on those >>> documents passing the filter, it seems to come at a cost. >>> >>> Currently, this is only used for filtering. AIUI, Tiers aren't really >>> that useful for distance calculations, are they? After all, all you have is >>> a box id and you'd have to reverse out the calc of that to be able to calc a >>> distance, no? Perhaps I'm missing something. >>> >>> >> How Spatial Lucene currently works (or at least one of the ways it was >> designed to work), is using a 2 step filtering process. Step 1 is the >> Cartesian Tier filtering. The resulting set of Documents is then passed on >> through to Step 2 which then calculates the distance from each Document to >> the search centre. If the distance is greater than the radius, the Document >> is filtered out. This means that after both filtering steps you have only >> those Documents that are in the search radius. >> >> How this impacts this algorithm choice is that the more Documents the pass >> through Step 1, the more calculations that have to be done in Step 2. >> >> >>> I'm not sure, however, that it is a win for filtering. It seems like you >>> end up including docs in the result set that should be in there. >>> >>> I'll wait for Nicolas' summary table, but I'm inclined to revert and then >>> someone can refactor if they want to offer alternate implementations. >>> >>> -Grant >>> -------------- +
Helleringer, Nicolas 2010-04-14, 16:42
-
Re: [SPATIAL] Best Fit CalculationGrant Ingersoll 2010-04-14, 17:17
Thanks. I added my comment on the issue. I think we should revert and then someone can put up a patch to make this pluggable. As it stands, this Best Fit calculation has nothing to do with the CartesianTierPlotter anyway, so we could refactor it pretty easily.
-Grant On Apr 14, 2010, at 12:42 PM, Helleringer, Nicolas wrote: > Tables are well on JIRA : https://issues.apache.org/jira/browse/LUCENE-2359 > > Nicolas > > 2010/4/14 Helleringer, Nicolas <[EMAIL PROTECTED]> > Here are the summary tables : > > First a table to remind metrics on the Tiers : > Tile Level TierLegnth TierBoxes TileLength (miles) 0 1 1 24902 1 2 4 12451 2 4 16 6225,5 3 8 64 3112,75 4 16 256 1556,375 5 32 1024 778,1875 6 64 4096 389,09375 7 128 16384 194,546875 8 256 65536 97,2734375 9 512 262144 48,63671875 10 1024 1048576 24,31835938 11 2048 4194304 12,15917969 12 4096 16777216 6,079589844 13 8192 67108864 3,039794922 14 16384 268435456 1,519897461 15 32768 1073741824 0,75994873 > > > Then the comparaison table between legacy and new bestFit : > Radius (miles) legacy bestFit legacy bestFit TileLength legacy bestFit max number of Box to fetch new bestFit new bestFit TileLength new bestFit number of Box to fetch 1 18 0,75994873 9 14 1,519897461 4 5 16 0,75994873 64 12 6,079589844 4 10 15 0,75994873 225 11 12,15917969 4 25 13 3,039794922 100 9 24,31835938 9 50 12 6,079589844 100 8 97,2734375 4 100 11 12,15917969 100 7 194,546875 4 250 10 24,31835938 144 6 389,09375 4 500 9 48,63671875 144 5 778,1875 4 1000 8 97,2734375 144 4 1556,375 4 2500 7 194,546875 196 3 3112,75 4 5000 6 389,09375 196 2 6225,5 4 10000 5 778,1875 196 1 12451 4 > > I hope mailers will keep the formating ... > > > > > > If not I shall post on JIRA. > > Formulas : > TileLength is 24902 (earth circumference) / TierLength > bestFit formulas as summarized by Grant in his email. > number of box to fetch : pow(ceil(TileLength/Radius)+1,2) => TileLength/Radius is for how many tiles are needed to cover the radius, +1 is because you are not always well aligned, the pow(X,2) because there is two directions/axis > > Best regards, > > Nicolas > > 2010/4/14 Chris Male <[EMAIL PROTECTED]> > > Hi, > > On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Apr 14, 2010, at 11:06 AM, Chris Male wrote: > > > Hi, > > > > My understanding of the benefits of the new algorithm is that it means a lower tier level resulting in fewer boxes, but more documents inside those boxes that are outside of the search radius. > > > > While having fewer boxes means fewer term queries to make against the index, more documents means more costly calculations to filter out those extraneous documents. > > > > For those doing just Cartesian Tier filtering it seems like the new approach is a win, but for those doing distance calculations on those documents passing the filter, it seems to come at a cost. > > Currently, this is only used for filtering. AIUI, Tiers aren't really that useful for distance calculations, are they? After all, all you have is a box id and you'd have to reverse out the calc of that to be able to calc a distance, no? Perhaps I'm missing something. > > > How Spatial Lucene currently works (or at least one of the ways it was designed to work), is using a 2 step filtering process. Step 1 is the Cartesian Tier filtering. The resulting set of Documents is then passed on through to Step 2 which then calculates the distance from each Document to the search centre. If the distance is greater than the radius, the Document is filtered out. This means that after both filtering steps you have only those Documents that are in the search radius. > > How this impacts this algorithm choice is that the more Documents the pass through Step 1, the more calculations that have to be done in Step 2. > > I'm not sure, however, that it is a win for filtering. It seems like you end up including docs in the result set that should be in there. > > I'll wait for Nicolas' summary table, but I'm inclined to revert and then someone can refactor if they want to offer alternate implementations. Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search +
Grant Ingersoll 2010-04-14, 17:17
-
Re: [SPATIAL] Best Fit CalculationGrant Ingersoll 2010-04-14, 17:12
On Apr 14, 2010, at 12:12 PM, Chris Male wrote: > Hi, > > On Wed, Apr 14, 2010 at 6:07 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Apr 14, 2010, at 11:06 AM, Chris Male wrote: > > > Hi, > > > > My understanding of the benefits of the new algorithm is that it means a lower tier level resulting in fewer boxes, but more documents inside those boxes that are outside of the search radius. > > > > While having fewer boxes means fewer term queries to make against the index, more documents means more costly calculations to filter out those extraneous documents. > > > > For those doing just Cartesian Tier filtering it seems like the new approach is a win, but for those doing distance calculations on those documents passing the filter, it seems to come at a cost. > > Currently, this is only used for filtering. AIUI, Tiers aren't really that useful for distance calculations, are they? After all, all you have is a box id and you'd have to reverse out the calc of that to be able to calc a distance, no? Perhaps I'm missing something. > > > How Spatial Lucene currently works (or at least one of the ways it was designed to work), is using a 2 step filtering process. Step 1 is the Cartesian Tier filtering. The resulting set of Documents is then passed on through to Step 2 which then calculates the distance from each Document to the search centre. If the distance is greater than the radius, the Document is filtered out. This means that after both filtering steps you have only those Documents that are in the search radius. > > How this impacts this algorithm choice is that the more Documents the pass through Step 1, the more calculations that have to be done in Step 2. OK, I see what you mean now. I thought you were implying the box id would be used for calculating a distance, too. +
Grant Ingersoll 2010-04-14, 17:12
|