|
|
I read the paper by Doug "Space optimizations for total ranking",
since it was written a long time ago, I wonder what algorithms lucene uses (regarding postings list traversal and score calculation, ranking) particularly the total ranking algorithm described there needs to traverse down the entire postings list for all the query terms, so in case of very common query terms like "yellow dog", either of the 2 terms may have a very very long postings list in case of web search, are they all really traversed in current lucene/Solr ? or any heuristics to truncate the list are actually employed?
in the case of returning top-k results, I can understand that partitioning the postings list into multiple machines, and then combining the top-k from each would work, but if we are required to return "the 100th result page", i.e. results ranked from 990--1000th, then each partition would still have to find out the top 1000, so partitioning would not help much. overall, is there any up-to-date detailed docs on the internal algorithms of lucene?
Thanks a lot Yang
additionally, anybody knows roughly (of course the details are a secret, but I guess the main ideas should be common enough these days) how google does fast ranking in cases of multi-term queries with AND ? (if their postings are sorted by PageRank order, then it's understandable that a single term query would quickly return the top-k, but if it's multi-term, they would have to traverse the entire lists to find the insersection set, because the lists are not sorted by docId, as in the Lucene paper case)
On Wed, Apr 25, 2012 at 2:13 PM, Yang <[EMAIL PROTECTED]> wrote:
> I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking) > > > particularly the total ranking algorithm described there needs to traverse > down the entire postings list for all the query terms, > so in case of very common query terms like "yellow dog", either of the 2 > terms may have a very very long postings list in case of web search, > are they all really traversed in current lucene/Solr ? or any heuristics > to truncate the list are actually employed? > > in the case of returning top-k results, I can understand that partitioning > the postings list into multiple machines, and then combining the top-k > from each would work, > but if we are required to return "the 100th result page", i.e. results > ranked from 990--1000th, then each partition would still have to find out > the top 1000, so > partitioning would not help much. > > > overall, is there any up-to-date detailed docs on the internal algorithms > of lucene? > > Thanks a lot > Yang >
Ralf Heyde 2012-04-26, 07:17
Hi, i dont know the correct implementation. All that I can say is, that you should take a look at query optimization in state-of-the-art database systems. The generell solution is to select this part of a query first, which reduces the resultset most. For example you try to search for a term like "I'm looking for a term" - each token has a selectivity (TF/IDF, histograms (ie. Zipf-Distribution)). Term / Frequency I'm - often looking - normal for - stopword (too often and maybe ignored) a - stopword (too often and maybe ignored) term - rare So - Lucene might to select all documents, which contain "term", then on the "small" resultset looking for documents matching "looking" ... and so on. For that reason Lucene stores information like Histograms and other "things" ... I hope I gave you a simple example, which Lucene might use. Greets from Berlin, Ralf -------- Original-Nachricht -------- > Datum: Wed, 25 Apr 2012 14:20:10 -0700 > Von: Yang <[EMAIL PROTECTED]> > An: [EMAIL PROTECTED] > Betreff: Re: lucene algorithm ? > additionally, anybody knows roughly (of course the details are a secret, > but I guess the main ideas should be > common enough these days) how google does fast ranking in cases of > multi-term queries with AND ? > (if their postings are sorted by PageRank order, then it's understandable > that a single term query would quickly return the top-k, but if it's > multi-term, they would have to traverse the entire lists to find the > insersection set, because the lists are not sorted by docId, as in the > Lucene paper case) > > > > On Wed, Apr 25, 2012 at 2:13 PM, Yang <[EMAIL PROTECTED]> wrote: > > > I read the paper by Doug "Space optimizations for total ranking", > > > > since it was written a long time ago, I wonder what algorithms lucene > uses > > (regarding postings list traversal and score calculation, ranking) > > > > > > particularly the total ranking algorithm described there needs to > traverse > > down the entire postings list for all the query terms, > > so in case of very common query terms like "yellow dog", either of the 2 > > terms may have a very very long postings list in case of web search, > > are they all really traversed in current lucene/Solr ? or any > heuristics > > to truncate the list are actually employed? > > > > in the case of returning top-k results, I can understand that > partitioning > > the postings list into multiple machines, and then combining the top-k > > from each would work, > > but if we are required to return "the 100th result page", i.e. results > > ranked from 990--1000th, then each partition would still have to find > out > > the top 1000, so > > partitioning would not help much. > > > > > > overall, is there any up-to-date detailed docs on the internal > algorithms > > of lucene? > > > > Thanks a lot > > Yang > > -- Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de---------------------------------------------------------------------
+
Ralf Heyde 2012-04-26, 07:17
Thanks Ralf. basically you are talking about selectivity of columns in a JOIN, right? but in my above example, "yellow dog", both terms are very common, and both have long postings lists. Yang On Thu, Apr 26, 2012 at 12:17 AM, Ralf Heyde <[EMAIL PROTECTED]> wrote: > Hi, > > i dont know the correct implementation. All that I can say is, that you > should take a look at query optimization in state-of-the-art database > systems. The generell solution is to select this part of a query first, > which reduces the resultset most. > > For example you try to search for a term like "I'm looking for a term" - > each token has a selectivity (TF/IDF, histograms (ie. Zipf-Distribution)). > > Term / Frequency > I'm - often > looking - normal > for - stopword (too often and maybe ignored) > a - stopword (too often and maybe ignored) > term - rare > > So - Lucene might to select all documents, which contain "term", then on > the "small" resultset looking for documents matching "looking" ... and so > on. > For that reason Lucene stores information like Histograms and other > "things" ... > > I hope I gave you a simple example, which Lucene might use. > > Greets from Berlin, > > Ralf > > > -------- Original-Nachricht -------- > > Datum: Wed, 25 Apr 2012 14:20:10 -0700 > > Von: Yang <[EMAIL PROTECTED]> > > An: [EMAIL PROTECTED] > > Betreff: Re: lucene algorithm ? > > > additionally, anybody knows roughly (of course the details are a secret, > > but I guess the main ideas should be > > common enough these days) how google does fast ranking in cases of > > multi-term queries with AND ? > > (if their postings are sorted by PageRank order, then it's understandable > > that a single term query would quickly return the top-k, but if it's > > multi-term, they would have to traverse the entire lists to find the > > insersection set, because the lists are not sorted by docId, as in the > > Lucene paper case) > > > > > > > > On Wed, Apr 25, 2012 at 2:13 PM, Yang <[EMAIL PROTECTED]> wrote: > > > > > I read the paper by Doug "Space optimizations for total ranking", > > > > > > since it was written a long time ago, I wonder what algorithms lucene > > uses > > > (regarding postings list traversal and score calculation, ranking) > > > > > > > > > particularly the total ranking algorithm described there needs to > > traverse > > > down the entire postings list for all the query terms, > > > so in case of very common query terms like "yellow dog", either of the > 2 > > > terms may have a very very long postings list in case of web search, > > > are they all really traversed in current lucene/Solr ? or any > > heuristics > > > to truncate the list are actually employed? > > > > > > in the case of returning top-k results, I can understand that > > partitioning > > > the postings list into multiple machines, and then combining the top-k > > > from each would work, > > > but if we are required to return "the 100th result page", i.e. results > > > ranked from 990--1000th, then each partition would still have to find > > out > > > the top 1000, so > > > partitioning would not help much. > > > > > > > > > overall, is there any up-to-date detailed docs on the internal > > algorithms > > > of lucene? > > > > > > Thanks a lot > > > Yang > > > > > -- > Empfehlen Sie GMX DSL Ihren Freunden und Bekannten und wir > belohnen Sie mit bis zu 50,- Euro! https://freundschaftswerbung.gmx.de> > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
Uwe Schindler 2012-04-26, 07:49
Hi, > I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking) The algorithms described in that paper are still in use to merge posting lists of several terms. The "total ranking" as described here is implemented in several key parts of Lucene: #1 Section 4.1 (Parallel Merge) is done in BooleanScorer2.java #2 Section 4.2 (Block Processing) is done in BooleanScorer.java #3 Maintaining the queue of Top-N-Results is implemented in TopScoreDocCollector.java, the above scorer implementations call for each hit found the "collect(docid, score) method of the collector. > particularly the total ranking algorithm described there needs to traverse down > the entire postings list for all the query terms, so in case of very common query > terms like "yellow dog", either of the 2 terms may have a very very long > postings list in case of web search, are they all really traversed in current > lucene/Solr ? or any heuristics to truncate the list are actually employed? There are possibilities to truncate those lists, but this is not implemented in Lucene core. The main problem with Lucene's segmented index structure is, that you cannot early exit, because the very last document in the posting list could be one with a very high score. Techniques like index pruning or index sorting can optimize all this, but need index preprocessing and loose updatability of the index while in use. The Top-N result collection is optimized in Lucene (item #3), to throw away all collected documents, where the score is never be able to get into the top-n priority queue (too small score, once the queue is full). But the score has to be calculated. > in the case of returning top-k results, I can understand that partitioning the > postings list into multiple machines, and then combining the top-k from each > would work, but if we are required to return "the 100th result page", i.e. results > ranked from 990--1000th, then each partition would still have to find out the > top 1000, so partitioning would not help much. Yes, and because of that almost no search engine support deep paging. Lucene uses lots of memory when trying to do this, although there are new collector implementations in Lucene that allow "optimized" deep paging (included into item #3), if the top-n of the 99th result page were already found. In that case, you can throw away all documents with a higher score in the collection phase than the last one of page 99 when collection page 100. > overall, is there any up-to-date detailed docs on the internal algorithms of > lucene? Reading code is best documentation. But Javadocs also help, we recently updated the documentation of Lucene's internals for the coming version 4: https://builds.apache.org/job/Lucene-trunk/javadoc/> Thanks a lot > Yang Uwe ---------------------------------------------------------------------
+
Uwe Schindler 2012-04-26, 07:49
Andrzej Bialecki 2012-04-26, 14:21
On 26/04/2012 09:49, Uwe Schindler wrote: > There are possibilities to truncate those lists, but this is not implemented > in Lucene core. The main problem with Lucene's segmented index structure is, > that you cannot early exit, because the very last document in the posting > list could be one with a very high score. Techniques like index pruning or > index sorting can optimize all this, but need index preprocessing and loose > updatability of the index while in use. > > The Top-N result collection is optimized in Lucene (item #3), to throw away > all collected documents, where the score is never be able to get into the > top-n priority queue (too small score, once the queue is full). But the > score has to be calculated. This paper presents yet another option: keep a value that represents max. impact of postings in a block and then skip whole blocks if the max impact is too low to get docs within the block in the queue: http://cis.poly.edu/suel/papers/bmw.pdfImplementing this in Lucene would require changing the iterators so that they can take a threshold value, i.e. the current lowest score in the queue, so that nextDoc() and advance() could skip whole blocks when their max impact is lower than the current lowest score. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ---------------------------------------------------------------------
+
Andrzej Bialecki 2012-04-26, 14:21
On Thu, Apr 26, 2012 at 5:13 AM, Yang <[EMAIL PROTECTED]> wrote: > > I read the paper by Doug "Space optimizations for total ranking", > > since it was written a long time ago, I wonder what algorithms lucene uses > (regarding postings list traversal and score calculation, ranking) > > > particularly the total ranking algorithm described there needs to traverse > down the entire postings list for all the query terms, > so in case of very common query terms like "yellow dog", either of the 2 > terms may have a very very long postings list in case of web search, > are they all really traversed in current lucene/Solr ? or any heuristics > to truncate the list are actually employed? you can read related papers about early termination, they are closely related to ranking algorithm. Now lucene did little thing of this area. Also is it's ranking algorithm. > > in the case of returning top-k results, I can understand that partitioning > the postings list into multiple machines, and then combining the top-k That's distributed searching, solr has this ability. Even for a single node, for conjunction query(and query), lucene will use skip list in posting to speed up. for disjunction query(or query), lucene will use BooleanScorer rather than BooleanScorer2. BooleanScorer is TAAT(Term at a Time) algorithm while BooleanScorer2 is DAAT(Document at a Time). > from each would work, > but if we are required to return "the 100th result page", i.e. results > ranked from 990--1000th, then each partition would still have to find out > the top 1000, so > partitioning would not help much. > yes, that's why many search engines will not allow user visit page number greater than a threshold. for most application, users usually only visit top results. That's why ranking algorithm is important. if you found your users always turn to next page, I think you should consider your application. you should provide more filter condition or improving ranking algorithm. > > > overall, is there any up-to-date detailed docs on the internal algorithms > of lucene? if you can read Chinese, I recommend http://www.cnblogs.com/forfuture1978/category/300665.htm. you may also find some of my blogs about lucene/solr in blog.csdn.net/fancyerII(I am not a persistent person, and plan of writing blogs of lucene/solr is not continued) anyhow, the source code is the best resource. > > Thanks a lot > Yang ---------------------------------------------------------------------
+
Li Li 2012-04-27, 07:25
yes, that's why many search engines will not allow user visit page > number greater than a threshold. for most application, users usually > only visit top results. That's why ranking algorithm is important. if > you found your users always turn to next page, I think you should > consider your application. you should provide more filter condition or > improving ranking algorithm. > > > > interesting, I just tried google "yellow dog" and ask it to retrieve from the 900's result, the access time is 0.6 seconds, while the first page comes out in 0.33 seconds (well of course the first page result may well be cached, but as I go from the first page to the 8'th page, the time does consistently increase) https://www.google.com/#q=yellow+dog&hl=en&prmd=imvns&ei=Q_maT433OombiQeB3dm1Dg&start=900&fp=a2cbb8d5ae567362
|
|