-inverted index pruning
Li Li 2011-02-15, 10:57
I recently read a paper "Pruning Policies for Two-Tiered Inverted
Index with Correctness Guarantee". It's idea is interesting and I
have some questions and like to share with you.
it's idea is pruning unlikely documents for certain terms
term1 d1 d3 d6 | d9 d7 d8
term2 d1 d6 d8 | d3 d4 d5
we have 2 terms here -- term1 and term2
we perform a "and query" for searching documents that contain both
the 2 terms
if our score function consider 2 types of scores: document static
score and term related document score
for simplicity, the static score is the page_rank of a document
and term related score is tf*idf and final score is linear combination
such as page_rank + tf*idf
the pruning criterion is : we only keep the documents whose
page_rank is top N of all the docList for this term and tf*idf is
take above example d1 d3 d5's page_rank is top 3 of all the 5
documents containing term1 and also d1's tf *idf is top 3 of all the 5
Then we want to get top 3 documents
we evaluate d1 d3 d6 d8 because it's in pruned docList
d1 and d6 's information is complete so we can calculate the
accurate score of d1
d3 d8's information is incomplete, we can only calculate the upper
bound of them.
we select top 3 document by score. suppose the result is d6 d1 d8 d3
because d8 and d3 is upper bound score, we can compare the real
score of d8 and d3. So the search is failed and we have to search
un-pruned index for answer
But if we want to get top 2 documents. we can know d1 and d6 are
the answer because d6, d8 and other documents' score is less
The experiments in this paper shows that we can use pruned
index(30% size of full index) to answer 90%+ queries
This result is excited but we have a problem in using it in lucene.
for "and query" in lucene, we use skipList to speed up queries.
but this pruning algorithm cannot use pruned index.
we can check it by above example.
if we use skipList, we can skip d3. but d3's upper bound score may
larger than d1. if we don't score d3, we cannot know it.
although we need only score docList in pruned index(30%), it may be
slower than "and query" in full index because we can
use skipList to skip many docs.
Anyone has good idea for this problem? if we can solve this
problem, we can improve performance well.