Carlos Gonzalez-Cadenas 2012-04-12, 10:13
Michael McCandless 2012-04-12, 16:19
Carlos Gonzalez-Cadenas 2012-04-12, 22:35
-Re: codecs for sorted indexes
Robert Muir 2012-04-13, 00:00
On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas
<[EMAIL PROTECTED]> wrote:
> Hello Michael,
> Yes, we are pre-sorting the documents before adding them to the index. We
> have a score associated to every document (not an IR score but a
> document-related score that reflects its "importance"). Therefore, the
> document with the biggest score will have the lowest docid (we add it first
> to the index). We do this in order to apply early termination effectively.
> With the actual coded, we haven't seen much of a difference in terms of
> space when we have the index sorted vs not sorted.
I wouldn't expect that you will see space savings when you sort this way.
The techniques I was mentioning involve sorting documents by other
factors instead (such as grouping related documents from the same
website together: idea being they probably share many of the same
terms): this hopefully creates smaller document deltas that require
less bits to represent.