-Re: codecs for sorted indexes
Michael McCandless 2012-04-12, 16:19
Do you mean you are pre-sorting the documents (by what criteria?)
yourself, before adding them to the index?
In which case... you should already be seeing some benefits (smaller
index size) than had you "randomly" added them (ie the vInts should
take fewer bytes), I think. (Probably the savings would be greater
for better intblock codecs like PForDelta, SimpleX, but I'm not
Or do you mean having a codec re-sort the documents (on flush/merge)?
I think this should be possible w/ the Codec API... but nobody has
tried it yet that I know of.
Note that the bulkpostings branch is effectively dead (nobody is
iterating on it, and we've removed the old bulk API from trunk), but
there is likely a GSoC project to add a PForDelta codec to trunk:
On Thu, Apr 12, 2012 at 6:13 AM, Carlos Gonzalez-Cadenas
<[EMAIL PROTECTED]> wrote:
> We're using a sorted index in order to implement early termination
> efficiently over an index of hundreds of millions of documents. As of now,
> we're using the default codecs coming with Lucene 4, but we believe that
> due to the fact that the docids are sorted, we should be able to do much
> better in terms of storage and achieve much better performance, especially
> decompression performance.
> In particular, Robert Muir is commenting on these lines here:
> We're aware that the in the bulkpostings branch there are different codecs
> being implemented and different experiments being done. We don't know
> whether we should implement our own codec (i.e. using some RLE-like
> techniques) or we should use one of the codecs implemented there (PFOR,
> Simple64, ...).
> Can you please give us some advice on this?
> Carlos Gonzalez-Cadenas
> CEO, ExperienceOn - New generation search
> Mobile: +34 652 911 201
> Skype: carlosgonzalezcadenas
> LinkedIn: http://www.linkedin.com/in/carlosgonzalezcadenas