Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Solr, mail # user - codecs for sorted indexes


+
Carlos Gonzalez-Cadenas 2012-04-12, 10:13
+
Michael McCandless 2012-04-12, 16:19
+
Carlos Gonzalez-Cadenas 2012-04-12, 22:35
Copy link to this message
-
Re: codecs for sorted indexes
Robert Muir 2012-04-13, 00:00
On Thu, Apr 12, 2012 at 6:35 PM, Carlos Gonzalez-Cadenas
<[EMAIL PROTECTED]> wrote:
> Hello Michael,
>
> Yes, we are pre-sorting the documents before adding them to the index. We
> have a score associated to every document (not an IR score but a
> document-related score that reflects its "importance"). Therefore, the
> document with the biggest score will have the lowest docid (we add it first
> to the index). We do this in order to apply early termination effectively.
> With the actual coded, we haven't seen much of a difference in terms of
> space when we have the index sorted vs not sorted.

I wouldn't expect that you will see space savings when you sort this way.

The techniques I was mentioning involve sorting documents by other
factors instead (such as grouping related documents from the same
website together: idea being they probably share many of the same
terms): this hopefully creates smaller document deltas that require
less bits to represent.

--
lucidimagination.com