-Re: Can Apache Solr Handle TeraByte Large Data
Otis Gospodnetic 2012-01-18, 06:30
Could indexing English Wikipedia dump over and over get you there?
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html
> From: Memory Makers <[EMAIL PROTECTED]>
>To: [EMAIL PROTECTED]
>Sent: Tuesday, January 17, 2012 12:15 AM
>Subject: Re: Can Apache Solr Handle TeraByte Large Data
>I've been toying with the idea of setting up an experiment to index a large
>document set 1+ TB -- any thoughts on an open data set that one could use
>for this purpose?
>On Mon, Jan 16, 2012 at 5:00 PM, Burton-West, Tom <[EMAIL PROTECTED]>wrote:
>> Hello ,
>> Searching real-time sounds difficult with that amount of data. With large
>> documents, 3 million documents, and 5TB of data the index will be very
>> large. With indexes that large your performance will probably be I/O bound.
>> Do you plan on allowing phrase or proximity searches? If so, your
>> performance will be even more I/O bound as documents that large will have
>> huge positions indexes that will need to be read into memory for processing
>> phrase queries. To reduce I/O you need as much of the index in memory
>> (Lucene/Solr caches, and operating system disk cache). Every commit
>> invalidates the Solr/Lucene caches (unless the newer nrt code has solved
>> this for Solr).
>> If you index and serve on the same server, you are also going to get
>> terrible response time whenever your commits trigger a large merge.
>> If you need to service 10-100 qps or more, you may need to look at putting
>> your index on SSDs or spreading it over enough machines so it can stay in
>> What kind of response times are you looking for and what query rate?
>> We have somewhat smaller documents. We have 10 million documents and about
>> 6-8TB of data in HathiTrust and have spread the index over 12 shards on 4
>> machines (i.e. 3 shards per machine). We get an average of around
>> 200-300ms response time but our 95th percentile times are about 800ms and
>> 99th percentile are around 2 seconds. This is with an average load of less
>> than 1 query/second.
>> As Otis suggested, you may want to implement a strategy that allows users
>> to search within the large documents by breaking the documents up into
>> smaller units. What we do is have two Solr indexes. The first indexes
>> complete documents. When the user clicks on a result, we index the entire
>> document on a page level in a small Solr index on-the-fly. That way they
>> can search within the document and get page level results.
>> More details about our setup:
>> Tom Burton-West
>> University of Michigan Library
>> -----Original Message-----