Burton-West, Tom 2010-10-05, 18:40
Nguyen, Vincent 2010-10-05, 18:46
Michael McCandless 2010-10-05, 23:12
Lance Norskog 2010-10-06, 04:24
-RE: Experience with large merge factors
Burton-West, Tom 2010-10-07, 01:57
>.Do you use multiple threads for indexing? Large RAM buffer size is
>>also good, but I think perf peaks out mabye around 512 MB (at least
>>based on past tests)?
We are using Solr, I'm not sure if Solr uses multiple threads for indexing. We have 30 "producers" each sending documents to 1 of 12 Solr shards on a round robin basis. So each shard will get multiple requests.
>>Believe it or not, merging is typically compute bound. It's costly to
>>decode & re-encode all the vInts.
Sounds like we need to do some monitoring during merging to see what the cpu use is and also the io wait during large merges.
>>Larger merge factor is good because it means the postings are copied
>>fewer times, but, it's bad beacuse you could risk running out of
>>descriptors, and, if the OS doesn't have enough RAM, you'll start to
>>thin out the readahead that the OS can do (which makes the merge less
>>efficient since the disk heads are seeking more).
Is there a way to estimate the amount of RAM for the readahead? Once we start the re-indexing we will be running 12 shards on a 16 processor box with 144 GB of memory.
>>Do you do any deleting?
Deletes would happen as a byproduct of updating a record. This shouldn't happen too frequently during re-indexing, but we update records when a document gets re-scanned and re-OCR'd. This would probably amount to a few thousand.
>>Do you use stored fields and/or term vectors? If so, try to make
>>your docs "uniform" if possible, ie add the same fields in the same
>>order. This enables lucene to use bulk byte copy merging under the hood.
We use 4 or 5 stored fields. They are very small compared to our huge OCR field. Since we construct our Solr documents programattically, I'm fairly certain that they are always in the same order. I'll have to look at the code when I get back to make sure.
We aren't using term vectors now, but we plan to add them as well as a number of fields based on MARC (cataloging) metadata in the future.
Otis Gospodnetic 2010-10-07, 03:36
Thijs 2010-10-07, 07:45
Jan Høydahl / Cominvent 2010-10-07, 09:07
Michael McCandless 2010-10-07, 08:30