|
Burton-West, Tom
2010-12-15, 23:52
Upayavira
2010-12-16, 08:59
Michael McCandless
2010-12-16, 10:51
Robert Petersen
2010-12-16, 17:17
Michael McCandless
2010-12-16, 18:23
Burton-West, Tom
2010-12-16, 19:09
Michael McCandless
2010-12-16, 19:33
Robert Muir
2010-12-16, 19:38
Burton-West, Tom
2010-12-16, 21:03
Michael McCandless
2010-12-16, 21:19
Robert Muir
2010-12-16, 21:22
Burton-West, Tom
2010-12-18, 15:55
Robert Petersen
2010-12-16, 19:27
Michael McCandless
2010-12-16, 19:36
Yonik Seeley
2010-12-16, 21:30
|
-
Memory use during merges (OOM)Burton-West, Tom 2010-12-15, 23:52
Hello all,
Are there any general guidelines for determining the main factors in memory use during merges? We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? Are there rules of thumb for the memory needed in terms of the number or size of segments? Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. Tom Burton-West ----------------------------------------------------------------- Changes to indexing configuration: mergeScheduler before: serialMergeScheduler after: concurrentMergeScheduler mergeFactor before: 10 after : 20 ramBufferSizeMB before: 32 after: 320 excerpt from indexWriter.log Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge ... Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments. Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument tom +
Burton-West, Tom 2010-12-15, 23:52
-
Re: Memory use during merges (OOM)Upayavira 2010-12-16, 08:59
How long does it take to reach this OOM situation? Is it possible for
you to try a merge with each setting in turn, and evaluate what impact they each have? That is, indexing speed and memory consumption? It might be interesting to watch garbage collection too while it is running with jstat, as that could be your speed bottleneck. Upayavira On Wed, 15 Dec 2010 18:52 -0500, "Burton-West, Tom" <[EMAIL PROTECTED]> wrote: > Hello all, > > Are there any general guidelines for determining the main factors in > memory use during merges? > > We recently changed our indexing configuration to speed up indexing but > in the process of doing a very large merge we are running out of memory. > Below is a list of the changes and part of the indexwriter log. The > changes increased the indexing though-put by almost an order of > magnitude. > (about 600 documents per hour to about 6000 documents per hour. Our > documents are about 800K) > > We are trying to determine which of the changes to tweak to avoid the > OOM, but still keep the benefit of the increased indexing throughput > > Is it likely that the changes to ramBufferSizeMB are the culprit or could > it be the mergeFactor change from 10-20? > > Is there any obvious relationship between ramBufferSizeMB and the memory > consumed by Solr? > Are there rules of thumb for the memory needed in terms of the number or > size of segments? > > Our largest segments prior to the failed merge attempt were between 5GB > and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. > > Tom Burton-West > ----------------------------------------------------------------- > > Changes to indexing configuration: > mergeScheduler > before: serialMergeScheduler > after: concurrentMergeScheduler > mergeFactor > before: 10 > after : 20 > ramBufferSizeMB > before: 32 > after: 320 > > excerpt from indexWriter.log > > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; > http-8091-Processor70]: LMP: findMerges: 40 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; > http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; > http-8091-Processor70]: LMP: 0 to 20: add this merge > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; > http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; > http-8091-Processor70]: LMP: 20 to 40: add this merge > > ... > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; > http-8091-Processor70]: applyDeletes > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; > http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 > deleted docIDs and 0 deleted queries on 40 segments. > Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; > http-8091-Processor70]: hit exception flushing deletes > Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; > http-8091-Processor70]: hit OutOfMemoryError inside updateDocument > tom > +
Upayavira 2010-12-16, 08:59
-
Re: Memory use during merges (OOM)Michael McCandless 2010-12-16, 10:51
RAM usage for merging is tricky.
First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not "false" deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote: > Hello all, > > Are there any general guidelines for determining the main factors in memory use during merges? > > We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. > Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. > (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) > > We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput > > Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? > > Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? > Are there rules of thumb for the memory needed in terms of the number or size of segments? > > Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. > > Tom Burton-West > ----------------------------------------------------------------- > > Changes to indexing configuration: > mergeScheduler > before: serialMergeScheduler > after: concurrentMergeScheduler > mergeFactor > before: 10 > after : 20 > ramBufferSizeMB > before: 32 > after: 320 > > excerpt from indexWriter.log > > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge > > ... > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: DW: apply 1320 buffered deleted terms and 0 deleted docIDs and 0 deleted queries on 40 segments. > Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit exception flushing deletes > Dec 14, 2010 5:48:17 PM IW 0 [Tue Dec 14 17:48:17 EST 2010; http-8091-Processor70]: hit OutOfMemoryError inside updateDocument > tom > > +
Michael McCandless 2010-12-16, 10:51
-
RE: Memory use during merges (OOM)Robert Petersen 2010-12-16, 17:17
Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below.
-----Original Message----- From: Michael McCandless [mailto:[EMAIL PROTECTED]] Sent: Thursday, December 16, 2010 2:51 AM To: [EMAIL PROTECTED] Subject: Re: Memory use during merges (OOM) RAM usage for merging is tricky. First off, merging must hold open a SegmentReader for each segment being merged. However, it's not necessarily a full segment reader; for example, merging doesn't need the terms index nor norms. But it will load deleted docs. But, if you are doing deletions (or updateDocument, which is just a delete + add under-the-hood), then this will force the terms index of the segment readers to be loaded, thus consuming more RAM. Furthermore, if the deletions you (by Term/Query) do in fact result in deleted documents (ie they were not "false" deletions), then the merging allocates an int[maxDoc()] for each SegmentReader that has deletions. Finally, if you have multiple merges running at once (see CSM.setMaxMergeCount) that means RAM for each currently running merge is tied up. So I think the gist is... the RAM usage will be in proportion to the net size of the merge (mergeFactor + how big each merged segment is), how many merges you allow concurrently, and whether you do false or true deletions. If you are doing false deletions (calling .updateDocument when in fact the Term you are replacing cannot exist) it'd be best if possible to change the app to not call .updateDocument if you know the Term doesn't exist. Mike On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote: > Hello all, > > Are there any general guidelines for determining the main factors in memory use during merges? > > We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. > Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. > (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) > > We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput > > Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? > > Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? > Are there rules of thumb for the memory needed in terms of the number or size of segments? > > Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. > > Tom Burton-West > ----------------------------------------------------------------- > > Changes to indexing configuration: > mergeScheduler > before: serialMergeScheduler > after: concurrentMergeScheduler > mergeFactor > before: 10 > after : 20 > ramBufferSizeMB > before: 32 > after: 320 > > excerpt from indexWriter.log > > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: findMerges: 40 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 7.23609 to 7.98609: 20 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 0 to 20: add this merge > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: level 5.44878 to 6.19878: 20 segments > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: LMP: 20 to 40: add this merge > > ... > Dec 14, 2010 5:34:10 PM IW 0 [Tue Dec 14 17:34:10 EST 2010; http-8091-Processor70]: applyDeletes +
Robert Petersen 2010-12-16, 17:17
-
Re: Memory use during merges (OOM)Michael McCandless 2010-12-16, 18:23
It's not that it's "bad", it's just that Lucene must do extra work to
check if these deletes are real or not, and that extra work requires loading the terms index which will consume additional RAM. For most apps, though, the terms index is relatively small and so this isn't really an issue. But if your terms index is large this can explain the added RAM usage. One workaround for large terms index is to set the terms index divisor that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). Mike On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen <[EMAIL PROTECTED]> wrote: > Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. > > -----Original Message----- > From: Michael McCandless [mailto:[EMAIL PROTECTED]] > Sent: Thursday, December 16, 2010 2:51 AM > To: [EMAIL PROTECTED] > Subject: Re: Memory use during merges (OOM) > > RAM usage for merging is tricky. > > First off, merging must hold open a SegmentReader for each segment > being merged. However, it's not necessarily a full segment reader; > for example, merging doesn't need the terms index nor norms. But it > will load deleted docs. > > But, if you are doing deletions (or updateDocument, which is just a > delete + add under-the-hood), then this will force the terms index of > the segment readers to be loaded, thus consuming more RAM. > Furthermore, if the deletions you (by Term/Query) do in fact result in > deleted documents (ie they were not "false" deletions), then the > merging allocates an int[maxDoc()] for each SegmentReader that has > deletions. > > Finally, if you have multiple merges running at once (see > CSM.setMaxMergeCount) that means RAM for each currently running merge > is tied up. > > So I think the gist is... the RAM usage will be in proportion to the > net size of the merge (mergeFactor + how big each merged segment is), > how many merges you allow concurrently, and whether you do false or > true deletions. > > If you are doing false deletions (calling .updateDocument when in fact > the Term you are replacing cannot exist) it'd be best if possible to > change the app to not call .updateDocument if you know the Term > doesn't exist. > > Mike > > On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote: >> Hello all, >> >> Are there any general guidelines for determining the main factors in memory use during merges? >> >> We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. >> Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. >> (about 600 documents per hour to about 6000 documents per hour. Our documents are about 800K) >> >> We are trying to determine which of the changes to tweak to avoid the OOM, but still keep the benefit of the increased indexing throughput >> >> Is it likely that the changes to ramBufferSizeMB are the culprit or could it be the mergeFactor change from 10-20? >> >> Is there any obvious relationship between ramBufferSizeMB and the memory consumed by Solr? >> Are there rules of thumb for the memory needed in terms of the number or size of segments? >> >> Our largest segments prior to the failed merge attempt were between 5GB and 30GB. The memory allocated to the Solr/tomcat JVM is 10GB. >> >> Tom Burton-West >> ----------------------------------------------------------------- >> >> Changes to indexing configuration: >> mergeScheduler >> before: serialMergeScheduler >> after: concurrentMergeScheduler >> mergeFactor >> before: 10 >> after : 20 >> ramBufferSizeMB +
Michael McCandless 2010-12-16, 18:23
-
RE: Memory use during merges (OOM)Burton-West, Tom 2010-12-16, 19:09
Thanks Mike,
>>But, if you are doing deletions (or updateDocument, which is just a >>delete + add under-the-hood), then this will force the terms index of >>the segment readers to be loaded, thus consuming more RAM. Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance a few documents have been updated, which would cause a delete +add. >>One workaround for large terms index is to set the terms index divisor >>.that IndexWriter should use whenever it loads a terms index (this is >>IndexWriter.setReaderTermsIndexDivisor). I always get confused about the two different divisors and their names in the solrconfig.xml file We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory"> <int name="termInfosIndexDivisor">8</int> </indexReaderFactory > The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file. I don't remember how to set this in Solr. Are we setting the right one to reduce RAM usage during merging? > So I think the gist is... the RAM usage will be in proportion to the > net size of the merge (mergeFactor + how big each merged segment is), > how many merges you allow concurrently, and whether you do false or > true deletions Does an optimize do something differently? Tom +
Burton-West, Tom 2010-12-16, 19:09
-
Re: Memory use during merges (OOM)Michael McCandless 2010-12-16, 19:33
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote:
> Thanks Mike, > >>>But, if you are doing deletions (or updateDocument, which is just a >>>delete + add under-the-hood), then this will force the terms index of >>>the segment readers to be loaded, thus consuming more RAM. > > Out of 700,000 docs, by the time we get to doc 600,000, there is a good chance a few documents have been updated, which would cause a delete +add. OK so you should do the .updateDocument not .addDocument. >>>One workaround for large terms index is to set the terms index divisor >>>.that IndexWriter should use whenever it loads a terms index (this is >>>IndexWriter.setReaderTermsIndexDivisor). > > I always get confused about the two different divisors and their names in the solrconfig.xml file > > We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor > > <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory"> > <int name="termInfosIndexDivisor">8</int> > </indexReaderFactory > > > The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file. I don't remember how to set this in Solr. > > Are we setting the right one to reduce RAM usage during merging? It's even more confusing! There are three settings. First tells IW how frequent the index terms are (default is 128). Second tells IndexReader whether to sub-sample these on load (default is 1, meaning load all indexed terms; but if you set it to 2 then 2*128 = every 256th term is loaded). Third, IW has the same setting (subsampling) to be used whenever it internally must open a reader (eg to apply deletes). The last two are really the same setting, just that one is passed when you open IndexReader yourself, and the other is passed whenever IW needs to open a reader. But, I'm not sure how these settings are named in solrconfig.xml. >> So I think the gist is... the RAM usage will be in proportion to the >> net size of the merge (mergeFactor + how big each merged segment is), >> how many merges you allow concurrently, and whether you do false or >> true deletions > > Does an optimize do something differently? No, optimize is the same deal. But, because it's a big merge (especially the last one), it's the highest RAM usage of all merges. Mike +
Michael McCandless 2010-12-16, 19:33
-
Re: Memory use during merges (OOM)Robert Muir 2010-12-16, 19:38
On Thu, Dec 16, 2010 at 2:09 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote:
> > I always get confused about the two different divisors and their names in the solrconfig.xml file This one (for the writer) isnt configurable by Solr. want to open an issue? > > We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor > > <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory"> > <int name="termInfosIndexDivisor">8</int> > </indexReaderFactory > > > The other one is termIndexInterval which is set on the writer and determines what gets written to the tii file. I don't remember how to set this in Solr. > > Are we setting the right one to reduce RAM usage during merging? > When you write the terms, it creates a terms dictionary, and a terms index. The termsIndexInterval (default 128) controls how many terms go into the index. For example every 128th term. The divisor just samples this at runtime... e.g. with your divisor of 8 its only reading every 8th term from the index [or every 8*128th term is read into ram, another way to see it]. Your setting isn't being applied to the reader IW uses during merging... its only for readers Solr opens from directories explicitly. I think you should open a jira issue! +
Robert Muir 2010-12-16, 19:38
-
RE: Memory use during merges (OOM)Burton-West, Tom 2010-12-16, 21:03
>>Your setting isn't being applied to the reader IW uses during
>>merging... its only for readers Solr opens from directories >>explicitly. >>I think you should open a jira issue! Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied? <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory"> <int name="termInfosIndexDivisor">8</int> </indexReaderFactory > I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging. Is the use during merging the similar to the use during searching? i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms? (Haven't yet dug into the merging/indexing code). Tom -----Original Message----- From: Robert Muir [mailto:[EMAIL PROTECTED]] > We are setting termInfosIndexDivisor, which I think translates to the Lucene IndexWriter.setReaderTermsIndexDivisor > > +
Burton-West, Tom 2010-12-16, 21:03
-
Re: Memory use during merges (OOM)Michael McCandless 2010-12-16, 21:19
On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote:
>>>Your setting isn't being applied to the reader IW uses during >>>merging... its only for readers Solr opens from directories >>>explicitly. >>>I think you should open a jira issue! > > Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied? > > <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory"> > <int name="termInfosIndexDivisor">8</int> > </indexReaderFactory > Yes. > I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging. Is the use during merging the similar to the use during searching? > > i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms? > (Haven't yet dug into the merging/indexing code). It's not used during merging, only for applying deletes. But, yes, we do a lookup of the Term (or Terms inside Query, if you delete-by-Query) from the terms index. Mike +
Michael McCandless 2010-12-16, 21:19
-
Re: Memory use during merges (OOM)Robert Muir 2010-12-16, 21:22
On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote:
>>>Your setting isn't being applied to the reader IW uses during >>>merging... its only for readers Solr opens from directories >>>explicitly. >>>I think you should open a jira issue! > > Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied? yes, i'm not really sure (especially given the "name=") if you can/or it was planned to have multiple IR factories in solr, e.g. a separate one for spellchecking. so i'm not sure if we should (hackishly) steal this parameter from the IR factory (it is common to all IRFactories, not just StandardIRFactory) and apply it to to IW.. but we could at least expose the divisor param separately to the IW config so you have some way of setting it. > > <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory"> > <int name="termInfosIndexDivisor">8</int> > </indexReaderFactory > > > I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging. Is the use during merging the similar to the use during searching? > > i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms? > (Haven't yet dug into the merging/indexing code). it needs it for applying deletes... as a workaround (if you are reindexing), maybe instead of using the Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8 * 128) ? this will solve your merging problem, and have the same perf characteristics of divisor=8, except you cant "go back down" like you can with the divisor without reindexing with a smaller interval... if you've already tested that performance with the divisor of 8 is acceptable, or in your case maybe necessary!, it sort of makes sense to 'bake it in' by setting your divisor back to 1 and your interval 1024 instead... +
Robert Muir 2010-12-16, 21:22
-
RE: Memory use during merges (OOM)Burton-West, Tom 2010-12-18, 15:55
Thanks Robert,
We will try the termsIndexInterval as a workaround. I have also opened a JIRA issue: https://issues.apache.org/jira/browse/SOLR-2290. Hope I found the right sections of the Lucene code. I'm just now in the process of looking at the Solr IndexReaderFactory and SolrIndexWriter and SolrIndexConfig trying to better understand how solrconfig.xml gets instantiated and how it affects the readers and writers. Tom ________________________________________ From: Robert Muir [[EMAIL PROTECTED]] On Thu, Dec 16, 2010 at 4:03 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote: >>>Your setting isn't being applied to the reader IW uses during >>>merging... its only for readers Solr opens from directories >>>explicitly. >>>I think you should open a jira issue! > > Do I understand correctly that this setting in theory could be applied to the reader IW uses during merging but is not currently being applied? yes, i'm not really sure (especially given the "name=") if you can/or it was planned to have multiple IR factories in solr, e.g. a separate one for spellchecking. so i'm not sure if we should (hackishly) steal this parameter from the IR factory (it is common to all IRFactories, not just StandardIRFactory) and apply it to to IW.. but we could at least expose the divisor param separately to the IW config so you have some way of setting it. > > <indexReaderFactory name="IndexReaderFactory" class="org.apache.solr.core.StandardIndexReaderFactory"> > <int name="termInfosIndexDivisor">8</int> > </indexReaderFactory > > > I understand the tradeoffs for doing this during searching, but not the trade-offs for doing this during merging. Is the use during merging the similar to the use during searching? > > i.e. Some process has to look up data for a particular term as opposed to having to iterate through all the terms? > (Haven't yet dug into the merging/indexing code). it needs it for applying deletes... as a workaround (if you are reindexing), maybe instead of using the Terms Index Divisor=8 you could set the Terms Index Interval = 1024 (8 * 128) ? this will solve your merging problem, and have the same perf characteristics of divisor=8, except you cant "go back down" like you can with the divisor without reindexing with a smaller interval... if you've already tested that performance with the divisor of 8 is acceptable, or in your case maybe necessary!, it sort of makes sense to 'bake it in' by setting your divisor back to 1 and your interval 1024 instead... +
Burton-West, Tom 2010-12-18, 15:55
-
RE: Memory use during merges (OOM)Robert Petersen 2010-12-16, 19:27
Thanks Mike! When you say 'term index of the segment readers', are you referring to the term vectors?
In our case our index of 8 million docs holds pretty 'skinny' docs containing searchable product titles and keywords, with the rest of the doc only holding Ids for faceting upon. Docs typically only have unique terms per doc, with a lot of overlap of the terms across categories of docs (all similar products). I'm thinking that our unique terms are low vs the size of our index. The way we spin out deletes and adds should keep the terms loaded all the time. Seems like once in a couple weeks a propagation happens which kills the slave farm with OOMs. We are bumping the heap up a couple gigs every time this happens and hoping it goes away at this point. That is why I jumped into this discussion, sorry for butting in like that. you guys are discussing very interesting settings I had not considered before. Rob -----Original Message----- From: Michael McCandless [mailto:[EMAIL PROTECTED]] Sent: Thursday, December 16, 2010 10:24 AM To: [EMAIL PROTECTED] Subject: Re: Memory use during merges (OOM) It's not that it's "bad", it's just that Lucene must do extra work to check if these deletes are real or not, and that extra work requires loading the terms index which will consume additional RAM. For most apps, though, the terms index is relatively small and so this isn't really an issue. But if your terms index is large this can explain the added RAM usage. One workaround for large terms index is to set the terms index divisor that IndexWriter should use whenever it loads a terms index (this is IndexWriter.setReaderTermsIndexDivisor). Mike On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen <[EMAIL PROTECTED]> wrote: > Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. > > -----Original Message----- > From: Michael McCandless [mailto:[EMAIL PROTECTED]] > Sent: Thursday, December 16, 2010 2:51 AM > To: [EMAIL PROTECTED] > Subject: Re: Memory use during merges (OOM) > > RAM usage for merging is tricky. > > First off, merging must hold open a SegmentReader for each segment > being merged. However, it's not necessarily a full segment reader; > for example, merging doesn't need the terms index nor norms. But it > will load deleted docs. > > But, if you are doing deletions (or updateDocument, which is just a > delete + add under-the-hood), then this will force the terms index of > the segment readers to be loaded, thus consuming more RAM. > Furthermore, if the deletions you (by Term/Query) do in fact result in > deleted documents (ie they were not "false" deletions), then the > merging allocates an int[maxDoc()] for each SegmentReader that has > deletions. > > Finally, if you have multiple merges running at once (see > CSM.setMaxMergeCount) that means RAM for each currently running merge > is tied up. > > So I think the gist is... the RAM usage will be in proportion to the > net size of the merge (mergeFactor + how big each merged segment is), > how many merges you allow concurrently, and whether you do false or > true deletions. > > If you are doing false deletions (calling .updateDocument when in fact > the Term you are replacing cannot exist) it'd be best if possible to > change the app to not call .updateDocument if you know the Term > doesn't exist. > > Mike > > On Wed, Dec 15, 2010 at 6:52 PM, Burton-West, Tom <[EMAIL PROTECTED]> wrote: >> Hello all, >> >> Are there any general guidelines for determining the main factors in memory use during merges? >> >> We recently changed our indexing configuration to speed up indexing but in the process of doing a very large merge we are running out of memory. >> Below is a list of the changes and part of the indexwriter log. The changes increased the indexing though-put by almost an order of magnitude. +
Robert Petersen 2010-12-16, 19:27
-
Re: Memory use during merges (OOM)Michael McCandless 2010-12-16, 19:36
Actually terms index is something different.
If you don't use CFS, go and look at the size of *.tii in your index directory -- those are the terms index. The terms index picks a subset of the terms (by default 128) to hold in RAM (plus some metadata) in order to make seeking to a specific term faster. Unfortunately they are held in a RAM intensive way, but in the upcoming 4.0 release we've greatly reduced that. Mike On Thu, Dec 16, 2010 at 2:27 PM, Robert Petersen <[EMAIL PROTECTED]> wrote: > Thanks Mike! When you say 'term index of the segment readers', are you referring to the term vectors? > > In our case our index of 8 million docs holds pretty 'skinny' docs containing searchable product titles and keywords, with the rest of the doc only holding Ids for faceting upon. Docs typically only have unique terms per doc, with a lot of overlap of the terms across categories of docs (all similar products). I'm thinking that our unique terms are low vs the size of our index. The way we spin out deletes and adds should keep the terms loaded all the time. Seems like once in a couple weeks a propagation happens which kills the slave farm with OOMs. We are bumping the heap up a couple gigs every time this happens and hoping it goes away at this point. That is why I jumped into this discussion, sorry for butting in like that. you guys are discussing very interesting settings I had not considered before. > > Rob > > > -----Original Message----- > From: Michael McCandless [mailto:[EMAIL PROTECTED]] > Sent: Thursday, December 16, 2010 10:24 AM > To: [EMAIL PROTECTED] > Subject: Re: Memory use during merges (OOM) > > It's not that it's "bad", it's just that Lucene must do extra work to > check if these deletes are real or not, and that extra work requires > loading the terms index which will consume additional RAM. > > For most apps, though, the terms index is relatively small and so this > isn't really an issue. But if your terms index is large this can > explain the added RAM usage. > > One workaround for large terms index is to set the terms index divisor > that IndexWriter should use whenever it loads a terms index (this is > IndexWriter.setReaderTermsIndexDivisor). > > Mike > > On Thu, Dec 16, 2010 at 12:17 PM, Robert Petersen <[EMAIL PROTECTED]> wrote: >> Hello we occasionally bump into the OOM issue during merging after propagation too, and from the discussion below I guess we are doing thousands of 'false deletions' by unique id to make sure certain documents are *not* in the index. Could anyone explain why that is bad? I didn't really understand the conclusion below. >> >> -----Original Message----- >> From: Michael McCandless [mailto:[EMAIL PROTECTED]] >> Sent: Thursday, December 16, 2010 2:51 AM >> To: [EMAIL PROTECTED] >> Subject: Re: Memory use during merges (OOM) >> >> RAM usage for merging is tricky. >> >> First off, merging must hold open a SegmentReader for each segment >> being merged. However, it's not necessarily a full segment reader; >> for example, merging doesn't need the terms index nor norms. But it >> will load deleted docs. >> >> But, if you are doing deletions (or updateDocument, which is just a >> delete + add under-the-hood), then this will force the terms index of >> the segment readers to be loaded, thus consuming more RAM. >> Furthermore, if the deletions you (by Term/Query) do in fact result in >> deleted documents (ie they were not "false" deletions), then the >> merging allocates an int[maxDoc()] for each SegmentReader that has >> deletions. >> >> Finally, if you have multiple merges running at once (see >> CSM.setMaxMergeCount) that means RAM for each currently running merge >> is tied up. >> >> So I think the gist is... the RAM usage will be in proportion to the >> net size of the merge (mergeFactor + how big each merged segment is), >> how many merges you allow concurrently, and whether you do false or >> true deletions. >> >> If you are doing false deletions (calling .updateDocument when in fact +
Michael McCandless 2010-12-16, 19:36
-
Re: Memory use during merges (OOM)Yonik Seeley 2010-12-16, 21:30
On Thu, Dec 16, 2010 at 5:51 AM, Michael McCandless
<[EMAIL PROTECTED]> wrote: > If you are doing false deletions (calling .updateDocument when in fact > the Term you are replacing cannot exist) it'd be best if possible to > change the app to not call .updateDocument if you know the Term > doesn't exist. FWIW, if you're going to add a batch of documents you know aren't already in the index, you can use the "overwrite=false" parameter for that Solr update request. -Yonik http://www.lucidimagination.com +
Yonik Seeley 2010-12-16, 21:30
|