|
|
-
Machine utilization while indexing
Thijs 2010-05-20, 15:02
Hi.
I have a question about how I can get solr to index quicker then it does at the moment.
I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument.
What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough.
(I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue)
I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward.
most fields look like <fieldType name="long" class="solr.LongField" omitNorms="true"/> <field name="objectId" type="long" stored="true" indexed="true" required="true" /> <field name="listId" type="long" stored="false" indexed="true" multiValued="true"/>
the relevant solrconfig.xml <indexDefaults> <useCompoundFile>false</useCompoundFile> <mergeFactor>100</mergeFactor> <RAMBufferSizeMB>256</RAMBufferSizeMB> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> <writeLockTimeout>1000</writeLockTimeout> <commitLockTimeout>10000</commitLockTimeout> <lockType>single</lockType> </indexDefaults> The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4
What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB
And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert.
Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how.
Thijs
+
Thijs 2010-05-20, 15:02
-
Re: Machine utilization while indexing
Chris Hostetter 2010-05-20, 19:14
I'm really only guessing here, but based on your description of what you are doing it sounds like you only have one thread streaming documents to solr (via a single StreamingUpdateSolrServer instance which creates a single HTTP connection)
Have you at all attempted to have parallel threads in your client initiate parallel connections to Solr via multiple instances of StreamingUpdateSolrServer objects?) -Hoss
+
Chris Hostetter 2010-05-20, 19:14
-
RE: Machine utilization while indexing
Nagelberg, Kallin 2010-05-20, 19:17
StreamingUpdateSolrServer already has multiple threads and uses multiple connections under the covers. At least the api says ' Uses an internal MultiThreadedHttpConnectionManager to manage http connections'. The constructor allows you to specify the number of threads used, http://lucene.apache.org/solr/api/org/apache/solr/client/solrj/impl/StreamingUpdateSolrServer.html#StreamingUpdateSolrServer(java.lang.String, int, int) . -Kallin Nagelberg -----Original Message----- From: Chris Hostetter [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 20, 2010 3:14 PM To: [EMAIL PROTECTED] Subject: Re: Machine utilization while indexing I'm really only guessing here, but based on your description of what you are doing it sounds like you only have one thread streaming documents to solr (via a single StreamingUpdateSolrServer instance which creates a single HTTP connection) Have you at all attempted to have parallel threads in your client initiate parallel connections to Solr via multiple instances of StreamingUpdateSolrServer objects?) -Hoss
+
Nagelberg, Kallin 2010-05-20, 19:17
-
RE: Machine utilization while indexing
Chris Hostetter 2010-05-20, 19:34
: StreamingUpdateSolrServer already has multiple threads and uses multiple : connections under the covers. At least the api says ' Uses an internal
Hmmm... i think one of us missunderstands the point behind StreamingUpdateSolrServer and it's internal threads/queues. (it's very possible that it's me)
my understanding is that this allows it to manage the batching of multiple operations for you, reusing connections as it goes -- so the the queueSize is how many individual requests it buffers before sending the batch to Solr, and the threadCount controls how many batches it can send in parallel (in the event that one thread is still waiting for the response when the queue next fills up)
But if you are only using a single thread to feed SolrRequests to a single instance of StreamingUpdateSolrServer then there can still be lots of opportunities for Solr itself to be idle -- as i said, it's not clear to me if you are using multiple threads to write to your StreamingUpdateSolrServer ... even if if you reuse the same StreamingUpdateSolrServer instance, multiple threads in your client code may increse the throughput (assuming that at the moment the threads in StreamingUpdateSolrServer are largely idle)
But as i said ... this is all mostly a guess. I'm not intimatiely familiar with solrj. -Hoss
+
Chris Hostetter 2010-05-20, 19:34
-
Re: Machine utilization while indexing
Thijs 2010-05-25, 11:42
Hi all,
I did some further investigation and (after turning of some filters in yourkit) found that is was actually the machine sending the files to solr that was slowing things down.
At first I couldn't find this as it turned out that yourkit hides org.apache.* classes. When I removed this filter, it turned out that atleast 50% of the CPU time was taken by org.apache.solr.client.solrj.util.ClientUtils.writeXML(SolrInputDocument, Writer) This was taking so much time that the commit queues where filling up on the client side instead of the solr server.
I have now switched back to my custom BlockingQueue with multiple CommonsHttpSolrServers that use the BinaryRequestWriter. And I'm now able to index 800000 documents in 8minutes (including optimize). And 2.9milj documents in 32 minutes(inlc. optimize). As the StreamingUpdateSolrServer only supports XML I can't use that.
So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) aren't turned on by default. (eps considering some threads on the dev-list some time ago about setting a default schema for optimum performance. Also finding out about this performance enhancement wasn't easy as it's hardly mentioned on the Wiki. I'll see if I can update this.
Thanks for all the advise and esp the great work on Solr&Lucene. Thijs On 20-5-2010 21:34, Chris Hostetter wrote: > > : StreamingUpdateSolrServer already has multiple threads and uses multiple > : connections under the covers. At least the api says ' Uses an internal > > Hmmm... i think one of us missunderstands the point behind > StreamingUpdateSolrServer and it's internal threads/queues. (it's very > possible that it's me) > > my understanding is that this allows it to manage the batching of multiple > operations for you, reusing connections as it goes -- so the the > queueSize is how many individual requests it buffers before sending the > batch to Solr, and the threadCount controls how many batches it can send > in parallel (in the event that one thread is still waiting for the > response when the queue next fills up) > > But if you are only using a single thread to feed SolrRequests to a single > instance of StreamingUpdateSolrServer then there can still be lots of > opportunities for Solr itself to be idle -- as i said, it's not clear to > me if you are using multiple threads to write to your > StreamingUpdateSolrServer ... even if if you reuse the same > StreamingUpdateSolrServer instance, multiple threads in your client code > may increse the throughput (assuming that at the moment the threads in > StreamingUpdateSolrServer are largely idle) > > But as i said ... this is all mostly a guess. I'm not intimatiely > familiar with solrj. > > > -Hoss >
+
Thijs 2010-05-25, 11:42
-
Re: Machine utilization while indexing
Chris Hostetter 2010-05-27, 04:41
: So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) : aren't turned on by default. (eps considering some threads on the dev-list
I don't really understand this question -- the BinaryUpdateRequestHandler is registered with the path /update/javabin in the example solrconfig.xml -- that's about as close to turning something on by "default" as solr supports.
-Hoss
+
Chris Hostetter 2010-05-27, 04:41
-
Re: Machine utilization while indexing
Thijs 2010-05-27, 08:12
Sorry I missed it in the solrconfig.xml (my bad). I wasn't looking for it in the right place.
Thijs
On 27-5-2010 6:41, Chris Hostetter wrote: > > : So now I wonder why BinaryRequestWriter (and BinaryUpdateRequestHandler) > : aren't turned on by default. (eps considering some threads on the dev-list > > I don't really understand this question -- the BinaryUpdateRequestHandler > is registered with the path /update/javabin in the example solrconfig.xml > -- that's about as close to turning something on by "default" as solr > supports. > > > > -Hoss >
+
Thijs 2010-05-27, 08:12
-
RE: Machine utilization while indexing
Nagelberg, Kallin 2010-05-20, 15:16
How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. -Kallin Nagelberg -----Original Message----- From: Thijs [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 20, 2010 11:02 AM To: [EMAIL PROTECTED] Subject: Machine utilization while indexing Hi. I have a question about how I can get solr to index quicker then it does at the moment. I have to index (and re-index) some 3-5 million documents. These documents are preprocessed by a java application that effectively combines multiple database tables with each-other to form the SolrInputDocument. What I'm seeing however is that the queue of documents that are ready to be send to the solr server exceeds my preset limit. Telling me that Solr somehow can't process the documents fast enough. (I have created my own queue in front of Solrj.StreamingUpdateSolrServer as it would not process the documents fast enough causing OutOfMemoryExceptions due to the large amount of documents building up in it's queue) I have an index that for 95% consist of ID's (Long). We don't do any analysis on the fields that are being indexed. The schema is rather straight forward. most fields look like <fieldType name="long" class="solr.LongField" omitNorms="true"/> <field name="objectId" type="long" stored="true" indexed="true" required="true" /> <field name="listId" type="long" stored="false" indexed="true" multiValued="true"/> the relevant solrconfig.xml <indexDefaults> <useCompoundFile>false</useCompoundFile> <mergeFactor>100</mergeFactor> <RAMBufferSizeMB>256</RAMBufferSizeMB> <maxMergeDocs>2147483647</maxMergeDocs> <maxFieldLength>10000</maxFieldLength> <writeLockTimeout>1000</writeLockTimeout> <commitLockTimeout>10000</commitLockTimeout> <lockType>single</lockType> </indexDefaults> The machines I'm testing on have a: Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz With 4GB of ram. Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 What I'm seeing is that the network almost never reaches more then 10% of the 1GB/s connection. That the CPU utilization is always below 25% (1 core is used, not the others) I don't see heavy disk-io. Also while indexing the memory consumption is: Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB And that in the beginning (with a empty index) I get 2ms per insert but this slows to 18-19ms per insert. Are there any tips/tricks I can use to speed up my indexing? Because I have a feeling that my machine is capable of doing more (use more cpu's). I just can't figure-out how. Thijs
+
Nagelberg, Kallin 2010-05-20, 15:16
-
RE: Machine utilization while indexing
Dennis Gearon 2010-05-20, 15:45
Here is a good article from IBM, with code, on how to do hybrid/cloud computing. http://www.ibm.com/developerworks/library/x-cloudpt1/Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php--- On Thu, 5/20/10, Nagelberg, Kallin <[EMAIL PROTECTED]> wrote: > From: Nagelberg, Kallin <[EMAIL PROTECTED]> > Subject: RE: Machine utilization while indexing > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]> > Date: Thursday, May 20, 2010, 8:16 AM > How about throwing a blockingqueue, > http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, > between your document-creator and solrserver? Give it a size > of 10,000 or something, with one thread trying to feed it, > and one thread waiting for it to get near full then draining > it. Take the drained results and add them to the server > (maybe try not using streamingsolrserver). Something like > that worked well for me with about 5,000,000 documents each > ~5k taking about 8 hours. > > -Kallin Nagelberg > > -----Original Message----- > From: Thijs [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, May 20, 2010 11:02 AM > To: [EMAIL PROTECTED] > Subject: Machine utilization while indexing > > Hi. > > I have a question about how I can get solr to index quicker > then it does > at the moment. > > I have to index (and re-index) some 3-5 million documents. > These > documents are preprocessed by a java application that > effectively > combines multiple database tables with each-other to form > the > SolrInputDocument. > > What I'm seeing however is that the queue of documents that > are ready to > be send to the solr server exceeds my preset limit. Telling > me that Solr > somehow can't process the documents fast enough. > > (I have created my own queue in front of > Solrj.StreamingUpdateSolrServer > as it would not process the documents fast enough causing > OutOfMemoryExceptions due to the large amount of documents > building up > in it's queue) > > I have an index that for 95% consist of ID's (Long). We > don't do any > analysis on the fields that are being indexed. The schema > is rather > straight forward. > > most fields look like > <fieldType name="long" class="solr.LongField" > omitNorms="true"/> > <field name="objectId" type="long" stored="true" > indexed="true" > required="true" /> > <field name="listId" type="long" stored="false" > indexed="true" > multiValued="true"/> > > the relevant solrconfig.xml > <indexDefaults> > > <useCompoundFile>false</useCompoundFile> > > <mergeFactor>100</mergeFactor> > > <RAMBufferSizeMB>256</RAMBufferSizeMB> > > <maxMergeDocs>2147483647</maxMergeDocs> > > <maxFieldLength>10000</maxFieldLength> > > <writeLockTimeout>1000</writeLockTimeout> > > <commitLockTimeout>10000</commitLockTimeout> > > <lockType>single</lockType> > </indexDefaults> > > > The machines I'm testing on have a: > Intel(R) Core(TM)2 Quad CPU Q9550 @ > 2.83GHz > With 4GB of ram. > Running on linux java version 1.6.0_17, tomcat 6 and solr > version 1.4 > > What I'm seeing is that the network almost never reaches > more then 10% > of the 1GB/s connection. > That the CPU utilization is always below 25% (1 core is > used, not the > others) > I don't see heavy disk-io. > Also while indexing the memory consumption is: > Free memory: 212.15 MB Total memory: 509.12 MB Max memory: > 2730.68 MB > > And that in the beginning (with a empty index) I get 2ms > per insert but > this slows to 18-19ms per insert. > > Are there any tips/tricks I can use to speed up my > indexing? Because I > have a feeling that my machine is capable of doing more > (use more > cpu's). I just can't figure-out how. > > Thijs >
+
Dennis Gearon 2010-05-20, 15:45
-
RE: Machine utilization while indexing
Dennis Gearon 2010-05-20, 15:25
It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php--- On Thu, 5/20/10, Nagelberg, Kallin <[EMAIL PROTECTED]> wrote: > From: Nagelberg, Kallin <[EMAIL PROTECTED]> > Subject: RE: Machine utilization while indexing > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]> > Date: Thursday, May 20, 2010, 8:16 AM > How about throwing a blockingqueue, > http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, > between your document-creator and solrserver? Give it a size > of 10,000 or something, with one thread trying to feed it, > and one thread waiting for it to get near full then draining > it. Take the drained results and add them to the server > (maybe try not using streamingsolrserver). Something like > that worked well for me with about 5,000,000 documents each > ~5k taking about 8 hours. > > -Kallin Nagelberg > > -----Original Message----- > From: Thijs [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, May 20, 2010 11:02 AM > To: [EMAIL PROTECTED] > Subject: Machine utilization while indexing > > Hi. > > I have a question about how I can get solr to index quicker > then it does > at the moment. > > I have to index (and re-index) some 3-5 million documents. > These > documents are preprocessed by a java application that > effectively > combines multiple database tables with each-other to form > the > SolrInputDocument. > > What I'm seeing however is that the queue of documents that > are ready to > be send to the solr server exceeds my preset limit. Telling > me that Solr > somehow can't process the documents fast enough. > > (I have created my own queue in front of > Solrj.StreamingUpdateSolrServer > as it would not process the documents fast enough causing > OutOfMemoryExceptions due to the large amount of documents > building up > in it's queue) > > I have an index that for 95% consist of ID's (Long). We > don't do any > analysis on the fields that are being indexed. The schema > is rather > straight forward. > > most fields look like > <fieldType name="long" class="solr.LongField" > omitNorms="true"/> > <field name="objectId" type="long" stored="true" > indexed="true" > required="true" /> > <field name="listId" type="long" stored="false" > indexed="true" > multiValued="true"/> > > the relevant solrconfig.xml > <indexDefaults> > > <useCompoundFile>false</useCompoundFile> > > <mergeFactor>100</mergeFactor> > > <RAMBufferSizeMB>256</RAMBufferSizeMB> > > <maxMergeDocs>2147483647</maxMergeDocs> > > <maxFieldLength>10000</maxFieldLength> > > <writeLockTimeout>1000</writeLockTimeout> > > <commitLockTimeout>10000</commitLockTimeout> > > <lockType>single</lockType> > </indexDefaults> > > > The machines I'm testing on have a: > Intel(R) Core(TM)2 Quad CPU Q9550 @ > 2.83GHz > With 4GB of ram. > Running on linux java version 1.6.0_17, tomcat 6 and solr > version 1.4 > > What I'm seeing is that the network almost never reaches > more then 10% > of the 1GB/s connection. > That the CPU utilization is always below 25% (1 core is > used, not the > others) > I don't see heavy disk-io. > Also while indexing the memory consumption is: > Free memory: 212.15 MB Total memory: 509.12 MB Max memory: > 2730.68 MB > > And that in the beginning (with a empty index) I get 2ms
+
Dennis Gearon 2010-05-20, 15:25
-
Re: Machine utilization while indexing
Thijs 2010-05-20, 15:29
Why would I need faster hardware if my current hardware isn't reaching it's max capacity? I'm already using a different machine for querying and indexing so while indexing the queries aren't affected. Pulling an optimized snapshot isn't even noticeable on the query-machines. Thijs On 20-5-2010 17:25, Dennis Gearon wrote: > It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. > > Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. > > 1/ Optimize as best as you can on one machine. > 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. > > > Dennis Gearon > > Signature Warning > ---------------- > EARTH has a Right To Life, > otherwise we all die. > > Read 'Hot, Flat, and Crowded' > Laugh at http://www.yert.com/film.php> > > --- On Thu, 5/20/10, Nagelberg, Kallin<[EMAIL PROTECTED]> wrote: > >> From: Nagelberg, Kallin<[EMAIL PROTECTED]> >> Subject: RE: Machine utilization while indexing >> To: "'[EMAIL PROTECTED]'"<[EMAIL PROTECTED]> >> Date: Thursday, May 20, 2010, 8:16 AM >> How about throwing a blockingqueue, >> http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, >> between your document-creator and solrserver? Give it a size >> of 10,000 or something, with one thread trying to feed it, >> and one thread waiting for it to get near full then draining >> it. Take the drained results and add them to the server >> (maybe try not using streamingsolrserver). Something like >> that worked well for me with about 5,000,000 documents each >> ~5k taking about 8 hours. >> >> -Kallin Nagelberg >> >> -----Original Message----- >> From: Thijs [mailto:[EMAIL PROTECTED]] >> >> Sent: Thursday, May 20, 2010 11:02 AM >> To: [EMAIL PROTECTED] >> Subject: Machine utilization while indexing >> >> Hi. >> >> I have a question about how I can get solr to index quicker >> then it does >> at the moment. >> >> I have to index (and re-index) some 3-5 million documents. >> These >> documents are preprocessed by a java application that >> effectively >> combines multiple database tables with each-other to form >> the >> SolrInputDocument. >> >> What I'm seeing however is that the queue of documents that >> are ready to >> be send to the solr server exceeds my preset limit. Telling >> me that Solr >> somehow can't process the documents fast enough. >> >> (I have created my own queue in front of >> Solrj.StreamingUpdateSolrServer >> as it would not process the documents fast enough causing >> OutOfMemoryExceptions due to the large amount of documents >> building up >> in it's queue) >> >> I have an index that for 95% consist of ID's (Long). We >> don't do any >> analysis on the fields that are being indexed. The schema >> is rather >> straight forward. >> >> most fields look like >> <fieldType name="long" class="solr.LongField" >> omitNorms="true"/> >> <field name="objectId" type="long" stored="true" >> indexed="true" >> required="true" /> >> <field name="listId" type="long" stored="false" >> indexed="true" >> multiValued="true"/> >> >> the relevant solrconfig.xml >> <indexDefaults> >> >> <useCompoundFile>false</useCompoundFile> >> >> <mergeFactor>100</mergeFactor> >> >> <RAMBufferSizeMB>256</RAMBufferSizeMB> >> >> <maxMergeDocs>2147483647</maxMergeDocs> >> >> <maxFieldLength>10000</maxFieldLength> >> >> <writeLockTimeout>1000</writeLockTimeout> >> >> <commitLockTimeout>10000</commitLockTimeout> >> >> <lockType>single</lockType> >> </indexDefaults> >> >> >> The machines I'm testing on have a: >> Intel(R) Core(TM)2 Quad CPU Q9550 @ >> 2.83GHz >> With 4GB of ram. >> Running on linux java version 1.6.0_17, tomcat 6 and solr
+
Thijs 2010-05-20, 15:29
-
RE: Machine utilization while indexing
Nagelberg, Kallin 2010-05-20, 15:33
Well to be fair I'm indexing on a modest virtualized machine with only 2 gigs ram, and a doc size of 5-10k maybe substantially larger than what you have. They could be substantially smaller too. As another point of reference my index ends up being about 20Gigs with the 5 million docs. I should also point out I only need to do this once.. I'm not constantly reindexing everything. My indexed documents rarely change, and when they do we have a process that selectively updates those few that need it. Combine that with a constant trickle of new documents and indexing performance isn't much of a concern. You should be able to experiment with a small subset of your documents to speedily test new schemas, etc. In my case I selected a representative sample and store them in my project for unit testing. -Kallin Nagelberg -----Original Message----- From: Dennis Gearon [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 20, 2010 11:25 AM To: [EMAIL PROTECTED] Subject: RE: Machine utilization while indexing It takes that long to do indexing? I'm HOPING to have a site that has low 10's of millions of documents to billions. Sounds to me like I will DEFINITELY need a cloud account at indexing time. For the original author of this thread, that's what I'd recommend. 1/ Optimize as best as you can on one machine. 2/ Set up an Amazon EC (Elastic Cloud) account. Spawn/shard the indexing over to 5-10 machines during indexing. Combine the index, shut down the EC instances. Probably could get it down to 1/2 hour, without impacting your current queries. Dennis Gearon Signature Warning ---------------- EARTH has a Right To Life, otherwise we all die. Read 'Hot, Flat, and Crowded' Laugh at http://www.yert.com/film.php--- On Thu, 5/20/10, Nagelberg, Kallin <[EMAIL PROTECTED]> wrote: > From: Nagelberg, Kallin <[EMAIL PROTECTED]> > Subject: RE: Machine utilization while indexing > To: "'[EMAIL PROTECTED]'" <[EMAIL PROTECTED]> > Date: Thursday, May 20, 2010, 8:16 AM > How about throwing a blockingqueue, > http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, > between your document-creator and solrserver? Give it a size > of 10,000 or something, with one thread trying to feed it, > and one thread waiting for it to get near full then draining > it. Take the drained results and add them to the server > (maybe try not using streamingsolrserver). Something like > that worked well for me with about 5,000,000 documents each > ~5k taking about 8 hours. > > -Kallin Nagelberg > > -----Original Message----- > From: Thijs [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, May 20, 2010 11:02 AM > To: [EMAIL PROTECTED] > Subject: Machine utilization while indexing > > Hi. > > I have a question about how I can get solr to index quicker > then it does > at the moment. > > I have to index (and re-index) some 3-5 million documents. > These > documents are preprocessed by a java application that > effectively > combines multiple database tables with each-other to form > the > SolrInputDocument. > > What I'm seeing however is that the queue of documents that > are ready to > be send to the solr server exceeds my preset limit. Telling > me that Solr > somehow can't process the documents fast enough. > > (I have created my own queue in front of > Solrj.StreamingUpdateSolrServer > as it would not process the documents fast enough causing > OutOfMemoryExceptions due to the large amount of documents > building up > in it's queue) > > I have an index that for 95% consist of ID's (Long). We > don't do any > analysis on the fields that are being indexed. The schema > is rather > straight forward. > > most fields look like > <fieldType name="long" class="solr.LongField" > omitNorms="true"/> > <field name="objectId" type="long" stored="true" > indexed="true" > required="true" /> > <field name="listId" type="long" stored="false" > indexed="true" > multiValued="true"/>
+
Nagelberg, Kallin 2010-05-20, 15:33
-
Re: Machine utilization while indexing
Thijs 2010-05-20, 15:25
I already have a blockingqueue in place (that's my custom queue) and luckily I'm indexing faster then what your doing.Currently it takes about 2hour to index the 5m documents I'm talking about. But I still feel as if my machine is under utilized. Thijs On 20-5-2010 17:16, Nagelberg, Kallin wrote: > How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. > > -Kallin Nagelberg > > -----Original Message----- > From: Thijs [mailto:[EMAIL PROTECTED]] > Sent: Thursday, May 20, 2010 11:02 AM > To: [EMAIL PROTECTED] > Subject: Machine utilization while indexing > > Hi. > > I have a question about how I can get solr to index quicker then it does > at the moment. > > I have to index (and re-index) some 3-5 million documents. These > documents are preprocessed by a java application that effectively > combines multiple database tables with each-other to form the > SolrInputDocument. > > What I'm seeing however is that the queue of documents that are ready to > be send to the solr server exceeds my preset limit. Telling me that Solr > somehow can't process the documents fast enough. > > (I have created my own queue in front of Solrj.StreamingUpdateSolrServer > as it would not process the documents fast enough causing > OutOfMemoryExceptions due to the large amount of documents building up > in it's queue) > > I have an index that for 95% consist of ID's (Long). We don't do any > analysis on the fields that are being indexed. The schema is rather > straight forward. > > most fields look like > <fieldType name="long" class="solr.LongField" omitNorms="true"/> > <field name="objectId" type="long" stored="true" indexed="true" > required="true" /> > <field name="listId" type="long" stored="false" indexed="true" > multiValued="true"/> > > the relevant solrconfig.xml > <indexDefaults> > <useCompoundFile>false</useCompoundFile> > <mergeFactor>100</mergeFactor> > <RAMBufferSizeMB>256</RAMBufferSizeMB> > <maxMergeDocs>2147483647</maxMergeDocs> > <maxFieldLength>10000</maxFieldLength> > <writeLockTimeout>1000</writeLockTimeout> > <commitLockTimeout>10000</commitLockTimeout> > <lockType>single</lockType> > </indexDefaults> > > > The machines I'm testing on have a: > Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz > With 4GB of ram. > Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 > > What I'm seeing is that the network almost never reaches more then 10% > of the 1GB/s connection. > That the CPU utilization is always below 25% (1 core is used, not the > others) > I don't see heavy disk-io. > Also while indexing the memory consumption is: > Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB > > And that in the beginning (with a empty index) I get 2ms per insert but > this slows to 18-19ms per insert. > > Are there any tips/tricks I can use to speed up my indexing? Because I > have a feeling that my machine is capable of doing more (use more > cpu's). I just can't figure-out how. > > Thijs
+
Thijs 2010-05-20, 15:25
-
RE: Machine utilization while indexing
Nagelberg, Kallin 2010-05-20, 15:36
You're sure it's not blocking on indexing IO? If not then I guess it must be a thread waiting unnecessarily in solr or your loading program. To get my loader running at full speed I hooked it up to jprofiler's thread views to see where the stalls were and optimized from there. -Kallin Nagelberg -----Original Message----- From: Thijs [mailto:[EMAIL PROTECTED]] Sent: Thursday, May 20, 2010 11:25 AM To: [EMAIL PROTECTED] Subject: Re: Machine utilization while indexing I already have a blockingqueue in place (that's my custom queue) and luckily I'm indexing faster then what your doing.Currently it takes about 2hour to index the 5m documents I'm talking about. But I still feel as if my machine is under utilized. Thijs On 20-5-2010 17:16, Nagelberg, Kallin wrote: > How about throwing a blockingqueue, http://java.sun.com/j2se/1.5.0/docs/api/java/util/concurrent/BlockingQueue.html, between your document-creator and solrserver? Give it a size of 10,000 or something, with one thread trying to feed it, and one thread waiting for it to get near full then draining it. Take the drained results and add them to the server (maybe try not using streamingsolrserver). Something like that worked well for me with about 5,000,000 documents each ~5k taking about 8 hours. > > -Kallin Nagelberg > > -----Original Message----- > From: Thijs [mailto:[EMAIL PROTECTED]] > Sent: Thursday, May 20, 2010 11:02 AM > To: [EMAIL PROTECTED] > Subject: Machine utilization while indexing > > Hi. > > I have a question about how I can get solr to index quicker then it does > at the moment. > > I have to index (and re-index) some 3-5 million documents. These > documents are preprocessed by a java application that effectively > combines multiple database tables with each-other to form the > SolrInputDocument. > > What I'm seeing however is that the queue of documents that are ready to > be send to the solr server exceeds my preset limit. Telling me that Solr > somehow can't process the documents fast enough. > > (I have created my own queue in front of Solrj.StreamingUpdateSolrServer > as it would not process the documents fast enough causing > OutOfMemoryExceptions due to the large amount of documents building up > in it's queue) > > I have an index that for 95% consist of ID's (Long). We don't do any > analysis on the fields that are being indexed. The schema is rather > straight forward. > > most fields look like > <fieldType name="long" class="solr.LongField" omitNorms="true"/> > <field name="objectId" type="long" stored="true" indexed="true" > required="true" /> > <field name="listId" type="long" stored="false" indexed="true" > multiValued="true"/> > > the relevant solrconfig.xml > <indexDefaults> > <useCompoundFile>false</useCompoundFile> > <mergeFactor>100</mergeFactor> > <RAMBufferSizeMB>256</RAMBufferSizeMB> > <maxMergeDocs>2147483647</maxMergeDocs> > <maxFieldLength>10000</maxFieldLength> > <writeLockTimeout>1000</writeLockTimeout> > <commitLockTimeout>10000</commitLockTimeout> > <lockType>single</lockType> > </indexDefaults> > > > The machines I'm testing on have a: > Intel(R) Core(TM)2 Quad CPU Q9550 @ 2.83GHz > With 4GB of ram. > Running on linux java version 1.6.0_17, tomcat 6 and solr version 1.4 > > What I'm seeing is that the network almost never reaches more then 10% > of the 1GB/s connection. > That the CPU utilization is always below 25% (1 core is used, not the > others) > I don't see heavy disk-io. > Also while indexing the memory consumption is: > Free memory: 212.15 MB Total memory: 509.12 MB Max memory: 2730.68 MB > > And that in the beginning (with a empty index) I get 2ms per insert but > this slows to 18-19ms per insert. > > Are there any tips/tricks I can use to speed up my indexing? Because I > have a feeling that my machine is capable of doing more (use more > cpu's). I just can't figure-out how. > > Thijs
+
Nagelberg, Kallin 2010-05-20, 15:36
|
|