|
|
+
Per Steffensen 2012-02-03, 08:55
+
Erick Erickson 2012-02-03, 14:29
-
Re: Parallel indexing in SolrPer Steffensen 2012-02-06, 12:53
See response below
Erick Erickson skrev: > Unfortunately, the answer is "it depends(tm)". > > First question: How are you indexing things? SolrJ? post.jar? > SolrJ, CommonsHttpSolrServer > But some observations: > > 1> sure, using multiple cores will have some parallelism. So will > using a single core but using something like SolrJ and > StreamingUpdateSolrServer. So SolrJ with CommonsHttpSolrServer will not support handling several requests concurrently? > Especially with trunk (4.0) > and the Document Writer Per Thread stuff. We are using trunk (4.0). Can you provide me with a little more info on this "Document Writer Per Thread stuff". A link or something? > In 3.x, you'll > see some pauses when segments are merged that you > can't get around (per core). See: > http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/ > for an excellent writeup. But whether or not you use several > cores should be determined by your problem space, certainly > not by trying to increase the throughput. Indexing usually > take a back seat to search performance. > We will have few searches, but a lot of indexing. > 2> general settings are hard to come by. If you're sending > structured documents that use Tika to parse the data > behind the scenes, your performance will be much > different (slower) than sending SolrInputDocuments > (SolrJ). > We are sending SolrInputDocuments > 3> The recommended servlet container is, generally, > "The one you're most comfortable with". Tomcat is > certainly popular. That said, use whatever you're > most comfortable with until you see a performance > problem. Odds are you'll find your load on Solr is a > at its limit before your servlet container has problems. > So Jetty in not a "easy to use, but non-performance"-container? > 4> Monitor you CPU, fire more requests at it until it > hits 100%. Note that there are occasions where the > servlet container limits the number of outstanding > requests it will allow and queues ones over that > limit (find the magic setting to increase this if it's a > problem, it differs by container). If you start to see > your response times lengthen but the CPU not being > fully utilized, that may be the cause. > Actually right now, I am trying to find our what my bottleneck is. The setup is more complex, than I would bother you with, but basically I have servers with 80-90% IO-wait and only 5-10% "real CPU usage". It might not be a Solr-related problem, I am investigating different things, but just wanted to know a little more about how Jetty/Solr works in order to make a qualified guess. > 5> How high is "high performance"? On a stock solr > with the Wikipedia dump (11M docs), all running on > my laptop, I see 7K docs/sec indexed. I know of > installations that see 60 docs/sec or even less. I'm > sending simple docs with SolrJ locally and they're > sending huge documents over the wire that Tika > handles. There are just so many variables it's hard > to say anything except "try it and see"...... > Well eventaually we need to be able to index and delete about 50mio documents per day. We will need to keep a "history" of 2 years of data in our system, deletion will not start before we have been in production for 2 years. At that point in time the system needs to contain 2 year * 365 days/year * 50mio docs/day = 36,5billion documents. At that point 50mio documents need to be deleted and index per day - before that we only need to index 50mio documents per day. We are aware that we are probably going to need a certain amout of hardware for this, but most important thing is that we make a scalable setup so that we can get to this kind of numbers at all. Right now I am focusing on getting most out of one Solr instance potentially with several cores, though. +
Erick Erickson 2012-02-06, 14:30
+
Per Steffensen 2012-02-06, 15:49
+
Sami Siren 2012-02-06, 15:21
+
Per Steffensen 2012-02-06, 15:55
+
Sami Siren 2012-02-07, 10:18
+
Per Steffensen 2012-02-07, 13:27
+
Erick Erickson 2012-02-06, 16:40
|