Per Steffensen 2012-02-03, 08:55
Erick Erickson 2012-02-03, 14:29
-Re: Parallel indexing in Solr
Per Steffensen 2012-02-06, 12:53
See response below
Erick Erickson skrev:
> Unfortunately, the answer is "it depends(tm)".
> First question: How are you indexing things? SolrJ? post.jar?
> But some observations:
> 1> sure, using multiple cores will have some parallelism. So will
> using a single core but using something like SolrJ and
So SolrJ with CommonsHttpSolrServer will not support handling several
> Especially with trunk (4.0)
> and the Document Writer Per Thread stuff.
We are using trunk (4.0). Can you provide me with a little more info on
this "Document Writer Per Thread stuff". A link or something?
> In 3.x, you'll
> see some pauses when segments are merged that you
> can't get around (per core). See:
> for an excellent writeup. But whether or not you use several
> cores should be determined by your problem space, certainly
> not by trying to increase the throughput. Indexing usually
> take a back seat to search performance.
We will have few searches, but a lot of indexing.
> 2> general settings are hard to come by. If you're sending
> structured documents that use Tika to parse the data
> behind the scenes, your performance will be much
> different (slower) than sending SolrInputDocuments
We are sending SolrInputDocuments
> 3> The recommended servlet container is, generally,
> "The one you're most comfortable with". Tomcat is
> certainly popular. That said, use whatever you're
> most comfortable with until you see a performance
> problem. Odds are you'll find your load on Solr is a
> at its limit before your servlet container has problems.
So Jetty in not a "easy to use, but non-performance"-container?
> 4> Monitor you CPU, fire more requests at it until it
> hits 100%. Note that there are occasions where the
> servlet container limits the number of outstanding
> requests it will allow and queues ones over that
> limit (find the magic setting to increase this if it's a
> problem, it differs by container). If you start to see
> your response times lengthen but the CPU not being
> fully utilized, that may be the cause.
Actually right now, I am trying to find our what my bottleneck is. The
setup is more complex, than I would bother you with, but basically I
have servers with 80-90% IO-wait and only 5-10% "real CPU usage". It
might not be a Solr-related problem, I am investigating different
things, but just wanted to know a little more about how Jetty/Solr works
in order to make a qualified guess.
> 5> How high is "high performance"? On a stock solr
> with the Wikipedia dump (11M docs), all running on
> my laptop, I see 7K docs/sec indexed. I know of
> installations that see 60 docs/sec or even less. I'm
> sending simple docs with SolrJ locally and they're
> sending huge documents over the wire that Tika
> handles. There are just so many variables it's hard
> to say anything except "try it and see"......
Well eventaually we need to be able to index and delete about 50mio
documents per day. We will need to keep a "history" of 2 years of data
in our system, deletion will not start before we have been in production
for 2 years. At that point in time the system needs to contain 2 year *
365 days/year * 50mio docs/day = 36,5billion documents. At that point
50mio documents need to be deleted and index per day - before that we
only need to index 50mio documents per day. We are aware that we are
probably going to need a certain amout of hardware for this, but most
important thing is that we make a scalable setup so that we can get to
this kind of numbers at all. Right now I am focusing on getting most out
of one Solr instance potentially with several cores, though.
Erick Erickson 2012-02-06, 14:30
Per Steffensen 2012-02-06, 15:49
Sami Siren 2012-02-06, 15:21
Per Steffensen 2012-02-06, 15:55
Sami Siren 2012-02-07, 10:18
Per Steffensen 2012-02-07, 13:27
Erick Erickson 2012-02-06, 16:40