Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Solr, mail # user - Parallel indexing in Solr


+
Per Steffensen 2012-02-03, 08:55
+
Erick Erickson 2012-02-03, 14:29
Copy link to this message
-
Re: Parallel indexing in Solr
Per Steffensen 2012-02-06, 12:53
See response below

Erick Erickson skrev:
> Unfortunately, the answer is "it depends(tm)".
>
> First question: How are you indexing things? SolrJ? post.jar?
>  
SolrJ, CommonsHttpSolrServer
> But some observations:
>
> 1> sure, using multiple cores will have some parallelism. So will
>     using a single core but using something like SolrJ and
>     StreamingUpdateSolrServer.
So SolrJ with CommonsHttpSolrServer will not support handling several
requests concurrently?
>  Especially with trunk (4.0)
>      and the Document Writer Per Thread stuff.
We are using trunk (4.0). Can you provide me with a little more info on
this "Document Writer Per Thread stuff". A link or something?
>  In 3.x, you'll
>      see some pauses when segments are merged that you
>      can't get around (per core). See:
>      http://www.searchworkings.org/blog/-/blogs/gimme-all-resources-you-have-i-can-use-them!/
>      for an excellent writeup. But whether or not you use several
>      cores should be determined by your problem space, certainly
>      not by trying to increase the throughput. Indexing usually
>      take a back seat to search performance.
>  
We will have few searches, but a lot of indexing.
> 2> general settings are hard to come by. If you're sending
>       structured documents that use Tika to parse the data
>       behind the scenes, your performance will be much
>       different (slower) than sending SolrInputDocuments
>      (SolrJ).
>  
We are sending SolrInputDocuments
> 3> The recommended servlet container is, generally,
>       "The one you're most comfortable with". Tomcat is
>       certainly popular. That said, use whatever you're
>       most comfortable with until you see a performance
>      problem. Odds are you'll find your load on Solr is a
>       at its limit before your servlet container has problems.
>  
So Jetty in not a "easy to use, but non-performance"-container?
> 4> Monitor you CPU, fire more requests at it until it
>      hits 100%. Note that there are occasions where the
>     servlet container limits the number of outstanding
>      requests it will allow and queues ones over that
>      limit (find the magic setting to increase this if it's a
>      problem, it differs by container). If you start to see
>      your response times lengthen but the CPU not being
>     fully utilized, that may be the cause.
>  
Actually right now, I am trying to find our what my bottleneck is. The
setup is more complex, than I would bother you with, but basically I
have servers with 80-90% IO-wait and only 5-10% "real CPU usage". It
might not be a Solr-related problem, I am investigating different
things, but just wanted to know a little more about how Jetty/Solr works
in order to make a qualified guess.
> 5> How high is "high performance"? On a stock solr
>      with the Wikipedia dump (11M docs), all running on
>      my laptop, I see 7K docs/sec indexed. I know of
>      installations that see 60 docs/sec or even less. I'm
>     sending simple docs with SolrJ locally and they're
>      sending huge documents over the wire that Tika
>      handles. There are just so many variables it's hard
>      to say anything except "try it and see"......
>  
Well eventaually we need to be able to index and delete about 50mio
documents per day. We will need to keep a "history" of 2 years of data
in our system, deletion will not start before we have been in production
for 2 years. At that point in time the system needs to contain 2 year *
365 days/year * 50mio docs/day = 36,5billion documents. At that point
50mio documents need to be deleted and index per day - before that we
only need to index 50mio documents per day. We are aware that we are
probably going to need a certain amout of hardware for this, but most
important thing is that we make a scalable setup so that we can get to
this kind of numbers at all. Right now I am focusing on getting most out
of one Solr instance potentially with several cores, though.
+
Erick Erickson 2012-02-06, 14:30
+
Per Steffensen 2012-02-06, 15:49
+
Sami Siren 2012-02-06, 15:21
+
Per Steffensen 2012-02-06, 15:55
+
Sami Siren 2012-02-07, 10:18
+
Per Steffensen 2012-02-07, 13:27
+
Erick Erickson 2012-02-06, 16:40