Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Solr, mail # user - How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?


+
Daniel Bruegge 2012-01-18, 22:59
+
Darren Govoni 2012-01-19, 00:21
+
Mark Miller 2012-01-18, 23:14
+
Daniel Bruegge 2012-01-18, 23:44
+
Otis Gospodnetic 2012-01-19, 03:51
+
Daniel Bruegge 2012-01-19, 10:49
Copy link to this message
-
Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?
Otis Gospodnetic 2012-01-20, 04:40
Hi Daniel,
----- Original Message -----
> From: Daniel Bruegge <[EMAIL PROTECTED]>
> To: [EMAIL PROTECTED]; Otis Gospodnetic <[EMAIL PROTECTED]>
> Cc:
> Sent: Thursday, January 19, 2012 5:49 AM
> Subject: Re: How can a distributed Solr setup scale to TB-data, if URL limitations are 4000 for distributed shard search?
>
> On Thu, Jan 19, 2012 at 4:51 AM, Otis Gospodnetic <
> [EMAIL PROTECTED]> wrote:
>>
>>  Huge is relative. ;)
>>  Huge Solr clusters also often have huge hardware. Servers with 16 cores
>>  and 32 GM RAM are becoming very common, for example.
>>  Another thing to keep in mind is that while lots of organizations have
>>  huge indices, only some portions of them may be hot at any one time. 
> We've
>>  had a number of clients who index social media or news data and while all
>>  of them have giant indices, typically only the most recent data is really
>>  actively searched.
>
> So let's say, if I have for example an index of 100GB with million of
> documents, but 99% of the queries only hit the latest 200.000 documents in
> the index, I can easily handle this on a machine which is not so powerful?
> So with 'hot' you mean a subset of the whole index. You don't mean,
> that
> there is e.g. one huge archive-index and a active-index in separate Solr
> instances?

That's correct, I'm not referring to one huge archive index and one smaller active index.

Otis

----
Performance Monitoring SaaS for Solr - http://sematext.com/spm/solr-performance-monitoring/index.html

>>  > Because I also read often, that the Index size of one shard
>>  >should fit into RAM.
>>
>>  Nah.  Don't take this as "the whole index needs to fit in
> RAM".  Just "the
>>  hot parts of the index should fit in RAM".  This is related to what I
> wrote
>>  above.
>>
>
> Ah, ok. Good to know. I always tried to split the index over multiple
> shards, because I recognized a big performance loss, when I try to put it
> on one machine. But maybe this is also connected to the 'hot' and
> 'not hot'
> parts. thanks.
>
>
>>
>>  > Or at least the heap size should be as big as the
>>  > index size. So I see a lots of limitations hardware-wise. Or am I on
> the
>>  > totally wrong track?
>>
>>  Regarding heap - nah, that's not correct.  The heap is usually much
>>  smaller than the index and RAM is given to the OS to use for data caching.
>>
>
> Oh, ok. Thanks for this information. Maybe I can tweak the settings then a
> bit. But I got several GC-errors etc. so I am always trying to modify all
> these heap/gc settings. But I haven't found the perfect settings up to now.
>
> Thanks.
>
> Daniel
>
>
>>
>>  Otis
>>  ----
>>  Performance Monitoring SaaS for Solr -
>>  http://sematext.com/spm/solr-performance-monitoring/index.html
>>
>>
>>
>>  >On Thu, Jan 19, 2012 at 12:14 AM, Mark Miller
> <[EMAIL PROTECTED]>
>>  wrote:
>>  >
>>  >> You can raise the limit to a point.
>>  >>
>>  >> On Jan 18, 2012, at 5:59 PM, Daniel Bruegge wrote:
>>  >>
>>  >> > Hi,
>>  >> >
>>  >> > I am just wondering how I can 'grow' a distributed
> Solr setup to an
>>  index
>>  >> > size of a couple of terabytes, when one of the distributed
> Solr
>>  >> limitations
>>  >> > is max. 4000 characters in URI limitation. See:
>>  >> >
>>  >> > *The number of shards is limited by number of characters
> allowed for
>>  GET
>>  >> >> method's URI; most Web servers generally support at
> least 4000
>>  >> characters,
>>  >> >> but many servers limit URI length to reduce their
> vulnerability to
>>  >> Denial
>>  >> >> of Service (DoS) attacks.
>>  >> >> *
>>  >> >
>>  >> >
>>  >> >
>>  >> >> *(via
>>  >> >>
>>  >>
>>
> http://lucidworks.lucidimagination.com/display/solr/Distributed+Search+with+Index+Sharding
>>  >> >> )*
>>  >> >>
>>  >> >
>>  >> > Is the only way then to make multiple distributed solr
> clusters and
>>  query
>>  >> > them independently and merge them in application code?
>>  >> >
>>  >> > Thanks. Daniel