Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene >> mail # user >> How best to handle a reasonable amount to data (25TB+)


Copy link to this message
-
Re: How best to handle a reasonable amount to data (25TB+)
I'm all confused. 100M X 13 shards = 1.3G records, not 1.25 T

But I get it 1.5 x 10^7 x 12 x 7 = 1.26 x 10 ^ 9 = 1.26 Billion, or am
I off base again? But yes, at 100M
records that would be 13 servers.

As for whether 100M documents/shard is reasonable... it depends (tm).
There are so many variables
that the *only* way is to try it with *your* data and *your* queries.
Otherwise it's just guessing. Are you
faceting? Sorting? Do you have 10 unique terms/field? 10M unique
terms? 10B unique terms?
All that stuff goes in to the mix to determine how many documents a
shard can hold and still get
adequate performance.

Not to mention the question "what's the hardware"? A MacBook Air with
4G memory? A monster
piece of metal with a bazillion gigs of memory and SSDs?

All that said, and especially with trunk, 100M documents/shard is
quite possible. So is
10M docs/shard. And it's not even, really, the size of the documents
that solely
determines the requirements, it's this weird calculation of how many
docs, how many
unique terms/doc and how you're searching them. I expect your documents are
quite small, so that may help. Some.

Try filling out the spreadsheet here:
http://www.lucidimagination.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/
and you'll swiftly find out how hard abstract estimations are....

Best
Erick

On Tue, Feb 7, 2012 at 9:07 PM, Peter Miller
<[EMAIL PROTECTED]> wrote:
> Oops again! Turns out I got to the right result earlier by the wrong means! I found this reference (http://www.dejavutechnologies.com/faq-solr-lucene.html) that states shards can be up to 100,000,000 documents. So, I'm back to 13 shards again. Phew!
>
> Now I'm just wondering if Cassandra/Lucandra would be a better option anyways. If Cassandra offers some of the same advantage as OpenStack Swift object store does, then it should be the way to go.
>
> Still looking for thoughts...
>
> Thanks, The Captn
>
> -----Original Message-----
> From: Peter Miller [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, 8 February 2012 12:20 PM
> To: [EMAIL PROTECTED]
> Subject: RE: How best to handle a reasonable amount to data (25TB+)
>
> Whoops! Very poor basic maths, I should have written it down. I was thinking 13 shards. But yes, 13,000 is a bit different. Now I'm in even more need of help.
>
> How is "easy" - 15 million audit records a month, coming from several active systems, and a requirement to keep and search across seven years of data.
>
> <Goes off to do more googling>
>
> Thanks a lot,
> The Captn
>
> -----Original Message-----
> From: Erick Erickson [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, 8 February 2012 12:39 AM
> To: [EMAIL PROTECTED]
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>
> I'm curious what the nature of your data is such that you have 1.25 trillion documents. Even at 100M/shard, you're still talking  12,500 shards. The "laggard"
> problem will rear it's ugly
> head, not to mention the administration of that many machines will be, shall we say, non-trivial...
>
> Best
> Erick
>
> On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller <[EMAIL PROTECTED]> wrote:
>> Thanks for the response. Actually, I am more concerned with trying to use an Object Store for the indexes. The next concern is the use of a local index versus the sharded ones, but I'm more relaxed about that now after thinking about it. I see that index shards could be up to 100 million documents, so that makes the 1.25 trillion number look reasonable.
>>
>> Any other thoughts?
>>
>> Thanks,
>> The Captn.
>>
>> -----Original Message-----
>> From: ppp c [mailto:[EMAIL PROTECTED]]
>> Sent: Monday, 6 February 2012 5:29 PM
>> To: [EMAIL PROTECTED]
>> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>>
>> it sounds not an issue of lucene but the logic of your app.
>> if you're afraid too many docs in one index you can make multiple indexes.

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB