Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene >> mail # user >> How best to handle a reasonable amount to data (25TB+)

Copy link to this message
Re: How best to handle a reasonable amount to data (25TB+)
I'm all confused. 100M X 13 shards = 1.3G records, not 1.25 T

But I get it 1.5 x 10^7 x 12 x 7 = 1.26 x 10 ^ 9 = 1.26 Billion, or am
I off base again? But yes, at 100M
records that would be 13 servers.

As for whether 100M documents/shard is reasonable... it depends (tm).
There are so many variables
that the *only* way is to try it with *your* data and *your* queries.
Otherwise it's just guessing. Are you
faceting? Sorting? Do you have 10 unique terms/field? 10M unique
terms? 10B unique terms?
All that stuff goes in to the mix to determine how many documents a
shard can hold and still get
adequate performance.

Not to mention the question "what's the hardware"? A MacBook Air with
4G memory? A monster
piece of metal with a bazillion gigs of memory and SSDs?

All that said, and especially with trunk, 100M documents/shard is
quite possible. So is
10M docs/shard. And it's not even, really, the size of the documents
that solely
determines the requirements, it's this weird calculation of how many
docs, how many
unique terms/doc and how you're searching them. I expect your documents are
quite small, so that may help. Some.

Try filling out the spreadsheet here:
and you'll swiftly find out how hard abstract estimations are....


On Tue, Feb 7, 2012 at 9:07 PM, Peter Miller
> Oops again! Turns out I got to the right result earlier by the wrong means! I found this reference (http://www.dejavutechnologies.com/faq-solr-lucene.html) that states shards can be up to 100,000,000 documents. So, I'm back to 13 shards again. Phew!
> Now I'm just wondering if Cassandra/Lucandra would be a better option anyways. If Cassandra offers some of the same advantage as OpenStack Swift object store does, then it should be the way to go.
> Still looking for thoughts...
> Thanks, The Captn
> -----Original Message-----
> From: Peter Miller [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, 8 February 2012 12:20 PM
> Subject: RE: How best to handle a reasonable amount to data (25TB+)
> Whoops! Very poor basic maths, I should have written it down. I was thinking 13 shards. But yes, 13,000 is a bit different. Now I'm in even more need of help.
> How is "easy" - 15 million audit records a month, coming from several active systems, and a requirement to keep and search across seven years of data.
> <Goes off to do more googling>
> Thanks a lot,
> The Captn
> -----Original Message-----
> From: Erick Erickson [mailto:[EMAIL PROTECTED]]
> Sent: Wednesday, 8 February 2012 12:39 AM
> Subject: Re: How best to handle a reasonable amount to data (25TB+)
> I'm curious what the nature of your data is such that you have 1.25 trillion documents. Even at 100M/shard, you're still talking  12,500 shards. The "laggard"
> problem will rear it's ugly
> head, not to mention the administration of that many machines will be, shall we say, non-trivial...
> Best
> Erick
> On Mon, Feb 6, 2012 at 11:17 PM, Peter Miller <[EMAIL PROTECTED]> wrote:
>> Thanks for the response. Actually, I am more concerned with trying to use an Object Store for the indexes. The next concern is the use of a local index versus the sharded ones, but I'm more relaxed about that now after thinking about it. I see that index shards could be up to 100 million documents, so that makes the 1.25 trillion number look reasonable.
>> Any other thoughts?
>> Thanks,
>> The Captn.
>> -----Original Message-----
>> From: ppp c [mailto:[EMAIL PROTECTED]]
>> Sent: Monday, 6 February 2012 5:29 PM
>> Subject: Re: How best to handle a reasonable amount to data (25TB+)
>> it sounds not an issue of lucene but the logic of your app.
>> if you're afraid too many docs in one index you can make multiple indexes.