|
Itamar Syn-Hershko
2011-06-11, 19:37
Erick Erickson
2011-06-12, 00:10
Shai Erera
2011-06-12, 03:43
Itamar Syn-Hershko
2011-06-12, 08:10
Shai Erera
2011-06-12, 08:42
Andrew Kane
2011-06-12, 08:45
Michael McCandless
2011-06-12, 10:16
Itamar Syn-Hershko
2011-06-12, 17:13
Itamar Syn-Hershko
2011-06-12, 17:14
Shai Erera
2011-06-12, 17:29
Itamar Syn-Hershko
2011-06-12, 18:25
Itamar Syn-Hershko
2011-06-12, 18:50
Michael McCandless
2011-06-12, 20:12
Itamar Syn-Hershko
2011-06-12, 20:46
Shai Erera
2011-06-13, 03:23
Michael McCandless
2011-06-13, 16:01
Itamar Syn-Hershko
2011-06-13, 21:02
Itamar Syn-Hershko
2011-06-13, 21:06
Michael McCandless
2011-06-13, 22:00
Jason Rutherglen
2011-06-13, 22:09
Itamar Syn-Hershko
2011-06-13, 22:19
Jason Rutherglen
2011-06-13, 22:25
Shai Erera
2011-06-14, 04:13
Toke Eskildsen
2011-06-14, 07:28
Ganesh
2011-06-14, 07:42
Itamar Syn-Hershko
2011-06-14, 08:03
mark harwood
2011-06-14, 09:02
Stefan Trcek
2011-06-14, 09:56
Ganesh
2011-06-14, 09:58
Michael McCandless
2011-06-14, 12:05
Michael McCandless
2011-06-14, 12:06
Ganesh
2011-06-16, 05:10
Shai Erera
2011-06-16, 05:44
Denis Bazhenov
2011-06-16, 08:42
|
-
Index size and performance degradationItamar Syn-Hershko 2011-06-11, 19:37
Hi all,
I know Lucene indexes to be at their optimum up to a certain size - said to be around several GBs. I haven't found a good discussion over this, but its my understanding that at some point its better to split an index into parts (a la sharding) than to continue searching on a huge-size index. I assume this has to do with OS and IO configurations. Can anyone point me to more info on this? We have a product that is using Lucene for various searches, and at the moment each type of search is using its own Lucene index. We plan on refactoring the way it works and to combine all indexes into one - making the whole system more robust and with a smaller memory footprint, among other things. Assuming the above is true, we are interested in knowing how to do this correctly. Initially all our indexes will be run in one big index, but if at some index size there is a severe performance degradation we would like to handle that correctly by starting a new FSDirectory index to flush into, or by re-indexing and moving large indexes into their own Lucene index. Are there are any guidelines for measuring or estimating this correctly? what we should be aware of while considering all that? We can't assume anything about the machine running it, so testing won't really tell us much... Thanks in advance for any input on this, Itamar. ---------------------------------------------------------------------
-
Re: Index size and performance degradationErick Erickson 2011-06-12, 00:10
<<<We can't assume anything about the machine running it,
so testing won't really tell us much>>> Hmmm, then it's pretty hopeless I think. Problem is that anything you say about running on a machine with 2G available memory on a single processor is completely incomparable to running on a machine with 64G of memory available for Lucene and 16 processors. There's really no such thing as an "optimum" Lucene index size, it always relates to the characteristics of the underlying hardware. I think the best you can do is actually test on various configurations, then at least you can say "on configuration X this is the tipping point". Sorry there isn't a better answer that I know of, but... Best Erick On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > Hi all, > > I know Lucene indexes to be at their optimum up to a certain size - said to > be around several GBs. I haven't found a good discussion over this, but its > my understanding that at some point its better to split an index into parts > (a la sharding) than to continue searching on a huge-size index. I assume > this has to do with OS and IO configurations. Can anyone point me to more > info on this? > > We have a product that is using Lucene for various searches, and at the > moment each type of search is using its own Lucene index. We plan on > refactoring the way it works and to combine all indexes into one - making > the whole system more robust and with a smaller memory footprint, among > other things. > > Assuming the above is true, we are interested in knowing how to do this > correctly. Initially all our indexes will be run in one big index, but if at > some index size there is a severe performance degradation we would like to > handle that correctly by starting a new FSDirectory index to flush into, or > by re-indexing and moving large indexes into their own Lucene index. > > Are there are any guidelines for measuring or estimating this correctly? > what we should be aware of while considering all that? We can't assume > anything about the machine running it, so testing won't really tell us > much... > > Thanks in advance for any input on this, > > Itamar. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: Index size and performance degradationShai Erera 2011-06-12, 03:43
I agree w/ Erick, there is no cutoff point (index size for that matter)
above which you start sharding. What you can do is create a scheduled job in your system that runs a select list of queries and monitors their performance. Once it degrades, it shards the index by either splitting it (you can use IndexSplitter under contrib) or create a new shard, and direct new documents to it. I think I read somewhere, not sure if it was in Solr or ElasticSearch documentation, about a Balancer object, which moves shards around in order to balance the load on the cluster. You can implement something similar which tries to balance the index sizes, creates new shards on-the-fly, even merge shards if suddenly a whole source is being removed from the system etc. Also, note that the 'largest index size' threshold is really a machine constraint and not Lucene's. So if you decide that 10 GB is your cutoff, it is pointless to create 10x10GB shards on the same machine -- searching them is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's even worse because you consume more RAM when the indexes are split (e.g., terms index, field infos etc.). Shai On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson <[EMAIL PROTECTED]>wrote: > <<<We can't assume anything about the machine running it, > so testing won't really tell us much>>> > > Hmmm, then it's pretty hopeless I think. Problem is that > anything you say about running on a machine with > 2G available memory on a single processor is completely > incomparable to running on a machine with 64G of > memory available for Lucene and 16 processors. > > There's really no such thing as an "optimum" Lucene index > size, it always relates to the characteristics of the > underlying hardware. > > I think the best you can do is actually test on various > configurations, then at least you can say "on configuration > X this is the tipping point". > > Sorry there isn't a better answer that I know of, but... > > Best > Erick > > On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> > wrote: > > Hi all, > > > > I know Lucene indexes to be at their optimum up to a certain size - said > to > > be around several GBs. I haven't found a good discussion over this, but > its > > my understanding that at some point its better to split an index into > parts > > (a la sharding) than to continue searching on a huge-size index. I assume > > this has to do with OS and IO configurations. Can anyone point me to more > > info on this? > > > > We have a product that is using Lucene for various searches, and at the > > moment each type of search is using its own Lucene index. We plan on > > refactoring the way it works and to combine all indexes into one - making > > the whole system more robust and with a smaller memory footprint, among > > other things. > > > > Assuming the above is true, we are interested in knowing how to do this > > correctly. Initially all our indexes will be run in one big index, but if > at > > some index size there is a severe performance degradation we would like > to > > handle that correctly by starting a new FSDirectory index to flush into, > or > > by re-indexing and moving large indexes into their own Lucene index. > > > > Are there are any guidelines for measuring or estimating this correctly? > > what we should be aware of while considering all that? We can't assume > > anything about the machine running it, so testing won't really tell us > > much... > > > > Thanks in advance for any input on this, > > > > Itamar. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-12, 08:10
Thanks.
The whole point of my question was to find out if and how to make balancing on the SAME machine. Apparently thats not going to help and at a certain point we will just have to prompt the user to buy more hardware... Out of curiosity, isn't there anything that we can do to avoid that? for instance using memory-mapped files for the indexes? anything that would help us overcome OS limitations of that sort... Also, you mention a scheduled job to check for performance degradation; any idea how serious such a drop should be for sharding to be really beneficial? or is it application specific too? Itamar. On 12/06/2011 06:43, Shai Erera wrote: > I agree w/ Erick, there is no cutoff point (index size for that matter) > above which you start sharding. > > What you can do is create a scheduled job in your system that runs a select > list of queries and monitors their performance. Once it degrades, it shards > the index by either splitting it (you can use IndexSplitter under contrib) > or create a new shard, and direct new documents to it. > > I think I read somewhere, not sure if it was in Solr or ElasticSearch > documentation, about a Balancer object, which moves shards around in order > to balance the load on the cluster. You can implement something similar > which tries to balance the index sizes, creates new shards on-the-fly, even > merge shards if suddenly a whole source is being removed from the system > etc. > > Also, note that the 'largest index size' threshold is really a machine > constraint and not Lucene's. So if you decide that 10 GB is your cutoff, it > is pointless to create 10x10GB shards on the same machine -- searching them > is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's even > worse because you consume more RAM when the indexes are split (e.g., terms > index, field infos etc.). > > Shai > > On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<[EMAIL PROTECTED]>wrote: > >> <<<We can't assume anything about the machine running it, >> so testing won't really tell us much>>> >> >> Hmmm, then it's pretty hopeless I think. Problem is that >> anything you say about running on a machine with >> 2G available memory on a single processor is completely >> incomparable to running on a machine with 64G of >> memory available for Lucene and 16 processors. >> >> There's really no such thing as an "optimum" Lucene index >> size, it always relates to the characteristics of the >> underlying hardware. >> >> I think the best you can do is actually test on various >> configurations, then at least you can say "on configuration >> X this is the tipping point". >> >> Sorry there isn't a better answer that I know of, but... >> >> Best >> Erick >> >> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >> wrote: >>> Hi all, >>> >>> I know Lucene indexes to be at their optimum up to a certain size - said >> to >>> be around several GBs. I haven't found a good discussion over this, but >> its >>> my understanding that at some point its better to split an index into >> parts >>> (a la sharding) than to continue searching on a huge-size index. I assume >>> this has to do with OS and IO configurations. Can anyone point me to more >>> info on this? >>> >>> We have a product that is using Lucene for various searches, and at the >>> moment each type of search is using its own Lucene index. We plan on >>> refactoring the way it works and to combine all indexes into one - making >>> the whole system more robust and with a smaller memory footprint, among >>> other things. >>> >>> Assuming the above is true, we are interested in knowing how to do this >>> correctly. Initially all our indexes will be run in one big index, but if >> at >>> some index size there is a severe performance degradation we would like >> to >>> handle that correctly by starting a new FSDirectory index to flush into, >> or >>> by re-indexing and moving large indexes into their own Lucene index. >>> >>> Are there are any guidelines for measuring or estimating this correctly?
-
Re: Index size and performance degradationShai Erera 2011-06-12, 08:42
>
> isn't there anything that we can do to avoid that? > That was my point :) --> you can optimize your search application, use mmap files, smart caches etc., until it reaches a point where you need to shard. But it's still application dependent, not much of an OS thing. You can count on the OS to cache what it needs in RAM, and if your index is small enough to exist in RAM, then it will probably be there. We've tried in the past to use RAMDirectory for GBs of indexes (we had the RAM to spare), and the OS cache just did a better job. On the other hand, you can have a 100GB index, but very smart app-level caches that return results in few ms ... any idea how serious such a drop should be for sharding to be really > beneficial? or is it application specific too? That's application specific too I'm afraid. For instance, if your system is expected to support 10 queries/sec, and that tool determines that it no longer supports it, but dropped to, say, 7, then that is not something you're willing to tolerate and therefore you shard the index. But I've been working w/ applications that achieved 80 queries/sec on a large index on one machine, and others that were willing to accept 30 seconds and even higher response time per query (for total recall, usually legal stuff). So, again, it's really hard to come up w/ a magic number :). People are used to Google's sub-second search response time. So if your app is aiming to give the same experience, then factor in some reasonable statistics like: * No query takes longer than 5s * Majority of the queries, say 80%, finish in < 500ms * Above still holds in X queries/sec rate (X is dynamic and depends on what you aim for) These are just some numbers I've been using recently to benchmark my app Shai On Sun, Jun 12, 2011 at 11:10 AM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > Thanks. > > > The whole point of my question was to find out if and how to make balancing > on the SAME machine. Apparently thats not going to help and at a certain > point we will just have to prompt the user to buy more hardware... > > > Out of curiosity, isn't there anything that we can do to avoid that? for > instance using memory-mapped files for the indexes? anything that would help > us overcome OS limitations of that sort... > > > Also, you mention a scheduled job to check for performance degradation; any > idea how serious such a drop should be for sharding to be really beneficial? > or is it application specific too? > > > Itamar. > > > > On 12/06/2011 06:43, Shai Erera wrote: > > I agree w/ Erick, there is no cutoff point (index size for that matter) >> above which you start sharding. >> >> What you can do is create a scheduled job in your system that runs a >> select >> list of queries and monitors their performance. Once it degrades, it >> shards >> the index by either splitting it (you can use IndexSplitter under contrib) >> or create a new shard, and direct new documents to it. >> >> I think I read somewhere, not sure if it was in Solr or ElasticSearch >> documentation, about a Balancer object, which moves shards around in order >> to balance the load on the cluster. You can implement something similar >> which tries to balance the index sizes, creates new shards on-the-fly, >> even >> merge shards if suddenly a whole source is being removed from the system >> etc. >> >> Also, note that the 'largest index size' threshold is really a machine >> constraint and not Lucene's. So if you decide that 10 GB is your cutoff, >> it >> is pointless to create 10x10GB shards on the same machine -- searching >> them >> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's >> even >> worse because you consume more RAM when the indexes are split (e.g., terms >> index, field infos etc.). >> >> Shai >> >> On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<[EMAIL PROTECTED] >> >wrote: >> >> <<<We can't assume anything about the machine running it, >>> so testing won't really tell us much>>>
-
Re: Index size and performance degradationAndrew Kane 2011-06-12, 08:45
In the literature there is some evidence that sharding of in-memory indexes
on multi-core machines might be better. Has anyone tried this lately? http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4228359 Single disk machines (HDD or SSD) would be slower. Multi-disk or RAID type setups might have some benefits. What's your hardware setup? Andrew. On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > Thanks. > > > The whole point of my question was to find out if and how to make balancing > on the SAME machine. Apparently thats not going to help and at a certain > point we will just have to prompt the user to buy more hardware... > > > Out of curiosity, isn't there anything that we can do to avoid that? for > instance using memory-mapped files for the indexes? anything that would help > us overcome OS limitations of that sort... > > > Also, you mention a scheduled job to check for performance degradation; any > idea how serious such a drop should be for sharding to be really beneficial? > or is it application specific too? > > > Itamar. > > > > On 12/06/2011 06:43, Shai Erera wrote: > > I agree w/ Erick, there is no cutoff point (index size for that matter) >> above which you start sharding. >> >> What you can do is create a scheduled job in your system that runs a >> select >> list of queries and monitors their performance. Once it degrades, it >> shards >> the index by either splitting it (you can use IndexSplitter under contrib) >> or create a new shard, and direct new documents to it. >> >> I think I read somewhere, not sure if it was in Solr or ElasticSearch >> documentation, about a Balancer object, which moves shards around in order >> to balance the load on the cluster. You can implement something similar >> which tries to balance the index sizes, creates new shards on-the-fly, >> even >> merge shards if suddenly a whole source is being removed from the system >> etc. >> >> Also, note that the 'largest index size' threshold is really a machine >> constraint and not Lucene's. So if you decide that 10 GB is your cutoff, >> it >> is pointless to create 10x10GB shards on the same machine -- searching >> them >> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's >> even >> worse because you consume more RAM when the indexes are split (e.g., terms >> index, field infos etc.). >> >> Shai >> >> On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<[EMAIL PROTECTED] >> >wrote: >> >> <<<We can't assume anything about the machine running it, >>> so testing won't really tell us much>>> >>> >>> Hmmm, then it's pretty hopeless I think. Problem is that >>> anything you say about running on a machine with >>> 2G available memory on a single processor is completely >>> incomparable to running on a machine with 64G of >>> memory available for Lucene and 16 processors. >>> >>> There's really no such thing as an "optimum" Lucene index >>> size, it always relates to the characteristics of the >>> underlying hardware. >>> >>> I think the best you can do is actually test on various >>> configurations, then at least you can say "on configuration >>> X this is the tipping point". >>> >>> Sorry there isn't a better answer that I know of, but... >>> >>> Best >>> Erick >>> >>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>> wrote: >>> >>>> Hi all, >>>> >>>> I know Lucene indexes to be at their optimum up to a certain size - said >>>> >>> to >>> >>>> be around several GBs. I haven't found a good discussion over this, but >>>> >>> its >>> >>>> my understanding that at some point its better to split an index into >>>> >>> parts >>> >>>> (a la sharding) than to continue searching on a huge-size index. I >>>> assume >>>> this has to do with OS and IO configurations. Can anyone point me to >>>> more >>>> info on this? >>>> >>>> We have a product that is using Lucene for various searches, and at the >>>> moment each type of search is using its own Lucene index. We plan on
-
Re: Index size and performance degradationMichael McCandless 2011-06-12, 10:16
Remember that memory-mapping is not a panacea: at the end of the day,
if there just isn't enough RAM on the machine to keep your full "working set" hot, then the OS will have to hit the disk, regardless of whether the access is through MMap or a "traditional" IO request. That said, on Fedora Linux anyway, I generally see better performance from MMap than from NIOFSDir; eg see the 2nd chart here: http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html Mike McCandless http://blog.mikemccandless.com On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > Thanks. > > > The whole point of my question was to find out if and how to make balancing > on the SAME machine. Apparently thats not going to help and at a certain > point we will just have to prompt the user to buy more hardware... > > > Out of curiosity, isn't there anything that we can do to avoid that? for > instance using memory-mapped files for the indexes? anything that would help > us overcome OS limitations of that sort... > > > Also, you mention a scheduled job to check for performance degradation; any > idea how serious such a drop should be for sharding to be really beneficial? > or is it application specific too? > > > Itamar. > > > On 12/06/2011 06:43, Shai Erera wrote: > >> I agree w/ Erick, there is no cutoff point (index size for that matter) >> above which you start sharding. >> >> What you can do is create a scheduled job in your system that runs a >> select >> list of queries and monitors their performance. Once it degrades, it >> shards >> the index by either splitting it (you can use IndexSplitter under contrib) >> or create a new shard, and direct new documents to it. >> >> I think I read somewhere, not sure if it was in Solr or ElasticSearch >> documentation, about a Balancer object, which moves shards around in order >> to balance the load on the cluster. You can implement something similar >> which tries to balance the index sizes, creates new shards on-the-fly, >> even >> merge shards if suddenly a whole source is being removed from the system >> etc. >> >> Also, note that the 'largest index size' threshold is really a machine >> constraint and not Lucene's. So if you decide that 10 GB is your cutoff, >> it >> is pointless to create 10x10GB shards on the same machine -- searching >> them >> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's >> even >> worse because you consume more RAM when the indexes are split (e.g., terms >> index, field infos etc.). >> >> Shai >> >> On Sun, Jun 12, 2011 at 3:10 AM, Erick >> Erickson<[EMAIL PROTECTED]>wrote: >> >>> <<<We can't assume anything about the machine running it, >>> so testing won't really tell us much>>> >>> >>> Hmmm, then it's pretty hopeless I think. Problem is that >>> anything you say about running on a machine with >>> 2G available memory on a single processor is completely >>> incomparable to running on a machine with 64G of >>> memory available for Lucene and 16 processors. >>> >>> There's really no such thing as an "optimum" Lucene index >>> size, it always relates to the characteristics of the >>> underlying hardware. >>> >>> I think the best you can do is actually test on various >>> configurations, then at least you can say "on configuration >>> X this is the tipping point". >>> >>> Sorry there isn't a better answer that I know of, but... >>> >>> Best >>> Erick >>> >>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Hi all, >>>> >>>> I know Lucene indexes to be at their optimum up to a certain size - said >>> >>> to >>>> >>>> be around several GBs. I haven't found a good discussion over this, but >>> >>> its >>>> >>>> my understanding that at some point its better to split an index into >>> >>> parts >>>> >>>> (a la sharding) than to continue searching on a huge-size index. I >>>> assume >>>> this has to do with OS and IO configurations. Can anyone point me to
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-12, 17:13
Shai, what would you call a smart app-level cache? remembering frequent
searches and storing them handy? or are there more advanced techniques for that? any pointers appreciated... Thanks for all the advice! On 12/06/2011 11:42, Shai Erera wrote: >> isn't there anything that we can do to avoid that? >> > That was my point :) --> you can optimize your search application, use mmap > files, smart caches etc., until it reaches a point where you need to shard. > But it's still application dependent, not much of an OS thing. You can count > on the OS to cache what it needs in RAM, and if your index is small enough > to exist in RAM, then it will probably be there. We've tried in the past to > use RAMDirectory for GBs of indexes (we had the RAM to spare), and the OS > cache just did a better job. > > On the other hand, you can have a 100GB index, but very smart app-level > caches that return results in few ms ... > > any idea how serious such a drop should be for sharding to be really >> beneficial? or is it application specific too? > > That's application specific too I'm afraid. For instance, if your system is > expected to support 10 queries/sec, and that tool determines that it no > longer supports it, but dropped to, say, 7, then that is not something > you're willing to tolerate and therefore you shard the index. > > But I've been working w/ applications that achieved 80 queries/sec on a > large index on one machine, and others that were willing to accept 30 > seconds and even higher response time per query (for total recall, usually > legal stuff). So, again, it's really hard to come up w/ a magic number :). > > People are used to Google's sub-second search response time. So if your app > is aiming to give the same experience, then factor in some reasonable > statistics like: > * No query takes longer than 5s > * Majority of the queries, say 80%, finish in< 500ms > * Above still holds in X queries/sec rate (X is dynamic and depends on what > you aim for) > > These are just some numbers I've been using recently to benchmark my app > > Shai > > On Sun, Jun 12, 2011 at 11:10 AM, Itamar Syn-Hershko<[EMAIL PROTECTED]>wrote: > >> Thanks. >> >> >> The whole point of my question was to find out if and how to make balancing >> on the SAME machine. Apparently thats not going to help and at a certain >> point we will just have to prompt the user to buy more hardware... >> >> >> Out of curiosity, isn't there anything that we can do to avoid that? for >> instance using memory-mapped files for the indexes? anything that would help >> us overcome OS limitations of that sort... >> >> >> Also, you mention a scheduled job to check for performance degradation; any >> idea how serious such a drop should be for sharding to be really beneficial? >> or is it application specific too? >> >> >> Itamar. >> >> >> >> On 12/06/2011 06:43, Shai Erera wrote: >> >> I agree w/ Erick, there is no cutoff point (index size for that matter) >>> above which you start sharding. >>> >>> What you can do is create a scheduled job in your system that runs a >>> select >>> list of queries and monitors their performance. Once it degrades, it >>> shards >>> the index by either splitting it (you can use IndexSplitter under contrib) >>> or create a new shard, and direct new documents to it. >>> >>> I think I read somewhere, not sure if it was in Solr or ElasticSearch >>> documentation, about a Balancer object, which moves shards around in order >>> to balance the load on the cluster. You can implement something similar >>> which tries to balance the index sizes, creates new shards on-the-fly, >>> even >>> merge shards if suddenly a whole source is being removed from the system >>> etc. >>> >>> Also, note that the 'largest index size' threshold is really a machine >>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff, >>> it >>> is pointless to create 10x10GB shards on the same machine -- searching >>> them >>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-12, 17:14
Andrew, no particular hardware setup I'm afraid. That is a general
product which we can't assume anything about the hardware it would run on. Thanks for the tip on multi-core tho. On 12/06/2011 11:45, Andrew Kane wrote: > In the literature there is some evidence that sharding of in-memory indexes > on multi-core machines might be better. Has anyone tried this lately? > > http://ieeexplore.ieee.org/xpl/freeabs_all.jsp?arnumber=4228359 > > Single disk machines (HDD or SSD) would be slower. Multi-disk or RAID type > setups might have some benefits. What's your hardware setup? > > Andrew. > > > On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<[EMAIL PROTECTED]>wrote: > >> Thanks. >> >> >> The whole point of my question was to find out if and how to make balancing >> on the SAME machine. Apparently thats not going to help and at a certain >> point we will just have to prompt the user to buy more hardware... >> >> >> Out of curiosity, isn't there anything that we can do to avoid that? for >> instance using memory-mapped files for the indexes? anything that would help >> us overcome OS limitations of that sort... >> >> >> Also, you mention a scheduled job to check for performance degradation; any >> idea how serious such a drop should be for sharding to be really beneficial? >> or is it application specific too? >> >> >> Itamar. >> >> >> >> On 12/06/2011 06:43, Shai Erera wrote: >> >> I agree w/ Erick, there is no cutoff point (index size for that matter) >>> above which you start sharding. >>> >>> What you can do is create a scheduled job in your system that runs a >>> select >>> list of queries and monitors their performance. Once it degrades, it >>> shards >>> the index by either splitting it (you can use IndexSplitter under contrib) >>> or create a new shard, and direct new documents to it. >>> >>> I think I read somewhere, not sure if it was in Solr or ElasticSearch >>> documentation, about a Balancer object, which moves shards around in order >>> to balance the load on the cluster. You can implement something similar >>> which tries to balance the index sizes, creates new shards on-the-fly, >>> even >>> merge shards if suddenly a whole source is being removed from the system >>> etc. >>> >>> Also, note that the 'largest index size' threshold is really a machine >>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff, >>> it >>> is pointless to create 10x10GB shards on the same machine -- searching >>> them >>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's >>> even >>> worse because you consume more RAM when the indexes are split (e.g., terms >>> index, field infos etc.). >>> >>> Shai >>> >>> On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson<[EMAIL PROTECTED] >>>> wrote: >>> <<<We can't assume anything about the machine running it, >>>> so testing won't really tell us much>>> >>>> >>>> Hmmm, then it's pretty hopeless I think. Problem is that >>>> anything you say about running on a machine with >>>> 2G available memory on a single processor is completely >>>> incomparable to running on a machine with 64G of >>>> memory available for Lucene and 16 processors. >>>> >>>> There's really no such thing as an "optimum" Lucene index >>>> size, it always relates to the characteristics of the >>>> underlying hardware. >>>> >>>> I think the best you can do is actually test on various >>>> configurations, then at least you can say "on configuration >>>> X this is the tipping point". >>>> >>>> Sorry there isn't a better answer that I know of, but... >>>> >>>> Best >>>> Erick >>>> >>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Hi all, >>>>> >>>>> I know Lucene indexes to be at their optimum up to a certain size - said >>>>> >>>> to >>>> >>>>> be around several GBs. I haven't found a good discussion over this, but >>>>> >>>> its >>>> >>>>> my understanding that at some point its better to split an index into >>>>>
-
Re: Index size and performance degradationShai Erera 2011-06-12, 17:29
>
> Shai, what would you call a smart app-level cache? remembering frequent > searches and storing them handy? Remembering frequent searches is good. If you do this, you can warm up the cache whenever a new IndexSearcher is opened (e.g., if you use SearcherManager from LIA2) and besides keeping the results 'ready', you also keep them up-to-date. Another thing to consider is session-level cache. This can have several uses: (1) In non-AJAX apps (probably not too many today, hopefully?), a page reload can issue several queries to the backend, while usually only one portion of the page gets updated, so caching the queries the user submitted during his session will help here. (2) If users interact w/ your web app by, e.g., repeating the same actions, that will help. An example is a user who frequently clicks the "Back" and "Forward" buttons in the browser. Although, there are client-side solutions for that, using Dojo, which helps you store the 'previous' pages the user visited. (3) If your app is ACL-constrained, caching the user ACLs and their matching docs will be very useful. In another app I'm involved with there are several Filters that exist. Their number varies between deployments, but for each they are fixed and known in advance. Each Filter matches a different set of documents and queries are always added one or more Filters. So we cache them (and warm them up when opening new IndexSearchers) using CachingWrapperFilter, which is great since it works at the segment level, so warming up is very fast usually. Another cache, which is very high-level are pre-defined queries with their matching result set. I.e., for queries like "my-company-name" you always return a fixed N results, and only if the user pages through them, do you run the actual query. And the list can go on and on :). Shai On Sun, Jun 12, 2011 at 8:13 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > Shai, what would you call a smart app-level cache? remembering frequent > searches and storing them handy? or are there more advanced techniques for > that? any pointers appreciated... > > > Thanks for all the advice! > > > On 12/06/2011 11:42, Shai Erera wrote: > > isn't there anything that we can do to avoid that? >>> >>> That was my point :) --> you can optimize your search application, use >> mmap >> files, smart caches etc., until it reaches a point where you need to >> shard. >> But it's still application dependent, not much of an OS thing. You can >> count >> on the OS to cache what it needs in RAM, and if your index is small enough >> to exist in RAM, then it will probably be there. We've tried in the past >> to >> use RAMDirectory for GBs of indexes (we had the RAM to spare), and the OS >> cache just did a better job. >> >> On the other hand, you can have a 100GB index, but very smart app-level >> caches that return results in few ms ... >> >> any idea how serious such a drop should be for sharding to be really >> >>> beneficial? or is it application specific too? >>> >> >> That's application specific too I'm afraid. For instance, if your system >> is >> expected to support 10 queries/sec, and that tool determines that it no >> longer supports it, but dropped to, say, 7, then that is not something >> you're willing to tolerate and therefore you shard the index. >> >> But I've been working w/ applications that achieved 80 queries/sec on a >> large index on one machine, and others that were willing to accept 30 >> seconds and even higher response time per query (for total recall, usually >> legal stuff). So, again, it's really hard to come up w/ a magic number :). >> >> People are used to Google's sub-second search response time. So if your >> app >> is aiming to give the same experience, then factor in some reasonable >> statistics like: >> * No query takes longer than 5s >> * Majority of the queries, say 80%, finish in< 500ms >> * Above still holds in X queries/sec rate (X is dynamic and depends on >> what >> you aim for) >> >> These are just some numbers I've been using recently to benchmark my app
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-12, 18:25
Mike,
Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently isn't fast enough if Zoie was needed, and now that Zoie is around are there any plans to make it Lucene's default? or: why would one still use NRT when Zoie seem to work much better? Itamar. On 12/06/2011 13:16, Michael McCandless wrote: > Remember that memory-mapping is not a panacea: at the end of the day, > if there just isn't enough RAM on the machine to keep your full > "working set" hot, then the OS will have to hit the disk, regardless > of whether the access is through MMap or a "traditional" IO request. > > That said, on Fedora Linux anyway, I generally see better performance > from MMap than from NIOFSDir; eg see the 2nd chart here: > > http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html > > Mike McCandless > > http://blog.mikemccandless.com > > On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<[EMAIL PROTECTED]> wrote: >> Thanks. >> >> >> The whole point of my question was to find out if and how to make balancing >> on the SAME machine. Apparently thats not going to help and at a certain >> point we will just have to prompt the user to buy more hardware... >> >> >> Out of curiosity, isn't there anything that we can do to avoid that? for >> instance using memory-mapped files for the indexes? anything that would help >> us overcome OS limitations of that sort... >> >> >> Also, you mention a scheduled job to check for performance degradation; any >> idea how serious such a drop should be for sharding to be really beneficial? >> or is it application specific too? >> >> >> Itamar. >> >> >> On 12/06/2011 06:43, Shai Erera wrote: >> >>> I agree w/ Erick, there is no cutoff point (index size for that matter) >>> above which you start sharding. >>> >>> What you can do is create a scheduled job in your system that runs a >>> select >>> list of queries and monitors their performance. Once it degrades, it >>> shards >>> the index by either splitting it (you can use IndexSplitter under contrib) >>> or create a new shard, and direct new documents to it. >>> >>> I think I read somewhere, not sure if it was in Solr or ElasticSearch >>> documentation, about a Balancer object, which moves shards around in order >>> to balance the load on the cluster. You can implement something similar >>> which tries to balance the index sizes, creates new shards on-the-fly, >>> even >>> merge shards if suddenly a whole source is being removed from the system >>> etc. >>> >>> Also, note that the 'largest index size' threshold is really a machine >>> constraint and not Lucene's. So if you decide that 10 GB is your cutoff, >>> it >>> is pointless to create 10x10GB shards on the same machine -- searching >>> them >>> is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's >>> even >>> worse because you consume more RAM when the indexes are split (e.g., terms >>> index, field infos etc.). >>> >>> Shai >>> >>> On Sun, Jun 12, 2011 at 3:10 AM, Erick >>> Erickson<[EMAIL PROTECTED]>wrote: >>> >>>> <<<We can't assume anything about the machine running it, >>>> so testing won't really tell us much>>> >>>> >>>> Hmmm, then it's pretty hopeless I think. Problem is that >>>> anything you say about running on a machine with >>>> 2G available memory on a single processor is completely >>>> incomparable to running on a machine with 64G of >>>> memory available for Lucene and 16 processors. >>>> >>>> There's really no such thing as an "optimum" Lucene index >>>> size, it always relates to the characteristics of the >>>> underlying hardware. >>>> >>>> I think the best you can do is actually test on various >>>> configurations, then at least you can say "on configuration >>>> X this is the tipping point". >>>> >>>> Sorry there isn't a better answer that I know of, but... >>>> >>>> Best >>>> Erick >>>> >>>> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>>> wrote: >>>>> Hi all, >>>>> >>>>> I know Lucene indexes to be at their optimum up to a certain size - said
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-12, 18:50
Our problem is a bit different. There aren't always common searches so
if we cache blindly we could end up having too much RAM allocated for virtually nothing. And we need to allow for real-time search so caching will hardly help. We enforce some client-side caching, but again - the real-time requirement is a bit of a problem... I'm not sure I understood the filters approach you described. Can you give an example? So we'll have to sit and plan some strategies. Perhaps a Lucene API to allow prioritizing caching based on fields is going to prove useful in such scenarios - so some application logic could make the search core itself a bit more robust based on usage (or have Lucene learn on its own). BTW, I don't actually recall reading about this topic anywhere, so thanks again for all the good advice and perhaps it worthy adding to LIA3 :) Itamar. On 12/06/2011 20:29, Shai Erera wrote: >> Shai, what would you call a smart app-level cache? remembering frequent >> searches and storing them handy? > > Remembering frequent searches is good. If you do this, you can warm up the > cache whenever a new IndexSearcher is opened (e.g., if you use > SearcherManager from LIA2) and besides keeping the results 'ready', you also > keep them up-to-date. > > Another thing to consider is session-level cache. This can have several > uses: > (1) In non-AJAX apps (probably not too many today, hopefully?), a page > reload can issue several queries to the backend, while usually only one > portion of the page gets updated, so caching the queries the user submitted > during his session will help here. > (2) If users interact w/ your web app by, e.g., repeating the same actions, > that will help. An example is a user who frequently clicks the "Back" and > "Forward" buttons in the browser. Although, there are client-side solutions > for that, using Dojo, which helps you store the 'previous' pages the user > visited. > (3) If your app is ACL-constrained, caching the user ACLs and their matching > docs will be very useful. > > In another app I'm involved with there are several Filters that exist. Their > number varies between deployments, but for each they are fixed and known in > advance. Each Filter matches a different set of documents and queries are > always added one or more Filters. So we cache them (and warm them up when > opening new IndexSearchers) using CachingWrapperFilter, which is great since > it works at the segment level, so warming up is very fast usually. > > Another cache, which is very high-level are pre-defined queries with their > matching result set. I.e., for queries like "my-company-name" you always > return a fixed N results, and only if the user pages through them, do you > run the actual query. > > And the list can go on and on :). > > Shai ---------------------------------------------------------------------
-
Re: Index size and performance degradationMichael McCandless 2011-06-12, 20:12
>From what I understand of Zoie (and it's been some time since I last
looked... so this could be wrong now), the biggest difference vs NRT is that Zoie aims for "immediate consistency", ie index changes are always made visible to the very next query, vs NRT which is "controlled consistency", a blend between immediate and eventual consistency where your app decides when the changes must become visible. But in exchange for that, Zoie pays a price: each search has a higher cost per collected hit, since it must post-filter for deleted docs. And since Zoie necessarily adds complexity, there's more risk; eg there were some nasty Zoie bugs that took quite some time to track down (under https://issues.apache.org/jira/browse/LUCENE-2729). Anyway, I don't think that's a good tradeoff, in general, for our users, because very few apps truly require immediate consistency from Lucene (can anyone give an example where their app depends on immediate consistency...?). I think it's better to spend time during reopen so that searches aren't slower. That said, Lucene has already incorporated one big part of Zoie (caching small segments in RAM) via the new NRTCachingDirectory (in contrib/misc). Also, the upcoming NRTManager (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over visibility of specific indexing changes to queries that need to see the changes. Finally, even better would be to not have to make any tradeoff whatsoever ;) Twitter's approach (created by Michael Busch) seems to bring immediate consistency with no search performance hit, so if we do anything here likely it'll be similar to what Michael has done (though, those changes are not simple either!). Mike McCandless http://blog.mikemccandless.com On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > Mike, > > > Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently > isn't fast enough if Zoie was needed, and now that Zoie is around are there > any plans to make it Lucene's default? or: why would one still use NRT when > Zoie seem to work much better? > > > Itamar. > > > On 12/06/2011 13:16, Michael McCandless wrote: > >> Remember that memory-mapping is not a panacea: at the end of the day, >> if there just isn't enough RAM on the machine to keep your full >> "working set" hot, then the OS will have to hit the disk, regardless >> of whether the access is through MMap or a "traditional" IO request. >> >> That said, on Fedora Linux anyway, I generally see better performance >> from MMap than from NIOFSDir; eg see the 2nd chart here: >> >> >> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >> wrote: >>> >>> Thanks. >>> >>> >>> The whole point of my question was to find out if and how to make >>> balancing >>> on the SAME machine. Apparently thats not going to help and at a certain >>> point we will just have to prompt the user to buy more hardware... >>> >>> >>> Out of curiosity, isn't there anything that we can do to avoid that? for >>> instance using memory-mapped files for the indexes? anything that would >>> help >>> us overcome OS limitations of that sort... >>> >>> >>> Also, you mention a scheduled job to check for performance degradation; >>> any >>> idea how serious such a drop should be for sharding to be really >>> beneficial? >>> or is it application specific too? >>> >>> >>> Itamar. >>> >>> >>> On 12/06/2011 06:43, Shai Erera wrote: >>> >>>> I agree w/ Erick, there is no cutoff point (index size for that matter) >>>> above which you start sharding. >>>> >>>> What you can do is create a scheduled job in your system that runs a >>>> select >>>> list of queries and monitors their performance. Once it degrades, it >>>> shards >>>> the index by either splitting it (you can use IndexSplitter under >>>> contrib) >>>> or create a new shard, and direct new documents to it.
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-12, 20:46
Thanks for your detailed answer. We'll have to tackle this and see whats
more important to us then. I'd definitely love to hear Zoie has overcame all that... Any pointers to Michael Busch's approach? I take this has something to do with the core itself or index format, probably using the Flex version? Itamar. On 12/06/2011 23:12, Michael McCandless wrote: > > From what I understand of Zoie (and it's been some time since I last > looked... so this could be wrong now), the biggest difference vs NRT > is that Zoie aims for "immediate consistency", ie index changes are > always made visible to the very next query, vs NRT which is > "controlled consistency", a blend between immediate and eventual > consistency where your app decides when the changes must become > visible. > > But in exchange for that, Zoie pays a price: each search has a higher > cost per collected hit, since it must post-filter for deleted docs. > And since Zoie necessarily adds complexity, there's more risk; eg > there were some nasty Zoie bugs that took quite some time to track > down (under https://issues.apache.org/jira/browse/LUCENE-2729). > > Anyway, I don't think that's a good tradeoff, in general, for our > users, because very few apps truly require immediate consistency from > Lucene (can anyone give an example where their app depends on > immediate consistency...?). I think it's better to spend time during > reopen so that searches aren't slower. > > That said, Lucene has already incorporated one big part of Zoie > (caching small segments in RAM) via the new NRTCachingDirectory (in > contrib/misc). Also, the upcoming NRTManager > (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over > visibility of specific indexing changes to queries that need to see > the changes. > > Finally, even better would be to not have to make any tradeoff > whatsoever ;) Twitter's approach (created by Michael Busch) seems to > bring immediate consistency with no search performance hit, so if we > do anything here likely it'll be similar to what Michael has done > (though, those changes are not simple either!). > > Mike McCandless > > http://blog.mikemccandless.com > > On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> wrote: >> Mike, >> >> >> Speaking of NRT, and completely off-topic, I know: Lucene's NRT apparently >> isn't fast enough if Zoie was needed, and now that Zoie is around are there >> any plans to make it Lucene's default? or: why would one still use NRT when >> Zoie seem to work much better? >> >> >> Itamar. >> >> >> On 12/06/2011 13:16, Michael McCandless wrote: >> >>> Remember that memory-mapping is not a panacea: at the end of the day, >>> if there just isn't enough RAM on the machine to keep your full >>> "working set" hot, then the OS will have to hit the disk, regardless >>> of whether the access is through MMap or a "traditional" IO request. >>> >>> That said, on Fedora Linux anyway, I generally see better performance >>> from MMap than from NIOFSDir; eg see the 2nd chart here: >>> >>> >>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Sun, Jun 12, 2011 at 4:10 AM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>> wrote: >>>> Thanks. >>>> >>>> >>>> The whole point of my question was to find out if and how to make >>>> balancing >>>> on the SAME machine. Apparently thats not going to help and at a certain >>>> point we will just have to prompt the user to buy more hardware... >>>> >>>> >>>> Out of curiosity, isn't there anything that we can do to avoid that? for >>>> instance using memory-mapped files for the indexes? anything that would >>>> help >>>> us overcome OS limitations of that sort... >>>> >>>> >>>> Also, you mention a scheduled job to check for performance degradation; >>>> any >>>> idea how serious such a drop should be for sharding to be really >>>> beneficial? >>>> or is it application specific too?
-
Re: Index size and performance degradationShai Erera 2011-06-13, 03:23
>
> I'm not sure I understood the filters approach you described. Can you give > an example? > A Language filter is one -- different users search in different languages and want to view pages in those languages only. If you have a field attach to your documents that identifies the language of the document, you can use it to filter the queries to return results only of the requested language. Another filter is file type. Shai On Sun, Jun 12, 2011 at 9:50 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > I'm not sure I understood the filters approach you described. Can you give > an example?
-
Re: Index size and performance degradationMichael McCandless 2011-06-13, 16:01
Here's a blog post describing some details of Twitter's approach:
http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html And here's a talk Michael did last October (Lucene Revolutions): http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter Twitter's case is simpler since they never delete ;) So we have to fix that to do it in Lucene... there are also various open issues that begin to explore some of the ideas here. But this ("immediate consistency") would be a deep and complex change, and I don't see many apps that actually require it. Mike McCandless http://blog.mikemccandless.com On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > Thanks for your detailed answer. We'll have to tackle this and see whats > more important to us then. I'd definitely love to hear Zoie has overcame all > that... > > > Any pointers to Michael Busch's approach? I take this has something to do > with the core itself or index format, probably using the Flex version? > > > Itamar. > > > On 12/06/2011 23:12, Michael McCandless wrote: > >> > From what I understand of Zoie (and it's been some time since I last >> looked... so this could be wrong now), the biggest difference vs NRT >> is that Zoie aims for "immediate consistency", ie index changes are >> always made visible to the very next query, vs NRT which is >> "controlled consistency", a blend between immediate and eventual >> consistency where your app decides when the changes must become >> visible. >> >> But in exchange for that, Zoie pays a price: each search has a higher >> cost per collected hit, since it must post-filter for deleted docs. >> And since Zoie necessarily adds complexity, there's more risk; eg >> there were some nasty Zoie bugs that took quite some time to track >> down (under https://issues.apache.org/jira/browse/LUCENE-2729). >> >> Anyway, I don't think that's a good tradeoff, in general, for our >> users, because very few apps truly require immediate consistency from >> Lucene (can anyone give an example where their app depends on >> immediate consistency...?). I think it's better to spend time during >> reopen so that searches aren't slower. >> >> That said, Lucene has already incorporated one big part of Zoie >> (caching small segments in RAM) via the new NRTCachingDirectory (in >> contrib/misc). Also, the upcoming NRTManager >> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over >> visibility of specific indexing changes to queries that need to see >> the changes. >> >> Finally, even better would be to not have to make any tradeoff >> whatsoever ;) Twitter's approach (created by Michael Busch) seems to >> bring immediate consistency with no search performance hit, so if we >> do anything here likely it'll be similar to what Michael has done >> (though, those changes are not simple either!). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >> wrote: >>> >>> Mike, >>> >>> >>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT >>> apparently >>> isn't fast enough if Zoie was needed, and now that Zoie is around are >>> there >>> any plans to make it Lucene's default? or: why would one still use NRT >>> when >>> Zoie seem to work much better? >>> >>> >>> Itamar. >>> >>> >>> On 12/06/2011 13:16, Michael McCandless wrote: >>> >>>> Remember that memory-mapping is not a panacea: at the end of the day, >>>> if there just isn't enough RAM on the machine to keep your full >>>> "working set" hot, then the OS will have to hit the disk, regardless >>>> of whether the access is through MMap or a "traditional" IO request. >>>> >>>> That said, on Fedora Linux anyway, I generally see better performance >>>> from MMap than from NIOFSDir; eg see the 2nd chart here: >>>> >>>> >>>> >>>> http://blog.mikemccandless.com/2011/06/lucenes-near-real-time-search-is-fast.html
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-13, 21:02
On 13/06/2011 06:23, Shai Erera wrote:
> A Language filter is one -- different users search in different languages > and want to view pages in those languages only. If you have a field attach > to your documents that identifies the language of the document, you can use > it to filter the queries to return results only of the requested language. > > Another filter is file type. Oh, ok, but you'll still cache the results - so again this isn't viable when RT search, or even an NRT, is a requirement... Is the usage of such Filters the only reason you don't include the language field in your query? Itamar. ---------------------------------------------------------------------
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-13, 21:06
Thanks Mike, much appreciated.
Wouldn't Twitter's approach fall for the exact same pit-hole you described Zoie does (or did) when it'll handle deletes too? I don't thing there is any other way of handling deletes other than post-filtering results. But perhaps the IW cache would be smaller than Zoie's RAMDirectory(ies)? I'll give all that a serious dive and report back with results or if more input will be required... Itamar. On 13/06/2011 19:01, Michael McCandless wrote: > Here's a blog post describing some details of Twitter's approach: > > http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html > > And here's a talk Michael did last October (Lucene Revolutions): > > http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter > > Twitter's case is simpler since they never delete ;) So we have to > fix that to do it in Lucene... there are also various open issues that > begin to explore some of the ideas here. > > But this ("immediate consistency") would be a deep and complex change, > and I don't see many apps that actually require it. > > Mike McCandless > > http://blog.mikemccandless.com > > On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> wrote: >> Thanks for your detailed answer. We'll have to tackle this and see whats >> more important to us then. I'd definitely love to hear Zoie has overcame all >> that... >> >> >> Any pointers to Michael Busch's approach? I take this has something to do >> with the core itself or index format, probably using the Flex version? >> >> >> Itamar. >> >> >> On 12/06/2011 23:12, Michael McCandless wrote: >> >>>> From what I understand of Zoie (and it's been some time since I last >>> looked... so this could be wrong now), the biggest difference vs NRT >>> is that Zoie aims for "immediate consistency", ie index changes are >>> always made visible to the very next query, vs NRT which is >>> "controlled consistency", a blend between immediate and eventual >>> consistency where your app decides when the changes must become >>> visible. >>> >>> But in exchange for that, Zoie pays a price: each search has a higher >>> cost per collected hit, since it must post-filter for deleted docs. >>> And since Zoie necessarily adds complexity, there's more risk; eg >>> there were some nasty Zoie bugs that took quite some time to track >>> down (under https://issues.apache.org/jira/browse/LUCENE-2729). >>> >>> Anyway, I don't think that's a good tradeoff, in general, for our >>> users, because very few apps truly require immediate consistency from >>> Lucene (can anyone give an example where their app depends on >>> immediate consistency...?). I think it's better to spend time during >>> reopen so that searches aren't slower. >>> >>> That said, Lucene has already incorporated one big part of Zoie >>> (caching small segments in RAM) via the new NRTCachingDirectory (in >>> contrib/misc). Also, the upcoming NRTManager >>> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over >>> visibility of specific indexing changes to queries that need to see >>> the changes. >>> >>> Finally, even better would be to not have to make any tradeoff >>> whatsoever ;) Twitter's approach (created by Michael Busch) seems to >>> bring immediate consistency with no search performance hit, so if we >>> do anything here likely it'll be similar to what Michael has done >>> (though, those changes are not simple either!). >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Sun, Jun 12, 2011 at 2:25 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>> wrote: >>>> Mike, >>>> >>>> >>>> Speaking of NRT, and completely off-topic, I know: Lucene's NRT >>>> apparently >>>> isn't fast enough if Zoie was needed, and now that Zoie is around are >>>> there >>>> any plans to make it Lucene's default? or: why would one still use NRT >>>> when >>>> Zoie seem to work much better? >>>>
-
Re: Index size and performance degradationMichael McCandless 2011-06-13, 22:00
Yes, adding deletes to Twitter's approach will be a challenge!
I don't think we'd do the post-filtering solution, but instead maybe resolve the deletes "live" and store them in a transactional data structure of some kind... but even then we will pay a perf hit to lookup del docs against it. So, yeah, there will presumably be a tradeoff with this approach too. However, turning around changes from the adds should be faster (no segment gets flushed). Mike McCandless http://blog.mikemccandless.com On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > Thanks Mike, much appreciated. > > > Wouldn't Twitter's approach fall for the exact same pit-hole you described > Zoie does (or did) when it'll handle deletes too? I don't thing there is any > other way of handling deletes other than post-filtering results. But perhaps > the IW cache would be smaller than Zoie's RAMDirectory(ies)? > > > I'll give all that a serious dive and report back with results or if more > input will be required... > > > Itamar. > > > On 13/06/2011 19:01, Michael McCandless wrote: > >> Here's a blog post describing some details of Twitter's approach: >> >> >> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html >> >> And here's a talk Michael did last October (Lucene Revolutions): >> >> >> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter >> >> Twitter's case is simpler since they never delete ;) So we have to >> fix that to do it in Lucene... there are also various open issues that >> begin to explore some of the ideas here. >> >> But this ("immediate consistency") would be a deep and complex change, >> and I don't see many apps that actually require it. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >> wrote: >>> >>> Thanks for your detailed answer. We'll have to tackle this and see whats >>> more important to us then. I'd definitely love to hear Zoie has overcame >>> all >>> that... >>> >>> >>> Any pointers to Michael Busch's approach? I take this has something to do >>> with the core itself or index format, probably using the Flex version? >>> >>> >>> Itamar. >>> >>> >>> On 12/06/2011 23:12, Michael McCandless wrote: >>> >>>>> From what I understand of Zoie (and it's been some time since I last >>>> >>>> looked... so this could be wrong now), the biggest difference vs NRT >>>> is that Zoie aims for "immediate consistency", ie index changes are >>>> always made visible to the very next query, vs NRT which is >>>> "controlled consistency", a blend between immediate and eventual >>>> consistency where your app decides when the changes must become >>>> visible. >>>> >>>> But in exchange for that, Zoie pays a price: each search has a higher >>>> cost per collected hit, since it must post-filter for deleted docs. >>>> And since Zoie necessarily adds complexity, there's more risk; eg >>>> there were some nasty Zoie bugs that took quite some time to track >>>> down (under https://issues.apache.org/jira/browse/LUCENE-2729). >>>> >>>> Anyway, I don't think that's a good tradeoff, in general, for our >>>> users, because very few apps truly require immediate consistency from >>>> Lucene (can anyone give an example where their app depends on >>>> immediate consistency...?). I think it's better to spend time during >>>> reopen so that searches aren't slower. >>>> >>>> That said, Lucene has already incorporated one big part of Zoie >>>> (caching small segments in RAM) via the new NRTCachingDirectory (in >>>> contrib/misc). Also, the upcoming NRTManager >>>> (https://issues.apache.org/jira/browse/LUCENE-2955) adds control over >>>> visibility of specific indexing changes to queries that need to see >>>> the changes. >>>> >>>> Finally, even better would be to not have to make any tradeoff >>>> whatsoever ;) Twitter's approach (created by Michael Busch) seems to
-
Re: Index size and performance degradationJason Rutherglen 2011-06-13, 22:09
> I don't think we'd do the post-filtering solution, but instead maybe
> resolve the deletes "live" and store them in a transactional data I think Michael B. aptly described the sequence ID approach for 'live' deletes? On Mon, Jun 13, 2011 at 3:00 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Yes, adding deletes to Twitter's approach will be a challenge! > > I don't think we'd do the post-filtering solution, but instead maybe > resolve the deletes "live" and store them in a transactional data > structure of some kind... but even then we will pay a perf hit to > lookup del docs against it. > > So, yeah, there will presumably be a tradeoff with this approach too. > However, turning around changes from the adds should be faster (no > segment gets flushed). > > Mike McCandless > > http://blog.mikemccandless.com > > On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: >> Thanks Mike, much appreciated. >> >> >> Wouldn't Twitter's approach fall for the exact same pit-hole you described >> Zoie does (or did) when it'll handle deletes too? I don't thing there is any >> other way of handling deletes other than post-filtering results. But perhaps >> the IW cache would be smaller than Zoie's RAMDirectory(ies)? >> >> >> I'll give all that a serious dive and report back with results or if more >> input will be required... >> >> >> Itamar. >> >> >> On 13/06/2011 19:01, Michael McCandless wrote: >> >>> Here's a blog post describing some details of Twitter's approach: >>> >>> >>> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html >>> >>> And here's a talk Michael did last October (Lucene Revolutions): >>> >>> >>> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter >>> >>> Twitter's case is simpler since they never delete ;) So we have to >>> fix that to do it in Lucene... there are also various open issues that >>> begin to explore some of the ideas here. >>> >>> But this ("immediate consistency") would be a deep and complex change, >>> and I don't see many apps that actually require it. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>> wrote: >>>> >>>> Thanks for your detailed answer. We'll have to tackle this and see whats >>>> more important to us then. I'd definitely love to hear Zoie has overcame >>>> all >>>> that... >>>> >>>> >>>> Any pointers to Michael Busch's approach? I take this has something to do >>>> with the core itself or index format, probably using the Flex version? >>>> >>>> >>>> Itamar. >>>> >>>> >>>> On 12/06/2011 23:12, Michael McCandless wrote: >>>> >>>>>> From what I understand of Zoie (and it's been some time since I last >>>>> >>>>> looked... so this could be wrong now), the biggest difference vs NRT >>>>> is that Zoie aims for "immediate consistency", ie index changes are >>>>> always made visible to the very next query, vs NRT which is >>>>> "controlled consistency", a blend between immediate and eventual >>>>> consistency where your app decides when the changes must become >>>>> visible. >>>>> >>>>> But in exchange for that, Zoie pays a price: each search has a higher >>>>> cost per collected hit, since it must post-filter for deleted docs. >>>>> And since Zoie necessarily adds complexity, there's more risk; eg >>>>> there were some nasty Zoie bugs that took quite some time to track >>>>> down (under https://issues.apache.org/jira/browse/LUCENE-2729). >>>>> >>>>> Anyway, I don't think that's a good tradeoff, in general, for our >>>>> users, because very few apps truly require immediate consistency from >>>>> Lucene (can anyone give an example where their app depends on >>>>> immediate consistency...?). I think it's better to spend time during >>>>> reopen so that searches aren't slower. >>>>> >>>>> That said, Lucene has already incorporated one big part of Zoie >>>>> (caching small segments in RAM) via the new NRTCachingDirectory (in
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-13, 22:19
Since there should only be one writer, I'm not sure why you'd need
transactional storage for that? deletions made by readers merely mark it for deletion, and once a doc has been marked for deletions it is deleted for all intents and purposes, right? But perhaps I need to refresh my memory on the internals, it has been a while. Does the N in NRT represent only the cost of reopening a searcher? meaning, if I could ensure reopening always happens fast and returns a searcher for the correct index revision, would it guarantee a real real-time search? or is there anything else standing in between? the only thing that comes to mind is the IW unflushed buffer - which only Twitter's approach seem to handle (not even Zoie). Itamar. On 14/06/2011 01:00, Michael McCandless wrote: > Yes, adding deletes to Twitter's approach will be a challenge! > > I don't think we'd do the post-filtering solution, but instead maybe > resolve the deletes "live" and store them in a transactional data > structure of some kind... but even then we will pay a perf hit to > lookup del docs against it. > > So, yeah, there will presumably be a tradeoff with this approach too. > However, turning around changes from the adds should be faster (no > segment gets flushed). > > Mike McCandless > > http://blog.mikemccandless.com > > On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> wrote: >> Thanks Mike, much appreciated. >> >> >> Wouldn't Twitter's approach fall for the exact same pit-hole you described >> Zoie does (or did) when it'll handle deletes too? I don't thing there is any >> other way of handling deletes other than post-filtering results. But perhaps >> the IW cache would be smaller than Zoie's RAMDirectory(ies)? >> >> >> I'll give all that a serious dive and report back with results or if more >> input will be required... >> >> >> Itamar. >> >> >> On 13/06/2011 19:01, Michael McCandless wrote: >> >>> Here's a blog post describing some details of Twitter's approach: >>> >>> >>> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html >>> >>> And here's a talk Michael did last October (Lucene Revolutions): >>> >>> >>> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter >>> >>> Twitter's case is simpler since they never delete ;) So we have to >>> fix that to do it in Lucene... there are also various open issues that >>> begin to explore some of the ideas here. >>> >>> But this ("immediate consistency") would be a deep and complex change, >>> and I don't see many apps that actually require it. >>> >>> Mike McCandless >>> >>> http://blog.mikemccandless.com >>> >>> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>> wrote: >>>> Thanks for your detailed answer. We'll have to tackle this and see whats >>>> more important to us then. I'd definitely love to hear Zoie has overcame >>>> all >>>> that... >>>> >>>> >>>> Any pointers to Michael Busch's approach? I take this has something to do >>>> with the core itself or index format, probably using the Flex version? >>>> >>>> >>>> Itamar. >>>> >>>> >>>> On 12/06/2011 23:12, Michael McCandless wrote: >>>> >>>>>> From what I understand of Zoie (and it's been some time since I last >>>>> looked... so this could be wrong now), the biggest difference vs NRT >>>>> is that Zoie aims for "immediate consistency", ie index changes are >>>>> always made visible to the very next query, vs NRT which is >>>>> "controlled consistency", a blend between immediate and eventual >>>>> consistency where your app decides when the changes must become >>>>> visible. >>>>> >>>>> But in exchange for that, Zoie pays a price: each search has a higher >>>>> cost per collected hit, since it must post-filter for deleted docs. >>>>> And since Zoie necessarily adds complexity, there's more risk; eg >>>>> there were some nasty Zoie bugs that took quite some time to track >>>>> down (under https://issues.apache.org/jira/browse/LUCENE-2729).
-
Re: Index size and performance degradationJason Rutherglen 2011-06-13, 22:25
> deletions made by readers merely mark it for
> deletion, and once a doc has been marked for deletions it is deleted for all > intents and purposes, right? There's the point-in-timeness of a reader to consider. > Does the N in NRT represent only the cost of reopening a searcher? Aptly put, and yes basically. > the only thing that comes to mind is the IW unflushed buffer This is LUCENE-2312. On Mon, Jun 13, 2011 at 3:19 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > Since there should only be one writer, I'm not sure why you'd need > transactional storage for that? deletions made by readers merely mark it for > deletion, and once a doc has been marked for deletions it is deleted for all > intents and purposes, right? But perhaps I need to refresh my memory on the > internals, it has been a while. > > Does the N in NRT represent only the cost of reopening a searcher? meaning, > if I could ensure reopening always happens fast and returns a searcher for > the correct index revision, would it guarantee a real real-time search? or > is there anything else standing in between? the only thing that comes to > mind is the IW unflushed buffer - which only Twitter's approach seem to > handle (not even Zoie). > > Itamar. > > On 14/06/2011 01:00, Michael McCandless wrote: >> >> Yes, adding deletes to Twitter's approach will be a challenge! >> >> I don't think we'd do the post-filtering solution, but instead maybe >> resolve the deletes "live" and store them in a transactional data >> structure of some kind... but even then we will pay a perf hit to >> lookup del docs against it. >> >> So, yeah, there will presumably be a tradeoff with this approach too. >> However, turning around changes from the adds should be faster (no >> segment gets flushed). >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Mon, Jun 13, 2011 at 5:06 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >> wrote: >>> >>> Thanks Mike, much appreciated. >>> >>> >>> Wouldn't Twitter's approach fall for the exact same pit-hole you >>> described >>> Zoie does (or did) when it'll handle deletes too? I don't thing there is >>> any >>> other way of handling deletes other than post-filtering results. But >>> perhaps >>> the IW cache would be smaller than Zoie's RAMDirectory(ies)? >>> >>> >>> I'll give all that a serious dive and report back with results or if more >>> input will be required... >>> >>> >>> Itamar. >>> >>> >>> On 13/06/2011 19:01, Michael McCandless wrote: >>> >>>> Here's a blog post describing some details of Twitter's approach: >>>> >>>> >>>> >>>> http://engineering.twitter.com/2010/10/twitters-new-search-architecture.html >>>> >>>> And here's a talk Michael did last October (Lucene Revolutions): >>>> >>>> >>>> >>>> http://www.lucidimagination.com/events/revolution2010/video-Realtime-Search-With-Lucene-presented-by-Michael-Busch-of-Twitter >>>> >>>> Twitter's case is simpler since they never delete ;) So we have to >>>> fix that to do it in Lucene... there are also various open issues that >>>> begin to explore some of the ideas here. >>>> >>>> But this ("immediate consistency") would be a deep and complex change, >>>> and I don't see many apps that actually require it. >>>> >>>> Mike McCandless >>>> >>>> http://blog.mikemccandless.com >>>> >>>> On Sun, Jun 12, 2011 at 4:46 PM, Itamar Syn-Hershko<[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Thanks for your detailed answer. We'll have to tackle this and see >>>>> whats >>>>> more important to us then. I'd definitely love to hear Zoie has >>>>> overcame >>>>> all >>>>> that... >>>>> >>>>> >>>>> Any pointers to Michael Busch's approach? I take this has something to >>>>> do >>>>> with the core itself or index format, probably using the Flex version? >>>>> >>>>> >>>>> Itamar. >>>>> >>>>> >>>>> On 12/06/2011 23:12, Michael McCandless wrote: >>>>> >>>>>>> From what I understand of Zoie (and it's been some time since I last >>>>>> >>>>>> looked... so this could be wrong now), the biggest difference vs NRT
-
Re: Index size and performance degradationShai Erera 2011-06-14, 04:13
>
> but you'll still cache the results - so again this isn't viable when RT > search, or even an NRT, is a requirement > No I don't cache the results. The Filter is an OpenBitSet of all docs that match the filter (e.g. have the specified language field's value) and it is refreshed whenever new segments are added / old ones deleted. So I think it actually works well w/ NRT because the 'warmup' would just update the filter w/ the new segments. I suggest you take a look at CachingWrapperFilter to get an idea how this works. Shai On Tue, Jun 14, 2011 at 12:02 AM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > On 13/06/2011 06:23, Shai Erera wrote: > >> A Language filter is one -- different users search in different languages >> and want to view pages in those languages only. If you have a field attach >> to your documents that identifies the language of the document, you can >> use >> it to filter the queries to return results only of the requested language. >> >> Another filter is file type. >> > > Oh, ok, but you'll still cache the results - so again this isn't viable > when RT search, or even an NRT, is a requirement... > > Is the usage of such Filters the only reason you don't include the language > field in your query? > > Itamar. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
-
Re: Index size and performance degradationToke Eskildsen 2011-06-14, 07:28
On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote:
> The whole point of my question was to find out if and how to make > balancing on the SAME machine. Apparently thats not going to help and at > a certain point we will just have to prompt the user to buy more hardware... It really depends on your scenario. If you have few concurrent requests and are looking to minimize latency, sharding might help; assuming you have fast IO and multiple cores. You basically want to saturate all available resources for all requests. On the other hand, if throughput is the issue, sharding on a single machine is counter-productive due to increased duplication and merging. > Out of curiosity, isn't there anything that we can do to avoid that? for > instance using memory-mapped files for the indexes? anything that would > help us overcome OS limitations of that sort... One standard advice for speeding up searches is using SSD's. Our (admittedly old) experiments puts SSD-performance near RAM. With the prices we have now, SSD's seems like an obvious choice for most setups. We tried a few performance tests at different index sizes and for us, index size vs. performance looked like the power law: Heavy performance degradation in the beginning, less later. It makes sense when we look at caching and it means that if you do not require stellar performance, you can have very large indexes on few machines (cue Hathi Trust). - Toke Eskildsen ---------------------------------------------------------------------
-
Re: Index size and performance degradationGanesh 2011-06-14, 07:42
We tried with more than 50 shards in the single system. Having multiple small index, indexes and optimizes the content faster. We use ParallelMultiSearcher to search across the index and the performance is really good. Now we plan to move to 64 Bit, so that we could use more RAM.
Regards Ganesh ----- Original Message ----- From: "Shai Erera" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Sunday, June 12, 2011 9:13 AM Subject: Re: Index size and performance degradation >I agree w/ Erick, there is no cutoff point (index size for that matter) > above which you start sharding. > > What you can do is create a scheduled job in your system that runs a select > list of queries and monitors their performance. Once it degrades, it shards > the index by either splitting it (you can use IndexSplitter under contrib) > or create a new shard, and direct new documents to it. > > I think I read somewhere, not sure if it was in Solr or ElasticSearch > documentation, about a Balancer object, which moves shards around in order > to balance the load on the cluster. You can implement something similar > which tries to balance the index sizes, creates new shards on-the-fly, even > merge shards if suddenly a whole source is being removed from the system > etc. > > Also, note that the 'largest index size' threshold is really a machine > constraint and not Lucene's. So if you decide that 10 GB is your cutoff, it > is pointless to create 10x10GB shards on the same machine -- searching them > is just like searching a 100GB index w/ 10x10GB segments. Perhaps it's even > worse because you consume more RAM when the indexes are split (e.g., terms > index, field infos etc.). > > Shai > > On Sun, Jun 12, 2011 at 3:10 AM, Erick Erickson <[EMAIL PROTECTED]>wrote: > >> <<<We can't assume anything about the machine running it, >> so testing won't really tell us much>>> >> >> Hmmm, then it's pretty hopeless I think. Problem is that >> anything you say about running on a machine with >> 2G available memory on a single processor is completely >> incomparable to running on a machine with 64G of >> memory available for Lucene and 16 processors. >> >> There's really no such thing as an "optimum" Lucene index >> size, it always relates to the characteristics of the >> underlying hardware. >> >> I think the best you can do is actually test on various >> configurations, then at least you can say "on configuration >> X this is the tipping point". >> >> Sorry there isn't a better answer that I know of, but... >> >> Best >> Erick >> >> On Sat, Jun 11, 2011 at 3:37 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> >> wrote: >> > Hi all, >> > >> > I know Lucene indexes to be at their optimum up to a certain size - said >> to >> > be around several GBs. I haven't found a good discussion over this, but >> its >> > my understanding that at some point its better to split an index into >> parts >> > (a la sharding) than to continue searching on a huge-size index. I assume >> > this has to do with OS and IO configurations. Can anyone point me to more >> > info on this? >> > >> > We have a product that is using Lucene for various searches, and at the >> > moment each type of search is using its own Lucene index. We plan on >> > refactoring the way it works and to combine all indexes into one - making >> > the whole system more robust and with a smaller memory footprint, among >> > other things. >> > >> > Assuming the above is true, we are interested in knowing how to do this >> > correctly. Initially all our indexes will be run in one big index, but if >> at >> > some index size there is a severe performance degradation we would like >> to >> > handle that correctly by starting a new FSDirectory index to flush into, >> or >> > by re-indexing and moving large indexes into their own Lucene index. >> > >> > Are there are any guidelines for measuring or estimating this correctly? >> > what we should be aware of while considering all that? We can't assume
-
Re: Index size and performance degradationItamar Syn-Hershko 2011-06-14, 08:03
Thanks. Our product is pretty generic and we can't assume much on the
hardware, as well as on usage. Some users would want low latency, others will prefer throughput. My job is to make as little compromise as possible... As for SSD, thats generally a good advice, except they seem to be failing quite a lot. For example see: http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html On 14/06/2011 10:28, Toke Eskildsen wrote: > On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >> The whole point of my question was to find out if and how to make >> balancing on the SAME machine. Apparently thats not going to help and at >> a certain point we will just have to prompt the user to buy more hardware... > It really depends on your scenario. If you have few concurrent requests > and are looking to minimize latency, sharding might help; assuming you > have fast IO and multiple cores. You basically want to saturate all > available resources for all requests. > > On the other hand, if throughput is the issue, sharding on a single > machine is counter-productive due to increased duplication and merging. > >> Out of curiosity, isn't there anything that we can do to avoid that? for >> instance using memory-mapped files for the indexes? anything that would >> help us overcome OS limitations of that sort... > One standard advice for speeding up searches is using SSD's. Our > (admittedly old) experiments puts SSD-performance near RAM. With the > prices we have now, SSD's seems like an obvious choice for most setups. > > We tried a few performance tests at different index sizes and for us, > index size vs. performance looked like the power law: Heavy performance > degradation in the beginning, less later. It makes sense when we look at > caching and it means that if you do not require stellar performance, you > can have very large indexes on few machines (cue Hathi Trust). > > - Toke Eskildsen > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > ---------------------------------------------------------------------
-
Re: Index size and performance degradationmark harwood 2011-06-14, 09:02
Partitioning and replication are the keys to handling data and user volumes
respectively. However, this approach introduces some other concerns over consistency and availability of content which I've tried to capture here: http://www.slideshare.net/MarkHarwood/patterns-for-large-scale-search These consistency concerns may not be an issue for you but I know they are for some organisations. Many organisations want everything (large data, many users, fast searches, quick updates and always-consistent views of the very latest content) and the above slide-deck tries to outline why this is hard/impossible and the necessary trade-offs in a system's qualities of service. I'd be interested in maintaining this with any other suggestions the community have to offer so that we can use it to explain the qualities of any particular engine/configuration and the justifications for that design choice. Cheers Mark ----- Original Message ---- From: Itamar Syn-Hershko <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tue, 14 June, 2011 9:03:15 Subject: Re: Index size and performance degradation Thanks. Our product is pretty generic and we can't assume much on the hardware, as well as on usage. Some users would want low latency, others will prefer throughput. My job is to make as little compromise as possible... As for SSD, thats generally a good advice, except they seem to be failing quite a lot. For example see: http://www.codinghorror.com/blog/2011/05/the-hot-crazy-solid-state-drive-scale.html On 14/06/2011 10:28, Toke Eskildsen wrote: > On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >> The whole point of my question was to find out if and how to make >> balancing on the SAME machine. Apparently thats not going to help and at >> a certain point we will just have to prompt the user to buy more hardware... > It really depends on your scenario. If you have few concurrent requests > and are looking to minimize latency, sharding might help; assuming you > have fast IO and multiple cores. You basically want to saturate all > available resources for all requests. > > On the other hand, if throughput is the issue, sharding on a single > machine is counter-productive due to increased duplication and merging. > >> Out of curiosity, isn't there anything that we can do to avoid that? for >> instance using memory-mapped files for the indexes? anything that would >> help us overcome OS limitations of that sort... > One standard advice for speeding up searches is using SSD's. Our > (admittedly old) experiments puts SSD-performance near RAM. With the > prices we have now, SSD's seems like an obvious choice for most setups. > > We tried a few performance tests at different index sizes and for us, > index size vs. performance looked like the power law: Heavy performance > degradation in the beginning, less later. It makes sense when we look at > caching and it means that if you do not require stellar performance, you > can have very large indexes on few machines (cue Hathi Trust). > > - Toke Eskildsen > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- ---------------------------------------------------------------------
-
Re: Index size and performance degradationStefan Trcek 2011-06-14, 09:56
On Sunday 12 June 2011 22:12:01 Michael McCandless wrote:
> Anyway, I don't think that's a good tradeoff, in general, for our > users, because very few apps truly require immediate consistency from > Lucene (can anyone give an example where their app depends on > immediate consistency...? For database (enterprise resource planning) applications we are in progress of replacing native database queries to lucene queries, because lucene offers an efficient combination of fulltext and structured (including facets) queries. Til now clients expect that if they change something it will be immediately reflected in their "lists" aka queries. If lucene reopen() won't be sufficient in performance (I didn't measure) we would solve it by an additional short lived index for new objects, which complicates the architecture, or by modifying the merge strategy. > I think it's better to spend time during reopen so that searches > aren't slower. Absolutely, if you build an internet search engine. For our closed world with numbered clients search speed doesn't have that impact. It must scale for one client on one cpu core - and buy as many cores as necessary. Stefan ---------------------------------------------------------------------
-
Re: Index size and performance degradationGanesh 2011-06-14, 09:58
Is it a bad idea to keep multiple shards in a single system?
Regards Ganesh ----- Original Message ----- From: "Toke Eskildsen" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, June 14, 2011 12:58 PM Subject: Re: Index size and performance degradation > On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >> The whole point of my question was to find out if and how to make >> balancing on the SAME machine. Apparently thats not going to help and at >> a certain point we will just have to prompt the user to buy more hardware... > > It really depends on your scenario. If you have few concurrent requests > and are looking to minimize latency, sharding might help; assuming you > have fast IO and multiple cores. You basically want to saturate all > available resources for all requests. > > On the other hand, if throughput is the issue, sharding on a single > machine is counter-productive due to increased duplication and merging. > >> Out of curiosity, isn't there anything that we can do to avoid that? for >> instance using memory-mapped files for the indexes? anything that would >> help us overcome OS limitations of that sort... > > One standard advice for speeding up searches is using SSD's. Our > (admittedly old) experiments puts SSD-performance near RAM. With the > prices we have now, SSD's seems like an obvious choice for most setups. > > We tried a few performance tests at different index sizes and for us, > index size vs. performance looked like the power law: Heavy performance > degradation in the beginning, less later. It makes sense when we look at > caching and it means that if you do not require stellar performance, you > can have very large indexes on few machines (cue Hathi Trust). > > - Toke Eskildsen > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
-
Re: Index size and performance degradationMichael McCandless 2011-06-14, 12:05
Hmm, this sounds hairy :)
Are you sure NRTCachingDir won't work for you? Mike McCandless http://blog.mikemccandless.com On Tue, Jun 14, 2011 at 5:58 AM, Ganesh <[EMAIL PROTECTED]> wrote: > Is it a bad idea to keep multiple shards in a single system? > > Regards > Ganesh > > ----- Original Message ----- > From: "Toke Eskildsen" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, June 14, 2011 12:58 PM > Subject: Re: Index size and performance degradation > > >> On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >>> The whole point of my question was to find out if and how to make >>> balancing on the SAME machine. Apparently thats not going to help and at >>> a certain point we will just have to prompt the user to buy more hardware... >> >> It really depends on your scenario. If you have few concurrent requests >> and are looking to minimize latency, sharding might help; assuming you >> have fast IO and multiple cores. You basically want to saturate all >> available resources for all requests. >> >> On the other hand, if throughput is the issue, sharding on a single >> machine is counter-productive due to increased duplication and merging. >> >>> Out of curiosity, isn't there anything that we can do to avoid that? for >>> instance using memory-mapped files for the indexes? anything that would >>> help us overcome OS limitations of that sort... >> >> One standard advice for speeding up searches is using SSD's. Our >> (admittedly old) experiments puts SSD-performance near RAM. With the >> prices we have now, SSD's seems like an obvious choice for most setups. >> >> We tried a few performance tests at different index sizes and for us, >> index size vs. performance looked like the power law: Heavy performance >> degradation in the beginning, less later. It makes sense when we look at >> caching and it means that if you do not require stellar performance, you >> can have very large indexes on few machines (cue Hathi Trust). >> >> - Toke Eskildsen >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: Index size and performance degradationMichael McCandless 2011-06-14, 12:06
Sorry, wrong email ;)
Mike McCandless http://blog.mikemccandless.com On Tue, Jun 14, 2011 at 8:05 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: > Hmm, this sounds hairy :) > > Are you sure NRTCachingDir won't work for you? > > Mike McCandless > > http://blog.mikemccandless.com > > On Tue, Jun 14, 2011 at 5:58 AM, Ganesh <[EMAIL PROTECTED]> wrote: >> Is it a bad idea to keep multiple shards in a single system? >> >> Regards >> Ganesh >> >> ----- Original Message ----- >> From: "Toke Eskildsen" <[EMAIL PROTECTED]> >> To: <[EMAIL PROTECTED]> >> Sent: Tuesday, June 14, 2011 12:58 PM >> Subject: Re: Index size and performance degradation >> >> >>> On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >>>> The whole point of my question was to find out if and how to make >>>> balancing on the SAME machine. Apparently thats not going to help and at >>>> a certain point we will just have to prompt the user to buy more hardware... >>> >>> It really depends on your scenario. If you have few concurrent requests >>> and are looking to minimize latency, sharding might help; assuming you >>> have fast IO and multiple cores. You basically want to saturate all >>> available resources for all requests. >>> >>> On the other hand, if throughput is the issue, sharding on a single >>> machine is counter-productive due to increased duplication and merging. >>> >>>> Out of curiosity, isn't there anything that we can do to avoid that? for >>>> instance using memory-mapped files for the indexes? anything that would >>>> help us overcome OS limitations of that sort... >>> >>> One standard advice for speeding up searches is using SSD's. Our >>> (admittedly old) experiments puts SSD-performance near RAM. With the >>> prices we have now, SSD's seems like an obvious choice for most setups. >>> >>> We tried a few performance tests at different index sizes and for us, >>> index size vs. performance looked like the power law: Heavy performance >>> degradation in the beginning, less later. It makes sense when we look at >>> caching and it means that if you do not require stellar performance, you >>> can have very large indexes on few machines (cue Hathi Trust). >>> >>> - Toke Eskildsen >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > ---------------------------------------------------------------------
-
Re: Index size and performance degradationGanesh 2011-06-16, 05:10
Any one could tthow some light on this? Is it a bad idea to keep multiple shards in a single system?
Below are my reasons, Please correct me if iam wrong. 1. If single large index goes beyond GB, It may take more time to merge and optimize. 2. Consider the total size of index is around 10 GB, then fdt file might be in 3 - 4GB. In order to display result summary we may need to fetch the field values from the fdt file. IO might be more as it needs to skip large amount bytes to locate the exact location. In other words the search summary retrieval might slow. 3. It is really good for less number of concurrent users going to search at a time. Regards Ganesh ----- Original Message ----- From: "Ganesh" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Tuesday, June 14, 2011 3:28 PM Subject: Re: Index size and performance degradation Is it a bad idea to keep multiple shards in a single system? Regards Ganesh ----- Original Message ----- From: "Toke Eskildsen" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, June 14, 2011 12:58 PM Subject: Re: Index size and performance degradation > On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >> The whole point of my question was to find out if and how to make >> balancing on the SAME machine. Apparently thats not going to help and at >> a certain point we will just have to prompt the user to buy more hardware... > > It really depends on your scenario. If you have few concurrent requests > and are looking to minimize latency, sharding might help; assuming you > have fast IO and multiple cores. You basically want to saturate all > available resources for all requests. > > On the other hand, if throughput is the issue, sharding on a single > machine is counter-productive due to increased duplication and merging. > >> Out of curiosity, isn't there anything that we can do to avoid that? for >> instance using memory-mapped files for the indexes? anything that would >> help us overcome OS limitations of that sort... > > One standard advice for speeding up searches is using SSD's. Our > (admittedly old) experiments puts SSD-performance near RAM. With the > prices we have now, SSD's seems like an obvious choice for most setups. > > We tried a few performance tests at different index sizes and for us, > index size vs. performance looked like the power law: Heavy performance > degradation in the beginning, less later. It makes sense when we look at > caching and it means that if you do not require stellar performance, you > can have very large indexes on few machines (cue Hathi Trust). > > - Toke Eskildsen > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > --------------------------------------------------------------------- ---------------------------------------------------------------------
-
Re: Index size and performance degradationShai Erera 2011-06-16, 05:44
>
> If single large index goes beyond GB, It may take more time to merge and > optimize. > This can be achieved w/ a single index too. LogMergePolicy allows setting a maxMergeMB and maxMergeMBForOptimize, which are thresholds that define the largest segment size to be merged. TieredMergePolicy, as far as I understand, goes one step further and lets you specify the largest segment size you'd wish to see as a result of a merge. IO might be more as it needs to skip large amount bytes to locate the exact > location > I don't believe there's much difference between a single index and multiple ones. Lucene does not control how the low-level IO subsystem works. It could be that the larger index would perform better, or vice versa. I would keep the stored fields separate from the content index in two cases: (1) You can ensure that the stored fields are stored on a separate physical disk than the content index, in which case you're more likely to get concurrency for fetching results and doing searches. (2) The stored fields are kept in their own cluster. But this is really for large deployments, w/ many shards and high query volume. In those scenarios, queries are executed on one (logical) cluster, and then results are fed off to another logical cluster to produce the summaries. I doubt it fits more than a handful of systems though. Shai On Thu, Jun 16, 2011 at 8:10 AM, Ganesh <[EMAIL PROTECTED]> wrote: > Any one could tthow some light on this? Is it a bad idea to keep multiple > shards in a single system? > > Below are my reasons, Please correct me if iam wrong. > > 1. If single large index goes beyond GB, It may take more time to merge and > optimize. > 2. Consider the total size of index is around 10 GB, then fdt file might be > in 3 - 4GB. In order to display result summary we may need to fetch the > field values from the fdt file. IO might be more as it needs to skip large > amount bytes to locate the exact location. In other words the search summary > retrieval might slow. > 3. It is really good for less number of concurrent users going to search at > a time. > > Regards > Ganesh > > > > ----- Original Message ----- > From: "Ganesh" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> > Sent: Tuesday, June 14, 2011 3:28 PM > Subject: Re: Index size and performance degradation > > > Is it a bad idea to keep multiple shards in a single system? > > Regards > Ganesh > > ----- Original Message ----- > From: "Toke Eskildsen" <[EMAIL PROTECTED]> > To: <[EMAIL PROTECTED]> > Sent: Tuesday, June 14, 2011 12:58 PM > Subject: Re: Index size and performance degradation > > > > On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: > >> The whole point of my question was to find out if and how to make > >> balancing on the SAME machine. Apparently thats not going to help and at > >> a certain point we will just have to prompt the user to buy more > hardware... > > > > It really depends on your scenario. If you have few concurrent requests > > and are looking to minimize latency, sharding might help; assuming you > > have fast IO and multiple cores. You basically want to saturate all > > available resources for all requests. > > > > On the other hand, if throughput is the issue, sharding on a single > > machine is counter-productive due to increased duplication and merging. > > > >> Out of curiosity, isn't there anything that we can do to avoid that? for > >> instance using memory-mapped files for the indexes? anything that would > >> help us overcome OS limitations of that sort... > > > > One standard advice for speeding up searches is using SSD's. Our > > (admittedly old) experiments puts SSD-performance near RAM. With the > > prices we have now, SSD's seems like an obvious choice for most setups. > > > > We tried a few performance tests at different index sizes and for us, > > index size vs. performance looked like the power law: Heavy performance
-
Re: Index size and performance degradationDenis Bazhenov 2011-06-16, 08:42
To summarize what was said before.
In general, using sharding on single machine does make sense only if using single lucene instance you could not utilize all the hardware on this machine. For it's own lucene does a good job in this area, so I think it's very rarely situation where you really need to shard lucene index on single machine. One particular case when you want to shard that comes to my mind. If you have large JVM heap (> 4-8Gb) and machine with very large RAM size it makes sense to shard for keeping GC ovehead small (AFAIK JVM have some troubles on large heap sizes). Also I thinks there is some potential in sharding if you do a lot of heavy lifting sorting. But again, it does make sense only if there is no a lot of concurrent requests going in the system. On 16.06.2011, at 16:44, Shai Erera wrote: >> >> If single large index goes beyond GB, It may take more time to merge and >> optimize. >> > > This can be achieved w/ a single index too. LogMergePolicy allows setting a > maxMergeMB and maxMergeMBForOptimize, which are thresholds that define the > largest segment size to be merged. TieredMergePolicy, as far as I > understand, goes one step further and lets you specify the largest segment > size you'd wish to see as a result of a merge. > > IO might be more as it needs to skip large amount bytes to locate the exact >> location >> > > I don't believe there's much difference between a single index and multiple > ones. Lucene does not control how the low-level IO subsystem works. It could > be that the larger index would perform better, or vice versa. I would keep > the stored fields separate from the content index in two cases: > > (1) You can ensure that the stored fields are stored on a separate physical > disk than the content index, in which case you're more likely to get > concurrency for fetching results and doing searches. > > (2) The stored fields are kept in their own cluster. But this is really for > large deployments, w/ many shards and high query volume. In those scenarios, > queries are executed on one (logical) cluster, and then results are fed off > to another logical cluster to produce the summaries. I doubt it fits more > than a handful of systems though. > > Shai > > On Thu, Jun 16, 2011 at 8:10 AM, Ganesh <[EMAIL PROTECTED]> wrote: > >> Any one could tthow some light on this? Is it a bad idea to keep multiple >> shards in a single system? >> >> Below are my reasons, Please correct me if iam wrong. >> >> 1. If single large index goes beyond GB, It may take more time to merge and >> optimize. >> 2. Consider the total size of index is around 10 GB, then fdt file might be >> in 3 - 4GB. In order to display result summary we may need to fetch the >> field values from the fdt file. IO might be more as it needs to skip large >> amount bytes to locate the exact location. In other words the search summary >> retrieval might slow. >> 3. It is really good for less number of concurrent users going to search at >> a time. >> >> Regards >> Ganesh >> >> >> >> ----- Original Message ----- >> From: "Ganesh" <[EMAIL PROTECTED]> >> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> >> Sent: Tuesday, June 14, 2011 3:28 PM >> Subject: Re: Index size and performance degradation >> >> >> Is it a bad idea to keep multiple shards in a single system? >> >> Regards >> Ganesh >> >> ----- Original Message ----- >> From: "Toke Eskildsen" <[EMAIL PROTECTED]> >> To: <[EMAIL PROTECTED]> >> Sent: Tuesday, June 14, 2011 12:58 PM >> Subject: Re: Index size and performance degradation >> >> >>> On Sun, 2011-06-12 at 10:10 +0200, Itamar Syn-Hershko wrote: >>>> The whole point of my question was to find out if and how to make >>>> balancing on the SAME machine. Apparently thats not going to help and at >>>> a certain point we will just have to prompt the user to buy more >> hardware... >>> >>> It really depends on your scenario. If you have few concurrent requests Denis Bazhenov <[EMAIL PROTECTED]> |