|
jchang
2010-01-12, 23:10
Jason Rutherglen
2010-01-12, 23:37
Jason Rutherglen
2010-01-13, 04:57
jchang
2010-01-13, 19:16
Michael McCandless
2010-01-13, 20:01
Otis Gospodnetic
2010-01-13, 04:15
Jake Mannix
2010-01-13, 04:43
Jason Rutherglen
2010-01-13, 04:55
Jake Mannix
2010-01-13, 05:49
Jason Rutherglen
2010-01-13, 06:14
Jake Mannix
2010-01-13, 06:41
jchang
2010-01-13, 19:08
Michael McCandless
2010-01-13, 20:00
Michael McCandless
2010-01-13, 10:20
John Wang
2010-01-13, 12:33
Michael McCandless
2010-01-13, 12:57
jchang
2010-01-12, 23:27
Jason Rutherglen
2010-01-12, 23:39
Otis Gospodnetic
2010-01-13, 04:13
jchang
2010-01-14, 17:15
Michael McCandless
2010-01-14, 18:09
Sanne Grinovero
2010-01-15, 12:30
|
-
Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsjchang 2010-01-12, 23:10
Lucene 2.9.0 has near real time indexing, writing to a RAMDir which gets flushed to disk when you do a search. Does anybody know how this works out with service restarts (both orderly shutdown and a crash)? If the service goes down while indexed items are in RAMDir but not on disk, are they lost? Or is there some kind of log recovery? Also, does anybody know the impact of this which clustered lucene servers? If you have numerous servers running off one index, I assume there is no way for the other services to pick up the newly indexed items until they are flushed to disk, correct? I'd be happy if that is not so, but I suspect it is so. Thanks, John -- View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- +
jchang 2010-01-12, 23:10
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJason Rutherglen 2010-01-12, 23:37
Greetin's John,
2.9 and 3.0 don't use a RAMDir... Deletes are held in RAM however so on power off, those would be lost. Jason On Tue, Jan 12, 2010 at 3:10 PM, jchang <[EMAIL PROTECTED]> wrote: > > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which gets > flushed to disk when you do a search. > > Does anybody know how this works out with service restarts (both orderly > shutdown and a crash)? If the service goes down while indexed items are in > RAMDir but not on disk, are they lost? Or is there some kind of log > recovery? > > Also, does anybody know the impact of this which clustered lucene servers? > If you have numerous servers running off one index, I assume there is no way > for the other services to pick up the newly indexed items until they are > flushed to disk, correct? I'd be happy if that is not so, but I suspect it > is so. > > Thanks, > John > -- > View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Jason Rutherglen 2010-01-12, 23:37
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJason Rutherglen 2010-01-13, 04:57
Actually, unless IW.commit is called, all changes after the last
commit will be lost (because the segment infos file will not have been written). On Tue, Jan 12, 2010 at 3:37 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > Greetin's John, > > 2.9 and 3.0 don't use a RAMDir... Deletes are held in RAM however so > on power off, those would be lost. > > Jason > > On Tue, Jan 12, 2010 at 3:10 PM, jchang <[EMAIL PROTECTED]> wrote: >> >> Lucene 2.9.0 has near real time indexing, writing to a RAMDir which gets >> flushed to disk when you do a search. >> >> Does anybody know how this works out with service restarts (both orderly >> shutdown and a crash)? If the service goes down while indexed items are in >> RAMDir but not on disk, are they lost? Or is there some kind of log >> recovery? >> >> Also, does anybody know the impact of this which clustered lucene servers? >> If you have numerous servers running off one index, I assume there is no way >> for the other services to pick up the newly indexed items until they are >> flushed to disk, correct? I'd be happy if that is not so, but I suspect it >> is so. >> >> Thanks, >> John >> -- >> View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > --------------------------------------------------------------------- +
Jason Rutherglen 2010-01-13, 04:57
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsjchang 2010-01-13, 19:16
Actually, unless IW.commit is called, all changes after the last commit will be lost (because the segment infos file will not have been written). On Tue, Jan 12, 2010 at 3:37 PM, Jason Rutherglen <[EMAIL PROTECTED]> wrote: > Greetin's John, > > 2.9 and 3.0 don't use a RAMDir... Deletes are held in RAM however so > on power off, those would be lost. I'm confused; at first you said that on power off only deletes are lost, but the later posting said unless "IW.commit is called, all changes after the last commit will be lost." So, do I lose only deletes, or also writes and updates in a crash? BTW, I'm using straight Lucene (or actually Lucene + Compass), not Zoie at the moment. For some of the responses, I'm not clear if the information applies to Zoie specifically, or also to straight Lucene. -- View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27148834.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- +
jchang 2010-01-13, 19:16
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsMichael McCandless 2010-01-13, 20:01
For Lucene, everything (adds & deletes) done after the last successful
commit, is lost on crash/power loss/etc. Mike On Wed, Jan 13, 2010 at 2:16 PM, jchang <[EMAIL PROTECTED]> wrote: > > > Actually, unless IW.commit is called, all changes after the last > commit will be lost (because the segment infos file will not have been > written). > > On Tue, Jan 12, 2010 at 3:37 PM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: >> Greetin's John, >> >> 2.9 and 3.0 don't use a RAMDir... Deletes are held in RAM however so >> on power off, those would be lost. > > I'm confused; at first you said that on power off only deletes are lost, but > the later posting said unless "IW.commit is called, all changes after the > last commit will be lost." So, do I lose only deletes, or also writes and > updates in a crash? BTW, I'm using straight Lucene (or actually Lucene + > Compass), not Zoie at the moment. For some of the responses, I'm not clear > if the information applies to Zoie specifically, or also to straight Lucene. > -- > View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27148834.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Michael McCandless 2010-01-13, 20:01
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsOtis Gospodnetic 2010-01-13, 04:15
John, you should have a look at Zoie. I just finished adding LinkedIn's case study about Zoie to Lucene in Action 2, so this is fresh in my mind. :)
Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: jchang <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tue, January 12, 2010 6:10:56 PM > Subject: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restarts > > > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which gets > flushed to disk when you do a search. > > Does anybody know how this works out with service restarts (both orderly > shutdown and a crash)? If the service goes down while indexed items are in > RAMDir but not on disk, are they lost? Or is there some kind of log > recovery? > > Also, does anybody know the impact of this which clustered lucene servers? > If you have numerous servers running off one index, I assume there is no way > for the other services to pick up the newly indexed items until they are > flushed to disk, correct? I'd be happy if that is not so, but I suspect it > is so. > > Thanks, > John > -- > View this message in context: > http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- +
Otis Gospodnetic 2010-01-13, 04:15
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJake Mannix 2010-01-13, 04:43
On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic <
[EMAIL PROTECTED]> wrote: > John, you should have a look at Zoie. I just finished adding LinkedIn's > case study about Zoie to Lucene in Action 2, so this is fresh in my mind. :) > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart part, in that while yes, you lose what is in RAM, Zoie keeps track of an "index version" on disk alongside the Lucene index which it uses to decide where it must reindex from to "catch up" if it there have been incoming indexing events while the server was out of commission. Zoie does not support multiple servers using the same index, because each zoie instance has IndexWriter instances, and you'll get locking problems trying to do that. You could have one Zoie instance effectively as the "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which could read off of that index, but as you say, could not get access to the RAMDirectory information until it was flushed to disk. Why do you need a "cluster" of servers hitting the same index? Are they different applications (with different search logic, so they need to be different instances), or is it just to try and utilize your hardware efficiently? If it's for performance reasons, you might find you get better use of your CPU cores by just sharding your one index into smaller ones, each having their own Zoie instance, and putting a "broker" on top of them searching across all and mergesorting the results. Often even this isn't necessary, because Zoie will be opening the disk-backed IndexReader in readonly mode, and thus all the synchronized blocks are gone, and one single Zoie instance will easily saturate your cpu cores by simple multi-threading by your appserver. If you really needed to do many different kinds of writes (from different applications) and also have applications not involved in the writing also seeing (in real-time) these writes, then you could still do it with Zoie, but it would take some interesting architectural juggling (write your own StreamDataProvider class which takes input from a variety of sources and merges them together to feed to one Zoie instance, then a broker on top of zoie which serves out IndexReaders to different applications living on top which can wrap them up in their own business logic as they saw fit... as long as it was ok to have all the applications in the same JVM, of course). -jake > > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > ----- Original Message ---- > > From: jchang <[EMAIL PROTECTED]> > > To: [EMAIL PROTECTED] > > Sent: Tue, January 12, 2010 6:10:56 PM > > Subject: Lucene 2.9.0 Near Real Time Indexing and Service > Crashes/restarts > > > > > > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which gets > > flushed to disk when you do a search. > > > > Does anybody know how this works out with service restarts (both orderly > > shutdown and a crash)? If the service goes down while indexed items are > in > > RAMDir but not on disk, are they lost? Or is there some kind of log > > recovery? > > > > Also, does anybody know the impact of this which clustered lucene > servers? > > If you have numerous servers running off one index, I assume there is no > way > > for the other services to pick up the newly indexed items until they are > > flushed to disk, correct? I'd be happy if that is not so, but I suspect > it > > is so. > > > > Thanks, > > John > > -- > > View this message in context: > > > http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27136539.html > > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > ------------------------------ +
Jake Mannix 2010-01-13, 04:43
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJason Rutherglen 2010-01-13, 04:55
> Zoie keeps track of an "index version" on disk alongside the Lucene index which it uses to decide where it must reindex from to "catch up" if it there have been incoming indexing events while the server was out of commission.
This begs a little more clarity... Sounds like a transaction log. Oh right, with Zoie there's the assumption of an external transaction log however it doesn't provide one out of the box? On Tue, Jan 12, 2010 at 8:43 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic > <[EMAIL PROTECTED]> wrote: >> >> John, you should have a look at Zoie. I just finished adding LinkedIn's >> case study about Zoie to Lucene in Action 2, so this is fresh in my mind. >> >> :) > > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart > part, in that while yes, you lose what is in RAM, Zoie keeps track of an > "index version" on disk alongside the Lucene index which it uses to decide > where it must reindex from to "catch up" if it there have been incoming > indexing events while the server was out of commission. > Zoie does not support multiple servers using the same index, because each > zoie instance has IndexWriter instances, and you'll get locking problems > trying to do that. You could have one Zoie instance effectively as the > "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which > could read off of that index, but as you say, could not get access to the > RAMDirectory information until it was flushed to disk. > Why do you need a "cluster" of servers hitting the same index? Are they > different applications (with different search logic, so they need to be > different instances), or is it just to try and utilize your hardware > efficiently? If it's for performance reasons, you might find you get better > use of your CPU cores by just sharding your one index into smaller ones, > each having their own Zoie instance, and putting a "broker" on top of them > searching across all and mergesorting the results. Often even this isn't > necessary, because Zoie will be opening the disk-backed IndexReader in > readonly mode, and thus all the synchronized blocks are gone, and one single > Zoie instance will easily saturate your cpu cores by simple multi-threading > by your appserver. > If you really needed to do many different kinds of writes (from different > applications) and also have applications not involved in the writing also > seeing (in real-time) these writes, then you could still do it with Zoie, > but it would take some interesting architectural juggling (write your own > StreamDataProvider class which takes input from a variety of sources and > merges them together to feed to one Zoie instance, then a broker on top of > zoie which serves out IndexReaders to different applications living on top > which can wrap them up in their own business logic as they saw fit... as > long as it was ok to have all the applications in the same JVM, of course). > -jake > >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch >> >> >> >> ----- Original Message ---- >> > From: jchang <[EMAIL PROTECTED]> >> > To: [EMAIL PROTECTED] >> > Sent: Tue, January 12, 2010 6:10:56 PM >> > Subject: Lucene 2.9.0 Near Real Time Indexing and Service >> > Crashes/restarts >> > >> > >> > Lucene 2.9.0 has near real time indexing, writing to a RAMDir which gets >> > flushed to disk when you do a search. >> > >> > Does anybody know how this works out with service restarts (both orderly >> > shutdown and a crash)? If the service goes down while indexed items are >> > in >> > RAMDir but not on disk, are they lost? Or is there some kind of log >> > recovery? >> > >> > Also, does anybody know the impact of this which clustered lucene >> > servers? >> > If you have numerous servers running off one index, I assume there is no >> > way >> > for the other services to pick up the newly indexed items until they are >> > flushed to disk, correct? I'd be happy if that is not so, but I suspect +
Jason Rutherglen 2010-01-13, 04:55
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJake Mannix 2010-01-13, 05:49
On Tue, Jan 12, 2010 at 8:55 PM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote: > > Zoie keeps track of an "index version" on disk alongside the Lucene index > which it uses to decide where it must reindex from to "catch up" if it there > have been incoming indexing events while the server was out of commission. > > This begs a little more clarity... Sounds like a transaction log. Oh > right, with Zoie there's the assumption of an external transaction log > however it doesn't provide one out of the box? > The index versioning scheme Zoie uses is independent of what mechanism you use to implement it. If your indexing technique is to talk to a database directly, you don't need a transaction log, something as simple as a "created_at" column will suffice in many situations. I gave a short talk to demo zoie yesterday, and for it I wrote up a simple file-based indexing event log in an afternoon. Similarly if you listen on a JMS queue or basically any other message-queue based system that not "push only", you'll have some notion of "replay since [timestamp / version / incrementing counter]", but they're all vendor dependent. It's not the kind of thing you can just provide out of the box due to this vendor dependence. On the other hand, if someone came along and said they wanted to use zoie with RabbitMQ or whatever, we'd certainly accept a patch for a StreamDataProvider implementation which does that (and maybe one of the zoie committers would even write it themself it it seemed like a common enough use case). -jake > > On Tue, Jan 12, 2010 at 8:43 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic > > <[EMAIL PROTECTED]> wrote: > >> > >> John, you should have a look at Zoie. I just finished adding LinkedIn's > >> case study about Zoie to Lucene in Action 2, so this is fresh in my > mind. > >> > >> :) > > > > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart > > part, in that while yes, you lose what is in RAM, Zoie keeps track of an > > "index version" on disk alongside the Lucene index which it uses to > decide > > where it must reindex from to "catch up" if it there have been incoming > > indexing events while the server was out of commission. > > Zoie does not support multiple servers using the same index, because each > > zoie instance has IndexWriter instances, and you'll get locking problems > > trying to do that. You could have one Zoie instance effectively as the > > "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which > > could read off of that index, but as you say, could not get access to the > > RAMDirectory information until it was flushed to disk. > > Why do you need a "cluster" of servers hitting the same index? Are they > > different applications (with different search logic, so they need to be > > different instances), or is it just to try and utilize your hardware > > efficiently? If it's for performance reasons, you might find you get > better > > use of your CPU cores by just sharding your one index into smaller ones, > > each having their own Zoie instance, and putting a "broker" on top of > them > > searching across all and mergesorting the results. Often even this isn't > > necessary, because Zoie will be opening the disk-backed IndexReader in > > readonly mode, and thus all the synchronized blocks are gone, and one > single > > Zoie instance will easily saturate your cpu cores by simple > multi-threading > > by your appserver. > > If you really needed to do many different kinds of writes (from different > > applications) and also have applications not involved in the writing also > > seeing (in real-time) these writes, then you could still do it with Zoie, > > but it would take some interesting architectural juggling (write your own > > StreamDataProvider class which takes input from a variety of sources and > > merges them together to feed to one Zoie instance, then a broker on top +
Jake Mannix 2010-01-13, 05:49
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJason Rutherglen 2010-01-13, 06:14
Jake,
I wonder how often people need reliable transactions for realtime search? Maybe Mysql's t-log could be used sans the database part? The created_at column for near realtime seems like it could hurt the database due to excessive polling? Has anyone tried it yet? > I wrote up a simple file-based indexing event log in an afternoon Right, however it's probably a long perilous leap from this to a t-log that's production ready. I'm waiting for someone to dive in and mess with Bookkeeper http://wiki.apache.org/hadoop/BookKeeper and report back! Jason On Tue, Jan 12, 2010 at 9:49 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Tue, Jan 12, 2010 at 8:55 PM, Jason Rutherglen > <[EMAIL PROTECTED]> wrote: >> >> > Zoie keeps track of an "index version" on disk alongside the Lucene >> > index which it uses to decide where it must reindex from to "catch up" if it >> > there have been incoming indexing events while the server was out of >> > commission. >> >> This begs a little more clarity... Sounds like a transaction log. Oh >> right, with Zoie there's the assumption of an external transaction log >> however it doesn't provide one out of the box? > > The index versioning scheme Zoie uses is independent of what mechanism you > use to implement it. If your indexing technique is to talk to a database > directly, you don't need a transaction log, something as simple as a > "created_at" column will suffice in many situations. I gave a short talk to > demo zoie yesterday, and for it I wrote up a simple file-based indexing > event log in an afternoon. Similarly if you listen on a JMS queue or > basically any other message-queue based system that not "push only", you'll > have some notion of "replay since [timestamp / version / incrementing > counter]", but they're all vendor dependent. > It's not the kind of thing you can just provide out of the box due to this > vendor dependence. On the other hand, if someone came along and said they > wanted to use zoie with RabbitMQ or whatever, we'd certainly accept a patch > for a StreamDataProvider implementation which does that (and maybe one of > the zoie committers would even write it themself it it seemed like a common > enough use case). > -jake > >> >> On Tue, Jan 12, 2010 at 8:43 PM, Jake Mannix <[EMAIL PROTECTED]> >> wrote: >> > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic >> > <[EMAIL PROTECTED]> wrote: >> >> >> >> John, you should have a look at Zoie. I just finished adding >> >> LinkedIn's >> >> case study about Zoie to Lucene in Action 2, so this is fresh in my >> >> mind. >> >> >> >> :) >> > >> > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart >> > part, in that while yes, you lose what is in RAM, Zoie keeps track of an >> > "index version" on disk alongside the Lucene index which it uses to >> > decide >> > where it must reindex from to "catch up" if it there have been incoming >> > indexing events while the server was out of commission. >> > Zoie does not support multiple servers using the same index, because >> > each >> > zoie instance has IndexWriter instances, and you'll get locking problems >> > trying to do that. You could have one Zoie instance effectively as the >> > "master/writer/realtime reader", and a bunch of raw Lucene "slaves" >> > which >> > could read off of that index, but as you say, could not get access to >> > the >> > RAMDirectory information until it was flushed to disk. >> > Why do you need a "cluster" of servers hitting the same index? Are they >> > different applications (with different search logic, so they need to be >> > different instances), or is it just to try and utilize your hardware >> > efficiently? If it's for performance reasons, you might find you get >> > better >> > use of your CPU cores by just sharding your one index into smaller ones, >> > each having their own Zoie instance, and putting a "broker" on top of >> > them >> > searching across all and mergesorting the results. Often even this +
Jason Rutherglen 2010-01-13, 06:14
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJake Mannix 2010-01-13, 06:41
On Tue, Jan 12, 2010 at 10:14 PM, Jason Rutherglen <
[EMAIL PROTECTED]> wrote: > Jake, > > I wonder how often people need reliable transactions for > realtime search? Maybe Mysql's t-log could be used sans the > database part? > A reliable message queue - I'd imagine all the time! Transactions... that depends on how "ACID" you care about. For social media, news, log monitoring, you can miss some events, and transactionality isn't necessarily the key part - just the ability to replay from some point in time, so you can elastically replicate, and handle server crashes (as well as doing background batch indexing followed with incremental realtime catchup). > The created_at column for near realtime seems like it could hurt > the database due to excessive polling? Has anyone tried it yet? > I haven't tried it in a production system, but in testing it only seemed only to be bad if you have a single, not replicated or sharded DB, but a fully replicated search system (so having no separate message queue would entail that you're spamming your db with polling queries). If your ratio of search shards to db shards isn't too high (like say, the non-distributed case where you have a handful of replica indexes talking to one centralized db), then this wouldn't be as much of a problem unless you go crazy and query every couple of milliseconds. > > I wrote up a simple file-based indexing event log in an > afternoon > > Right, however it's probably a long perilous leap from this to a t-log > that's production ready. > Certainly. I'm waiting for someone to dive in and mess with Bookkeeper > http://wiki.apache.org/hadoop/BookKeeper and report back! > That could be great for this kind of thing. Add some custom adapters to write ledger entries which are easily translatable entries into Lucene Documents, and that would hook right into Zoie rather easily. BookKeeperStreamDataProvider! -jake +
Jake Mannix 2010-01-13, 06:41
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsjchang 2010-01-13, 19:08
I don't specifically need a cluster of servers writing indexes. Actually, at the moment, I only have one server, but multiple message consuming threads, so I still land back at the same problem of contention for the index lock. Why do I have multiple message consumers? Speed...I wanted to dequeue my items to be indexed fast. However, I'm getting the impression that may have been a foolish effort. I find that only having one writer thread is not much slower than having 20, which makes sense if they are all waiting on one file. If only one writer thread can be fast enough (which gets rid of timeout exceptions that I asked about in a different thread), that that is good enough for me. Do you know what kind of index writes per second I can hope to hit with one writer thread? I guess it depends on many factors. Also, I know 2.9.0 is faster than 2.4.0 (which I'm on), but I'm not sure I can move up to 2.9.0 really easily because all my Lucene usage is wrapped in Compass, which does not yet support 2.9.0. I think I'd have to rewrite my service to use straight Lucene, which might be a good idea, but I can't do quickly. We don't use Solr. Thanks for your help thus far and thanks in advance for any more responses. Jake Mannix wrote: > > On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic < > [EMAIL PROTECTED]> wrote: > >> John, you should have a look at Zoie. I just finished adding LinkedIn's >> case study about Zoie to Lucene in Action 2, so this is fresh in my mind. > > :) >> > > Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart > part, in that while yes, you lose what is in RAM, Zoie keeps track of an > "index version" on disk alongside the Lucene index which it uses to decide > where it must reindex from to "catch up" if it there have been incoming > indexing events while the server was out of commission. > > Zoie does not support multiple servers using the same index, because each > zoie instance has IndexWriter instances, and you'll get locking problems > trying to do that. You could have one Zoie instance effectively as the > "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which > could read off of that index, but as you say, could not get access to the > RAMDirectory information until it was flushed to disk. > > Why do you need a "cluster" of servers hitting the same index? Are they > different applications (with different search logic, so they need to be > different instances), or is it just to try and utilize your hardware > efficiently? If it's for performance reasons, you might find you get > better > use of your CPU cores by just sharding your one index into smaller ones, > each having their own Zoie instance, and putting a "broker" on top of them > searching across all and mergesorting the results. Often even this isn't > necessary, because Zoie will be opening the disk-backed IndexReader in > readonly mode, and thus all the synchronized blocks are gone, and one > single > Zoie instance will easily saturate your cpu cores by simple > multi-threading > by your appserver. > > If you really needed to do many different kinds of writes (from different > applications) and also have applications not involved in the writing also > seeing (in real-time) these writes, then you could still do it with Zoie, > but it would take some interesting architectural juggling (write your own > StreamDataProvider class which takes input from a variety of sources and > merges them together to feed to one Zoie instance, then a broker on top of > zoie which serves out IndexReaders to different applications living on top > which can wrap them up in their own business logic as they saw fit... as > long as it was ok to have all the applications in the same JVM, of > course). > > -jake > > >> >> Otis >> -- >> Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch >> >> >> >> ----- Original Message ---- >> > From: jchang <[EMAIL PROTECTED]> >> > To: [EMAIL PROTECTED] View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-Service-Crashes-restarts-tp27136539p27148813.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. +
jchang 2010-01-13, 19:08
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsMichael McCandless 2010-01-13, 20:00
IndexWriter should show good concurrency, ie, as you add threads you
should see indexing speedup, assuming you have no external synchronization, your hardware has free concurrency and you use a large enough RAM buffer, and don't commit too frequently. But you should use a single IndexWriter, which the threads share. Trying to open a different IW per thread will lead to the lock timeout exception. Mike On Wed, Jan 13, 2010 at 2:08 PM, jchang <[EMAIL PROTECTED]> wrote: > > I don't specifically need a cluster of servers writing indexes. Actually, at > the moment, I only have one server, but multiple message consuming threads, > so I still land back at the same problem of contention for the index lock. > Why do I have multiple message consumers? Speed...I wanted to dequeue my > items to be indexed fast. However, I'm getting the impression that may have > been a foolish effort. I find that only having one writer thread is not > much slower than having 20, which makes sense if they are all waiting on one > file. If only one writer thread can be fast enough (which gets rid of > timeout exceptions that I asked about in a different thread), that that is > good enough for me. > > Do you know what kind of index writes per second I can hope to hit with one > writer thread? I guess it depends on many factors. > > Also, I know 2.9.0 is faster than 2.4.0 (which I'm on), but I'm not sure I > can move up to 2.9.0 really easily because all my Lucene usage is wrapped in > Compass, which does not yet support 2.9.0. I think I'd have to rewrite my > service to use straight Lucene, which might be a good idea, but I can't do > quickly. We don't use Solr. > > Thanks for your help thus far and thanks in advance for any more responses. > > > > Jake Mannix wrote: >> >> On Tue, Jan 12, 2010 at 8:15 PM, Otis Gospodnetic < >> [EMAIL PROTECTED]> wrote: >> >>> John, you should have a look at Zoie. I just finished adding LinkedIn's >>> case study about Zoie to Lucene in Action 2, so this is fresh in my mind. >> >> :) >>> >> >> Yep, Zoie ( http://zoie.googlecode.com ) will handle the server restart >> part, in that while yes, you lose what is in RAM, Zoie keeps track of an >> "index version" on disk alongside the Lucene index which it uses to decide >> where it must reindex from to "catch up" if it there have been incoming >> indexing events while the server was out of commission. >> >> Zoie does not support multiple servers using the same index, because each >> zoie instance has IndexWriter instances, and you'll get locking problems >> trying to do that. You could have one Zoie instance effectively as the >> "master/writer/realtime reader", and a bunch of raw Lucene "slaves" which >> could read off of that index, but as you say, could not get access to the >> RAMDirectory information until it was flushed to disk. >> >> Why do you need a "cluster" of servers hitting the same index? Are they >> different applications (with different search logic, so they need to be >> different instances), or is it just to try and utilize your hardware >> efficiently? If it's for performance reasons, you might find you get >> better >> use of your CPU cores by just sharding your one index into smaller ones, >> each having their own Zoie instance, and putting a "broker" on top of them >> searching across all and mergesorting the results. Often even this isn't >> necessary, because Zoie will be opening the disk-backed IndexReader in >> readonly mode, and thus all the synchronized blocks are gone, and one >> single >> Zoie instance will easily saturate your cpu cores by simple >> multi-threading >> by your appserver. >> >> If you really needed to do many different kinds of writes (from different >> applications) and also have applications not involved in the writing also >> seeing (in real-time) these writes, then you could still do it with Zoie, >> but it would take some interesting architectural juggling (write your own >> StreamDataProvider class which takes input from a variety of sources and +
Michael McCandless 2010-01-13, 20:00
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsMichael McCandless 2010-01-13, 10:20
On Tue, Jan 12, 2010 at 6:10 PM, jchang <[EMAIL PROTECTED]> wrote:
> Does anybody know how this works out with service restarts (both orderly > shutdown and a crash)? If the service goes down while indexed items are in > RAMDir but not on disk, are they lost? Or is there some kind of log > recovery? Lucene exposes commit() for exactly this reason -- on ungraceful shutdown (ie, you didn't succeed in calling IndexWriter.close), the index will be at the last successful commit() (or close(), which calls commit internally). As with Zoie, it's still the app's job to replay updates after the last commit. Note this is fully orthogonal to whether you use an NRT reader or not. NRT reader "simply" lets you search the full index, including un-committed changes. Mike --------------------------------------------------------------------- +
Michael McCandless 2010-01-13, 10:20
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsJohn Wang 2010-01-13, 12:33
"NRT reader "simply" lets you search the full index, including
un-committed changes." I am not sure I understand: I think the context of the discussion is for when the indexer crashes before IW.commit. At which point, does not really matter if you are using NRT, e.g. IW.getReader, or IndexReader.open, the uncommitted changes are NOT searchable. -John On Wed, Jan 13, 2010 at 2:20 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > On Tue, Jan 12, 2010 at 6:10 PM, jchang <[EMAIL PROTECTED]> wrote: > > > Does anybody know how this works out with service restarts (both orderly > > shutdown and a crash)? If the service goes down while indexed items are > in > > RAMDir but not on disk, are they lost? Or is there some kind of log > > recovery? > > Lucene exposes commit() for exactly this reason -- on ungraceful > shutdown (ie, you didn't succeed in calling IndexWriter.close), the > index will be at the last successful commit() (or close(), which calls > commit internally). As with Zoie, it's still the app's job to replay > updates after the last commit. > > Note this is fully orthogonal to whether you use an NRT reader or not. > > NRT reader "simply" lets you search the full index, including > un-committed changes. > > Mike > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > +
John Wang 2010-01-13, 12:33
-
Re: Lucene 2.9.0 Near Real Time Indexing and Service Crashes/restartsMichael McCandless 2010-01-13, 12:57
On Wed, Jan 13, 2010 at 7:33 AM, John Wang <[EMAIL PROTECTED]> wrote:
> "NRT reader "simply" lets you search the full index, including > un-committed changes." > > I am not sure I understand: > > I think the context of the discussion is for when the indexer crashes before > IW.commit. At which point, does not really matter if you are using NRT, e.g. > IW.getReader, or IndexReader.open, the uncommitted changes are NOT > searchable. Right, that's what I meant by "orthogonal". It used to be (before NRT) if you wanted to search newly added/deleted docs, you had to first commit, then reopen your reader, which was costly. Whereas now, with NRT, you're no longer forced to commit, in order for a reader to see changes to the index. This makes the decision of "how often to commit" (safety) completely independent from "how often to reopen the reader" (freshness). How often to commit is entirely a safety vs performance tradeoff, to be made by the app. Commit often and you lose (have to replay) very little on crash, but, have worse indexing throughput. How often to reopen is a freshness vs performance tradeoff. The two (safety & freshness) are now fully decoupled, but were not in past releases. Mike --------------------------------------------------------------------- +
Michael McCandless 2010-01-13, 12:57
-
Lucene 2.9.0 Near Real Time Indexing and lock timeoutsjchang 2010-01-12, 23:27
Hello, I am using Lucene 2.4.0 and am getting org.apache.lucene.store.LockObtainFailedException's when I have a backed up queue of items to index (with multiple concurrent writers). Of course, if I throttle all my writer threads to 1, I don't get the exception, but I'm hoping to write faster than that. Also, when I have multiple index writer threads, it's not that much faster anyhow, I assume because the bottle neck is on one the index file. Will 2.9.0 help? I've read about the near real time indexing. Will this allow me to have: 1) Multiple threads indexing items all at the same time, achieving appreciably faster throughput than only one thread? 2) And avoid getting org.apache.lucene.store.LockObtainFailedException's? 3) Have this all near real time? And generally, if anybody has any advice on high-throughput indexing with Lucene and what kind of numbers I can acheive, I'd welcome the feedback. Thanks, John -- View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-lock-timeouts-tp27136743p27136743.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- +
jchang 2010-01-12, 23:27
-
Re: Lucene 2.9.0 Near Real Time Indexing and lock timeoutsJason Rutherglen 2010-01-12, 23:39
> And generally, if anybody has any advice on high-throughput indexing with
> Lucene and what kind of numbers I can acheive, I'd welcome the feedback. I believe it's directly related to how often IW.getReader is called. The longer the duration between calls, larger the resultant new segment is (which reduces subsequent merging). On Tue, Jan 12, 2010 at 3:27 PM, jchang <[EMAIL PROTECTED]> wrote: > > Hello, > > I am using Lucene 2.4.0 and am getting > org.apache.lucene.store.LockObtainFailedException's when I have a backed up > queue of items to index (with multiple concurrent writers). Of course, if I > throttle all my writer threads to 1, I don't get the exception, but I'm > hoping to write faster than that. Also, when I have multiple index writer > threads, it's not that much faster anyhow, I assume because the bottle neck > is on one the index file. > > Will 2.9.0 help? I've read about the near real time indexing. Will this > allow me to have: > 1) Multiple threads indexing items all at the same time, achieving > appreciably faster throughput than only one thread? > 2) And avoid getting org.apache.lucene.store.LockObtainFailedException's? > 3) Have this all near real time? > > And generally, if anybody has any advice on high-throughput indexing with > Lucene and what kind of numbers I can acheive, I'd welcome the feedback. > > Thanks, > John > -- > View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-lock-timeouts-tp27136743p27136743.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Jason Rutherglen 2010-01-12, 23:39
-
Re: Lucene 2.9.0 Near Real Time Indexing and lock timeoutsOtis Gospodnetic 2010-01-13, 04:13
John,
Yes, you should get 2.9.0 or 3.0.0, their indexing is faster. Still, even with 2.4.0 you shouldn't run into problems if you are really using just 1 IndexWriter. Still, I'd try upgrading first. Oh, and java-user is the place to ask. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: jchang <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tue, January 12, 2010 6:27:12 PM > Subject: Lucene 2.9.0 Near Real Time Indexing and lock timeouts > > > Hello, > > I am using Lucene 2.4.0 and am getting > org.apache.lucene.store.LockObtainFailedException's when I have a backed up > queue of items to index (with multiple concurrent writers). Of course, if I > throttle all my writer threads to 1, I don't get the exception, but I'm > hoping to write faster than that. Also, when I have multiple index writer > threads, it's not that much faster anyhow, I assume because the bottle neck > is on one the index file. > > Will 2.9.0 help? I've read about the near real time indexing. Will this > allow me to have: > 1) Multiple threads indexing items all at the same time, achieving > appreciably faster throughput than only one thread? > 2) And avoid getting org.apache.lucene.store.LockObtainFailedException's? > 3) Have this all near real time? > > And generally, if anybody has any advice on high-throughput indexing with > Lucene and what kind of numbers I can acheive, I'd welcome the feedback. > > Thanks, > John > -- > View this message in context: > http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-lock-timeouts-tp27136743p27136743.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] --------------------------------------------------------------------- +
Otis Gospodnetic 2010-01-13, 04:13
-
Re: Lucene 2.9.0 Near Real Time Indexing and lock timeoutsjchang 2010-01-14, 17:15
With only 10 concurrent consumers, I do get lock problems. However, I am calling commit() at the end of each addition. Could I expect better concurrency without timeouts if I did not commit as often? -- View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-lock-timeouts-tp27136743p27164797.html Sent from the Lucene - Java Developer mailing list archive at Nabble.com. --------------------------------------------------------------------- +
jchang 2010-01-14, 17:15
-
Re: Lucene 2.9.0 Near Real Time Indexing and lock timeoutsMichael McCandless 2010-01-14, 18:09
Calling commit after every addition will drastically slow down your
indexing throughput, and concurrency (commit is internally synchronized), but should not create lock timeouts, unless you are also opening a new IndexWriter for every addition? Mike On Thu, Jan 14, 2010 at 12:15 PM, jchang <[EMAIL PROTECTED]> wrote: > > With only 10 concurrent consumers, I do get lock problems. However, I am > calling commit() at the end of each addition. Could I expect better > concurrency without timeouts if I did not commit as often? > > -- > View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-lock-timeouts-tp27136743p27164797.html > Sent from the Lucene - Java Developer mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Michael McCandless 2010-01-14, 18:09
-
Re: Lucene 2.9.0 Near Real Time Indexing and lock timeoutsSanne Grinovero 2010-01-15, 12:30
A common error I see is that people assume the IndexWriter to be not
threadsafe, and open several different instances. You should use just one IndexWriter, keep it open and flush periodically (not commit at each add operation), and read the Lucene wiki pages about the IndexWriter settings like ramBufferSize. That why there's only one lock, no contention from different threads. There's an explanation of the fastest design I could get here: http://in.relation.to/Bloggers/HibernateSearch32FastIndexRebuild It's describing the procedure used by Hibernate Search for rebuilding the Lucene index from an Hibernate mapped database. While I recommend reading for newcomers, I'd also appreciate feedback and comments from Lucene experts and developers :-) Regards, Sanne 2010/1/14 Michael McCandless <[EMAIL PROTECTED]>: > Calling commit after every addition will drastically slow down your > indexing throughput, and concurrency (commit is internally > synchronized), but should not create lock timeouts, unless you are > also opening a new IndexWriter for every addition? > > Mike > > On Thu, Jan 14, 2010 at 12:15 PM, jchang <[EMAIL PROTECTED]> wrote: >> >> With only 10 concurrent consumers, I do get lock problems. However, I am >> calling commit() at the end of each addition. Could I expect better >> concurrency without timeouts if I did not commit as often? >> >> -- >> View this message in context: http://old.nabble.com/Lucene-2.9.0-Near-Real-Time-Indexing-and-lock-timeouts-tp27136743p27164797.html >> Sent from the Lucene - Java Developer mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- +
Sanne Grinovero 2010-01-15, 12:30
|