|
Guido Bartolucci
2010-01-20, 03:58
Otis Gospodnetic
2010-01-20, 05:15
Ganesh
2010-01-20, 06:13
Otis Gospodnetic
2010-01-20, 06:24
Ganesh
2010-01-20, 07:27
Király Péter
2010-01-20, 08:11
Chris Harris
2010-01-20, 10:30
Erick Erickson
2010-01-20, 13:59
Darren Hartford
2010-01-20, 14:11
Karl Wettin
2010-01-20, 17:35
Chris Lu
2010-01-20, 19:40
Guido Bartolucci
2010-01-20, 21:25
Jacob Rhoden
2010-01-20, 22:09
Erick Erickson
2010-01-20, 22:16
Otis Gospodnetic
2010-01-21, 05:23
Aaron McCurry
2010-01-22, 23:52
|
-
Lucene as a primary datastoreGuido Bartolucci 2010-01-20, 03:58
I know that the primary use case for Lucene is as an index of data
that can be reconstructed (e.g., from a relational database or from spidering your corporate intranet). But, I'm curious if anyone uses Lucene as their primary datastore for their gold data. Is it good enough? Would anyone consider (or do people already) store data in Lucene that, if it was lost, would destroy their business? And no, I'm not suggesting that you don't back up this data, I'm just curious if there are problems with using Lucene in this way. Are there subtle corruptions that might show up in Lucene that wouldn't show up in Oracle or MySQL? I'm considering using Lucene in this way but I haven't been able to find any documentation describing this use case. Are there any studies of Lucene vs MySQL running for N years comparing the corruptions and recovery times? Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? Thanks. -guido. (BTW, I did find a similar question asked back in 2007 in the archives but it doesn't really answer my question) ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreOtis Gospodnetic 2010-01-20, 05:15
You are not alone, Guido. It's a good question. In my experience, Lucene is as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not corrupt it. Of course, even with the most expensive databases, you'd want to make backups. The same goes with Lucene. Nowadays, one way people make "backups" is via replication. :) Solr users thus often get backups for free, as do people who put copies of their data on file systems like HDFS, which tend to have replication turned on.
Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Guido Bartolucci <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Tue, January 19, 2010 10:58:36 PM > Subject: Lucene as a primary datastore > > I know that the primary use case for Lucene is as an index of data > that can be reconstructed (e.g., from a relational database or from > spidering your corporate intranet). > > But, I'm curious if anyone uses Lucene as their primary datastore for > their gold data. Is it good enough? > > Would anyone consider (or do people already) store data in Lucene > that, if it was lost, would destroy their business? And no, I'm not > suggesting that you don't back up this data, I'm just curious if there > are problems with using Lucene in this way. Are there subtle > corruptions that might show up in Lucene that wouldn't show up in > Oracle or MySQL? > > I'm considering using Lucene in this way but I haven't been able to > find any documentation describing this use case. Are there any studies > of Lucene vs MySQL running for N years comparing the corruptions and > recovery times? > > Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? > > Thanks. > > -guido. > > (BTW, I did find a similar question asked back in 2007 in the archives > but it doesn't really answer my question) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreGanesh 2010-01-20, 06:13
We have data in compound files and we use Lucene as primary database. Its working great and much faster with millions of records. The only issue, I face is with sorting. Lucene sorting consumes good amount of memory. I don't know much about the MySQL/PostgreSQL database, and how they behave with millions of records but i guess their sorting memory consumption would be less.
It would be great, If Lucene has the ability to do backups / replication. I don't know how to modify/use the solr script. Regards Ganesh ----- Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]>; <[EMAIL PROTECTED]> Sent: Wednesday, January 20, 2010 10:45 AM Subject: Re: Lucene as a primary datastore > You are not alone, Guido. It's a good question. In my experience, Lucene is as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not corrupt it. Of course, even with the most expensive databases, you'd want to make backups. The same goes with Lucene. Nowadays, one way people make "backups" is via replication. :) Solr users thus often get backups for free, as do people who put copies of their data on file systems like HDFS, which tend to have replication turned on. > > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > ----- Original Message ---- >> From: Guido Bartolucci <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Tue, January 19, 2010 10:58:36 PM >> Subject: Lucene as a primary datastore >> >> I know that the primary use case for Lucene is as an index of data >> that can be reconstructed (e.g., from a relational database or from >> spidering your corporate intranet). >> >> But, I'm curious if anyone uses Lucene as their primary datastore for >> their gold data. Is it good enough? >> >> Would anyone consider (or do people already) store data in Lucene >> that, if it was lost, would destroy their business? And no, I'm not >> suggesting that you don't back up this data, I'm just curious if there >> are problems with using Lucene in this way. Are there subtle >> corruptions that might show up in Lucene that wouldn't show up in >> Oracle or MySQL? >> >> I'm considering using Lucene in this way but I haven't been able to >> find any documentation describing this use case. Are there any studies >> of Lucene vs MySQL running for N years comparing the corruptions and >> recovery times? >> >> Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? >> >> Thanks. >> >> -guido. >> >> (BTW, I did find a similar question asked back in 2007 in the archives >> but it doesn't really answer my question) >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > Send instant messages to your online friends http://in.messenger.yahoo.com ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreOtis Gospodnetic 2010-01-20, 06:24
Have you seen the "Hot Backups with Lucene" paper available via http://www.manning.com/hatcher3/ ?
Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Ganesh <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wed, January 20, 2010 1:13:21 AM > Subject: Re: Lucene as a primary datastore > > We have data in compound files and we use Lucene as primary database. Its > working great and much faster with millions of records. The only issue, I face > is with sorting. Lucene sorting consumes good amount of memory. I don't know > much about the MySQL/PostgreSQL database, and how they behave with millions of > records but i guess their sorting memory consumption would be less. > > It would be great, If Lucene has the ability to do backups / replication. I > don't know how to modify/use the solr script. > > Regards > Ganesh > > > ----- Original Message ----- > From: "Otis Gospodnetic" > To: ; > Sent: Wednesday, January 20, 2010 10:45 AM > Subject: Re: Lucene as a primary datastore > > > > You are not alone, Guido. It's a good question. In my experience, Lucene is > as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not > corrupt it. Of course, even with the most expensive databases, you'd want to > make backups. The same goes with Lucene. Nowadays, one way people make > "backups" is via replication. :) Solr users thus often get backups for free, as > do people who put copies of their data on file systems like HDFS, which tend to > have replication turned on. > > > > Otis > > -- > > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > > > > > ----- Original Message ---- > >> From: Guido Bartolucci > >> To: [EMAIL PROTECTED] > >> Sent: Tue, January 19, 2010 10:58:36 PM > >> Subject: Lucene as a primary datastore > >> > >> I know that the primary use case for Lucene is as an index of data > >> that can be reconstructed (e.g., from a relational database or from > >> spidering your corporate intranet). > >> > >> But, I'm curious if anyone uses Lucene as their primary datastore for > >> their gold data. Is it good enough? > >> > >> Would anyone consider (or do people already) store data in Lucene > >> that, if it was lost, would destroy their business? And no, I'm not > >> suggesting that you don't back up this data, I'm just curious if there > >> are problems with using Lucene in this way. Are there subtle > >> corruptions that might show up in Lucene that wouldn't show up in > >> Oracle or MySQL? > >> > >> I'm considering using Lucene in this way but I haven't been able to > >> find any documentation describing this use case. Are there any studies > >> of Lucene vs MySQL running for N years comparing the corruptions and > >> recovery times? > >> > >> Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? > >> > >> Thanks. > >> > >> -guido. > >> > >> (BTW, I did find a similar question asked back in 2007 in the archives > >> but it doesn't really answer my question) > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > Send instant messages to your online friends http://in.messenger.yahoo.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreGanesh 2010-01-20, 07:27
Thanks Otis. The download link sent via email has file called cemail. There is no extn. I tried with html,pdf but it is not opening properly.
Regards Ganesh ----- Original Message ----- From: "Otis Gospodnetic" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, January 20, 2010 11:54 AM Subject: Re: Lucene as a primary datastore > Have you seen the "Hot Backups with Lucene" paper available via http://www.manning.com/hatcher3/ ? > > > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > ----- Original Message ---- >> From: Ganesh <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Wed, January 20, 2010 1:13:21 AM >> Subject: Re: Lucene as a primary datastore >> >> We have data in compound files and we use Lucene as primary database. Its >> working great and much faster with millions of records. The only issue, I face >> is with sorting. Lucene sorting consumes good amount of memory. I don't know >> much about the MySQL/PostgreSQL database, and how they behave with millions of >> records but i guess their sorting memory consumption would be less. >> >> It would be great, If Lucene has the ability to do backups / replication. I >> don't know how to modify/use the solr script. >> >> Regards >> Ganesh >> >> >> ----- Original Message ----- >> From: "Otis Gospodnetic" >> To: ; >> Sent: Wednesday, January 20, 2010 10:45 AM >> Subject: Re: Lucene as a primary datastore >> >> >> > You are not alone, Guido. It's a good question. In my experience, Lucene is >> as stable as MySQL/PostgreSQL in terms of its ability to hold your data and not >> corrupt it. Of course, even with the most expensive databases, you'd want to >> make backups. The same goes with Lucene. Nowadays, one way people make >> "backups" is via replication. :) Solr users thus often get backups for free, as >> do people who put copies of their data on file systems like HDFS, which tend to >> have replication turned on. >> > >> > Otis >> > -- >> > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch >> > >> > >> > >> > ----- Original Message ---- >> >> From: Guido Bartolucci >> >> To: [EMAIL PROTECTED] >> >> Sent: Tue, January 19, 2010 10:58:36 PM >> >> Subject: Lucene as a primary datastore >> >> >> >> I know that the primary use case for Lucene is as an index of data >> >> that can be reconstructed (e.g., from a relational database or from >> >> spidering your corporate intranet). >> >> >> >> But, I'm curious if anyone uses Lucene as their primary datastore for >> >> their gold data. Is it good enough? >> >> >> >> Would anyone consider (or do people already) store data in Lucene >> >> that, if it was lost, would destroy their business? And no, I'm not >> >> suggesting that you don't back up this data, I'm just curious if there >> >> are problems with using Lucene in this way. Are there subtle >> >> corruptions that might show up in Lucene that wouldn't show up in >> >> Oracle or MySQL? >> >> >> >> I'm considering using Lucene in this way but I haven't been able to >> >> find any documentation describing this use case. Are there any studies >> >> of Lucene vs MySQL running for N years comparing the corruptions and >> >> recovery times? >> >> >> >> Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? >> >> >> >> Thanks. >> >> >> >> -guido. >> >> >> >> (BTW, I did find a similar question asked back in 2007 in the archives >> >> but it doesn't really answer my question) >> >> >> >> --------------------------------------------------------------------- >> >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> >> For additional commands, e-mail: [EMAIL PROTECTED] >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: [EMAIL PROTECTED] >> > For additional commands, e-mail: [EMAIL PROTECTED] >> Send instant messages to your online friends http://in.messenger.yahoo.com
-
Re: Lucene as a primary datastoreKirály Péter 2010-01-20, 08:11
Hi, I am using Lucene for the same purpose since years.
I import an XML files with records, and in Lucene there is a special field, which stores the original XML (this used for displaying with XSLT), the other fields are for searching. There is a webform, where the users can modify the data. If users click "submit" the application re-creates the XML and the normal fields. I have an 'exporter' utility, which can dump the XML fields into one file - this is used for creating a backup of data. (I prefer to store the non binary data for archival purpose, than an older Lucene index (or MySQL table) - I got very wrong experience with incopatibily issues of MS Word, PDF etc. formats.) P�ter ----- Original Message ----- From: "Guido Bartolucci" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, January 20, 2010 4:58 AM Subject: Lucene as a primary datastore >I know that the primary use case for Lucene is as an index of data > that can be reconstructed (e.g., from a relational database or from > spidering your corporate intranet). > > But, I'm curious if anyone uses Lucene as their primary datastore for > their gold data. Is it good enough? > > Would anyone consider (or do people already) store data in Lucene > that, if it was lost, would destroy their business? And no, I'm not > suggesting that you don't back up this data, I'm just curious if there > are problems with using Lucene in this way. Are there subtle > corruptions that might show up in Lucene that wouldn't show up in > Oracle or MySQL? > > I'm considering using Lucene in this way but I haven't been able to > find any documentation describing this use case. Are there any studies > of Lucene vs MySQL running for N years comparing the corruptions and > recovery times? > > Am I just ignorant and scared of Lucene and too trusting of Oracle and > MySQL? > > Thanks. > > -guido. > > (BTW, I did find a similar question asked back in 2007 in the archives > but it doesn't really answer my question) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreChris Harris 2010-01-20, 10:30
I don't do a lot of work with straight Lucene right now, but I do use
Solr, and from time to time the Lucene index inside my master Solr server gets corrupted; in particular, some of the Lucene segment files that are still in use somehow get deleted, resulting in Lucene throwing FileNotFoundExceptions. Once this happens, I have to either rebuild the whole index, or else run the Lucene CheckIndex tool in "fix" mode, which renders the index operable again, but at the expense of throwing away some of the data. This happens rarely, and I haven't been able to diagnose it yet. In the meantime, though, I find it somewhat reassuring to know that my source data is in a SQL table. I don't know that this experience is relevant to you; my problem could come from a variety of sources outside Lucene, including a potential bug in Solr, and user error on my part. All the same, perhaps it would be worth searching the mailing list archives for FileNotFound, to see what else comes up? On Tue, Jan 19, 2010 at 7:58 PM, Guido Bartolucci <[EMAIL PROTECTED]> wrote: > I know that the primary use case for Lucene is as an index of data > that can be reconstructed (e.g., from a relational database or from > spidering your corporate intranet). > > But, I'm curious if anyone uses Lucene as their primary datastore for > their gold data. Is it good enough? > > Would anyone consider (or do people already) store data in Lucene > that, if it was lost, would destroy their business? And no, I'm not > suggesting that you don't back up this data, I'm just curious if there > are problems with using Lucene in this way. Are there subtle > corruptions that might show up in Lucene that wouldn't show up in > Oracle or MySQL? > > I'm considering using Lucene in this way but I haven't been able to > find any documentation describing this use case. Are there any studies > of Lucene vs MySQL running for N years comparing the corruptions and > recovery times? > > Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? > > Thanks. > > -guido. > > (BTW, I did find a similar question asked back in 2007 in the archives > but it doesn't really answer my question) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreErick Erickson 2010-01-20, 13:59
My preference is to put the effort into preserving the original
source on the theory that I'm sure no information is lost that way. So the suitability of Lucene to store it varies depending upon the source IMO. If it's raw text, then storing all the raw text in an un-indexed field in Lucene might suit you well. But if it's a database, there may be lots of meta-data codified in the design (e.g. foreign keys, required fields, etc) that's hard to preserve outside a DB, and that you may need someday.... So that's the first question I'd explore.... Erick On Tue, Jan 19, 2010 at 10:58 PM, Guido Bartolucci < [EMAIL PROTECTED]> wrote: > I know that the primary use case for Lucene is as an index of data > that can be reconstructed (e.g., from a relational database or from > spidering your corporate intranet). > > But, I'm curious if anyone uses Lucene as their primary datastore for > their gold data. Is it good enough? > > Would anyone consider (or do people already) store data in Lucene > that, if it was lost, would destroy their business? And no, I'm not > suggesting that you don't back up this data, I'm just curious if there > are problems with using Lucene in this way. Are there subtle > corruptions that might show up in Lucene that wouldn't show up in > Oracle or MySQL? > > I'm considering using Lucene in this way but I haven't been able to > find any documentation describing this use case. Are there any studies > of Lucene vs MySQL running for N years comparing the corruptions and > recovery times? > > Am I just ignorant and scared of Lucene and too trusting of Oracle and > MySQL? > > Thanks. > > -guido. > > (BTW, I did find a similar question asked back in 2007 in the archives > but it doesn't really answer my question) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
-
RE: Lucene as a primary datastoreDarren Hartford 2010-01-20, 14:11
My two cents is no, not to use lucene as a primary datastore. Although
there are some datastores that look similar to lucene who define themselves as primary datastores (the 'nosql' style datastores), I would put lucene besides the likes of RRD and other specifically purposed information stores that are about providing information and functionality, but not necessarily be the gatekeeper of your raw (gold) data. Caveat: My definition of raw, or 'gold', data is detailed data and the audit/transaction history to identify the origin of the detailed data. So it's not just the end result (often called 'current') data, it's all the data on how it got to the current state and as it fluctuates over time. My two coppers, -D -----Original Message----- From: Erick Erickson [mailto:[EMAIL PROTECTED]] Sent: Wednesday, January 20, 2010 8:59 AM To: [EMAIL PROTECTED] Subject: Re: Lucene as a primary datastore My preference is to put the effort into preserving the original source on the theory that I'm sure no information is lost that way. So the suitability of Lucene to store it varies depending upon the source IMO. If it's raw text, then storing all the raw text in an un-indexed field in Lucene might suit you well. But if it's a database, there may be lots of meta-data codified in the design (e.g. foreign keys, required fields, etc) that's hard to preserve outside a DB, and that you may need someday.... So that's the first question I'd explore.... Erick On Tue, Jan 19, 2010 at 10:58 PM, Guido Bartolucci < [EMAIL PROTECTED]> wrote: > I know that the primary use case for Lucene is as an index of data > that can be reconstructed (e.g., from a relational database or from > spidering your corporate intranet). > > But, I'm curious if anyone uses Lucene as their primary datastore for > their gold data. Is it good enough? > > Would anyone consider (or do people already) store data in Lucene > that, if it was lost, would destroy their business? And no, I'm not > suggesting that you don't back up this data, I'm just curious if there > are problems with using Lucene in this way. Are there subtle > corruptions that might show up in Lucene that wouldn't show up in > Oracle or MySQL? > > I'm considering using Lucene in this way but I haven't been able to > find any documentation describing this use case. Are there any studies > of Lucene vs MySQL running for N years comparing the corruptions and > recovery times? > > Am I just ignorant and scared of Lucene and too trusting of Oracle and > MySQL? > > Thanks. > > -guido. > > (BTW, I did find a similar question asked back in 2007 in the archives > but it doesn't really answer my question) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreKarl Wettin 2010-01-20, 17:35
20 jan 2010 kl. 04.58 skrev Guido Bartolucci: > Am I just ignorant and scared of Lucene and too trusting of Oracle > and MySQL? Since all your comparations is with relational databases I feel obligated to say what has been said so many times on this list: Lucene is an index and not a relational database. There are many things you can do with a relational database you don't even want to try to do with a Lucene index. What Lucene can do without a problem is to act as a key-value store. My indices does however tend to evolve. I find ways to improve the index and that often result in rebuilding the index from scratch. That can be problematic if the index also act as the primary persistency layer. Therefore I've occationally used Lucene as a secondary persistency layer, i.e. pushed data from the primary persistency (usually a Berkeley DB) to Lucene (Solr) for easy and up to date distribution of the domain data the index points at. But to be quite honest I don't like it and I can't explain why in more detail than that it feels wrong. It's always been a hack to save time. (Since BDB JE came with distribution I don't do this anymore.) karl ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreChris Lu 2010-01-20, 19:40
I have 3 concerns of making Lucene as a primary database.
1) Lucene is stable when it's stable. But you will have java exceptions. What would you do when FileNotFoundException or "Lucene 2.9.1 'read past EOF' IOException under system load" happens? For me, I don't the data is safe this way. Or, you can understand all Lucene APIs and never make any mistakes. Some databases, like some versions of mysql, could corrupt data. No better, but it's still more robust. 2) As the name suggests, Lucene index is just an index, like database index, it's an auxiliary data structure. It's only fast in one way, but could be slow in other ways. 3) The more robust approach is to pull data out of database, and create a Lucene index. In case something goes wrong, you can always pull data out again and create the index again. -- Chris Lu ------------------------- Instant Scalable Full-Text Search On Any Database/Application site: http://www.dbsight.net demo: http://search.dbsight.com Lucene Database Search in 3 minutes: http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes DBSight customer, a shopping comparison site, (anonymous per request) got 2.6 Million Euro funding! Guido Bartolucci wrote: > I know that the primary use case for Lucene is as an index of data > that can be reconstructed (e.g., from a relational database or from > spidering your corporate intranet). > > But, I'm curious if anyone uses Lucene as their primary datastore for > their gold data. Is it good enough? > > Would anyone consider (or do people already) store data in Lucene > that, if it was lost, would destroy their business? And no, I'm not > suggesting that you don't back up this data, I'm just curious if there > are problems with using Lucene in this way. Are there subtle > corruptions that might show up in Lucene that wouldn't show up in > Oracle or MySQL? > > I'm considering using Lucene in this way but I haven't been able to > find any documentation describing this use case. Are there any studies > of Lucene vs MySQL running for N years comparing the corruptions and > recovery times? > > Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? > > Thanks. > > -guido. > > (BTW, I did find a similar question asked back in 2007 in the archives > but it doesn't really answer my question) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreGuido Bartolucci 2010-01-20, 21:25
Thanks for the response. I understand all of what you wrote, but what
I care about and what I had a little trouble describing exactly in my previous question is: - Are all problems with Lucene obvious (e.g., you get an exception and you know your data is now bad) or are there subtle corruptions that just happen and because of that it makes sense to constantly rebuild the index? I ask this because if this isn't the case then replication isn't going to help, the problems probably get copied over to the other instances (unless I'm missing something). guido. On Wed, Jan 20, 2010 at 11:40 AM, Chris Lu <[EMAIL PROTECTED]> wrote: > I have 3 concerns of making Lucene as a primary database. > 1) Lucene is stable when it's stable. But you will have java exceptions. > What would you do when FileNotFoundException or "Lucene 2.9.1 'read past > EOF' IOException under system load" happens? > For me, I don't the data is safe this way. Or, you can understand all Lucene > APIs and never make any mistakes. > Some databases, like some versions of mysql, could corrupt data. No better, > but it's still more robust. > 2) As the name suggests, Lucene index is just an index, like database index, > it's an auxiliary data structure. It's only fast in one way, but could be > slow in other ways. > 3) The more robust approach is to pull data out of database, and create a > Lucene index. In case something goes wrong, you can always pull data out > again and create the index again. > > -- > Chris Lu > ------------------------- > Instant Scalable Full-Text Search On Any Database/Application > site: http://www.dbsight.net > demo: http://search.dbsight.com > Lucene Database Search in 3 minutes: > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > DBSight customer, a shopping comparison site, (anonymous per request) got > 2.6 Million Euro funding! > > > > Guido Bartolucci wrote: >> >> I know that the primary use case for Lucene is as an index of data >> that can be reconstructed (e.g., from a relational database or from >> spidering your corporate intranet). >> >> But, I'm curious if anyone uses Lucene as their primary datastore for >> their gold data. Is it good enough? >> >> Would anyone consider (or do people already) store data in Lucene >> that, if it was lost, would destroy their business? And no, I'm not >> suggesting that you don't back up this data, I'm just curious if there >> are problems with using Lucene in this way. Are there subtle >> corruptions that might show up in Lucene that wouldn't show up in >> Oracle or MySQL? >> >> I'm considering using Lucene in this way but I haven't been able to >> find any documentation describing this use case. Are there any studies >> of Lucene vs MySQL running for N years comparing the corruptions and >> recovery times? >> >> Am I just ignorant and scared of Lucene and too trusting of Oracle and >> MySQL? >> >> Thanks. >> >> -guido. >> >> (BTW, I did find a similar question asked back in 2007 in the archives >> but it doesn't really answer my question) >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreJacob Rhoden 2010-01-20, 22:09
In the same way that you should take regular exports/dumps of your mysql databases, you could have the same strategy with lucene.
As long as you have code that can export your data that runs daily, and code that can rebuild your index from that data, In the event of a problem the most you will loose is up to 24 hours of data yes? The whole concept of using lucene as the data store has also been on my mind, simply because I have some systems where the lucene index is simply a copy of all of the mysql data, makes me wonder why I even bother with the mysql part (: On 20/01/2010, at 9:30 PM, Chris Harris wrote: > I don't do a lot of work with straight Lucene right now, but I do use > Solr, and from time to time the Lucene index inside my master Solr > server gets corrupted; in particular, some of the Lucene segment files > that are still in use somehow get deleted, resulting in Lucene > throwing FileNotFoundExceptions. Once this happens, I have to either > rebuild the whole index, or else run the Lucene CheckIndex tool in > "fix" mode, which renders the index operable again, but at the expense > of throwing away some of the data. This happens rarely, and I haven't > been able to diagnose it yet. In the meantime, though, I find it > somewhat reassuring to know that my source data is in a SQL table. > > I don't know that this experience is relevant to you; my problem could > come from a variety of sources outside Lucene, including a potential > bug in Solr, and user error on my part. All the same, perhaps it would > be worth searching the mailing list archives for FileNotFound, to see > what else comes up? > > On Tue, Jan 19, 2010 at 7:58 PM, Guido Bartolucci > <[EMAIL PROTECTED]> wrote: >> I know that the primary use case for Lucene is as an index of data >> that can be reconstructed (e.g., from a relational database or from >> spidering your corporate intranet). >> >> But, I'm curious if anyone uses Lucene as their primary datastore for >> their gold data. Is it good enough? >> >> Would anyone consider (or do people already) store data in Lucene >> that, if it was lost, would destroy their business? And no, I'm not >> suggesting that you don't back up this data, I'm just curious if there >> are problems with using Lucene in this way. Are there subtle >> corruptions that might show up in Lucene that wouldn't show up in >> Oracle or MySQL? >> >> I'm considering using Lucene in this way but I haven't been able to >> find any documentation describing this use case. Are there any studies >> of Lucene vs MySQL running for N years comparing the corruptions and >> recovery times? >> >> Am I just ignorant and scared of Lucene and too trusting of Oracle and MySQL? >> >> Thanks. >> >> -guido. >> >> (BTW, I did find a similar question asked back in 2007 in the archives >> but it doesn't really answer my question) >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > Kind regards, Jacob Rhoden ____________________________________ Information Technology Services, The University of Melbourne Email: [EMAIL PROTECTED] Phone: +61 3 8344 2884 Mobile: +61 4 1095 7575 ---------------------------------------------------------------------
-
Re: Lucene as a primary datastoreErick Erickson 2010-01-20, 22:16
It depends (tm). From what I've seen on this list, *if* the index
gets corrupted, you'll see some exceptions somewhere. They may be head-scratchers, but you'll get exceptions. But when I've seen this kind of thing reported, it's been because of coding errors. Manually unlocking the IndexWriter and opening a second one in a second process while the first is still running comes to mind. In other words, you have to work at it to corrupt your index. The Lucene guys understand thoroughly that having an index just decide to corrupt itself is completely unacceptable, and take great pains to insure it doesn't happen. We typically have very infrequent updates (every quarter or more), and we've never had one just go wonky. So I wouldn't rebuild my indexes unless 1> the application started going west and you saw exceptions 2> you want to periodically rebuild it to test that you *can*. Of course having your disk go bad hurts, but frequent rebuilding isn't going to help with that.... FWIW Erick@IHopeWereOnTrackNow On Wed, Jan 20, 2010 at 4:25 PM, Guido Bartolucci < [EMAIL PROTECTED]> wrote: > Thanks for the response. I understand all of what you wrote, but what > I care about and what I had a little trouble describing exactly in my > previous question is: > > - Are all problems with Lucene obvious (e.g., you get an exception and > you know your data is now bad) or are there subtle corruptions that > just happen and because of that it makes sense to constantly rebuild > the index? > > I ask this because if this isn't the case then replication isn't going > to help, the problems probably get copied over to the other instances > (unless I'm missing something). > > guido. > > > On Wed, Jan 20, 2010 at 11:40 AM, Chris Lu <[EMAIL PROTECTED]> wrote: > > I have 3 concerns of making Lucene as a primary database. > > 1) Lucene is stable when it's stable. But you will have java exceptions. > > What would you do when FileNotFoundException or "Lucene 2.9.1 'read past > > EOF' IOException under system load" happens? > > For me, I don't the data is safe this way. Or, you can understand all > Lucene > > APIs and never make any mistakes. > > Some databases, like some versions of mysql, could corrupt data. No > better, > > but it's still more robust. > > 2) As the name suggests, Lucene index is just an index, like database > index, > > it's an auxiliary data structure. It's only fast in one way, but could be > > slow in other ways. > > 3) The more robust approach is to pull data out of database, and create a > > Lucene index. In case something goes wrong, you can always pull data out > > again and create the index again. > > > > -- > > Chris Lu > > ------------------------- > > Instant Scalable Full-Text Search On Any Database/Application > > site: http://www.dbsight.net > > demo: http://search.dbsight.com > > Lucene Database Search in 3 minutes: > > > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > > DBSight customer, a shopping comparison site, (anonymous per request) got > > 2.6 Million Euro funding! > > > > > > > > Guido Bartolucci wrote: > >> > >> I know that the primary use case for Lucene is as an index of data > >> that can be reconstructed (e.g., from a relational database or from > >> spidering your corporate intranet). > >> > >> But, I'm curious if anyone uses Lucene as their primary datastore for > >> their gold data. Is it good enough? > >> > >> Would anyone consider (or do people already) store data in Lucene > >> that, if it was lost, would destroy their business? And no, I'm not > >> suggesting that you don't back up this data, I'm just curious if there > >> are problems with using Lucene in this way. Are there subtle > >> corruptions that might show up in Lucene that wouldn't show up in > >> Oracle or MySQL? > >> > >> I'm considering using Lucene in this way but I haven't been able to > >> find any documentation describing this use case. Are there any studies > >> of Lucene vs MySQL running for N years comparing the corruptions and
-
Re: Lucene as a primary datastoreOtis Gospodnetic 2010-01-21, 05:23
Guido,
No, you should absolutely not need to constantly rebuild the index. If you find you have to do that, you'll know you are doing something wrong. Otis -- Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch ----- Original Message ---- > From: Guido Bartolucci <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Wed, January 20, 2010 4:25:09 PM > Subject: Re: Lucene as a primary datastore > > Thanks for the response. I understand all of what you wrote, but what > I care about and what I had a little trouble describing exactly in my > previous question is: > > - Are all problems with Lucene obvious (e.g., you get an exception and > you know your data is now bad) or are there subtle corruptions that > just happen and because of that it makes sense to constantly rebuild > the index? > > I ask this because if this isn't the case then replication isn't going > to help, the problems probably get copied over to the other instances > (unless I'm missing something). > > guido. > > > On Wed, Jan 20, 2010 at 11:40 AM, Chris Lu wrote: > > I have 3 concerns of making Lucene as a primary database. > > 1) Lucene is stable when it's stable. But you will have java exceptions. > > What would you do when FileNotFoundException or "Lucene 2.9.1 'read past > > EOF' IOException under system load" happens? > > For me, I don't the data is safe this way. Or, you can understand all Lucene > > APIs and never make any mistakes. > > Some databases, like some versions of mysql, could corrupt data. No better, > > but it's still more robust. > > 2) As the name suggests, Lucene index is just an index, like database index, > > it's an auxiliary data structure. It's only fast in one way, but could be > > slow in other ways. > > 3) The more robust approach is to pull data out of database, and create a > > Lucene index. In case something goes wrong, you can always pull data out > > again and create the index again. > > > > -- > > Chris Lu > > ------------------------- > > Instant Scalable Full-Text Search On Any Database/Application > > site: http://www.dbsight.net > > demo: http://search.dbsight.com > > Lucene Database Search in 3 minutes: > > > http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes > > DBSight customer, a shopping comparison site, (anonymous per request) got > > 2.6 Million Euro funding! > > > > > > > > Guido Bartolucci wrote: > >> > >> I know that the primary use case for Lucene is as an index of data > >> that can be reconstructed (e.g., from a relational database or from > >> spidering your corporate intranet). > >> > >> But, I'm curious if anyone uses Lucene as their primary datastore for > >> their gold data. Is it good enough? > >> > >> Would anyone consider (or do people already) store data in Lucene > >> that, if it was lost, would destroy their business? And no, I'm not > >> suggesting that you don't back up this data, I'm just curious if there > >> are problems with using Lucene in this way. Are there subtle > >> corruptions that might show up in Lucene that wouldn't show up in > >> Oracle or MySQL? > >> > >> I'm considering using Lucene in this way but I haven't been able to > >> find any documentation describing this use case. Are there any studies > >> of Lucene vs MySQL running for N years comparing the corruptions and > >> recovery times? > >> > >> Am I just ignorant and scared of Lucene and too trusting of Oracle and > >> MySQL? > >> > >> Thanks. > >> > >> -guido. > >> > >> (BTW, I did find a similar question asked back in 2007 in the archives > >> but it doesn't really answer my question) > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [EMAIL PROTECTED] > >> For additional commands, e-mail: [EMAIL PROTECTED] > >> > >> > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED]
-
Re: Lucene as a primary datastoreAaron McCurry 2010-01-22, 23:52
While I know that our situation is fairly unique, but we rebuild our indexes weekly. The source of our indexes are data marts generated from flat files. We do this because our data changes too rapidly for us to keep up with the changes. We do update the indexes at runtime, but only with about 10% of the changes. The other changes are re processed weekly. So Lucene is our runtime data store for search and data retrieval. However it is not the system of record.
Aaron On Jan 21, 2010, at 12:23 AM, Otis Gospodnetic wrote: > Guido, > > No, you should absolutely not need to constantly rebuild the index. If you find you have to do that, you'll know you are doing something wrong. > > Otis > -- > Sematext -- http://sematext.com/ -- Solr - Lucene - Nutch > > > > ----- Original Message ---- >> From: Guido Bartolucci <[EMAIL PROTECTED]> >> To: [EMAIL PROTECTED] >> Sent: Wed, January 20, 2010 4:25:09 PM >> Subject: Re: Lucene as a primary datastore >> >> Thanks for the response. I understand all of what you wrote, but what >> I care about and what I had a little trouble describing exactly in my >> previous question is: >> >> - Are all problems with Lucene obvious (e.g., you get an exception and >> you know your data is now bad) or are there subtle corruptions that >> just happen and because of that it makes sense to constantly rebuild >> the index? >> >> I ask this because if this isn't the case then replication isn't going >> to help, the problems probably get copied over to the other instances >> (unless I'm missing something). >> >> guido. >> >> >> On Wed, Jan 20, 2010 at 11:40 AM, Chris Lu wrote: >>> I have 3 concerns of making Lucene as a primary database. >>> 1) Lucene is stable when it's stable. But you will have java exceptions. >>> What would you do when FileNotFoundException or "Lucene 2.9.1 'read past >>> EOF' IOException under system load" happens? >>> For me, I don't the data is safe this way. Or, you can understand all Lucene >>> APIs and never make any mistakes. >>> Some databases, like some versions of mysql, could corrupt data. No better, >>> but it's still more robust. >>> 2) As the name suggests, Lucene index is just an index, like database index, >>> it's an auxiliary data structure. It's only fast in one way, but could be >>> slow in other ways. >>> 3) The more robust approach is to pull data out of database, and create a >>> Lucene index. In case something goes wrong, you can always pull data out >>> again and create the index again. >>> >>> -- >>> Chris Lu >>> ------------------------- >>> Instant Scalable Full-Text Search On Any Database/Application >>> site: http://www.dbsight.net >>> demo: http://search.dbsight.com >>> Lucene Database Search in 3 minutes: >>> >> http://wiki.dbsight.com/index.php?title=Create_Lucene_Database_Search_in_3_minutes >>> DBSight customer, a shopping comparison site, (anonymous per request) got >>> 2.6 Million Euro funding! >>> >>> >>> >>> Guido Bartolucci wrote: >>>> >>>> I know that the primary use case for Lucene is as an index of data >>>> that can be reconstructed (e.g., from a relational database or from >>>> spidering your corporate intranet). >>>> >>>> But, I'm curious if anyone uses Lucene as their primary datastore for >>>> their gold data. Is it good enough? >>>> >>>> Would anyone consider (or do people already) store data in Lucene >>>> that, if it was lost, would destroy their business? And no, I'm not >>>> suggesting that you don't back up this data, I'm just curious if there >>>> are problems with using Lucene in this way. Are there subtle >>>> corruptions that might show up in Lucene that wouldn't show up in >>>> Oracle or MySQL? >>>> >>>> I'm considering using Lucene in this way but I haven't been able to >>>> find any documentation describing this use case. Are there any studies >>>> of Lucene vs MySQL running for N years comparing the corruptions and >>>> recovery times? >>>> >>>> Am I just ignorant and scared of Lucene and too trusting of Oracle and |