|
Itamar Syn-Hershko
2012-06-13, 00:20
Michael McCandless
2012-06-13, 16:31
Christopher Currens
2012-06-13, 22:41
Itamar Syn-Hershko
2012-06-14, 00:45
Itamar Syn-Hershko
2012-06-14, 00:54
Itamar Syn-Hershko
2012-06-14, 01:13
Michael McCandless
2012-06-14, 12:36
Christopher Currens
2012-06-14, 17:03
Itamar Syn-Hershko
2012-06-14, 17:41
Troy Howard
2012-06-14, 21:36
Michael McCandless
2012-06-15, 00:10
Itamar Syn-Hershko
2012-06-15, 00:14
Itamar Syn-Hershko
2012-06-15, 00:40
Michael McCandless
2012-06-15, 11:32
|
-
Corrupt indexItamar Syn-Hershko 2012-06-13, 00:20
Hi Java devs,
I'm a Lucene.Net committer, and there is a chance we have a bug in our FSDirectory implementation that causes indexes to get corrupted when indexing is cut while the IW is still open. As it roots from some retroactive fixes you made, I'd appreciate your feedback. Correct me if I'm wrong, but by design Lucene should be able to recover rather quickly from power failures or app crashes. Since existing segment files are read only, only new segments that are still being written can get corrupted. Hence, recovering from worst-case scenarios is done by simply removing the write.lock file. The worst that could happen then is having the last segment damaged, and that can be fixed by removing those files, possibly by running CheckIndex on the index. Last week I have been playing with rather large indexes and crashed my app while it was indexing. I wasn't able to open the index, and Luke was even kind enough to wipe the index folder clean even though I opened it in read-only mode. I re-ran this, and after another crash running CheckIndex revealed nothing - the index was detected to be an empty one. I am not entirely sure what could be the cause for this, but I suspect it has been corrupted by the crash. I've been looking at these: https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel And it seems like this is what I was experiencing. Mike and Mark will probably be able to tell if this is what they saw or not, but as far as I can tell this is not an expected behavior of a Lucene index. What I'm looking for at the moment is some advice on what FSDirectory implementation to use to make sure no corruption can happen. The 3.4 version (which is where LUCENE-3418 was committed to) seems to handle a lot of things the 3.0 doesn't, but on the other hand LUCENE-3418 was introduced by changes made to the 3.0 codebase. Also, is there any test in the suite checking for those scenarios? Will appreciate any help on this, Itamar.
-
Re: Corrupt indexMichael McCandless 2012-06-13, 16:31
Hi Itamar,
One quick question: does Lucene.Net include the fixes done for LUCENE-1044 (to fsync files on commit)? Those are very important for an index to be intact after OS/JVM crash or power loss. More responses below: On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > I'm a Lucene.Net committer, and there is a chance we have a bug in our > FSDirectory implementation that causes indexes to get corrupted when > indexing is cut while the IW is still open. As it roots from some > retroactive fixes you made, I'd appreciate your feedback. > > Correct me if I'm wrong, but by design Lucene should be able to recover > rather quickly from power failures or app crashes. Since existing segment > files are read only, only new segments that are still being written can get > corrupted. Hence, recovering from worst-case scenarios is done by simply > removing the write.lock file. The worst that could happen then is having the > last segment damaged, and that can be fixed by removing those files, > possibly by running CheckIndex on the index. You shouldn't even have to run CheckIndex ... because (as of LUCENE-1044) we now fsync all segment files before writing the new segments_N file, and then removing old segments_N files (and any segments that are no longer referenced). You do have to remove the write.lock if you aren't using NativeFSLockFactory (but this has been the default lock impl for a while now). > Last week I have been playing with rather large indexes and crashed my app > while it was indexing. I wasn't able to open the index, and Luke was even > kind enough to wipe the index folder clean even though I opened it in > read-only mode. I re-ran this, and after another crash running CheckIndex > revealed nothing - the index was detected to be an empty one. I am not > entirely sure what could be the cause for this, but I suspect it has > been corrupted by the crash. Had no commit completed (no segments file written)? If you don't fsync then all sorts of crazy things are possible... > I've been looking at these: > > https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328 broke...). > And it seems like this is what I was experiencing. Mike and Mark will > probably be able to tell if this is what they saw or not, but as far as I > can tell this is not an expected behavior of a Lucene index. Definitely not expected behavior: assuming nothing is flipping bits, then on OS/JVM crash or power loss your index should be fine, just reverted to the last successful commit. > What I'm looking for at the moment is some advice on what FSDirectory > implementation to use to make sure no corruption can happen. The 3.4 version > (which is where LUCENE-3418 was committed to) seems to handle a lot of > things the 3.0 doesn't, but on the other hand LUCENE-3418 was introduced by > changes made to the 3.0 codebase. Hopefully it's just that you are missing fsync! > Also, is there any test in the suite checking for those scenarios? Our test framework has a sneaky MockDirectoryWrapper that, after a test finishes, goes and corrupts any unsync'd files and then verifies the index is still OK... it's good because it'll catch any times we are missing calls t sync, but, it's not low level enough such that if FSDir is failing to actually call fsync (that wsa the bug in LUCENE-3418) then it won't catch that... Mike McCandless http://blog.mikemccandless.com
-
Re: Corrupt indexChristopher Currens 2012-06-13, 22:41
Mike, The codebase for lucene.net should be almost identical to java's
3.0.3 release, and LUCENE-1044 is included in that. Itamar, are you committing the index regularly? I only ask because I can't reproduce it myself by forcibly terminating the process while it's indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and terminate the process (even with a 10,000 4K documents created), there will be no documents in the index when I open it in luke, which I expect. If I commit at 10,000 documents, and terminate it a few thousand after that, the index has the first ten thousand that were committed. I've even terminated it *while* a second commit was taking place, and it still had all of the documents I expected. It may be that I'm not trying to reproducing it correctly. Do you have a minimal amount of code that can reproduce it? Thanks, Christopher On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > Hi Itamar, > > One quick question: does Lucene.Net include the fixes done for > LUCENE-1044 (to fsync files on commit)? Those are very important for > an index to be intact after OS/JVM crash or power loss. > > More responses below: > > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> > wrote: > > > I'm a Lucene.Net committer, and there is a chance we have a bug in our > > FSDirectory implementation that causes indexes to get corrupted when > > indexing is cut while the IW is still open. As it roots from some > > retroactive fixes you made, I'd appreciate your feedback. > > > > Correct me if I'm wrong, but by design Lucene should be able to recover > > rather quickly from power failures or app crashes. Since existing segment > > files are read only, only new segments that are still being written can > get > > corrupted. Hence, recovering from worst-case scenarios is done by simply > > removing the write.lock file. The worst that could happen then is having > the > > last segment damaged, and that can be fixed by removing those files, > > possibly by running CheckIndex on the index. > > You shouldn't even have to run CheckIndex ... because (as of > LUCENE-1044) we now fsync all segment files before writing the new > segments_N file, and then removing old segments_N files (and any > segments that are no longer referenced). > > You do have to remove the write.lock if you aren't using > NativeFSLockFactory (but this has been the default lock impl for a > while now). > > > Last week I have been playing with rather large indexes and crashed my > app > > while it was indexing. I wasn't able to open the index, and Luke was even > > kind enough to wipe the index folder clean even though I opened it in > > read-only mode. I re-ran this, and after another crash running CheckIndex > > revealed nothing - the index was detected to be an empty one. I am not > > entirely sure what could be the cause for this, but I suspect it has > > been corrupted by the crash. > > Had no commit completed (no segments file written)? > > If you don't fsync then all sorts of crazy things are possible... > > > I've been looking at these: > > > > > https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > > https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328 > broke...). > > > And it seems like this is what I was experiencing. Mike and Mark will > > probably be able to tell if this is what they saw or not, but as far as I > > can tell this is not an expected behavior of a Lucene index. > > Definitely not expected behavior: assuming nothing is flipping bits, > then on OS/JVM crash or power loss your index should be fine, just > reverted to the last successful commit. > > > What I'm looking for at the moment is some advice on what FSDirectory > > implementation to use to make sure no corruption can happen. The 3.4
-
Re: Corrupt indexItamar Syn-Hershko 2012-06-14, 00:45
Mike,
On Wed, Jun 13, 2012 at 7:31 PM, Michael McCandless < [EMAIL PROTECTED]> wrote: > Hi Itamar, > > One quick question: does Lucene.Net include the fixes done for > LUCENE-1044 (to fsync files on commit)? Those are very important for > an index to be intact after OS/JVM crash or power loss. > Definitely, as Christopher noted we are about to release a 3.0.3 compatible version, which is line-by-line port of the Java version. > You shouldn't even have to run CheckIndex ... because (as of > LUCENE-1044) we now fsync all segment files before writing the new > segments_N file, and then removing old segments_N files (and any > segments that are no longer referenced). > > You do have to remove the write.lock if you aren't using > NativeFSLockFactory (but this has been the default lock impl for a > while now). > Somewhat unrelated to this thread, but what should I expect to see? from time to time we do see write.lock present after an app-crash or power failure. Also, what are the steps that are expected to be performed in such cases? > > > Last week I have been playing with rather large indexes and crashed my > app > > while it was indexing. I wasn't able to open the index, and Luke was even > > kind enough to wipe the index folder clean even though I opened it in > > read-only mode. I re-ran this, and after another crash running CheckIndex > > revealed nothing - the index was detected to be an empty one. I am not > > entirely sure what could be the cause for this, but I suspect it has > > been corrupted by the crash. > > Had no commit completed (no segments file written)? > > If you don't fsync then all sorts of crazy things are possible... > Ok, so we do have fsync since LUCENE-1044 is present, and there were segments present from previous commits. Any idea what went wrong? > > I've been looking at these: > > > > > https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > > https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel > > (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328broke...). > So 2328 broke 1044, and this was fixed only in 3.4, right? so 2328 made it to a 3.0.x release while the fix for it (3418) was only released in 3.4. Am I right? If this is the case, 2328 probably made it's way to Lucene.Net since we are using the released sources for porting, and we now need to apply 3418 in the current version. Does it make sense to just port FSDirectory from 3.4 to 3.0.3? or were there API or other changes that will make our life miserable if we do that? > > > And it seems like this is what I was experiencing. Mike and Mark will > > probably be able to tell if this is what they saw or not, but as far as I > > can tell this is not an expected behavior of a Lucene index. > > Definitely not expected behavior: assuming nothing is flipping bits, > then on OS/JVM crash or power loss your index should be fine, just > reverted to the last successful commit. > What I suspected. Will try to reproduce reliably - any recommendations? not really feeling like reinventing the wheel here... MockDirectoryWrapper wasn't ported yet as it appears to only appear in 3.4, and as you said it won't really help here anyway > > > What I'm looking for at the moment is some advice on what FSDirectory > > implementation to use to make sure no corruption can happen. The 3.4 > version > > (which is where LUCENE-3418 was committed to) seems to handle a lot of > > things the 3.0 doesn't, but on the other hand LUCENE-3418 was > introduced by > > changes made to the 3.0 codebase. > > Hopefully it's just that you are missing fsync! > > > Also, is there any test in the suite checking for those scenarios? > > Our test framework has a sneaky MockDirectoryWrapper that, after a > test finishes, goes and corrupts any unsync'd files and then verifies > the index is still OK... it's good because it'll catch any times we
-
Re: Corrupt indexItamar Syn-Hershko 2012-06-14, 00:54
Christopher,
I used the IndexBuilder app from here https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a 8.5GB wikipedia dump. After running for 2.5 days I had to forcefully close it (infinite loop in the wiki-markdown parser at 92%, go figure), and the 40-something GB index I had by then was unusable. I then was able to reproduce this Please note I now added a few safe-guards you might want to remove to make sure the app really crashes on process kill. I'll try to come up with a better way to reproduce this - hopefully Mike will be able to suggest better ways than manual process kill... On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens < [EMAIL PROTECTED]> wrote: > Mike, The codebase for lucene.net should be almost identical to java's > 3.0.3 release, and LUCENE-1044 is included in that. > > Itamar, are you committing the index regularly? I only ask because I can't > reproduce it myself by forcibly terminating the process while it's > indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and > terminate the process (even with a 10,000 4K documents created), there will > be no documents in the index when I open it in luke, which I expect. If I > commit at 10,000 documents, and terminate it a few thousand after that, the > index has the first ten thousand that were committed. I've even terminated > it *while* a second commit was taking place, and it still had all of the > documents I expected. > > It may be that I'm not trying to reproducing it correctly. Do you have a > minimal amount of code that can reproduce it? > > > Thanks, > Christopher > > On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless < > [EMAIL PROTECTED]> wrote: > > > Hi Itamar, > > > > One quick question: does Lucene.Net include the fixes done for > > LUCENE-1044 (to fsync files on commit)? Those are very important for > > an index to be intact after OS/JVM crash or power loss. > > > > More responses below: > > > > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> > > wrote: > > > > > I'm a Lucene.Net committer, and there is a chance we have a bug in our > > > FSDirectory implementation that causes indexes to get corrupted when > > > indexing is cut while the IW is still open. As it roots from some > > > retroactive fixes you made, I'd appreciate your feedback. > > > > > > Correct me if I'm wrong, but by design Lucene should be able to recover > > > rather quickly from power failures or app crashes. Since existing > segment > > > files are read only, only new segments that are still being written can > > get > > > corrupted. Hence, recovering from worst-case scenarios is done by > simply > > > removing the write.lock file. The worst that could happen then is > having > > the > > > last segment damaged, and that can be fixed by removing those files, > > > possibly by running CheckIndex on the index. > > > > You shouldn't even have to run CheckIndex ... because (as of > > LUCENE-1044) we now fsync all segment files before writing the new > > segments_N file, and then removing old segments_N files (and any > > segments that are no longer referenced). > > > > You do have to remove the write.lock if you aren't using > > NativeFSLockFactory (but this has been the default lock impl for a > > while now). > > > > > Last week I have been playing with rather large indexes and crashed my > > app > > > while it was indexing. I wasn't able to open the index, and Luke was > even > > > kind enough to wipe the index folder clean even though I opened it in > > > read-only mode. I re-ran this, and after another crash running > CheckIndex > > > revealed nothing - the index was detected to be an empty one. I am not > > > entirely sure what could be the cause for this, but I suspect it has > > > been corrupted by the crash. > > > > Had no commit completed (no segments file written)? > > > > If you don't fsync then all sorts of crazy things are possible... > > > > > I've been looking at these: > > >
-
Re: Corrupt indexItamar Syn-Hershko 2012-06-14, 01:13
Yes, reproduced in first try. See attached program - I referenced it to
current trunk. On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > Christopher, > > I used the IndexBuilder app from here > https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a > 8.5GB wikipedia dump. > > After running for 2.5 days I had to forcefully close it (infinite loop in > the wiki-markdown parser at 92%, go figure), and the 40-something GB index > I had by then was unusable. I then was able to reproduce this > > Please note I now added a few safe-guards you might want to remove to make > sure the app really crashes on process kill. > > I'll try to come up with a better way to reproduce this - hopefully Mike > will be able to suggest better ways than manual process kill... > > On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens < > [EMAIL PROTECTED]> wrote: > >> Mike, The codebase for lucene.net should be almost identical to java's >> 3.0.3 release, and LUCENE-1044 is included in that. >> >> Itamar, are you committing the index regularly? I only ask because I >> can't >> reproduce it myself by forcibly terminating the process while it's >> indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and >> terminate the process (even with a 10,000 4K documents created), there >> will >> be no documents in the index when I open it in luke, which I expect. If I >> commit at 10,000 documents, and terminate it a few thousand after that, >> the >> index has the first ten thousand that were committed. I've even >> terminated >> it *while* a second commit was taking place, and it still had all of the >> documents I expected. >> >> It may be that I'm not trying to reproducing it correctly. Do you have a >> minimal amount of code that can reproduce it? >> >> >> Thanks, >> Christopher >> >> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless < >> [EMAIL PROTECTED]> wrote: >> >> > Hi Itamar, >> > >> > One quick question: does Lucene.Net include the fixes done for >> > LUCENE-1044 (to fsync files on commit)? Those are very important for >> > an index to be intact after OS/JVM crash or power loss. >> > >> > More responses below: >> > >> > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko <[EMAIL PROTECTED] >> > >> > wrote: >> > >> > > I'm a Lucene.Net committer, and there is a chance we have a bug in our >> > > FSDirectory implementation that causes indexes to get corrupted when >> > > indexing is cut while the IW is still open. As it roots from some >> > > retroactive fixes you made, I'd appreciate your feedback. >> > > >> > > Correct me if I'm wrong, but by design Lucene should be able to >> recover >> > > rather quickly from power failures or app crashes. Since existing >> segment >> > > files are read only, only new segments that are still being written >> can >> > get >> > > corrupted. Hence, recovering from worst-case scenarios is done by >> simply >> > > removing the write.lock file. The worst that could happen then is >> having >> > the >> > > last segment damaged, and that can be fixed by removing those files, >> > > possibly by running CheckIndex on the index. >> > >> > You shouldn't even have to run CheckIndex ... because (as of >> > LUCENE-1044) we now fsync all segment files before writing the new >> > segments_N file, and then removing old segments_N files (and any >> > segments that are no longer referenced). >> > >> > You do have to remove the write.lock if you aren't using >> > NativeFSLockFactory (but this has been the default lock impl for a >> > while now). >> > >> > > Last week I have been playing with rather large indexes and crashed my >> > app >> > > while it was indexing. I wasn't able to open the index, and Luke was >> even >> > > kind enough to wipe the index folder clean even though I opened it in >> > > read-only mode. I re-ran this, and after another crash running >> CheckIndex >> > > revealed nothing - the index was detected to be an empty one. I am not
-
Re: Corrupt indexMichael McCandless 2012-06-14, 12:36
On Wed, Jun 13, 2012 at 8:45 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote:
> Mike, > > On Wed, Jun 13, 2012 at 7:31 PM, Michael McCandless > <[EMAIL PROTECTED]> wrote: >> >> Hi Itamar, >> >> One quick question: does Lucene.Net include the fixes done for >> LUCENE-1044 (to fsync files on commit)? Those are very important for >> an index to be intact after OS/JVM crash or power loss. > > > Definitely, as Christopher noted we are about to release a 3.0.3 compatible > version, which is line-by-line port of the Java version. Hmm OK. Then we still need to explain the corruption... >> You shouldn't even have to run CheckIndex ... because (as of >> LUCENE-1044) we now fsync all segment files before writing the new >> segments_N file, and then removing old segments_N files (and any >> segments that are no longer referenced). >> >> You do have to remove the write.lock if you aren't using >> NativeFSLockFactory (but this has been the default lock impl for a >> while now). > > Somewhat unrelated to this thread, but what should I expect to see? from > time to time we do see write.lock present after an app-crash or power > failure. Also, what are the steps that are expected to be performed in such > cases? If you are using NativeFSLockFactory, you will see a write.lock but it will not actually be locked (according to the OS); so, it's fine. If you are using SimpleFSLockFactory then the presence of write.lock means the index is still locked and you'll have to remove it. >> > Last week I have been playing with rather large indexes and crashed my >> > app >> > while it was indexing. I wasn't able to open the index, and Luke was >> > even >> > kind enough to wipe the index folder clean even though I opened it in >> > read-only mode. I re-ran this, and after another crash running >> > CheckIndex >> > revealed nothing - the index was detected to be an empty one. I am not >> > entirely sure what could be the cause for this, but I suspect it has >> > been corrupted by the crash. >> >> Had no commit completed (no segments file written)? >> >> If you don't fsync then all sorts of crazy things are possible... > > Ok, so we do have fsync since LUCENE-1044 is present, and there were > segments present from previous commits. Any idea what went wrong? I don't know! >> > I've been looking at these: >> > >> > >> > https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> > >> > https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> >> (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328 >> broke...). > > So 2328 broke 1044, and this was fixed only in 3.4, right? so 2328 made it > to a 3.0.x release while the fix for it (3418) was only released in 3.4. Am > I right? > > If this is the case, 2328 probably made it's way to Lucene.Net since we are > using the released sources for porting, and we now need to apply 3418 in the > current version. OK that makes sense: 2328 broke things as of 3.0.3, and 3418 fixed things in 3.4. > Does it make sense to just port FSDirectory from 3.4 to 3.0.3? or were there > API or other changes that will make our life miserable if we do that? Hmmm I'm not certain offhand: maybe diff the two sources? The fix in 3418 was trivial in the end, so maybe just backport that. >> > And it seems like this is what I was experiencing. Mike and Mark will >> > probably be able to tell if this is what they saw or not, but as far as >> > I >> > can tell this is not an expected behavior of a Lucene index. >> >> Definitely not expected behavior: assuming nothing is flipping bits, >> then on OS/JVM crash or power loss your index should be fine, just >> reverted to the last successful commit. > > What I suspected. Will try to reproduce reliably - any recommendations? not > really feeling like reinventing the wheel here... > > MockDirectoryWrapper wasn't ported yet as it appears to only appear in 3.4, Use a spare computer and try pulling the plug on it ... or pull a (hot swappable/pluggable) hard drive while indexing onto it ... You can also use a virtual machine and power it off ungracefully / kill the process. If any of these events can corrupt the index then there's a bug somewhere (or: the IO system ignores fsync). Mike McCandless http://blog.mikemccandless.com
-
Re: Corrupt indexChristopher Currens 2012-06-14, 17:03
Well, the only thing I see is that there is no place where writer.Commit()
is called in the delegate assigned to corpusReader.OnDocument. I know that lucene is very transactional, and at least in 3.x, the writer will never auto commit to the index. You can write millions of documents, but if commit is never called, those documents aren't actually part of the index. Committing isn't a cheap operation, so you definitely don't want to do it on every document. You can test it yourself with this (naive) solution. Right below the writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;". At the end of the corpusReader.OnDocument delegate add: // Example only. I wouldn't suggest committing this often if(++numDocsAdded % 5 == 0) { writer.Commit(); } I had the application crash for real on this file: http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2, about 20% into the operation. Without the commit, the index is empty. Add it in, and I get 755 files in the index after it crashes. Thanks, Christopher On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > Yes, reproduced in first try. See attached program - I referenced it to > current trunk. > > > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko <[EMAIL PROTECTED]>wrote: > >> Christopher, >> >> I used the IndexBuilder app from here >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a >> 8.5GB wikipedia dump. >> >> After running for 2.5 days I had to forcefully close it (infinite loop in >> the wiki-markdown parser at 92%, go figure), and the 40-something GB index >> I had by then was unusable. I then was able to reproduce this >> >> Please note I now added a few safe-guards you might want to remove to >> make sure the app really crashes on process kill. >> >> I'll try to come up with a better way to reproduce this - hopefully Mike >> will be able to suggest better ways than manual process kill... >> >> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens < >> [EMAIL PROTECTED]> wrote: >> >>> Mike, The codebase for lucene.net should be almost identical to java's >>> 3.0.3 release, and LUCENE-1044 is included in that. >>> >>> Itamar, are you committing the index regularly? I only ask because I >>> can't >>> reproduce it myself by forcibly terminating the process while it's >>> indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all and >>> terminate the process (even with a 10,000 4K documents created), there >>> will >>> be no documents in the index when I open it in luke, which I expect. If >>> I >>> commit at 10,000 documents, and terminate it a few thousand after that, >>> the >>> index has the first ten thousand that were committed. I've even >>> terminated >>> it *while* a second commit was taking place, and it still had all of the >>> documents I expected. >>> >>> It may be that I'm not trying to reproducing it correctly. Do you have a >>> minimal amount of code that can reproduce it? >>> >>> >>> Thanks, >>> Christopher >>> >>> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless < >>> [EMAIL PROTECTED]> wrote: >>> >>> > Hi Itamar, >>> > >>> > One quick question: does Lucene.Net include the fixes done for >>> > LUCENE-1044 (to fsync files on commit)? Those are very important for >>> > an index to be intact after OS/JVM crash or power loss. >>> > >>> > More responses below: >>> > >>> > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko < >>> [EMAIL PROTECTED]> >>> > wrote: >>> > >>> > > I'm a Lucene.Net committer, and there is a chance we have a bug in >>> our >>> > > FSDirectory implementation that causes indexes to get corrupted when >>> > > indexing is cut while the IW is still open. As it roots from some >>> > > retroactive fixes you made, I'd appreciate your feedback. >>> > > >>> > > Correct me if I'm wrong, but by design Lucene should be able to >>> recover >>> > > rather quickly from power failures or app crashes. Since existing
-
Re: Corrupt indexItamar Syn-Hershko 2012-06-14, 17:41
I'm quite certain this shouldn't happen also when Commit wasn't called.
Mike, can you comment on that? On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens < [EMAIL PROTECTED]> wrote: > Well, the only thing I see is that there is no place where writer.Commit() > is called in the delegate assigned to corpusReader.OnDocument. I know that > lucene is very transactional, and at least in 3.x, the writer will never > auto commit to the index. You can write millions of documents, but if > commit is never called, those documents aren't actually part of the index. > Committing isn't a cheap operation, so you definitely don't want to do it > on every document. > > You can test it yourself with this (naive) solution. Right below the > writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;". At the > end of the corpusReader.OnDocument delegate add: > > // Example only. I wouldn't suggest committing this often > if(++numDocsAdded % 5 == 0) > { > writer.Commit(); > } > > I had the application crash for real on this file: > > http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 > , > about 20% into the operation. Without the commit, the index is empty. Add > it in, and I get 755 files in the index after it crashes. > > > Thanks, > Christopher > > On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko <[EMAIL PROTECTED] > >wrote: > > > Yes, reproduced in first try. See attached program - I referenced it to > > current trunk. > > > > > > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko <[EMAIL PROTECTED] > >wrote: > > > >> Christopher, > >> > >> I used the IndexBuilder app from here > >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a > >> 8.5GB wikipedia dump. > >> > >> After running for 2.5 days I had to forcefully close it (infinite loop > in > >> the wiki-markdown parser at 92%, go figure), and the 40-something GB > index > >> I had by then was unusable. I then was able to reproduce this > >> > >> Please note I now added a few safe-guards you might want to remove to > >> make sure the app really crashes on process kill. > >> > >> I'll try to come up with a better way to reproduce this - hopefully Mike > >> will be able to suggest better ways than manual process kill... > >> > >> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens < > >> [EMAIL PROTECTED]> wrote: > >> > >>> Mike, The codebase for lucene.net should be almost identical to java's > >>> 3.0.3 release, and LUCENE-1044 is included in that. > >>> > >>> Itamar, are you committing the index regularly? I only ask because I > >>> can't > >>> reproduce it myself by forcibly terminating the process while it's > >>> indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all > and > >>> terminate the process (even with a 10,000 4K documents created), there > >>> will > >>> be no documents in the index when I open it in luke, which I expect. > If > >>> I > >>> commit at 10,000 documents, and terminate it a few thousand after that, > >>> the > >>> index has the first ten thousand that were committed. I've even > >>> terminated > >>> it *while* a second commit was taking place, and it still had all of > the > >>> documents I expected. > >>> > >>> It may be that I'm not trying to reproducing it correctly. Do you > have a > >>> minimal amount of code that can reproduce it? > >>> > >>> > >>> Thanks, > >>> Christopher > >>> > >>> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless < > >>> [EMAIL PROTECTED]> wrote: > >>> > >>> > Hi Itamar, > >>> > > >>> > One quick question: does Lucene.Net include the fixes done for > >>> > LUCENE-1044 (to fsync files on commit)? Those are very important for > >>> > an index to be intact after OS/JVM crash or power loss. > >>> > > >>> > More responses below: > >>> > > >>> > On Tue, Jun 12, 2012 at 8:20 PM, Itamar Syn-Hershko < > >>> [EMAIL PROTECTED]> > >>> > wrote: > >>> > > >>> > > I'm a Lucene.Net committer, and there is a chance we have a bug in
-
Re: Corrupt indexTroy Howard 2012-06-14, 21:36
> If this is the case, 2328 probably made it's way to Lucene.Net since we are
> using the released sources for porting, and we now need to apply 3418 in > the current version. Iatmar: I confirmed that 2328 is in the latest code. Thanks, Troy On Wed, Jun 13, 2012 at 5:45 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > Mike, > > On Wed, Jun 13, 2012 at 7:31 PM, Michael McCandless < > [EMAIL PROTECTED]> wrote: > >> Hi Itamar, >> >> One quick question: does Lucene.Net include the fixes done for >> LUCENE-1044 (to fsync files on commit)? Those are very important for >> an index to be intact after OS/JVM crash or power loss. >> > > Definitely, as Christopher noted we are about to release a 3.0.3 compatible > version, which is line-by-line port of the Java version. > > >> You shouldn't even have to run CheckIndex ... because (as of >> LUCENE-1044) we now fsync all segment files before writing the new >> segments_N file, and then removing old segments_N files (and any >> segments that are no longer referenced). >> >> You do have to remove the write.lock if you aren't using >> NativeFSLockFactory (but this has been the default lock impl for a >> while now). >> > > Somewhat unrelated to this thread, but what should I expect to see? from > time to time we do see write.lock present after an app-crash or power > failure. Also, what are the steps that are expected to be performed in such > cases? > > >> >> > Last week I have been playing with rather large indexes and crashed my >> app >> > while it was indexing. I wasn't able to open the index, and Luke was even >> > kind enough to wipe the index folder clean even though I opened it in >> > read-only mode. I re-ran this, and after another crash running CheckIndex >> > revealed nothing - the index was detected to be an empty one. I am not >> > entirely sure what could be the cause for this, but I suspect it has >> > been corrupted by the crash. >> >> Had no commit completed (no segments file written)? >> >> If you don't fsync then all sorts of crazy things are possible... >> > > Ok, so we do have fsync since LUCENE-1044 is present, and there were > segments present from previous commits. Any idea what went wrong? > > >> > I've been looking at these: >> > >> > >> https://issues.apache.org/jira/browse/LUCENE-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> > >> https://issues.apache.org/jira/browse/LUCENE-2328?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> >> (And LUCENE-1044 before that ... it was LUCENE-1044 that LUCENE-2328broke...). >> > > So 2328 broke 1044, and this was fixed only in 3.4, right? so 2328 made it > to a 3.0.x release while the fix for it (3418) was only released in 3.4. Am > I right? > > If this is the case, 2328 probably made it's way to Lucene.Net since we are > using the released sources for porting, and we now need to apply 3418 in > the current version. > > Does it make sense to just port FSDirectory from 3.4 to 3.0.3? or were > there API or other changes that will make our life miserable if we do that? > > >> >> > And it seems like this is what I was experiencing. Mike and Mark will >> > probably be able to tell if this is what they saw or not, but as far as I >> > can tell this is not an expected behavior of a Lucene index. >> >> Definitely not expected behavior: assuming nothing is flipping bits, >> then on OS/JVM crash or power loss your index should be fine, just >> reverted to the last successful commit. >> > > What I suspected. Will try to reproduce reliably - any recommendations? not > really feeling like reinventing the wheel here... > > MockDirectoryWrapper wasn't ported yet as it appears to only appear in 3.4, > and as you said it won't really help here anyway > > >> >> > What I'm looking for at the moment is some advice on what FSDirectory >> > implementation to use to make sure no corruption can happen. The 3.4 >> version >> > (which is where LUCENE-3418 was committed to) seems to handle a lot of
-
Re: Corrupt indexMichael McCandless 2012-06-15, 00:10
Right: Lucene never autocommits anymore ...
If you create a new index, add a bunch of docs, and things crash before you have a chance to commit, then there is no index (not even a 0 doc one) in that directory. Mike McCandless http://blog.mikemccandless.com On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > I'm quite certain this shouldn't happen also when Commit wasn't called. > > Mike, can you comment on that? > > On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens > <[EMAIL PROTECTED]> wrote: >> >> Well, the only thing I see is that there is no place where writer.Commit() >> is called in the delegate assigned to corpusReader.OnDocument. I know >> that >> lucene is very transactional, and at least in 3.x, the writer will never >> auto commit to the index. You can write millions of documents, but if >> commit is never called, those documents aren't actually part of the index. >> Committing isn't a cheap operation, so you definitely don't want to do it >> on every document. >> >> You can test it yourself with this (naive) solution. Right below the >> writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;". At >> the >> end of the corpusReader.OnDocument delegate add: >> >> // Example only. I wouldn't suggest committing this often >> if(++numDocsAdded % 5 == 0) >> { >> writer.Commit(); >> } >> >> I had the application crash for real on this file: >> >> http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2, >> about 20% into the operation. Without the commit, the index is empty. >> Add >> it in, and I get 755 files in the index after it crashes. >> >> >> Thanks, >> Christopher >> >> On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko >> <[EMAIL PROTECTED]>wrote: >> >> >> > Yes, reproduced in first try. See attached program - I referenced it to >> > current trunk. >> > >> > >> > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko >> > <[EMAIL PROTECTED]>wrote: >> > >> >> Christopher, >> >> >> >> I used the IndexBuilder app from here >> >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThings with a >> >> 8.5GB wikipedia dump. >> >> >> >> After running for 2.5 days I had to forcefully close it (infinite loop >> >> in >> >> the wiki-markdown parser at 92%, go figure), and the 40-something GB >> >> index >> >> I had by then was unusable. I then was able to reproduce this >> >> >> >> Please note I now added a few safe-guards you might want to remove to >> >> make sure the app really crashes on process kill. >> >> >> >> I'll try to come up with a better way to reproduce this - hopefully >> >> Mike >> >> will be able to suggest better ways than manual process kill... >> >> >> >> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens < >> >> [EMAIL PROTECTED]> wrote: >> >> >> >>> Mike, The codebase for lucene.net should be almost identical to java's >> >>> 3.0.3 release, and LUCENE-1044 is included in that. >> >>> >> >>> Itamar, are you committing the index regularly? I only ask because I >> >>> can't >> >>> reproduce it myself by forcibly terminating the process while it's >> >>> indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at all >> >>> and >> >>> terminate the process (even with a 10,000 4K documents created), there >> >>> will >> >>> be no documents in the index when I open it in luke, which I expect. >> >>> If >> >>> I >> >>> commit at 10,000 documents, and terminate it a few thousand after >> >>> that, >> >>> the >> >>> index has the first ten thousand that were committed. I've even >> >>> terminated >> >>> it *while* a second commit was taking place, and it still had all of >> >>> the >> >>> documents I expected. >> >>> >> >>> It may be that I'm not trying to reproducing it correctly. Do you >> >>> have a >> >>> minimal amount of code that can reproduce it? >> >>> >> >>> >> >>> Thanks, >> >>> Christopher >> >>> >> >>> On Wed, Jun 13, 2012 at 9:31 AM, Michael McCandless < >> >>> [EMAIL PROTECTED]> wrote:
-
Re: Corrupt indexItamar Syn-Hershko 2012-06-15, 00:14
Not what I'm seeing. I actually see a lot of segments created and merged
while it operates. Expected? Reminding you, this is 2.9.4 / 3.0.3 On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > Right: Lucene never autocommits anymore ... > > If you create a new index, add a bunch of docs, and things crash > before you have a chance to commit, then there is no index (not even a > 0 doc one) in that directory. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> > wrote: > > I'm quite certain this shouldn't happen also when Commit wasn't called. > > > > Mike, can you comment on that? > > > > On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens > > <[EMAIL PROTECTED]> wrote: > >> > >> Well, the only thing I see is that there is no place where > writer.Commit() > >> is called in the delegate assigned to corpusReader.OnDocument. I know > >> that > >> lucene is very transactional, and at least in 3.x, the writer will never > >> auto commit to the index. You can write millions of documents, but if > >> commit is never called, those documents aren't actually part of the > index. > >> Committing isn't a cheap operation, so you definitely don't want to do > it > >> on every document. > >> > >> You can test it yourself with this (naive) solution. Right below the > >> writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;". At > >> the > >> end of the corpusReader.OnDocument delegate add: > >> > >> // Example only. I wouldn't suggest committing this often > >> if(++numDocsAdded % 5 == 0) > >> { > >> writer.Commit(); > >> } > >> > >> I had the application crash for real on this file: > >> > >> > http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 > , > >> about 20% into the operation. Without the commit, the index is empty. > >> Add > >> it in, and I get 755 files in the index after it crashes. > >> > >> > >> Thanks, > >> Christopher > >> > >> On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko > >> <[EMAIL PROTECTED]>wrote: > >> > >> > >> > Yes, reproduced in first try. See attached program - I referenced it > to > >> > current trunk. > >> > > >> > > >> > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko > >> > <[EMAIL PROTECTED]>wrote: > >> > > >> >> Christopher, > >> >> > >> >> I used the IndexBuilder app from here > >> >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThingswith a > >> >> 8.5GB wikipedia dump. > >> >> > >> >> After running for 2.5 days I had to forcefully close it (infinite > loop > >> >> in > >> >> the wiki-markdown parser at 92%, go figure), and the 40-something GB > >> >> index > >> >> I had by then was unusable. I then was able to reproduce this > >> >> > >> >> Please note I now added a few safe-guards you might want to remove to > >> >> make sure the app really crashes on process kill. > >> >> > >> >> I'll try to come up with a better way to reproduce this - hopefully > >> >> Mike > >> >> will be able to suggest better ways than manual process kill... > >> >> > >> >> On Thu, Jun 14, 2012 at 1:41 AM, Christopher Currens < > >> >> [EMAIL PROTECTED]> wrote: > >> >> > >> >>> Mike, The codebase for lucene.net should be almost identical to > java's > >> >>> 3.0.3 release, and LUCENE-1044 is included in that. > >> >>> > >> >>> Itamar, are you committing the index regularly? I only ask because > I > >> >>> can't > >> >>> reproduce it myself by forcibly terminating the process while it's > >> >>> indexing. I've tried both 3.0.3 and 2.9.4. If I don't commit at > all > >> >>> and > >> >>> terminate the process (even with a 10,000 4K documents created), > there > >> >>> will > >> >>> be no documents in the index when I open it in luke, which I expect. > >> >>> If > >> >>> I > >> >>> commit at 10,000 documents, and terminate it a few thousand after > >> >>> that, > >> >>> the > >> >>> index has the first ten thousand that were committed. I've even
-
Re: Corrupt indexItamar Syn-Hershko 2012-06-15, 00:40
I can confirm 2.9.4 had autoCommit, but it is gone in 3.0.3 already, so
Lucene.Net doesn't have autoCommit. So I don't have autoCommit set to true, but I can clearly see a segments_1 file there along with the other files. If that helpes, it always keeps with the name segments_1 with 32 bytes, never changes. And as again, if I kill the process and try to open the index with Luke 3.3, the index folder is being wiped out. Not sure what to make of all that. On Fri, Jun 15, 2012 at 3:21 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > Hmm, OK: in 2.9.4 / 3.0.x, if you open IW on a new directory, it will > make a zero-segment commit. This was changed/fixed in 3.1 with > LUCENE-2386. > > In 2.9.x (not 3.0.x) there is still an autoCommit parameter, > defaulting to false, but if you set it to true then IndexWriter will > periodically commit. > > Seeing segment files created and merge is definitely expected, but > it's not expected to see segments_N files unless you pass > autoCommit=true. > > Mike McCandless > > http://blog.mikemccandless.com > > On Thu, Jun 14, 2012 at 8:14 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> > wrote: > > Not what I'm seeing. I actually see a lot of segments created and merged > > while it operates. Expected? > > > > Reminding you, this is 2.9.4 / 3.0.3 > > > > On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless > > <[EMAIL PROTECTED]> wrote: > >> > >> Right: Lucene never autocommits anymore ... > >> > >> If you create a new index, add a bunch of docs, and things crash > >> before you have a chance to commit, then there is no index (not even a > >> 0 doc one) in that directory. > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com > >> > >> On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko <[EMAIL PROTECTED] > > > >> wrote: > >> > I'm quite certain this shouldn't happen also when Commit wasn't > called. > >> > > >> > Mike, can you comment on that? > >> > > >> > On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens > >> > <[EMAIL PROTECTED]> wrote: > >> >> > >> >> Well, the only thing I see is that there is no place where > >> >> writer.Commit() > >> >> is called in the delegate assigned to corpusReader.OnDocument. I > know > >> >> that > >> >> lucene is very transactional, and at least in 3.x, the writer will > >> >> never > >> >> auto commit to the index. You can write millions of documents, but > if > >> >> commit is never called, those documents aren't actually part of the > >> >> index. > >> >> Committing isn't a cheap operation, so you definitely don't want to > do > >> >> it > >> >> on every document. > >> >> > >> >> You can test it yourself with this (naive) solution. Right below the > >> >> writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;". > At > >> >> the > >> >> end of the corpusReader.OnDocument delegate add: > >> >> > >> >> // Example only. I wouldn't suggest committing this often > >> >> if(++numDocsAdded % 5 == 0) > >> >> { > >> >> writer.Commit(); > >> >> } > >> >> > >> >> I had the application crash for real on this file: > >> >> > >> >> > >> >> > http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2 > , > >> >> about 20% into the operation. Without the commit, the index is > empty. > >> >> Add > >> >> it in, and I get 755 files in the index after it crashes. > >> >> > >> >> > >> >> Thanks, > >> >> Christopher > >> >> > >> >> On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko > >> >> <[EMAIL PROTECTED]>wrote: > >> >> > >> >> > >> >> > Yes, reproduced in first try. See attached program - I referenced > it > >> >> > to > >> >> > current trunk. > >> >> > > >> >> > > >> >> > On Thu, Jun 14, 2012 at 3:54 AM, Itamar Syn-Hershko > >> >> > <[EMAIL PROTECTED]>wrote: > >> >> > > >> >> >> Christopher, > >> >> >> > >> >> >> I used the IndexBuilder app from here > >> >> >> https://github.com/synhershko/Talks/tree/master/LuceneNeatThings > >> >> >> with a > >> >> >> 8.5GB wikipedia dump.
-
Re: Corrupt indexMichael McCandless 2012-06-15, 11:32
I think the 0-segment segments_1 file is expected in Lucene.Net since
we changed that later, in 3.1 in Lucene (LUCENE-2386)? Mike McCandless http://blog.mikemccandless.com On Thu, Jun 14, 2012 at 8:40 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> wrote: > I can confirm 2.9.4 had autoCommit, but it is gone in 3.0.3 already, so > Lucene.Net doesn't have autoCommit. > > So I don't have autoCommit set to true, but I can clearly see a segments_1 > file there along with the other files. If that helpes, it always keeps with > the name segments_1 with 32 bytes, never changes. > > And as again, if I kill the process and try to open the index with Luke 3.3, > the index folder is being wiped out. > > Not sure what to make of all that. > > On Fri, Jun 15, 2012 at 3:21 AM, Michael McCandless > <[EMAIL PROTECTED]> wrote: >> >> Hmm, OK: in 2.9.4 / 3.0.x, if you open IW on a new directory, it will >> make a zero-segment commit. This was changed/fixed in 3.1 with >> LUCENE-2386. >> >> In 2.9.x (not 3.0.x) there is still an autoCommit parameter, >> defaulting to false, but if you set it to true then IndexWriter will >> periodically commit. >> >> Seeing segment files created and merge is definitely expected, but >> it's not expected to see segments_N files unless you pass >> autoCommit=true. >> >> Mike McCandless >> >> http://blog.mikemccandless.com >> >> On Thu, Jun 14, 2012 at 8:14 PM, Itamar Syn-Hershko <[EMAIL PROTECTED]> >> wrote: >> > Not what I'm seeing. I actually see a lot of segments created and merged >> > while it operates. Expected? >> > >> > Reminding you, this is 2.9.4 / 3.0.3 >> > >> > On Fri, Jun 15, 2012 at 3:10 AM, Michael McCandless >> > <[EMAIL PROTECTED]> wrote: >> >> >> >> Right: Lucene never autocommits anymore ... >> >> >> >> If you create a new index, add a bunch of docs, and things crash >> >> before you have a chance to commit, then there is no index (not even a >> >> 0 doc one) in that directory. >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com >> >> >> >> On Thu, Jun 14, 2012 at 1:41 PM, Itamar Syn-Hershko >> >> <[EMAIL PROTECTED]> >> >> wrote: >> >> > I'm quite certain this shouldn't happen also when Commit wasn't >> >> > called. >> >> > >> >> > Mike, can you comment on that? >> >> > >> >> > On Thu, Jun 14, 2012 at 8:03 PM, Christopher Currens >> >> > <[EMAIL PROTECTED]> wrote: >> >> >> >> >> >> Well, the only thing I see is that there is no place where >> >> >> writer.Commit() >> >> >> is called in the delegate assigned to corpusReader.OnDocument. I >> >> >> know >> >> >> that >> >> >> lucene is very transactional, and at least in 3.x, the writer will >> >> >> never >> >> >> auto commit to the index. You can write millions of documents, but >> >> >> if >> >> >> commit is never called, those documents aren't actually part of the >> >> >> index. >> >> >> Committing isn't a cheap operation, so you definitely don't want to >> >> >> do >> >> >> it >> >> >> on every document. >> >> >> >> >> >> You can test it yourself with this (naive) solution. Right below >> >> >> the >> >> >> writer.SetUseCompoundFile(false) line, add "int numDocsAdded = 0;". >> >> >> At >> >> >> the >> >> >> end of the corpusReader.OnDocument delegate add: >> >> >> >> >> >> // Example only. I wouldn't suggest committing this often >> >> >> if(++numDocsAdded % 5 == 0) >> >> >> { >> >> >> writer.Commit(); >> >> >> } >> >> >> >> >> >> I had the application crash for real on this file: >> >> >> >> >> >> >> >> >> >> >> >> http://dumps.wikimedia.org/gawiktionary/20120613/gawiktionary-20120613-pages-meta-history.xml.bz2, >> >> >> about 20% into the operation. Without the commit, the index is >> >> >> empty. >> >> >> Add >> >> >> it in, and I get 755 files in the index after it crashes. >> >> >> >> >> >> >> >> >> Thanks, >> >> >> Christopher >> >> >> >> >> >> On Wed, Jun 13, 2012 at 6:13 PM, Itamar Syn-Hershko >> >> >> <[EMAIL PROTECTED]>wrote: >> >> >> >> >> >> >> >> >> > Yes, reproduced in first try. See attached program - I referenced |