|
Ning Li
2006-08-22, 19:45
jason rutherglen
2006-08-22, 21:29
Ning Li
2006-08-29, 20:18
Ning Li
2006-08-30, 00:00
jason rutherglen
2006-08-30, 00:11
Ning Li
2006-08-30, 00:28
Ning Li
2006-09-02, 04:46
Yonik Seeley
2006-09-05, 20:20
Ning Li
2006-09-05, 22:28
Yonik Seeley
2006-09-06, 01:50
Ning Li
2006-09-06, 15:26
Marvin Humphrey
2006-09-06, 15:40
Yonik Seeley
2006-09-06, 16:15
Ning Li
2006-09-06, 16:49
Yonik Seeley
2006-09-06, 17:30
Marvin Humphrey
2006-09-06, 18:35
Yonik Seeley
2006-09-06, 19:06
jason rutherglen
2006-09-06, 19:08
Marvin Humphrey
2006-09-06, 19:40
Yonik Seeley
2006-09-06, 20:37
Ning Li
2006-09-06, 22:54
Ning Li
2006-09-06, 23:23
Marvin Humphrey
2006-09-06, 23:57
Yonik Seeley
2006-09-12, 03:17
Ning Li
2006-09-12, 14:21
Ning Li
2006-12-12, 17:33
|
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-08-22, 19:45
> I tested just the IndexWriter from this code base, it does not seem to work.
> NewIndexModifier does work. I simply used IndexWriter to create several > documents and then search for them. Nothing came back even though it seems > something was written to disk. The patch worked until several days ago, when a change was committed to IndexWriter which probably broke the patch. I didn't submit another patch to avoid flooding this issue with too many patches. But if you or anyone else are interested in using/reviewing/committing this patch, I will submit one that works with the latest version. Ning ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)jason rutherglen 2006-08-22, 21:29
Yes I am including this patch as it is very useful for increasing the efficiency of updates as you described. I will be conducting more tests and will post any results. Yes a patch for IndexWriter will be useful so that the entirety of this build will work. Thanks!
----- Original Message ---- From: Ning Li <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, August 22, 2006 12:45:00 PM Subject: Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) > I tested just the IndexWriter from this code base, it does not seem to work. > NewIndexModifier does work. I simply used IndexWriter to create several > documents and then search for them. Nothing came back even though it seems > something was written to disk. The patch worked until several days ago, when a change was committed to IndexWriter which probably broke the patch. I didn't submit another patch to avoid flooding this issue with too many patches. But if you or anyone else are interested in using/reviewing/committing this patch, I will submit one that works with the latest version. Ning ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-08-29, 20:18
Could you elaborate?
> Jason Rutherglen commented on LUCENE-565: > ----------------------------------------- > > It seems this writer works, but then some mysterious happens to the index and the searcher can no longer read it. I am using this in conjunction with Solr. The index files look ok, however a search will return nothing. I have seen this repeatedly over about 1 weeks time. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-08-30, 00:00
> (reopen), then perform a batch addDocuments. Then when a search is executed
> nothing is returned, and after an optimize the index goes down to 1K. Seems What did you set maxBufferedDocs to? If it is bigger than the number of documents you inserted, the newly added documents haven't reached disk, therefore a search will return nothing. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)jason rutherglen 2006-08-30, 00:11
The documents reached disk as a close was performed on the NewIndexModifier and the index size grows, seem like the deleteable files registers the documents as deleted though, so a search returns nothing and an optimize deletes all of the documents. Maybe the new documents have the same docid as the deleted documents and so are registered as deleted.
----- Original Message ---- From: Ning Li <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Tuesday, August 29, 2006 5:00:17 PM Subject: Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) > (reopen), then perform a batch addDocuments. Then when a search is executed > nothing is returned, and after an optimize the index goes down to 1K. Seems What did you set maxBufferedDocs to? If it is bigger than the number of documents you inserted, the newly added documents haven't reached disk, therefore a search will return nothing. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-08-30, 00:28
> DirectUpdateHandler2. I will create a non-Solr reproduction of the issue.
I'm still not clear how you used the patch. So this will definitely help. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-09-02, 04:46
> I believe this patch probably also changes the merge behavior.
> I think we need to discuss what exactly the new merge behavior is, if it's OK, > what we think the index invariants should be (no more than x segments of y size, > etc), and I'd like to see some code to test those invariants. Yes, the patch does change the merge behaviour. One example was discussed in one of my comments on TestIndexModifier. I think it's a very good idea to think of the index invariants. The index invariants can help check the robustness of any new merge policy. They can also serve as a guideline when we come up with a good merge policy for a version of addIndexes() where optimize() won't be called (LUCENE-528). Here is an outline of the rest of the email: 1) Current Lucene merge policy. 2) Its strengths and weaknesses. The strengths are good candidates for index invariants. 3) Changed merge behaviour in the patch. 1) Current Lucene merge policy TargetMergeDocs is a threshold above which a merge will happen. Initially, it's set to maxBufferedDocs. Starting from the newest segment, sum the doc counts of consecutive segments whose doc count is less than targetMergeDocs. If the sum is less than targetMergeDocs, no merge. Otherwise, merge those segments, multiply targetMergeDocs by mergeFactor, and go through the process again. Another part of a merge policy is how ram segments are flushed during close(). Currently, the doc counts of all the ram segments plus one on-disk segment are summed. If the sum is less than or equal to mergeFactor, those segments are merged and written to disk. Otherwise, only ram segments are merged and written to disk. (Does it make more sense to use maxBufferedDocs as the threshold?) 2) Strengths and weakness of the current merge policy The following notations will be used: B = maxBufferedDocs M = mergeFactor f(n) = floor(log_M (n / B)). If f(n) = c, it means B*(M^c) <= n < B*(M^(c+1)). The current merge policy has two nice properties: 1 If i and i+1 (i older, i+1 newer) are two consecutive segments of doc counts x and y with y >= B, then f(x) >= f(y). In other words, if B*(M^c) <= y < B*(M^(c+1)), then x >= B*(M^c). 2 Less than M number of segments whose doc count y satisfies B*(M^c) <= y < B*(M^(c+1)) for any c >= 0. From property 1 we know segments with the same f(y) are consecutive. They can be proved by induction. It also has two weaknesses: 1 It doesn't take deletes into consideration. DocCount was used above, not numDocs. So it's possible that a large segment with a lot of deleted documents won't get cleaned up until much later. 2 Property 2 above only says about segments whose doc count is >= B. The number of on-disk segments whose doc count is < B could be well above M because of how ram segments are flushed. For example, if B=1000, M=10, and we close the index every time 10 documents are inserted, we could have 99 on-disk segments each with doc count 10. Does it make more sense to use B as the threshold during ram flush instead of M? Or is it better to simply set a max number of on-disk segments whose doc count is < B? 3) Changed merge behaviour in the patch In the patch, when a merge happens, the segments being merged are either all in ram or all on disk, but not both. Because of this, it has a similar weakness as the second weakness of the current merge policy, but worse. The rest are the same: the same two strengths and the same first weakness. I'm looking forward to more discussions on the index invarients and how the current merge policy could be improved etc. So although the patch could be modified to match the exact behaviour of the current merge policy, I'll wait until we reach some agreement. Ning ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Yonik Seeley 2006-09-05, 20:20
On 9/2/06, Ning Li <[EMAIL PROTECTED]> wrote:
> Here is an outline of the rest of the email: > 1) Current Lucene merge policy. > 2) Its strengths and weaknesses. The strengths are good candidates for > index invariants. > 3) Changed merge behaviour in the patch. > > 1) Current Lucene merge policy > TargetMergeDocs is a threshold above which a merge will happen. > Initially, it's set to maxBufferedDocs. Starting from the newest > segment, sum the doc counts of consecutive segments whose doc count is > less than targetMergeDocs. If the sum is less than targetMergeDocs, no > merge. Otherwise, merge those segments, multiply targetMergeDocs by > mergeFactor, and go through the process again. > > Another part of a merge policy is how ram segments are flushed during > close(). Currently, the doc counts of all the ram segments plus one > on-disk segment are summed. If the sum is less than or equal to > mergeFactor, those segments are merged and written to disk. Otherwise, > only ram segments are merged and written to disk. (Does it make more > sense to use maxBufferedDocs as the threshold?) Yes, I think it does make more sense to use maxBufferedDocs. > 2) Strengths and weakness of the current merge policy > The following notations will be used: > B = maxBufferedDocs > M = mergeFactor > f(n) = floor(log_M (n / B)). If f(n) = c, it means B*(M^c) <= n < B*(M^(c+1)). > > The current merge policy has two nice properties: > 1 If i and i+1 (i older, i+1 newer) are two consecutive segments of > doc counts x and y with y >= B, then f(x) >= f(y). In other words, if > B*(M^c) <= y < B*(M^(c+1)), then x >= B*(M^c). > 2 Less than M number of segments whose doc count y satisfies B*(M^c) > <= y < B*(M^(c+1)) for any c >= 0. From property 1 we know segments > with the same f(y) are consecutive. > They can be proved by induction. > > It also has two weaknesses: > 1 It doesn't take deletes into consideration. DocCount was used > above, not numDocs. So it's possible that a large segment with a lot > of deleted documents won't get cleaned up until much later. > 2 Property 2 above only says about segments whose doc count is >= B. > The number of on-disk segments whose doc count is < B could be well > above M because of how ram segments are flushed. For example, if > B=1000, M=10, and we close the index every time 10 documents are > inserted, we could have 99 on-disk segments each with doc count 10. > Does it make more sense to use B as the threshold during ram flush > instead of M? Or is it better to simply set a max number of on-disk > segments whose doc count is < B? What about an invariant that says the number of main index segments with the same level (f(n)) should be less than M. I am concerned about corner cases causing tons of segments and slowing search or causing errors due to file descriptor exhaustion. When merging, maybe we should count the number of segments at a particular index level f(n), rather than adding up the number of documents. In the presence of deletions, this should lead to faster indexing (due to less frequent merges) I think. > 3) Changed merge behaviour in the patch > In the patch, when a merge happens, the segments being merged are > either all in ram or all on disk, but not both. Because of this, it > has a similar weakness as the second weakness of the current merge > policy, but worse. The rest are the same: the same two strengths and > the same first weakness. > > I'm looking forward to more discussions on the index invarients and > how the current merge policy could be improved etc. So although the > patch could be modified to match the exact behaviour of the current > merge policy, I'll wait until we reach some agreement. What is the behavior of your patch under the current scenario: M=10, B=1000 open writer, add 3 docs, close writer open writer, add 1000 docs, close writer Do you avoid the situation of having segments with docs=3 and 1000 (hence f(n) increases as you increase segment numbers... a no-no)? -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-09-05, 22:28
> What about an invariant that says the number of main index segments
> with the same level (f(n)) should be less than M. That is exactly what the second property says: "Less than M number of segments whose doc count n satisfies B*(M^c) <n < B*(M^(c+1)) for any c >= 0." In other words, less than M number of segments with the same f(n). > I am concerned about corner cases causing tons of segments and slowing > search or causing errors due to file descriptor exhaustion. > > When merging, maybe we should count the number of segments at a > particular index level f(n), rather than adding up the number of > documents. In the presence of deletions, this should lead to faster > indexing (due to less frequent merges) I think. Given M, B and an index which has L (0 < L < M) segments with docs less than B, how many ram docs should be accumulated before a merge is triggered? B is not good. B-sum(L) is the old strategy which has problems. So between B-sum(L) and B? Once there are M segments with docs less than B, they'll be merged. But what if L=0? Should B ram docs be accumulated before flushed in that case? In any case, if flushing ram docs causes the the number of segments with <B docs to reach M in close(), a merge with those segments should be triggered. > What is the behavior of your patch under the current scenario: > > M=10, B=1000 > open writer, add 3 docs, close writer > open writer, add 1000 docs, close writer > > Do you avoid the situation of having segments with docs=3 and 1000 > (hence f(n) increases as you increase segment numbers... a no-no)? Currently, it does result in segments with docs=3 and 1000. I'll modify the patch so that it completely complies with all the index invariants once an agreement is reached. Ning ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Yonik Seeley 2006-09-06, 01:50
On 9/5/06, Ning Li <[EMAIL PROTECTED]> wrote:
> > What about an invariant that says the number of main index segments > > with the same level (f(n)) should be less than M. > > That is exactly what the second property says: > "Less than M number of segments whose doc count n satisfies B*(M^c) <> n < B*(M^(c+1)) for any c >= 0." > > In other words, less than M number of segments with the same f(n). Ah, I had missed that. But I don't believe that lucene currently obeys this in all cases. > > I am concerned about corner cases causing tons of segments and slowing > > search or causing errors due to file descriptor exhaustion. > > > > When merging, maybe we should count the number of segments at a > > particular index level f(n), rather than adding up the number of > > documents. In the presence of deletions, this should lead to faster > > indexing (due to less frequent merges) I think. > > Given M, B and an index which has L (0 < L < M) segments with docs > less than B, how many ram docs should be accumulated before a merge is > triggered? B is not good. B-sum(L) is the old strategy which has > problems. The new IndexWriter changes ad an additional constraint: to delete documents efficiently, the first merge must be on buffered documents only to ensure that ids don't change. We should also explore changing the index invariants to accommodate this. Do you have any ideas in this area? Is a monotonically decreasing segment level (your f(n)) really required? > So between B-sum(L) and B? Once there are M segments with > docs less than B, they'll be merged. But what if L=0? Should B ram > docs be accumulated before flushed in that case? It seems like it. Examples are easier to visualize sometimes... do you have an example where this wouldn't be advisable? > In any case, if flushing ram docs causes the the number of segments > with <B docs to reach M in close(), a merge with those segments should > be triggered. Right. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-09-06, 15:26
> > "Less than M number of segments whose doc count n satisfies B*(M^c) <> > n < B*(M^(c+1)) for any c >= 0."
> > In other words, less than M number of segments with the same f(n). > > Ah, I had missed that. But I don't believe that lucene currently > obeys this in all cases. I think it does hold for n >= B, i.e. c >= 0. But not for n < B. > The new IndexWriter changes ad an additional constraint: to delete > documents efficiently, the first merge must be on buffered documents > only to ensure that ids don't change. We should also explore changing > the index invariants to accommodate this. > > Do you have any ideas in this area? Is a monotonically decreasing > segment level (your f(n)) really required? Currently, the first merge always starts on buffered documents. Do you want this constraint to be reflected in the index invariants, or do you want to remove this constraint? In any case, a monotonically decreasing f(n) is definitely a good thing. Otherwise, cases like a sandwich (segments with small f(n) sandwiched by two segments with large f(n)) make it even harder to come up with a robust merge policy. > > So between B-sum(L) and B? Once there are M segments with > > docs less than B, they'll be merged. But what if L=0? Should B ram > > docs be accumulated before flushed in that case? > > It seems like it. Examples are easier to visualize sometimes... do > you have an example where this wouldn't be advisable? I'm ok with it. I simply wish there were one strategy that would work for both cases. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Marvin Humphrey 2006-09-06, 15:40
On Sep 5, 2006, at 3:28 PM, Ning Li wrote: > Given M, B and an index which has L (0 < L < M) segments with docs > less than B, how many ram docs should be accumulated before a merge is > triggered? B is not good. B-sum(L) is the old strategy which has > problems. So between B-sum(L) and B? Once there are M segments with > docs less than B, they'll be merged. But what if L=0? Should B ram > docs be accumulated before flushed in that case? So cut the Gordian Knot? http://wiki.apache.org/jakarta-lucene/KinoSearchMergeModel Marvin Humphrey Rectangular Research http://www.rectangular.com/ ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Yonik Seeley 2006-09-06, 16:15
Just brainstorming a little...
Assuming B=1000, M=10 (I think better with concrete examples) It seems like we should avoid unnecessary merging, allowing up to 9 segments of 1000 documents or less w/o merging. When we reach 10 segments, they should be merged into a single segment. Let's assume a segment of size 8500 is created by the merge. Assume we write another 10 full segments that are merged into a bigger segment of size 10,000. It *feels* like: 1) we should be able to write full segments of 1000 docs, or less than that if closing the writer. 2) we should be able to write a full segment of 1000 docs *after* a non-full segment w/o having to merge 3) 10,000 and 8,500 should be at the same index level, not different levels 4) 1000 and 999 docs should be at the same index level So, I *think* most of our hypothetical problems go away with a simple adjustment to f(n): f(n) = floor(log_M((n-1)/B)) Right? That allows us to write all buffered docs separately (necessary for easy deletions), allows us to only merge M segments at a time (decreases number of merges), and allows us to maintain a monotonically decreasing f(n). -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-09-06, 16:49
> So, I *think* most of our hypothetical problems go away with a simple
> adjustment to f(n): > > f(n) = floor(log_M((n-1)/B)) Correct. And nice. :-) Equivalently, f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)). So f(n) = 0 means n <= B. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Yonik Seeley 2006-09-06, 17:30
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> So cut the Gordian Knot? > > http://wiki.apache.org/jakarta-lucene/KinoSearchMergeModel :-) Interesting stuff... So it looks like you have intermediate things that aren't lucene segments, but end up producing valid lucene segments at the end of a session? For Java lucene, I think the biggest indexing gain could be had by not buffering using single doc segments, but something optimized for in-memory single segment creation. The downside is complexity... two sets of "merge" code. It would be interesting to see an IndexWriter2 for full Gordian Knot cutting like you do :-) -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Marvin Humphrey 2006-09-06, 18:35
On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote: > So it looks like you have intermediate things that aren't lucene > segments, but end up producing valid lucene segments at the end of a > session? That's one way of thinking about it. There's only one "thing" though: a big bucket of serialized index entries. At the end of a session, those are sorted, pulled apart, and used to write the tis, tii, frq, and prx files. Everything else (e.g. stored fields) gets written incrementally as documents get added. The fact that stored fields don't get shuffled around is one of this algorithm's advantages (along with much lower memory requirements, etc). > For Java lucene, I think the biggest indexing gain could be had by not > buffering using single doc segments, but something optimized for > in-memory single segment creation. In theory, you could apply this technique only to a limited number of docs and create segments, say, 10 docs at a time rather than 1 at a time. But then you still have to do something with each 10 doc segment, and you don't get the benefits of less disk shuffling and lower RAM usage. Better to just create 1 segment per session. > The downside is complexity... two > sets of "merge" code. KS doesn't have SegmentMerger. :) > It would be interesting to see an IndexWriter2 for full Gordian Knot > cutting like you do :-) I've already contributed a Java port of KinoSearch's external sorter (along with its tests), which is the crucial piece. The rest isn't easy, but stay tuned. ;) Marvin Humphrey Rectangular Research http://www.rectangular.com/ ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Yonik Seeley 2006-09-06, 19:06
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote: > > > So it looks like you have intermediate things that aren't lucene > > segments, but end up producing valid lucene segments at the end of a > > session? > > That's one way of thinking about it. There's only one "thing" > though: a big bucket of serialized index entries. At the end of a > session, those are sorted, pulled apart, and used to write the tis, > tii, frq, and prx files. > > Everything else (e.g. stored fields) gets written incrementally as > documents get added. The fact that stored fields don't get shuffled > around is one of this algorithm's advantages (along with much lower > memory requirements, etc). Hmmm, not rewriting stored fields is nice. I guess that could apply to anything that's strictly document specific, such as term vectors. > > For Java lucene, I think the biggest indexing gain could be had by not > > buffering using single doc segments, but something optimized for > > in-memory single segment creation. > > In theory, you could apply this technique only to a limited number of > docs and create segments, say, 10 docs at a time rather than 1 at a > time. But then you still have to do something with each 10 doc > segment, and you don't get the benefits of less disk shuffling and > lower RAM usage. Better to just create 1 segment per session. One should be able to get the bulk of the benefit by buffered 1000 or 10,000 docs at a time though (with increased mem usage of course). One problem with extending it to any number of documents is that the complexity goes up because you can't assume it will all fit in memory. That's fine if you're starting from scratch.... but if one looks at lucene, which already has working segment merging to use when things can't fit in memory, the simplest path toward greater indexing performance is changing all the first-level merging. Of course, if someone like you is willing to take on reworking indexing in general, who am I to complain about the additional effort involved ;-) > > The downside is complexity... two > > sets of "merge" code. > > KS doesn't have SegmentMerger. :) Yeah, I was talking about the downside of my incremental plan... only using a different strategy for the buffered docs, and use segment merging for everything else. Still, how do you deal with multiple sessions w/o being able to merge segments? Do you just keep creating more and more segments? It seems like if you had a way to read a segment into an existing "big bucket", then that's a segment merger. > > It would be interesting to see an IndexWriter2 for full Gordian Knot > > cutting like you do :-) > > I've already contributed a Java port of KinoSearch's external sorter > (along with its tests), which is the crucial piece. The rest isn't > easy, but stay tuned. ;) I definitely will. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)jason rutherglen 2006-09-06, 19:08
Sounds interesting Marvin, I would be willing to test out what you create. I am working on trying creating a rapidly updating index and it sounds like this may help that. I've noticed even using a ramdisk that the whole merging process is quite slow. Maybe also because of the locking that occurs the CPU is not maxed out either. Seems like there is a lot of room for optimization. Cheers.
----- Original Message ---- From: Marvin Humphrey <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, September 6, 2006 11:35:59 AM Subject: Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided) On Sep 6, 2006, at 10:30 AM, Yonik Seeley wrote: > So it looks like you have intermediate things that aren't lucene > segments, but end up producing valid lucene segments at the end of a > session? That's one way of thinking about it. There's only one "thing" though: a big bucket of serialized index entries. At the end of a session, those are sorted, pulled apart, and used to write the tis, tii, frq, and prx files. Everything else (e.g. stored fields) gets written incrementally as documents get added. The fact that stored fields don't get shuffled around is one of this algorithm's advantages (along with much lower memory requirements, etc). > For Java lucene, I think the biggest indexing gain could be had by not > buffering using single doc segments, but something optimized for > in-memory single segment creation. In theory, you could apply this technique only to a limited number of docs and create segments, say, 10 docs at a time rather than 1 at a time. But then you still have to do something with each 10 doc segment, and you don't get the benefits of less disk shuffling and lower RAM usage. Better to just create 1 segment per session. > The downside is complexity... two > sets of "merge" code. KS doesn't have SegmentMerger. :) > It would be interesting to see an IndexWriter2 for full Gordian Knot > cutting like you do :-) I've already contributed a Java port of KinoSearch's external sorter (along with its tests), which is the crucial piece. The rest isn't easy, but stay tuned. ;) Marvin Humphrey Rectangular Research http://www.rectangular.com/ ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Marvin Humphrey 2006-09-06, 19:40
On Sep 6, 2006, at 12:06 PM, Yonik Seeley wrote: > Hmmm, not rewriting stored fields is nice. > I guess that could apply to anything that's strictly document > specific, such as term vectors. Yes. Remember the old benchmarks I posted a few months ago? KinoSearch's performance was much closer to Lucene when fields and term vectors where turned on. This is why. > One problem with extending it to any number of documents is that the > complexity goes up because you can't assume it will all fit in memory. That problem is 100% solved by the external sorter. In fact, that's the whole reason that the external sorter is needed. > Still, how do you deal with multiple sessions w/o being able to > merge segments? > Do you just keep creating more and more segments? It seems like if > you had a way to read a segment into an existing "big bucket", then > that's a segment merger. The relevant classes have an add_segment() method. Aspects are similar to SegmentMerger. Marvin Humphrey Rectangular Research http://www.rectangular.com/ ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Yonik Seeley 2006-09-06, 20:37
On 9/6/06, Ning Li <[EMAIL PROTECTED]> wrote:
> > So, I *think* most of our hypothetical problems go away with a simple > > adjustment to f(n): > > > > f(n) = floor(log_M((n-1)/B)) > > Correct. And nice. :-) > > Equivalently, > f(n) = ceil(log_M (n / B)). If f(n) = c, it means B*(M^(c-1)) < n <= B*(M^(c)). > > So f(n) = 0 means n <= B. Cool. or for all n>0, f(n) = ceil(log_M(ceil(n/B))) to avoid negative f(n) So what's left... maxMergeDocs I guess. Capping the segment size breaks the simple invariants a bit. If the first M segments of any given f(n) total more than maxMergeDocs, then the total number of segments with that same f(n) may be >= M We also need to be able to handle changes to M and maxMergeDocs between different IndexWriter sessions. When checking for a merge for a particular f(n) level, always checking and merging leftmost* segments first solves some potential problems here. *when I say leftmost, I envision the index with the largest segments on the left. -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-09-06, 22:54
> So what's left... maxMergeDocs I guess.
> Capping the segment size breaks the simple invariants a bit. Correct. > We also need to be able to handle changes to M and maxMergeDocs > between different IndexWriter sessions. When checking for a merge for Hmmm. A change of M could easily break the invariants. > a particular f(n) level, always checking and merging leftmost* > segments first solves some potential problems here. Yes, this will help. This means multiple merges may be necessary for a particular f(n) level. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-09-06, 23:23
On 9/6/06, Marvin Humphrey <[EMAIL PROTECTED]> wrote:
> That's one way of thinking about it. There's only one "thing" > though: a big bucket of serialized index entries. At the end of a > session, those are sorted, pulled apart, and used to write the tis, > tii, frq, and prx files. Interesting. When do you add "merge-worthy" segments? I'd guess at the end of a session, when it's easy to decide which segments are "merge-worthy". If so, however, a newer doc could get a smaller docid than an older doc, right? It's a nice property of Lucene that an older doc always has a smaller docid. I think some applications use this to decide newer/older versions of a document. > In theory, you could apply this technique only to a limited number of > docs and create segments, say, 10 docs at a time rather than 1 at a > time. But then you still have to do something with each 10 doc > segment, and you don't get the benefits of less disk shuffling and > lower RAM usage. Better to just create 1 segment per session. This means no new documents are visible to IndexReader until a session is over. In some sense, "1 segment/commit per session" lets an application decide when a "merge" happens. ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Marvin Humphrey 2006-09-06, 23:57
On Sep 6, 2006, at 4:23 PM, Ning Li wrote: > When do you add "merge-worthy" segments? I'd guess at the end of a > session, when it's easy to decide which segments are "merge-worthy". Right. KS sorts the segments by size, then tries to merge the smallest away. The calculation uses the fibonacci series, the idea being to perform the least number of merges while keeping the number of segments manageable. > If so, however, a newer doc could get a smaller docid than an older > doc, right? It's a nice property of Lucene that an older doc always > has a smaller docid. I think some applications use this to decide > newer/older versions of a document. Correct. That information is not preserved with this algorithm. > This means no new documents are visible to IndexReader until a session > is over. In some sense, "1 segment/commit per session" lets an > application decide when a "merge" happens. Yes. And since there's only one class in KinoSearch which modifies the index (InvIndexer), all adds and deletes are committed at the same time. Marvin Humphrey Rectangular Research http://www.rectangular.com/ ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Yonik Seeley 2006-09-12, 03:17
A strange case I just thought of.
Does the new code handle the case where a merge can drop the resulting segment "down a level" (due to deletions)? Example: M=3, B=10, maxMergeDocs=30 1) segment sizes = 30, 30, 30, 30 2) set maxMergeDocs=1000000 3) add enough docs to cause a merge 4) the leftmost 3 segments will be merged first, resulting in segment sizes = 2, 30, n -Yonik http://incubator.apache.org/solr Solr, the open-source Lucene search server ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-09-12, 14:21
The new code does handle the case.
After mergeSegments(...) in maybeMergeSegments(), there is the following code: numSegments -= mergeFactor; if (docCount > upperBound) { minSegment++; exceedsUpperLimit = true; } else if (docCount > 0) { numSegments++; } Assume segment sizes = 30, 30, 30, 30, 30 just before the merge of the leftmost 3 segments: - If the merge results in segment sizes = 60, 30, 30, i.e. the merged size is greater than upperBound, continue with the rest of the worthy segments on this level. In this case, only two left, no further merge on this level. - If the merge results in segment sizes = 2, 30, 30, i.e. the merged size is less than or equal to upperBound, consider this segment again for further merges on this same level. In this case, there are three segments, they are merged. Also worth noticing is the while loop which is used to find merge-worthy segments in maybeMergeSegments(). Merge-worthy segments start from the rightmost segment whose doc count is in bounds, and end before the rightmost segment whose doc count is greater than upperBound. Although merge-worthy segments start from one in bounds and the others are <= upperBound, but the other are not necessarily > lowerBound. This only happens if parameters such as mergeFactor and maxBufferedDocs change. So if segment sizes = 2, 30, 30 when the loop is executed with lowerBound 10 and upperBound 30, all three will be considered merge-worthy and be merged. In TestIndexWriterMergePolicy, testMaxBufferedDocsChange() (in fact, both mergeFactor and maxBufferedDocs are changed in this test case) tests both scenarios. Ning On 9/11/06, Yonik Seeley <[EMAIL PROTECTED]> wrote: > A strange case I just thought of. > Does the new code handle the case where a merge can drop the resulting > segment "down a level" (due to deletions)? > > Example: M=3, B=10, maxMergeDocs=30 > > 1) segment sizes = 30, 30, 30, 30 > 2) set maxMergeDocs=1000000 > 3) add enough docs to cause a merge > 4) the leftmost 3 segments will be merged first, resulting in segment > sizes = 2, 30, n > > -Yonik > http://incubator.apache.org/solr Solr, the open-source Lucene search server > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: [jira] Commented: (LUCENE-565) Supporting deleteDocuments in IndexWriter (Code and Performance Results Provided)Ning Li 2006-12-12, 17:33
Thanks for the comments Yonik!
> To minimize the number of reader open/closes on large persistent segments, I think the ability to apply deletes only before a merge is important. That might add a 4th method: doBeforeMerge() I'm not sure I get this. Buffered deletes are only applied(flushed) during ram flush. No buffered deletes are applied in the merges of on-disk segments. > It would be nice to not have to continually open and close readers on segments that aren't involved in a merge. Is there a way to do this? > If SegmentInfos had a cached reader, that seems like it would solve both problems. > I haven't thought about it enough to figure out how doable it is though. This is a good idea! One concern, however, is that caching readers will cause a larger memory footprint. Is it acceptable? > Ning, please do produce another patch to the latest trunk (but you might want to wait until after LUCENE-702 is sorted out. Will do. --------------------------------------------------------------------- |