|
Grant Ingersoll
2007-12-20, 14:41
Yonik Seeley
2007-12-20, 14:44
Yonik Seeley
2007-12-20, 15:55
Michael McCandless
2007-12-20, 16:13
Michael McCandless
2007-12-20, 16:15
Yonik Seeley
2007-12-20, 16:27
Gabi Steinberg
2007-12-20, 16:33
Michael McCandless
2007-12-20, 16:33
Grant Ingersoll
2007-12-20, 16:36
Yonik Seeley
2007-12-20, 16:39
Michael McCandless
2007-12-20, 16:57
Yonik Seeley
2007-12-20, 17:04
Gabi Steinberg
2007-12-20, 17:58
Grant Ingersoll
2007-12-20, 18:36
Yonik Seeley
2007-12-20, 18:47
Grant Ingersoll
2007-12-20, 19:25
Yonik Seeley
2007-12-20, 19:43
Michael McCandless
2007-12-20, 20:08
Michael McCandless
2007-12-20, 20:13
Gabi Steinberg
2007-12-20, 21:52
Michael McCandless
2007-12-21, 20:46
Doron Cohen
2007-12-24, 06:10
Michael McCandless
2007-12-31, 10:53
Yonik Seeley
2007-12-31, 16:10
Doron Cohen
2007-12-31, 16:37
Yonik Seeley
2007-12-31, 16:44
Grant Ingersoll
2007-12-31, 16:59
Yonik Seeley
2007-12-31, 17:11
Grant Ingersoll
2007-12-31, 17:25
Yonik Seeley
2007-12-31, 17:47
Michael McCandless
2007-12-31, 17:54
Yonik Seeley
2007-12-31, 17:57
Grant Ingersoll
2007-12-31, 18:49
Michael McCandless
2007-12-31, 23:40
Doron Cohen
2008-01-01, 06:55
Michael McCandless
2008-01-01, 10:50
|
-
DocumentsWriter.checkMaxTermLength issuesGrant Ingersoll 2007-12-20, 14:41
I am getting the following exception when running against trunk:
java.lang.IllegalArgumentException: at least one term (length 20079) exceeds max term length 16383; these terms were skipped at org .apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java: 1545) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1451) at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411) .... I'm wondering if the IndexWriter should throw an explicit exception in this case as opposed to a RuntimeException, as it seems to me really long tokens should be handled more gracefully. It seems strange that the message says the terms were skipped (which the code does in fact do), but then there is a RuntimeException thrown which usually indicates to me the issue is not recoverable. I am using the StandardTokenizer, but I don't think that much matters. Any thoughts on this? -Grant ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-20, 14:44
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I am getting the following exception when running against trunk: > java.lang.IllegalArgumentException: at least one term (length 20079) > exceeds max term length 16383; these terms were skipped > at > org > .apache.lucene.index.IndexWriter.checkMaxTermLength(IndexWriter.java: > 1545) > at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1451) > at > org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1411) > .... > > I'm wondering if the IndexWriter should throw an explicit exception in > this case as opposed to a RuntimeException, as it seems to me really > long tokens should be handled more gracefully. It seems strange that > the message says the terms were skipped (which the code does in fact > do), but then there is a RuntimeException thrown which usually > indicates to me the issue is not recoverable. I am using the > StandardTokenizer, but I don't think that much matters. > > Any thoughts on this? I think it's a good to bring attention to it and not sweep it under the rug. It indicates potential issues or problems with analysis or the data. The user can use a LengthFilter to explicitly throw long tokens away. -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-20, 15:55
On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> I'm wondering if the IndexWriter should throw an explicit exception in > this case as opposed to a RuntimeException, RuntimeExceptions can happen in analysis components during indexing anyway, so it seems like indexing code should deal with exceptions just to be safe. As long as exceptions happinging during indexing don't mess up the indexing code, everything should be OK. > as it seems to me really > long tokens should be handled more gracefully. It seems strange that > the message says the terms were skipped (which the code does in fact > do), but then there is a RuntimeException thrown which usually > indicates to me the issue is not recoverable. It does seem like the document shouldn't be added at all if it caused an exception. Is that what happens if one of the analyzers causes an exception to be thrown? The other option is to simply ignore tokens above 16K... I'm not sure what's right here. -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-20, 16:13
Yonik Seeley wrote: >> as it seems to me really >> long tokens should be handled more gracefully. It seems strange that >> the message says the terms were skipped (which the code does in fact >> do), but then there is a RuntimeException thrown which usually >> indicates to me the issue is not recoverable. > > It does seem like the document shouldn't be added at all if it caused > an exception. > Is that what happens if one of the analyzers causes an exception to > be thrown? > > The other option is to simply ignore tokens above 16K... I'm not sure > what's right here. Right now we are ignoring the too-long tokens and adding the rest. Unfortunately, because DocumentsWriter directly updates the posting lists in RAM, it's very difficult to "undo" those tokens we have already successfully processed & added to the posting lists. Mike ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-20, 16:15
Yonik Seeley wrote: > On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> I'm wondering if the IndexWriter should throw an explicit >> exception in >> this case as opposed to a RuntimeException, > > RuntimeExceptions can happen in analysis components during indexing > anyway, so it seems like indexing code should deal with exceptions > just to be safe. As long as exceptions happinging during indexing > don't mess up the indexing code, everything should be OK. > >> as it seems to me really >> long tokens should be handled more gracefully. It seems strange that >> the message says the terms were skipped (which the code does in fact >> do), but then there is a RuntimeException thrown which usually >> indicates to me the issue is not recoverable. > > It does seem like the document shouldn't be added at all if it caused > an exception. > Is that what happens if one of the analyzers causes an exception to > be thrown? > > The other option is to simply ignore tokens above 16K... I'm not sure > what's right here. Though ... we could simply immediately delete the document when any exception occurs during its processing. So if we think whenever any doc hits an exception, then it should be deleted, it's not so hard to implement that policy... Mike ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-20, 16:27
On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Though ... we could simply immediately delete the document when any > exception occurs during its processing. So if we think whenever any > doc hits an exception, then it should be deleted, it's not so hard to > implement that policy... It does seem like you only want documents in the index that didn't generate exceptions... otherwise it doesn't seem like you would know exactly what got indexed. -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGabi Steinberg 2007-12-20, 16:33
It might be a bit harsh to drop the document if it has a very long token
in it. I can imagine documents with embedded binary data, where the text around the binary data is still useful for search. My feeling is that long tokens (longer than 128 or 256 bytes) are not useful for search, and should be truncated or dropped. Gabi. Yonik Seeley wrote: > On Dec 20, 2007 11:15 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: >> Though ... we could simply immediately delete the document when any >> exception occurs during its processing. So if we think whenever any >> doc hits an exception, then it should be deleted, it's not so hard to >> implement that policy... > > It does seem like you only want documents in the index that didn't > generate exceptions... otherwise it doesn't seem like you would know > exactly what got indexed. > > -Yonik > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-20, 16:33
Yonik Seeley wrote: > On Dec 20, 2007 11:15 AM, Michael McCandless > <[EMAIL PROTECTED]> wrote: >> Though ... we could simply immediately delete the document when any >> exception occurs during its processing. So if we think whenever any >> doc hits an exception, then it should be deleted, it's not so hard to >> implement that policy... > > It does seem like you only want documents in the index that didn't > generate exceptions... otherwise it doesn't seem like you would know > exactly what got indexed. I agree -- I'll work on this. Mike ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGrant Ingersoll 2007-12-20, 16:36
On Dec 20, 2007, at 10:55 AM, Yonik Seeley wrote: > On Dec 20, 2007 9:41 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> I'm wondering if the IndexWriter should throw an explicit exception >> in >> this case as opposed to a RuntimeException, > > RuntimeExceptions can happen in analysis components during indexing > anyway, so it seems like indexing code should deal with exceptions > just to be safe. As long as exceptions happinging during indexing > don't mess up the indexing code, everything should be OK. > >> as it seems to me really >> long tokens should be handled more gracefully. It seems strange that >> the message says the terms were skipped (which the code does in fact >> do), but then there is a RuntimeException thrown which usually >> indicates to me the issue is not recoverable. > > It does seem like the document shouldn't be added at all if it caused > an exception. > Is that what happens if one of the analyzers causes an exception to > be thrown? > > The other option is to simply ignore tokens above 16K... I'm not sure > what's right here. +1. The code already does ignore them, that is why the exception seems so weird. DocsWriter gracefully handles the problem, but then throws up after the fact. I would vote to just log it or let the user decide somehow. ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-20, 16:39
On Dec 20, 2007 11:33 AM, Gabi Steinberg <[EMAIL PROTECTED]> wrote:
> It might be a bit harsh to drop the document if it has a very long token > in it. There is really two issues here. For long tokens, one could either ignore them or generate an exception. For all exceptions generated while indexing a document (that are passed through to the user) it seems like that document should not be in the index. -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-20, 16:57
Yonik Seeley wrote: > On Dec 20, 2007 11:33 AM, Gabi Steinberg > <[EMAIL PROTECTED]> wrote: >> It might be a bit harsh to drop the document if it has a very long >> token >> in it. > > There is really two issues here. > For long tokens, one could either ignore them or generate an > exception. I can see the argument both ways. On the one hand, we want indexing to be robust/resilient, such that massive terms are quietly skipped (maybe w/ a log to infoStream if its set). On the other hand, clearly there is something seriously wrong when your analyzer is producing a single 16+ KB term, and so it would be nice to be brittle/in-your-face so the user is forced to deal with/ correct the situation. Also, it's really bad once these terms pollute your index. EG suddenly the Terminfos index can easily take tremendous amounts of RAM, slow down indexing/merging/searching, etc. This is why LUCENE-1052 was created. It's alot better if you catch this up front then letting it pollute your index. If we want to take the "in your face" solution, I think the cutoff should be less than 16 KB (16 KB is just the hard limit inside DW). > For all exceptions generated while indexing a document (that are > passed through to the user) > it seems like that document should not be in the index. I like this disposition because it means the index is in a known state. It's bad to have partial docs in the index: it can only lead to more confusion as people try to figure out why some terms work for retrieving the doc but others don't. Mike ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-20, 17:04
On Dec 20, 2007 11:57 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Yonik Seeley wrote: > > On Dec 20, 2007 11:33 AM, Gabi Steinberg > > <[EMAIL PROTECTED]> wrote: > >> It might be a bit harsh to drop the document if it has a very long > >> token > >> in it. > > > > There is really two issues here. > > For long tokens, one could either ignore them or generate an > > exception. > > I can see the argument both ways. Me too. > On the one hand, we want indexing > to be robust/resilient, such that massive terms are quietly skipped > (maybe w/ a log to infoStream if its set). > > On the other hand, clearly there is something seriously wrong when > your analyzer is producing a single 16+ KB term, and so it would be > nice to be brittle/in-your-face so the user is forced to deal with/ > correct the situation. > > Also, it's really bad once these terms pollute your index. EG > suddenly the Terminfos index can easily take tremendous amounts of > RAM, slow down indexing/merging/searching, etc. This is why > LUCENE-1052 was created. It's alot better if you catch this up front > then letting it pollute your index. > > If we want to take the "in your face" solution, I think the cutoff > should be less than 16 KB (16 KB is just the hard limit inside DW). > > > For all exceptions generated while indexing a document (that are > > passed through to the user) > > it seems like that document should not be in the index. > > I like this disposition because it means the index is in a known > state. It's bad to have partial docs in the index: it can only lead > to more confusion as people try to figure out why some terms work for > retrieving the doc but others don't. Right... and I think that was the behavior before the indexing code was rewritten since the new single doc segment was only added after the complete document was inverted (hence any exception would prevent it from being added). -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGabi Steinberg 2007-12-20, 17:58
On balance, I think that dropping the document makes sense. I think
Yonik is right in that ensuring that keys are useful - and indexable - is the tokenizer's job. StandardTokenizer, in my opinion, should behave similarly to a person looking at a document and deciding which tokens should be indexed. Few people would argue that a 16K block of binary data is useful for searching, but it's reasonable to suggest that the text around it is useful. I know that one can add the LengthFilter to avoid this problem, but this is not really intuitive; one does not expect the standard tokenizer to generate tokens that IndexWriter chokes on. My vote is to: - drop documents with tokens longer than 16K, as Mike and Yonik suggested - because uninformed user would start with StandardTokenizer, I think it should limit token size to 128 bytes, and add options to change that size, choose between truncating or dropping longer tokens, and in no case produce tokens longer that what IndexWriter can digest. - perhaps come up a clear policy on when a tokenizer should throw an exception? Gabi Steinberg. Yonik Seeley wrote: > On Dec 20, 2007 11:57 AM, Michael McCandless <[EMAIL PROTECTED]> wrote: >> Yonik Seeley wrote: >>> On Dec 20, 2007 11:33 AM, Gabi Steinberg >>> <[EMAIL PROTECTED]> wrote: >>>> It might be a bit harsh to drop the document if it has a very long >>>> token >>>> in it. >>> There is really two issues here. >>> For long tokens, one could either ignore them or generate an >>> exception. >> I can see the argument both ways. > > Me too. > >> On the one hand, we want indexing >> to be robust/resilient, such that massive terms are quietly skipped >> (maybe w/ a log to infoStream if its set). >> >> On the other hand, clearly there is something seriously wrong when >> your analyzer is producing a single 16+ KB term, and so it would be >> nice to be brittle/in-your-face so the user is forced to deal with/ >> correct the situation. >> >> Also, it's really bad once these terms pollute your index. EG >> suddenly the Terminfos index can easily take tremendous amounts of >> RAM, slow down indexing/merging/searching, etc. This is why >> LUCENE-1052 was created. It's alot better if you catch this up front >> then letting it pollute your index. >> >> If we want to take the "in your face" solution, I think the cutoff >> should be less than 16 KB (16 KB is just the hard limit inside DW). >> >>> For all exceptions generated while indexing a document (that are >>> passed through to the user) >>> it seems like that document should not be in the index. >> I like this disposition because it means the index is in a known >> state. It's bad to have partial docs in the index: it can only lead >> to more confusion as people try to figure out why some terms work for >> retrieving the doc but others don't. > > Right... and I think that was the behavior before the indexing code > was rewritten since the new single doc segment was only added after > the complete document was inverted (hence any exception would prevent > it from being added). > > -Yonik > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGrant Ingersoll 2007-12-20, 18:36
On Dec 20, 2007, at 11:57 AM, Michael McCandless wrote: > > Yonik Seeley wrote: > >> On Dec 20, 2007 11:33 AM, Gabi Steinberg >> <[EMAIL PROTECTED]> wrote: >>> It might be a bit harsh to drop the document if it has a very long >>> token >>> in it. >> >> There is really two issues here. >> For long tokens, one could either ignore them or generate an >> exception. > > I can see the argument both ways. On the one hand, we want indexing > to be robust/resilient, such that massive terms are quietly skipped > (maybe w/ a log to infoStream if its set). This would be fine for me. In some sense, it is just like applying the LengthFilter, which removes tokens silently, too, but works for all analyzers. But, I can see the value in the throw the exception case too, except I think the API should declare the exception is being thrown. It could throw an extension of IOException. > > > On the other hand, clearly there is something seriously wrong when > your analyzer is producing a single 16+ KB term, and so it would be > nice to be brittle/in-your-face so the user is forced to deal with/ > correct the situation. > > Also, it's really bad once these terms pollute your index. EG > suddenly the Terminfos index can easily take tremendous amounts of > RAM, slow down indexing/merging/searching, etc. This is why > LUCENE-1052 was created. It's alot better if you catch this up > front then letting it pollute your index. > > If we want to take the "in your face" solution, I think the cutoff > should be less than 16 KB (16 KB is just the hard limit inside DW). > >> For all exceptions generated while indexing a document (that are >> passed through to the user) >> it seems like that document should not be in the index. > > I like this disposition because it means the index is in a known > state. It's bad to have partial docs in the index: it can only lead > to more confusion as people try to figure out why some terms work > for retrieving the doc but others don't. > > Mike > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-20, 18:47
On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> But, I can see the value in the throw the exception > case too, except I think the API should declare the exception is being > thrown. It could throw an extension of IOException. To be robust, user indexing code needs to catch other types of exceptions that could be thrown from Anaylzers anyway. I don't think this exception (if we choose to keep it as an exception) fits in the class of IOException, where something is normally really wrong. We could declare addDocument() to throw something inherited from RuntimeException though, right? -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGrant Ingersoll 2007-12-20, 19:25
Makes sense. I wasn't sure if declaring new exceptions to be thrown
is violating back-compat. issues or not (even if they are runtime exceptions) On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote: > On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> But, I can see the value in the throw the exception >> case too, except I think the API should declare the exception is >> being >> thrown. It could throw an extension of IOException. > > To be robust, user indexing code needs to catch other types of > exceptions that could be thrown from Anaylzers anyway. > > I don't think this exception (if we choose to keep it as an exception) > fits in the class of IOException, where something is normally really > wrong. > > We could declare addDocument() to throw something inherited from > RuntimeException though, right? > > -Yonik > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-20, 19:43
On Dec 20, 2007 2:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Makes sense. I wasn't sure if declaring new exceptions to be thrown > is violating back-compat. issues or not (even if they are runtime > exceptions) That's a good question... I know that declared RuntimeExceptions are contained in the bytecode (the method signature)... but I don't know if they need to match up exactly for things to work. To be safe I guess we should start out with it commented out (or just documented in the JavaDoc). -Yonik > On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote: > > > On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> But, I can see the value in the throw the exception > >> case too, except I think the API should declare the exception is > >> being > >> thrown. It could throw an extension of IOException. > > > > To be robust, user indexing code needs to catch other types of > > exceptions that could be thrown from Anaylzers anyway. > > > > I don't think this exception (if we choose to keep it as an exception) > > fits in the class of IOException, where something is normally really > > wrong. > > > > We could declare addDocument() to throw something inherited from > > RuntimeException though, right? > > > > -Yonik > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-20, 20:08
OK I will take this approach... create TermTooLongException (subclasses RuntimeException), listed in the javadocs but not the throws clause of add/updateDocument. DW throws this if it encounters any term >= 16383 chars in length. Whenever that exception (or others) are thrown from within DW, it means that document will not be added to your index (well, perhaps partially added and then deleted). Probably won't get going on this one until early next year ... I'm mostly offline from 12/22 - 1/1. Mike Yonik Seeley wrote: > On Dec 20, 2007 2:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> Makes sense. I wasn't sure if declaring new exceptions to be thrown >> is violating back-compat. issues or not (even if they are runtime >> exceptions) > > That's a good question... I know that declared RuntimeExceptions are > contained in the bytecode (the method signature)... but I don't know > if they need to match up exactly for things to work. > > To be safe I guess we should start out with it commented out (or just > documented in the JavaDoc). > > -Yonik > >> On Dec 20, 2007, at 1:47 PM, Yonik Seeley wrote: >> >>> On Dec 20, 2007 1:36 PM, Grant Ingersoll <[EMAIL PROTECTED]> >>> wrote: >>>> But, I can see the value in the throw the exception >>>> case too, except I think the API should declare the exception is >>>> being >>>> thrown. It could throw an extension of IOException. >>> >>> To be robust, user indexing code needs to catch other types of >>> exceptions that could be thrown from Anaylzers anyway. >>> >>> I don't think this exception (if we choose to keep it as an >>> exception) >>> fits in the class of IOException, where something is normally really >>> wrong. >>> >>> We could declare addDocument() to throw something inherited from >>> RuntimeException though, right? >>> >>> -Yonik >>> >> >>> -------------------------------------------------------------------- >>> - >>> To unsubscribe, e-mail: [EMAIL PROTECTED] >>> For additional commands, e-mail: [EMAIL PROTECTED] >>> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> >> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-20, 20:13
Gabi Steinberg wrote:
> On balance, I think that dropping the document makes sense. I > think Yonik is right in that ensuring that keys are useful - and > indexable - is the tokenizer's job. > > StandardTokenizer, in my opinion, should behave similarly to a > person looking at a document and deciding which tokens should be > indexed. Few people would argue that a 16K block of binary data is > useful for searching, but it's reasonable to suggest that the text > around it is useful. > > I know that one can add the LengthFilter to avoid this problem, but > this is not really intuitive; one does not expect the standard > tokenizer to generate tokens that IndexWriter chokes on. > > My vote is to: > - drop documents with tokens longer than 16K, as Mike and Yonik > suggested > - because uninformed user would start with StandardTokenizer, I > think it should limit token size to 128 bytes, and add options to > change that size, choose between truncating or dropping longer > tokens, and in no case produce tokens longer that what IndexWriter > can digest. I like this idea, though we probably can't do that until 3.0 so we don't break backwards compatibility? > - perhaps come up a clear policy on when a tokenizer should throw > an exception? > Gabi Steinberg. > > Yonik Seeley wrote: >> On Dec 20, 2007 11:57 AM, Michael McCandless >> <[EMAIL PROTECTED]> wrote: >>> Yonik Seeley wrote: >>>> On Dec 20, 2007 11:33 AM, Gabi Steinberg >>>> <[EMAIL PROTECTED]> wrote: >>>>> It might be a bit harsh to drop the document if it has a very long >>>>> token >>>>> in it. >>>> There is really two issues here. >>>> For long tokens, one could either ignore them or generate an >>>> exception. >>> I can see the argument both ways. >> Me too. >>> On the one hand, we want indexing >>> to be robust/resilient, such that massive terms are quietly skipped >>> (maybe w/ a log to infoStream if its set). >>> >>> On the other hand, clearly there is something seriously wrong when >>> your analyzer is producing a single 16+ KB term, and so it would be >>> nice to be brittle/in-your-face so the user is forced to deal with/ >>> correct the situation. >>> >>> Also, it's really bad once these terms pollute your index. EG >>> suddenly the Terminfos index can easily take tremendous amounts of >>> RAM, slow down indexing/merging/searching, etc. This is why >>> LUCENE-1052 was created. It's alot better if you catch this up >>> front >>> then letting it pollute your index. >>> >>> If we want to take the "in your face" solution, I think the cutoff >>> should be less than 16 KB (16 KB is just the hard limit inside DW). >>> >>>> For all exceptions generated while indexing a document (that are >>>> passed through to the user) >>>> it seems like that document should not be in the index. >>> I like this disposition because it means the index is in a known >>> state. It's bad to have partial docs in the index: it can only lead >>> to more confusion as people try to figure out why some terms work >>> for >>> retrieving the doc but others don't. >> Right... and I think that was the behavior before the indexing code >> was rewritten since the new single doc segment was only added after >> the complete document was inverted (hence any exception would prevent >> it from being added). >> -Yonik >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGabi Steinberg 2007-12-20, 21:52
How about defaulting to a max token size of 16K in StandardTokenizer, so
that it never causes an IndexWriter exception, with an option to reduce that size? The backward incompatibilty is limited then - tokens exceeding 16K will NOT causing an IndexWriter exception. In 3.0 we can reduce that default to a useful size. The option to truncate the token can be useful, I think. It will index the max size prefix of the long tokens. You can still find them, pretty accurately - this becomes a prefix search, but is unlikely to return multiple values because it's a long prefix. It allow you to choose a relatively small max, such as 32 or 64, reducing the overhead caused by junk in the documents while minimizing the chance of not finding something. Gabi. Michael McCandless wrote: > Gabi Steinberg wrote: > >> On balance, I think that dropping the document makes sense. I think >> Yonik is right in that ensuring that keys are useful - and indexable - >> is the tokenizer's job. >> >> StandardTokenizer, in my opinion, should behave similarly to a person >> looking at a document and deciding which tokens should be indexed. >> Few people would argue that a 16K block of binary data is useful for >> searching, but it's reasonable to suggest that the text around it is >> useful. >> >> I know that one can add the LengthFilter to avoid this problem, but >> this is not really intuitive; one does not expect the standard >> tokenizer to generate tokens that IndexWriter chokes on. >> >> My vote is to: >> - drop documents with tokens longer than 16K, as Mike and Yonik suggested >> - because uninformed user would start with StandardTokenizer, I think >> it should limit token size to 128 bytes, and add options to change >> that size, choose between truncating or dropping longer tokens, and in >> no case produce tokens longer that what IndexWriter can digest. > > I like this idea, though we probably can't do that until 3.0 so we don't > break backwards compatibility? > ... ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-21, 20:46
I think this is a good approach -- any objections? This way, IndexWriter is in-your-face (throws TermTooLongException on seeing a massive term), but StandardAnalyzer is robust (silently skips or prefix's the too-long terms). Mike Gabi Steinberg wrote: > How about defaulting to a max token size of 16K in > StandardTokenizer, so that it never causes an IndexWriter > exception, with an option to reduce that size? > > The backward incompatibilty is limited then - tokens exceeding 16K > will NOT causing an IndexWriter exception. In 3.0 we can reduce > that default to a useful size. > > The option to truncate the token can be useful, I think. It will > index the max size prefix of the long tokens. You can still find > them, pretty accurately - this becomes a prefix search, but is > unlikely to return multiple values because it's a long prefix. It > allow you to choose a relatively small max, such as 32 or 64, > reducing the overhead caused by junk in the documents while > minimizing the chance of not finding something. > > Gabi. > > Michael McCandless wrote: >> Gabi Steinberg wrote: >>> On balance, I think that dropping the document makes sense. I >>> think Yonik is right in that ensuring that keys are useful - and >>> indexable - is the tokenizer's job. >>> >>> StandardTokenizer, in my opinion, should behave similarly to a >>> person looking at a document and deciding which tokens should be >>> indexed. Few people would argue that a 16K block of binary data >>> is useful for searching, but it's reasonable to suggest that the >>> text around it is useful. >>> >>> I know that one can add the LengthFilter to avoid this problem, >>> but this is not really intuitive; one does not expect the >>> standard tokenizer to generate tokens that IndexWriter chokes on. >>> >>> My vote is to: >>> - drop documents with tokens longer than 16K, as Mike and Yonik >>> suggested >>> - because uninformed user would start with StandardTokenizer, I >>> think it should limit token size to 128 bytes, and add options to >>> change that size, choose between truncating or dropping longer >>> tokens, and in no case produce tokens longer that what >>> IndexWriter can digest. >> I like this idea, though we probably can't do that until 3.0 so we >> don't break backwards compatibility? > ... > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesDoron Cohen 2007-12-24, 06:10
I like the approach of configuration of this behavior in Analysis
(and so IndexWriter can throw an exception on such errors). It seems that this should be a property of Analyzer vs. just StandardAnalyzer, right? It can probably be a "policy" property, with two parameters: 1) maxLength, 2) action: chop/split/ignore/raiseException when generating too long tokens. Doron On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > > I think this is a good approach -- any objections? > > This way, IndexWriter is in-your-face (throws TermTooLongException on > seeing a massive term), but StandardAnalyzer is robust (silently > skips or prefix's the too-long terms). > > Mike > > Gabi Steinberg wrote: > > > How about defaulting to a max token size of 16K in > > StandardTokenizer, so that it never causes an IndexWriter > > exception, with an option to reduce that size? > > > > The backward incompatibilty is limited then - tokens exceeding 16K > > will NOT causing an IndexWriter exception. In 3.0 we can reduce > > that default to a useful size. > > > > The option to truncate the token can be useful, I think. It will > > index the max size prefix of the long tokens. You can still find > > them, pretty accurately - this becomes a prefix search, but is > > unlikely to return multiple values because it's a long prefix. It > > allow you to choose a relatively small max, such as 32 or 64, > > reducing the overhead caused by junk in the documents while > > minimizing the chance of not finding something. > > > > Gabi. > > > > Michael McCandless wrote: > >> Gabi Steinberg wrote: > >>> On balance, I think that dropping the document makes sense. I > >>> think Yonik is right in that ensuring that keys are useful - and > >>> indexable - is the tokenizer's job. > >>> > >>> StandardTokenizer, in my opinion, should behave similarly to a > >>> person looking at a document and deciding which tokens should be > >>> indexed. Few people would argue that a 16K block of binary data > >>> is useful for searching, but it's reasonable to suggest that the > >>> text around it is useful. > >>> > >>> I know that one can add the LengthFilter to avoid this problem, > >>> but this is not really intuitive; one does not expect the > >>> standard tokenizer to generate tokens that IndexWriter chokes on. > >>> > >>> My vote is to: > >>> - drop documents with tokens longer than 16K, as Mike and Yonik > >>> suggested > >>> - because uninformed user would start with StandardTokenizer, I > >>> think it should limit token size to 128 bytes, and add options to > >>> change that size, choose between truncating or dropping longer > >>> tokens, and in no case produce tokens longer that what > >>> IndexWriter can digest. > >> I like this idea, though we probably can't do that until 3.0 so we > >> don't break backwards compatibility? > > ... > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > >
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-31, 10:53
Doron Cohen <[EMAIL PROTECTED]> wrote:
> I like the approach of configuration of this behavior in Analysis > (and so IndexWriter can throw an exception on such errors). > > It seems that this should be a property of Analyzer vs. > just StandardAnalyzer, right? > > It can probably be a "policy" property, with two parameters: > 1) maxLength, 2) action: chop/split/ignore/raiseException when > generating too long tokens. Agreed, this should be generic/shared to all analyzers. But maybe for 2.3, we just truncate any too-long term to the max allowed size, and then after 2.3 we make this a settable "policy"? > Doron > > On Dec 21, 2007 10:46 PM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > > > > I think this is a good approach -- any objections? > > > > This way, IndexWriter is in-your-face (throws TermTooLongException on > > seeing a massive term), but StandardAnalyzer is robust (silently > > skips or prefix's the too-long terms). > > > > Mike > > > > Gabi Steinberg wrote: > > > > > How about defaulting to a max token size of 16K in > > > StandardTokenizer, so that it never causes an IndexWriter > > > exception, with an option to reduce that size? > > > > > > The backward incompatibilty is limited then - tokens exceeding 16K > > > will NOT causing an IndexWriter exception. In 3.0 we can reduce > > > that default to a useful size. > > > > > > The option to truncate the token can be useful, I think. It will > > > index the max size prefix of the long tokens. You can still find > > > them, pretty accurately - this becomes a prefix search, but is > > > unlikely to return multiple values because it's a long prefix. It > > > allow you to choose a relatively small max, such as 32 or 64, > > > reducing the overhead caused by junk in the documents while > > > minimizing the chance of not finding something. > > > > > > Gabi. > > > > > > Michael McCandless wrote: > > >> Gabi Steinberg wrote: > > >>> On balance, I think that dropping the document makes sense. I > > >>> think Yonik is right in that ensuring that keys are useful - and > > >>> indexable - is the tokenizer's job. > > >>> > > >>> StandardTokenizer, in my opinion, should behave similarly to a > > >>> person looking at a document and deciding which tokens should be > > >>> indexed. Few people would argue that a 16K block of binary data > > >>> is useful for searching, but it's reasonable to suggest that the > > >>> text around it is useful. > > >>> > > >>> I know that one can add the LengthFilter to avoid this problem, > > >>> but this is not really intuitive; one does not expect the > > >>> standard tokenizer to generate tokens that IndexWriter chokes on. > > >>> > > >>> My vote is to: > > >>> - drop documents with tokens longer than 16K, as Mike and Yonik > > >>> suggested > > >>> - because uninformed user would start with StandardTokenizer, I > > >>> think it should limit token size to 128 bytes, and add options to > > >>> change that size, choose between truncating or dropping longer > > >>> tokens, and in no case produce tokens longer that what > > >>> IndexWriter can digest. > > >> I like this idea, though we probably can't do that until 3.0 so we > > >> don't break backwards compatibility? > > > ... > > > > > > --------------------------------------------------------------------- > > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [EMAIL PROTECTED] > > For additional commands, e-mail: [EMAIL PROTECTED] > > > > > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-31, 16:10
On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> Doron Cohen <[EMAIL PROTECTED]> wrote: > > I like the approach of configuration of this behavior in Analysis > > (and so IndexWriter can throw an exception on such errors). > > > > It seems that this should be a property of Analyzer vs. > > just StandardAnalyzer, right? > > > > It can probably be a "policy" property, with two parameters: > > 1) maxLength, 2) action: chop/split/ignore/raiseException when > > generating too long tokens. > > Agreed, this should be generic/shared to all analyzers. > > But maybe for 2.3, we just truncate any too-long term to the max > allowed size, and then after 2.3 we make this a settable "policy"? But we already have a nice component model for analyzers... why not just encapsulate truncation/discarding in a TokenFilter? -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesDoron Cohen 2007-12-31, 16:37
On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote:
> On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> > wrote: > > Doron Cohen <[EMAIL PROTECTED]> wrote: > > > I like the approach of configuration of this behavior in Analysis > > > (and so IndexWriter can throw an exception on such errors). > > > > > > It seems that this should be a property of Analyzer vs. > > > just StandardAnalyzer, right? > > > > > > It can probably be a "policy" property, with two parameters: > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when > > > generating too long tokens. > > > > Agreed, this should be generic/shared to all analyzers. > > > > But maybe for 2.3, we just truncate any too-long term to the max > > allowed size, and then after 2.3 we make this a settable "policy"? > > But we already have a nice component model for analyzers... > why not just encapsulate truncation/discarding in a TokenFilter? Makes sense, especially for the implementation aspect. I'm not sure what API you have in mind: (1) leave that for applications, to append such a TokenFilter to their Analyzer (== no change), (2) DocumentsWriter to create such a TokenFilter under the cover, to force behavior that is defined (where?), or (3) have an IndexingTokenFilter assigned to IndexWriter, make the default such filter trim/ignore/whatever as discussed and then applications can set a different IndexingTokenFilter for changing the default behavior? I think I like the 3'rd option - is this what you meant? Doron
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-31, 16:44
On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote:
> > On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: > > > On Dec 31, 2007 5:53 AM, Michael McCandless <[EMAIL PROTECTED]> > > wrote: > > > Doron Cohen <[EMAIL PROTECTED]> wrote: > > > > I like the approach of configuration of this behavior in Analysis > > > > (and so IndexWriter can throw an exception on such errors). > > > > > > > > It seems that this should be a property of Analyzer vs. > > > > just StandardAnalyzer, right? > > > > > > > > It can probably be a "policy" property, with two parameters: > > > > 1) maxLength, 2) action: chop/split/ignore/raiseException when > > > > generating too long tokens. > > > > > > Agreed, this should be generic/shared to all analyzers. > > > > > > But maybe for 2.3, we just truncate any too-long term to the max > > > allowed size, and then after 2.3 we make this a settable "policy"? > > > > But we already have a nice component model for analyzers... > > why not just encapsulate truncation/discarding in a TokenFilter? > > > Makes sense, especially for the implementation aspect. > I'm not sure what API you have in mind: > > (1) leave that for applications, to append such a > TokenFilter to their Analyzer (== no change), > > (2) DocumentsWriter to create such a TokenFilter > under the cover, to force behavior that is defined (where?), or > > (3) have an IndexingTokenFilter assigned to IndexWriter, > make the default such filter trim/ignore/whatever as discussed > and then applications can set a different IndexingTokenFilter for > changing the default behavior? > > I think I like the 3'rd option - is this what you meant? I meant (1)... it leaves the core smaller. I don't see any reason to have logic to truncate or discard tokens in the core indexing code (except to handle tokens >16k as an error condition). Most of the time you want to catch those large tokens early on in the chain anyway (put the filter right after the tokenizer). Doing it later could cause exceptions or issues with other token filters that might not be expecting huge tokens. -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGrant Ingersoll 2007-12-31, 16:59
On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: > On Dec 31, 2007 11:37 AM, Doron Cohen <[EMAIL PROTECTED]> wrote: >> >> On Dec 31, 2007 6:10 PM, Yonik Seeley <[EMAIL PROTECTED]> wrote: >> >> I think I like the 3'rd option - is this what you meant? > > I meant (1)... it leaves the core smaller. > I don't see any reason to have logic to truncate or discard tokens in > the core indexing code (except to handle tokens >16k as an error > condition). I would agree here, with the exception that I want the option for it to be treated as an error. In some cases, I would be just as happy for it to silently ignore the token, or to log it. ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-31, 17:11
On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> > On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: > > I meant (1)... it leaves the core smaller. > > I don't see any reason to have logic to truncate or discard tokens in > > the core indexing code (except to handle tokens >16k as an error > > condition). > > I would agree here, with the exception that I want the option for it > to be treated as an error. That should also be possible via an analyzer component throwing an exception. -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGrant Ingersoll 2007-12-31, 17:25
On Dec 31, 2007, at 12:11 PM, Yonik Seeley wrote: > On Dec 31, 2007 11:59 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> >> On Dec 31, 2007, at 11:44 AM, Yonik Seeley wrote: >>> I meant (1)... it leaves the core smaller. >>> I don't see any reason to have logic to truncate or discard tokens >>> in >>> the core indexing code (except to handle tokens >16k as an error >>> condition). >> >> I would agree here, with the exception that I want the option for it >> to be treated as an error. > > That should also be possible via an analyzer component throwing an > exception. > Sure, but I mean in the >16K (in other words, in the case where DocsWriter fails, which presumably only DocsWriter knows about) case. I want the option to ignore tokens larger than that instead of failing/ throwing an exception. Imagine I am charged w/ indexing some data that I don't know anything about (i.e. computer forensics), my goal would be to index as much as possible in my first raw pass, so that I can then begin to explore the dataset. Having it completely discard the document is not a good thing, but throwing away some large binary tokens would be acceptable (especially if I get warnings about said tokens) and robust. -Grant ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-31, 17:47
On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Sure, but I mean in the >16K (in other words, in the case where > DocsWriter fails, which presumably only DocsWriter knows about) case. > I want the option to ignore tokens larger than that instead of failing/ > throwing an exception. I think the issue here is what the default behavior for IndexWriter should be. If configuration is required because something other than the default is desired, then one could use a TokenFilter to change the behavior rather than changing options on IndexWriter. Using a TokenFilter is much more flexible. > Imagine I am charged w/ indexing some data > that I don't know anything about (i.e. computer forensics), my goal > would be to index as much as possible in my first raw pass, so that I > can then begin to explore the dataset. Having it completely discard > the document is not a good thing, but throwing away some large binary > tokens would be acceptable (especially if I get warnings about said > tokens) and robust. -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-31, 17:54
I actually think indexing should try to be as robust as possible. You
could test like crazy and never hit a massive term, go into production (say, ship your app to lots of your customer's computers) only to suddenly see this exception. In general it could be a long time before you "accidentally" our users see this. So I'm thinking we should have the default behavior, in IndexWriter, be to skip immense terms? Then people can use TokenFilter to change this behavior if they want. Mike Yonik Seeley <[EMAIL PROTECTED]> wrote: > On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > Sure, but I mean in the >16K (in other words, in the case where > > DocsWriter fails, which presumably only DocsWriter knows about) case. > > I want the option to ignore tokens larger than that instead of failing/ > > throwing an exception. > > I think the issue here is what the default behavior for IndexWriter should be. > > If configuration is required because something other than the default > is desired, then one could use a TokenFilter to change the behavior > rather than changing options on IndexWriter. Using a TokenFilter is > much more flexible. > > > Imagine I am charged w/ indexing some data > > that I don't know anything about (i.e. computer forensics), my goal > > would be to index as much as possible in my first raw pass, so that I > > can then begin to explore the dataset. Having it completely discard > > the document is not a good thing, but throwing away some large binary > > tokens would be acceptable (especially if I get warnings about said > > tokens) and robust. > > -Yonik > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > > ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesYonik Seeley 2007-12-31, 17:57
On Dec 31, 2007 12:54 PM, Michael McCandless <[EMAIL PROTECTED]> wrote:
> I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly see this exception. In general it could be a long time before > you "accidentally" our users see this. > > So I'm thinking we should have the default behavior, in IndexWriter, > be to skip immense terms? > > Then people can use TokenFilter to change this behavior if they want. +1 -Yonik ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesGrant Ingersoll 2007-12-31, 18:49
On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote: > I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly see this exception. In general it could be a long time > before > you "accidentally" our users see this. > > So I'm thinking we should have the default behavior, in IndexWriter, > be to skip immense terms? > > Then people can use TokenFilter to change this behavior if they want. > +1. We could log it, right? ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2007-12-31, 23:40
Grant Ingersoll wrote:
> > On Dec 31, 2007, at 12:54 PM, Michael McCandless wrote: > >> I actually think indexing should try to be as robust as possible. >> You >> could test like crazy and never hit a massive term, go into >> production >> (say, ship your app to lots of your customer's computers) only to >> suddenly see this exception. In general it could be a long time >> before >> you "accidentally" our users see this. >> >> So I'm thinking we should have the default behavior, in IndexWriter, >> be to skip immense terms? >> >> Then people can use TokenFilter to change this behavior if they want. >> > +1. We could log it, right? Yes, to IndexWriter's infoStream, if it's set. I'll do that... Mike ---------------------------------------------------------------------
-
Re: DocumentsWriter.checkMaxTermLength issuesDoron Cohen 2008-01-01, 06:55
On Dec 31, 2007 7:54 PM, Michael McCandless <[EMAIL PROTECTED]>
wrote: > I actually think indexing should try to be as robust as possible. You > could test like crazy and never hit a massive term, go into production > (say, ship your app to lots of your customer's computers) only to > suddenly see this exception. In general it could be a long time before > you "accidentally" our users see this. > > So I'm thinking we should have the default behavior, in IndexWriter, > be to skip immense terms? > > Then people can use TokenFilter to change this behavior if they want. > +1 At first I saw this similar to IndexWriter.setMaxFieldLength(), but it was a wrong comparison, because #terms is a "real" indexing/serarch characteristic that many applications can benefit from being able to modify, whereas a huge token is in most cases a bug. Just to make sure on the scenario - the only change is to skip too long tokens, while any other exception is thrown (not ignored.) Also, for a skipped token I think the position increment of the following token should be incremented.
-
Re: DocumentsWriter.checkMaxTermLength issuesMichael McCandless 2008-01-01, 10:50
Doron Cohen wrote: > On Dec 31, 2007 7:54 PM, Michael McCandless > <[EMAIL PROTECTED]> > wrote: > >> I actually think indexing should try to be as robust as possible. >> You >> could test like crazy and never hit a massive term, go into >> production >> (say, ship your app to lots of your customer's computers) only to >> suddenly see this exception. In general it could be a long time >> before >> you "accidentally" our users see this. >> >> So I'm thinking we should have the default behavior, in IndexWriter, >> be to skip immense terms? >> >> Then people can use TokenFilter to change this behavior if they want. >> > > +1 OK I will take this approach. > At first I saw this similar to IndexWriter.setMaxFieldLength(), but > it was > a wrong comparison, because #terms is a "real" indexing/serarch > characteristic that many applications can benefit from being able > to modify, whereas a huge token is in most cases a bug. > > Just to make sure on the scenario - the only change is to skip too > long > tokens, while any other exception is thrown (not ignored.) Exactly. And, on any exception, we will immediately mark any partially indexed doc as deleted. > Also, for a skipped token I think the position increment of the > following token should be incremented. Good point; I'll make sure we do. Mike --------------------------------------------------------------------- |