Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene, mail # dev - DocumentsWriter.checkMaxTermLength issues


Copy link to this message
-
Re: DocumentsWriter.checkMaxTermLength issues
Yonik Seeley 2007-12-31, 17:47
On Dec 31, 2007 12:25 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
> Sure, but I mean in the >16K (in other words, in the case where
> DocsWriter fails, which presumably only DocsWriter knows about) case.
> I want the option to ignore tokens larger than that instead of failing/
> throwing an exception.

I think the issue here is what the default behavior for IndexWriter should be.

If configuration is required because something other than the default
is desired, then one could use a TokenFilter to change the behavior
rather than changing options on IndexWriter.  Using a TokenFilter is
much more flexible.

> Imagine I am charged w/ indexing some data
> that I don't know anything about (i.e. computer forensics), my goal
> would be to index as much as possible in my first raw pass, so that I
> can then begin to explore the dataset.  Having it completely discard
> the document is not a good thing, but throwing away some large binary
> tokens would be acceptable (especially if I get warnings about said
> tokens) and robust.

-Yonik

---------------------------------------------------------------------