Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # user - Problem with overriding built-in parser


Copy link to this message
-
Re: Problem with overriding built-in parser
Stephan Mühlstrasser 2012-02-16, 16:07
Hi Nick,

thanks for your reply.

Am 16.02.12 16:51, schrieb Nick Burch:
> On Tue, 14 Feb 2012, Stephan Mühlstrasser wrote:
>> https://issues.apache.org/jira/browse/TIKA-527
>...
>
>> The problem is that using the proposed method does not work for me.
>> Any use of the configuration file apparently sends Tika into an
>> endless recursion, even without overriding a built-in parser in the
>> configuration file.
>
> Are you able to produce a unit test that shows the problem?

That's what I was trying to provide with the example in my previous message:

>
>> If I understand it correctly, the following configuration file should
>> have the same effect as the built-in configuration:
>>
>>> $ cat tika-config.xml
>>> <properties>
>>> <parsers>
>>> <parser class="org.apache.tika.parser.DefaultParser"/>
>>> </parsers>
>>> </properties>

If you invoke the Tika CLI application with this configuration file, the
error happens. Just start it like this: "java
-Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers" and
the error will happen.

> Ah, I'm not sure that's correct. I think you also need to give a
> mimetypes and a detector. Looking at lines 145 to 172 of TikaConfig, it
> seems that you either get the defaults with no config, or specify them
> all with your own config
>

Ok, I see now in the source what you mean. Then the example in TIKA-527
is not complete, as it does not have mimetypes and a detector.

In the meantime since yesterday I got my override working by packaging a
META-INF/services/org.apache.tika.parser.Parser into the JAR file
together with my parser. So I don't need the configuration file approach
anymore. But I think it still could be considered a bug if an
incorrect/insufficient configuration file sends Tika into an endless
recursion instead of producing a meaningful error message.

Thanks
Stephan

--
_______________________________________________________________
Stephan Mühlstrasser   [EMAIL PROTECTED]            www.pdflib.com
   PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München,  Germany
        Court of registry/Amtsgericht München HRB 129497
  Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst
---------------------------------------------------------------
     PDFlib: powerful toolkits for PDF developers since 1997
_______ See www.pdflib.com/products for product details________