|
Stephan Mühlstrasser
2012-02-14, 12:20
Nick Burch
2012-02-16, 15:51
Stephan Mühlstrasser
2012-02-16, 16:07
Nick Burch
2012-02-16, 16:22
Stephan Mühlstrasser
2012-02-17, 07:54
Nick Burch
2012-02-17, 12:22
Stephan Mühlstrasser
2012-02-17, 12:41
|
-
Problem with overriding built-in parserStephan Mühlstrasser 2012-02-14, 12:20
Hi,
I'm trying to override the built-in PDF parser with another one. I looked through the mailing list archive and found the following hints how to override a built-in parser: http://mail-archives.apache.org/mod_mbox/tika-user/201105.mbox/%3CBANLkTimp4omHywv_ptOmqEX9v-%2BW4e7fVA%40mail.gmail.com%3E https://issues.apache.org/jira/browse/TIKA-527 Is there any documentation of the syntax of the configuration file available? The problem is that using the proposed method does not work for me. Any use of the configuration file apparently sends Tika into an endless recursion, even without overriding a built-in parser in the configuration file. If I understand it correctly, the following configuration file should have the same effect as the built-in configuration: > $ cat tika-config.xml > <properties> > <parsers> > <parser class="org.apache.tika.parser.DefaultParser"/> > </parsers> > </properties> But if I provide that to Tika, after a while the command line application is terminated with an exception: > $ java -Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers > Exception in thread "main" java.lang.OutOfMemoryError: GC overhead limit exceeded > at java.util.Arrays.copyOfRange(Arrays.java:3209) > at java.lang.String.<init>(String.java:216) > at java.lang.StringBuilder.toString(StringBuilder.java:430) > at org.apache.tika.mime.MediaType.toString(MediaType.java:237) > at org.apache.tika.detect.MagicDetector.<init>(MagicDetector.java:142) > at org.apache.tika.mime.MimeTypesReader.readMatch(MimeTypesReader.java:254) > at org.apache.tika.mime.MimeTypesReader.readMatches(MimeTypesReader.java:202) > at org.apache.tika.mime.MimeTypesReader.readMagic(MimeTypesReader.java:186) > at org.apache.tika.mime.MimeTypesReader.readMimeType(MimeTypesReader.java:152) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:124) > at org.apache.tika.mime.MimeTypesReader.read(MimeTypesReader.java:107) > at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:63) > at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:91) > at org.apache.tika.mime.MimeTypesFactory.create(MimeTypesFactory.java:147) > at org.apache.tika.mime.MimeTypes.getDefaultMimeTypes(MimeTypes.java:455) > at org.apache.tika.config.TikaConfig.typesFromDomElement(TikaConfig.java:273) > at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:161) > at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237) > at org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42) > at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52) > at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) > at java.lang.reflect.Constructor.newInstance(Constructor.java:513) > at java.lang.Class.newInstance0(Class.java:355) > at java.lang.Class.newInstance(Class.java:308) > at org.apache.tika.config.TikaConfig.parserFromDomElement(TikaConfig.java:288) > at org.apache.tika.config.TikaConfig.<init>(TikaConfig.java:162) > at org.apache.tika.config.TikaConfig.getDefaultConfig(TikaConfig.java:237) > at org.apache.tika.mime.MediaTypeRegistry.getDefaultRegistry(MediaTypeRegistry.java:42) > at org.apache.tika.parser.DefaultParser.<init>(DefaultParser.java:52) > at sun.reflect.GeneratedConstructorAccessor4.newInstance(Unknown Source) > at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) Is this a bug in Tika, or am I doing something wrong? Thanks Stephan -- _______________________________________________________________ Stephan Mühlstrasser [EMAIL PROTECTED] www.pdflib.com PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München, Germany Court of registry/Amtsgericht München HRB 129497 Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst PDFlib: powerful toolkits for PDF developers since 1997 _______ See www.pdflib.com/products for product details________
-
Re: Problem with overriding built-in parserNick Burch 2012-02-16, 15:51
On Tue, 14 Feb 2012, Stephan Mᅵhlstrasser wrote:
> https://issues.apache.org/jira/browse/TIKA-527 > > Is there any documentation of the syntax of the configuration file > available? You could look at the code that process the file, but the example in that JIRA ought to cover most uses cases > The problem is that using the proposed method does not work for me. Any > use of the configuration file apparently sends Tika into an endless > recursion, even without overriding a built-in parser in the > configuration file. Are you able to produce a unit test that shows the problem? > If I understand it correctly, the following configuration file should have > the same effect as the built-in configuration: > >> $ cat tika-config.xml >> <properties> >> <parsers> >> <parser class="org.apache.tika.parser.DefaultParser"/> >> </parsers> >> </properties> Ah, I'm not sure that's correct. I think you also need to give a mimetypes and a detector. Looking at lines 145 to 172 of TikaConfig, it seems that you either get the defaults with no config, or specify them all with your own config Nick
-
Re: Problem with overriding built-in parserStephan Mühlstrasser 2012-02-16, 16:07
Hi Nick,
thanks for your reply. Am 16.02.12 16:51, schrieb Nick Burch: > On Tue, 14 Feb 2012, Stephan Mühlstrasser wrote: >> https://issues.apache.org/jira/browse/TIKA-527 >... > >> The problem is that using the proposed method does not work for me. >> Any use of the configuration file apparently sends Tika into an >> endless recursion, even without overriding a built-in parser in the >> configuration file. > > Are you able to produce a unit test that shows the problem? That's what I was trying to provide with the example in my previous message: > >> If I understand it correctly, the following configuration file should >> have the same effect as the built-in configuration: >> >>> $ cat tika-config.xml >>> <properties> >>> <parsers> >>> <parser class="org.apache.tika.parser.DefaultParser"/> >>> </parsers> >>> </properties> If you invoke the Tika CLI application with this configuration file, the error happens. Just start it like this: "java -Dtika.config=tika-config.xml -jar tika-app-1.0.jar --list-parsers" and the error will happen. > Ah, I'm not sure that's correct. I think you also need to give a > mimetypes and a detector. Looking at lines 145 to 172 of TikaConfig, it > seems that you either get the defaults with no config, or specify them > all with your own config > Ok, I see now in the source what you mean. Then the example in TIKA-527 is not complete, as it does not have mimetypes and a detector. In the meantime since yesterday I got my override working by packaging a META-INF/services/org.apache.tika.parser.Parser into the JAR file together with my parser. So I don't need the configuration file approach anymore. But I think it still could be considered a bug if an incorrect/insufficient configuration file sends Tika into an endless recursion instead of producing a meaningful error message. Thanks Stephan -- _______________________________________________________________ Stephan Mühlstrasser [EMAIL PROTECTED] www.pdflib.com PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München, Germany Court of registry/Amtsgericht München HRB 129497 Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst --------------------------------------------------------------- PDFlib: powerful toolkits for PDF developers since 1997 _______ See www.pdflib.com/products for product details________
-
Re: Problem with overriding built-in parserNick Burch 2012-02-16, 16:22
On Thu, 16 Feb 2012, Stephan Mᅵhlstrasser wrote:
>> Are you able to produce a unit test that shows the problem? > > That's what I was trying to provide with the example in my previous message: That's not a unit test though - yours needs to be run manually. If we can run it automatically, we can add it to the test suite to make sure it doesn't get broken in future. Nick
-
Re: Problem with overriding built-in parserStephan Mühlstrasser 2012-02-17, 07:54
Am 16.02.12 17:22, schrieb Nick Burch:
> On Thu, 16 Feb 2012, Stephan Mühlstrasser wrote: > That's not a unit test though - yours needs to be run manually. If we > can run it automatically, we can add it to the test suite to make sure > it doesn't get broken in future. I understand, here is the reproduction as a unit test: package org.apache.tika; import java.io.File; import java.io.FileOutputStream; import java.io.IOException; import java.io.OutputStreamWriter; import junit.framework.TestCase; import org.junit.Before; /** * Provoke endless recursion with small configuration file that loads the * default parser, but omits mimetypes and detector. If this is an insuffient * configuration file, Tika should report an error. Instead it terminates with * an OutOfMemoryError. * * @author [EMAIL PROTECTED] */ public class ConfigFile extends TestCase { File configFile; @Before public void setUp() throws Exception { configFile = File.createTempFile("tika-config", ".xml"); configFile.deleteOnExit(); OutputStreamWriter osw = new OutputStreamWriter(new FileOutputStream(configFile), "UTF-8"); osw.write("<properties><parsers><parser class=\"org.apache.tika.parser.DefaultParser\"/></parsers></properties>\n"); osw.close(); } public void test() throws IOException { System.setProperty("tika.config", configFile.getAbsolutePath()); new Tika(); } } Best Regards Stephan -- _______________________________________________________________ Stephan Mühlstrasser [EMAIL PROTECTED] www.pdflib.com PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München, Germany Court of registry/Amtsgericht München HRB 129497 Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst --------------------------------------------------------------- PDFlib: powerful toolkits for PDF developers since 1997 _______ See www.pdflib.com/products for product details________
-
Re: Problem with overriding built-in parserNick Burch 2012-02-17, 12:22
On Fri, 17 Feb 2012, Stephan Mᅵhlstrasser wrote:
>> That's not a unit test though - yours needs to be run manually. If we >> can run it automatically, we can add it to the test suite to make sure >> it doesn't get broken in future. > > I understand, here is the reproduction as a unit test: Looks good, thanks! Any chance you could open a new issue in JIRA, and attach it there? We'll need to decide what to do in the case of missing entries in the config file (abort vs silently put in the default), by having it in JIRA we won't forget it :) Cheers Nick
-
Re: Problem with overriding built-in parserStephan Mühlstrasser 2012-02-17, 12:41
Am 17.02.12 13:22, schrieb Nick Burch:
> On Fri, 17 Feb 2012, Stephan Mühlstrasser wrote: >>> That's not a unit test though - yours needs to be run manually. If we >>> can run it automatically, we can add it to the test suite to make >>> sure it doesn't get broken in future. >> >> I understand, here is the reproduction as a unit test: > > Looks good, thanks! > > Any chance you could open a new issue in JIRA, and attach it there? > We'll need to decide what to do in the case of missing entries in the > config file (abort vs silently put in the default), by having it in JIRA > we won't forget it :) I created TIKA-866 and attached the unit test. Best Regards Stephan -- _______________________________________________________________ Stephan Mühlstrasser [EMAIL PROTECTED] www.pdflib.com PDFlib GmbH, Franziska-Bilek-Weg 9, 80339 München, Germany Court of registry/Amtsgericht München HRB 129497 Managing Directors/Geschäftsführer: Thomas Merz, Petra Porst --------------------------------------------------------------- PDFlib: powerful toolkits for PDF developers since 1997 _______ See www.pdflib.com/products for product details________ |