|
|
-
Re: PDF parser exceptionKen Krugler 2010-01-12, 22:18
Hi Doug,
On Jan 12, 2010, at 11:37am, Doug Carter wrote: > > Hi all, > > I'm new to Tika and to this mailing list, so I hope this is the right > place to ask this question. > > I've just downloading, built and installed Tika 0.5. I've been able to > translate Microsoft Office documents without any problems. However, > when > I try to translate a PDF file, I get a parser exception. Is this the case with any and all PDF files? Based on the stack trace below, it sure looks like a busted file, but I've mostly been working with the HTML parser. -- Ken > > The command line I'm running is: > > % java -jar tika-app/target/tika-app-0.5.jar foo.pdf > > The resulting exception output is: > > Exception in thread "main" org.apache.tika.exception.TikaException: > TIKA-198: Illegal IOException from > org.apache.tika.parser.pdf.PDFParser@11e1e67 > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126) > at > org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java: > 101) > at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175) > at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62) > Caused by: org.apache.pdfbox.exceptions.WrappedIOException > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: > 237) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: > 841) > at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java: > 808) > at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java: > 53) > at > org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120) > ... 3 more > Caused by: java.util.NoSuchElementException > at java.util.AbstractList$Itr.next(AbstractList.java:350) > at > org > .apache > .pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java: > 115) > at > org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java: > 538) > at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java: > 203) > ... 7 more > > --- > > Can someone help point me to a way to solve this problem? I'm familiar > with Java but not the PDF format or how Tika parses a document. > > Please let me know if there is a better forum to ask this question, or > if I need to provide more information. > > > TIA, > > Doug -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g |