Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # dev - PDF parser exception


Copy link to this message
-
Re: PDF parser exception
Ken Krugler 2010-01-12, 22:18
Hi Doug,

On Jan 12, 2010, at 11:37am, Doug Carter wrote:

>
> Hi all,
>
> I'm new to Tika and to this mailing list, so I hope this is the right
> place to ask this question.
>
> I've just downloading, built and installed Tika 0.5. I've been able to
> translate Microsoft Office documents without any problems. However,  
> when
> I try to translate a PDF file, I get a parser exception.

Is this the case with any and all PDF files?

Based on the stack trace below, it sure looks like a busted file, but  
I've mostly been working with the HTML parser.

-- Ken

>
> The command line I'm running is:
>
>  % java -jar tika-app/target/tika-app-0.5.jar foo.pdf
>
> The resulting exception output is:
>
> Exception in thread "main" org.apache.tika.exception.TikaException:  
> TIKA-198: Illegal IOException from  
> org.apache.tika.parser.pdf.PDFParser@11e1e67
>        at  
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:126)
>        at  
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:
> 101)
>        at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:175)
>        at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:62)
> Caused by: org.apache.pdfbox.exceptions.WrappedIOException
>        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:
> 237)
>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
> 841)
>        at org.apache.pdfbox.pdmodel.PDDocument.load(PDDocument.java:
> 808)
>        at org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:
> 53)
>        at  
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:120)
>        ... 3 more
> Caused by: java.util.NoSuchElementException
>        at java.util.AbstractList$Itr.next(AbstractList.java:350)
>        at  
> org
> .apache
> .pdfbox.pdfparser.PDFXrefStreamParser.parse(PDFXrefStreamParser.java:
> 115)
>        at  
> org.apache.pdfbox.cos.COSDocument.parseXrefStreams(COSDocument.java:
> 538)
>        at org.apache.pdfbox.pdfparser.PDFParser.parse(PDFParser.java:
> 203)
>        ... 7 more
>
> ---
>
> Can someone help point me to a way to solve this problem? I'm familiar
> with Java but not the PDF format or how Tika parses a document.
>
> Please let me know if there is a better forum to ask this question, or
> if I need to provide more information.
>
>
> TIA,
>
> Doug

--------------------------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g