Hi Chris,
Are you running with the max content size limited (http.content.limit != -1)?
If so, I think you can run into this issue if you have a ppt/pptx file that's bigger than this limit.
-- Ken
On Apr 16, 2012, at 11:22am, Christopher Gross wrote:
> Hi all.
>
> I'm running Nutch 1.4 with Java 1.6.0_30. I'm trying to have it
> crawl a directory with test files and I'm getting an error on ppt and
> pptx files. It can get pdf, doc/docx, xsl/xslx, but for whatever
> reason it flips out on powerpoint. I can attach the document if need
> be. Below is a snippet from the log:
>
> 2012-04-16 17:59:46,300 DEBUG parse.ParseUtil - Parsing
> [file:/data/test/crawldocs/Sales Call Agenda.pptx] with
> [org.apache.nutch.parse.tika.TikaParser@26945b95]
> 2012-04-16 17:59:46,300 DEBUG tika.TikaParser - Using Tika parser
> org.apache.tika.parser.microsoft.ooxml.OOXMLParser for mime-type
> application/x-tika-ooxml
> 2012-04-16 17:59:46,305 ERROR tika.TikaParser - Error parsing
> file:/data/test/crawldocs/Sales Call Agenda.pptx
> java.util.zip.ZipException: unexpected EOF
> at java.util.zip.ZipInputStream.read(ZipInputStream.java:174)
> at java.io.FilterInputStream.read(FilterInputStream.java:90)
> at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource$FakeZipEntry.<init>(ZipInputStreamZipEntrySource.java:127)
> at org.apache.poi.openxml4j.util.ZipInputStreamZipEntrySource.<init>(ZipInputStreamZipEntrySource.java:55)
> at org.apache.poi.openxml4j.opc.ZipPackage.<init>(ZipPackage.java:83)
> at org.apache.poi.openxml4j.opc.OPCPackage.open(OPCPackage.java:228)
> at org.apache.poi.extractor.ExtractorFactory.createExtractor(ExtractorFactory.java:152)
> at org.apache.tika.parser.microsoft.ooxml.OOXMLExtractorFactory.parse(OOXMLExtractorFactory.java:70)
> at org.apache.tika.parser.microsoft.ooxml.OOXMLParser.parse(OOXMLParser.java:70)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.lang.Thread.run(Thread.java:662)
> 2012-04-16 17:59:46,305 INFO parse.ParseSegment - Parsing:
> file:/data/test/crawldocs/Sales Call Agenda.pptx
> 2012-04-16 17:59:46,306 WARN parse.ParseSegment - Error parsing:
> file:/data/test/crawldocs/Sales Call Agenda.pptx: failed(2,0):
> unexpected EOF
>
> I've tried a few google searches and I can't seem to find anyone else
> with this error. I'm out of ideas as to what to try to do in order to
> fix this. Any help would be appreciated!
>
> -- Chris
--------------------------
Ken Krugler
http://www.scaleunlimited.comcustom big data solutions & training
Hadoop, Cascading, Mahout & Solr