Hey Sebastian,
as far is i found out, the Tika parser is far away from being perfect,
but i would expect that the included Testfiles should get correct
results.
There is an alternative lib (
http://sourceforge.net/projects/chm4j/),
but i don't think that there are enough possible users to switch for
this filetype to a differed parser.
Jan
Am Dienstag, den 14.08.2012, 22:28 +0200 schrieb Sebastian Nagel:
> Hi Jan,
>
> opened a Jira issue:
https://issues.apache.org/jira/browse/NUTCH-1454> Thanks!
>
> Beyond the "can't retrieve parser" error:
> I've tried a couple of chm files (among them the test files from Tika)
> but I wasn't able to get Tika to extract content.
>
> % java -jar tika-app/target/tika-app-1.3-SNAPSHOT.jar -v \
> tika-parsers/src/test/resources/test-documents/testChm2.chm
>
> only extracts:
>
> <?xml version="1.0" encoding="UTF-8"?><html xmlns="
http://www.w3.org/1999/xhtml">> <head>
> <meta name="Content-Length" content="10807437"/>
> <meta name="Content-Type" content="application/vnd.ms-htmlhelp"/>
> <meta name="resourceName" content="testChm2.chm"/>
> <title/>
> </head>
> <body/></html>
>
> A CHM-viewer shows much more content. What's wrong?
>
> Sebastian
>
> On 08/10/2012 09:32 AM, Julien Nioche wrote:
> > new JIRA?
> >
> > On 9 August 2012 23:30, Markus Jelsma <[EMAIL PROTECTED]> wrote:
> >
> >> hmm, i'm not sure but maybe we don't include all Tika parser deps in our
> >> build.xml?
> >>
> >>
> >>
> >> -----Original message-----
> >>> From:Sebastian Nagel <[EMAIL PROTECTED]>
> >>> Sent: Thu 09-Aug-2012 23:18
> >>> To: [EMAIL PROTECTED]
> >>> Subject: Re: CHM Files and Tika
> >>>
> >>> Hi Jan,
> >>>
> >>> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> >>> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> >>> in the Nutch package.
> >>>
> >>> Any ideas?
> >>>
> >>> Sebastian
> >>>
> >>> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> >>>> Hey there,
> >>>>
> >>>> i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> >>>>
> >>>> Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> >>>>
> >>>> i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> >>>> should be able to parse those files
> >>>>
https://issues.apache.org/jira/browse/TIKA-245> >>>>
> >>>> In the tika-mimetypes.xml i do find a entry related to
> >>>> application/vnd.ms-htmlhelp
> >>>>
> >>>> Does anyone ever ran into the same issues and knows how to fix that?
> >>>>
> >>>> Bye
> >>>> Jan
> >>>>
> >>>
> >>>
> >>
> >
> >
> >
>