Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - CHM Files and Tika


Copy link to this message
-
RE: CHM Files and Tika
Markus Jelsma 2012-08-09, 22:30
hmm, i'm not sure but maybe we don't include all Tika parser deps in our build.xml?

 
 
-----Original message-----
> From:Sebastian Nagel <[EMAIL PROTECTED]>
> Sent: Thu 09-Aug-2012 23:18
> To: [EMAIL PROTECTED]
> Subject: Re: CHM Files and Tika
>
> Hi Jan,
>
> confirmed: Nutch cannot parse, while Tika (same version used by Nutch)
> can parse chm. The chm parsers are in tika-parser*.jar which is contained
> in the Nutch package.
>
> Any ideas?
>
> Sebastian
>
> On 08/08/2012 12:03 PM, Jan Riewe wrote:
> > Hey there,
> >
> > i try to parse CHM (Microsoft Help Files) with Nucht, but i get a:
> >
> > Can't retrieve Tika parser for mime-type application/vnd.ms-htmlhelp
> >
> > i've tried version 1.4 (tika 0.10) and 1.51 from nutch (tika 1.1) which
> > should be able to parse those files
> > https://issues.apache.org/jira/browse/TIKA-245
> >
> > In the tika-mimetypes.xml i do find a entry related to
> > application/vnd.ms-htmlhelp
> >
> > Does anyone ever ran into the same issues and knows how to fix that?
> >
> > Bye
> > Jan
> >
>
>