Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - parse.ParserFactory


Copy link to this message
-
Re: parse.ParserFactory
Tolga 2012-05-29, 06:31
Hi,

I know this issue should have been closed, but I thought I'd continue
this rather than starting a new thread.

Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
parse-pdf mapped to contentType application/pdf via parse-plugins.xml,
but not enabled via plugin.includes in nutch-default.xml and I have tika
in my nutch-default.xml:
<value>protocol-http|urlfilter-regex|parse-(html|tika|js|swf|zip|xml)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>.
What's the point of seeing this warning if I already have tika? This
should be removed IMHO.

Regards,

On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
> Unless your using<= Nutch 1.2 you should not be using
> msexcel|mspowerpoint|msword|oo|pdf| within your plugin.includes... all
> of these document formats are (and have been for some time)
> implemented as Apache Tika parsers.
>
> hth
>
>
>
> On Tue, May 22, 2012 at 9:20 PM, Tolga<[EMAIL PROTECTED]>  wrote:
>> Hi,
>>
>> I crawl / index PDF files just fine, but I get the following warning.
>>
>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to contentType
>> application/pdf via parse-plugins.xml, but not enabled via plugin.includes
>> in nutch-default.xml.
>>
>> I've got the value
>> protocol-http|urlfilter-regex|parse-(html|tika|js|msexcel|mspowerpoint|msword|oo|pdf|swf|zip)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)
>> for plugin.includes property in nutch-default.xml. What am I missing?
>>
>> Regards,
>
>