Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - parse.ParserFactory


Copy link to this message
-
Re: parse.ParserFactory
Tolga 2012-05-29, 07:52
I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use
nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml.

On 5/29/12 9:48 AM, Julien Nioche wrote:
> if you are seeing this warning then this means that parse-pdf IS being
> used. You should modify nutch-site.xml and not nutch-default and my bet is
> that your are doing this in NUTCH_HOME/conf and not in
> NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
>
>
>
> On 29 May 2012 07:31, Tolga<[EMAIL PROTECTED]>  wrote:
>
>> Hi,
>>
>> I know this issue should have been closed, but I thought I'd continue this
>> rather than starting a new thread.
>>
>> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
>> parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but
>> not enabled via plugin.includes in nutch-default.xml and I have tika in my
>> nutch-default.xml:<value>protocol-http|**urlfilter-regex|parse-(html|**
>> tika|js|swf|zip|xml)|index-(**basic|anchor)|scoring-opic|**
>> urlnormalizer-(pass|regex|**basic)</value>. What's the point of seeing
>> this warning if I already have tika? This should be removed IMHO.
>>
>> Regards,
>>
>>
>> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>>
>>> Unless your using<= Nutch 1.2 you should not be using
>>> msexcel|mspowerpoint|msword|**oo|pdf| within your plugin.includes... all
>>> of these document formats are (and have been for some time)
>>> implemented as Apache Tika parsers.
>>>
>>> hth
>>>
>>>
>>>
>>> On Tue, May 22, 2012 at 9:20 PM, Tolga<[EMAIL PROTECTED]>   wrote:
>>>
>>>> Hi,
>>>>
>>>> I crawl / index PDF files just fine, but I get the following warning.
>>>>
>>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>>> contentType
>>>> application/pdf via parse-plugins.xml, but not enabled via
>>>> plugin.includes
>>>> in nutch-default.xml.
>>>>
>>>> I've got the value
>>>> protocol-http|urlfilter-regex|**parse-(html|tika|js|msexcel|**
>>>> mspowerpoint|msword|oo|pdf|**swf|zip)|index-(basic|anchor)|**
>>>> scoring-opic|urlnormalizer-(**pass|regex|basic)
>>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>>
>>>> Regards,
>>>>
>>>
>>>
>