Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - parse.ParserFactory


Copy link to this message
-
Re: parse.ParserFactory
Tolga 2012-05-29, 08:37
...and also, nutch-site.xml is blank here, so I'm sure it's not being
used at all.

On 5/29/12 11:34 AM, Julien Nioche wrote:
>> I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use
>> nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml.
>
> that's the case. I was just mentioning a recommended practice, not a strict
> requirement
>
>
>
>>
>> On 5/29/12 9:48 AM, Julien Nioche wrote:
>>
>>> if you are seeing this warning then this means that parse-pdf IS being
>>> used. You should modify nutch-site.xml and not nutch-default and my bet is
>>> that your are doing this in NUTCH_HOME/conf and not in
>>> NUTCH_HOME/runtime/local/conf (see tutorial on WIKI)
>>>
>>>
>>>
>>> On 29 May 2012 07:31, Tolga<[EMAIL PROTECTED]>   wrote:
>>>
>>>   Hi,
>>>> I know this issue should have been closed, but I thought I'd continue
>>>> this
>>>> rather than starting a new thread.
>>>>
>>>> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin:
>>>> parse-pdf mapped to contentType application/pdf via parse-plugins.xml,
>>>> but
>>>> not enabled via plugin.includes in nutch-default.xml and I have tika in
>>>> my
>>>> nutch-default.xml:<value>**protocol-http|**urlfilter-**
>>>> regex|parse-(html|**
>>>> tika|js|swf|zip|xml)|index-(****basic|anchor)|scoring-opic|**
>>>> urlnormalizer-(pass|regex|****basic)</value>. What's the point of seeing
>>>>
>>>> this warning if I already have tika? This should be removed IMHO.
>>>>
>>>> Regards,
>>>>
>>>>
>>>> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote:
>>>>
>>>>   Unless your using<= Nutch 1.2 you should not be using
>>>>> msexcel|mspowerpoint|msword|****oo|pdf| within your plugin.includes...
>>>>> all
>>>>>
>>>>> of these document formats are (and have been for some time)
>>>>> implemented as Apache Tika parsers.
>>>>>
>>>>> hth
>>>>>
>>>>>
>>>>>
>>>>> On Tue, May 22, 2012 at 9:20 PM, Tolga<[EMAIL PROTECTED]>    wrote:
>>>>>
>>>>>   Hi,
>>>>>> I crawl / index PDF files just fine, but I get the following warning.
>>>>>>
>>>>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to
>>>>>> contentType
>>>>>> application/pdf via parse-plugins.xml, but not enabled via
>>>>>> plugin.includes
>>>>>> in nutch-default.xml.
>>>>>>
>>>>>> I've got the value
>>>>>> protocol-http|urlfilter-regex|****parse-(html|tika|js|msexcel|****
>>>>>> mspowerpoint|msword|oo|pdf|****swf|zip)|index-(basic|anchor)|****
>>>>>> scoring-opic|urlnormalizer-(****pass|regex|basic)
>>>>>>
>>>>>> for plugin.includes property in nutch-default.xml. What am I missing?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>>
>>>>>
>