|
|
-
Re: parse.ParserFactoryTolga 2012-05-29, 07:52
I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use
nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml. On 5/29/12 9:48 AM, Julien Nioche wrote: > if you are seeing this warning then this means that parse-pdf IS being > used. You should modify nutch-site.xml and not nutch-default and my bet is > that your are doing this in NUTCH_HOME/conf and not in > NUTCH_HOME/runtime/local/conf (see tutorial on WIKI) > > > > On 29 May 2012 07:31, Tolga<[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I know this issue should have been closed, but I thought I'd continue this >> rather than starting a new thread. >> >> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin: >> parse-pdf mapped to contentType application/pdf via parse-plugins.xml, but >> not enabled via plugin.includes in nutch-default.xml and I have tika in my >> nutch-default.xml:<value>protocol-http|**urlfilter-regex|parse-(html|** >> tika|js|swf|zip|xml)|index-(**basic|anchor)|scoring-opic|** >> urlnormalizer-(pass|regex|**basic)</value>. What's the point of seeing >> this warning if I already have tika? This should be removed IMHO. >> >> Regards, >> >> >> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote: >> >>> Unless your using<= Nutch 1.2 you should not be using >>> msexcel|mspowerpoint|msword|**oo|pdf| within your plugin.includes... all >>> of these document formats are (and have been for some time) >>> implemented as Apache Tika parsers. >>> >>> hth >>> >>> >>> >>> On Tue, May 22, 2012 at 9:20 PM, Tolga<[EMAIL PROTECTED]> wrote: >>> >>>> Hi, >>>> >>>> I crawl / index PDF files just fine, but I get the following warning. >>>> >>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to >>>> contentType >>>> application/pdf via parse-plugins.xml, but not enabled via >>>> plugin.includes >>>> in nutch-default.xml. >>>> >>>> I've got the value >>>> protocol-http|urlfilter-regex|**parse-(html|tika|js|msexcel|** >>>> mspowerpoint|msword|oo|pdf|**swf|zip)|index-(basic|anchor)|** >>>> scoring-opic|urlnormalizer-(**pass|regex|basic) >>>> for plugin.includes property in nutch-default.xml. What am I missing? >>>> >>>> Regards, >>>> >>> >>> > |