|
|
-
Re: parse.ParserFactoryTolga 2012-05-29, 08:37
...and also, nutch-site.xml is blank here, so I'm sure it's not being
used at all. On 5/29/12 11:34 AM, Julien Nioche wrote: >> I am doing it in NUTCH_HOME/runtime/local/conf. I thought I could use >> nutch-default.xml, and nutch-site.xml just overrode nutch-default.xml. > > that's the case. I was just mentioning a recommended practice, not a strict > requirement > > > >> >> On 5/29/12 9:48 AM, Julien Nioche wrote: >> >>> if you are seeing this warning then this means that parse-pdf IS being >>> used. You should modify nutch-site.xml and not nutch-default and my bet is >>> that your are doing this in NUTCH_HOME/conf and not in >>> NUTCH_HOME/runtime/local/conf (see tutorial on WIKI) >>> >>> >>> >>> On 29 May 2012 07:31, Tolga<[EMAIL PROTECTED]> wrote: >>> >>> Hi, >>>> I know this issue should have been closed, but I thought I'd continue >>>> this >>>> rather than starting a new thread. >>>> >>>> Anyway, I'm getting this: parse.ParserFactory - ParserFactory: Plugin: >>>> parse-pdf mapped to contentType application/pdf via parse-plugins.xml, >>>> but >>>> not enabled via plugin.includes in nutch-default.xml and I have tika in >>>> my >>>> nutch-default.xml:<value>**protocol-http|**urlfilter-** >>>> regex|parse-(html|** >>>> tika|js|swf|zip|xml)|index-(****basic|anchor)|scoring-opic|** >>>> urlnormalizer-(pass|regex|****basic)</value>. What's the point of seeing >>>> >>>> this warning if I already have tika? This should be removed IMHO. >>>> >>>> Regards, >>>> >>>> >>>> On 5/23/12 12:27 AM, Lewis John Mcgibbney wrote: >>>> >>>> Unless your using<= Nutch 1.2 you should not be using >>>>> msexcel|mspowerpoint|msword|****oo|pdf| within your plugin.includes... >>>>> all >>>>> >>>>> of these document formats are (and have been for some time) >>>>> implemented as Apache Tika parsers. >>>>> >>>>> hth >>>>> >>>>> >>>>> >>>>> On Tue, May 22, 2012 at 9:20 PM, Tolga<[EMAIL PROTECTED]> wrote: >>>>> >>>>> Hi, >>>>>> I crawl / index PDF files just fine, but I get the following warning. >>>>>> >>>>>> parse.ParserFactory - ParserFactory: Plugin: parse-pdf mapped to >>>>>> contentType >>>>>> application/pdf via parse-plugins.xml, but not enabled via >>>>>> plugin.includes >>>>>> in nutch-default.xml. >>>>>> >>>>>> I've got the value >>>>>> protocol-http|urlfilter-regex|****parse-(html|tika|js|msexcel|**** >>>>>> mspowerpoint|msword|oo|pdf|****swf|zip)|index-(basic|anchor)|**** >>>>>> scoring-opic|urlnormalizer-(****pass|regex|basic) >>>>>> >>>>>> for plugin.includes property in nutch-default.xml. What am I missing? >>>>>> >>>>>> Regards, >>>>>> >>>>>> >>>>> > |