|
|
+
Vikas Hazrati 2012-04-03, 07:48
+
Julien Nioche 2012-04-03, 10:05
+
Vikas Hazrati 2012-04-03, 15:58
+
shlomi java 2012-04-03, 10:28
-
Re: Order of plugins, regex-urlfilter being ignoredVikas Hazrati 2012-04-03, 15:59
Thanks SJ, response appreciated!
On Tue, Apr 3, 2012 at 3:58 PM, shlomi java <[EMAIL PROTECTED]> wrote: > Also available (in Nutch 1.4) are the following properties: > indexingfilter.order, urlnormalizer.order, > htmlparsefilter.order,scoring.filter.order. > SJ > > On Tue, Apr 3, 2012 at 1:05 PM, Julien Nioche < > [EMAIL PROTECTED] > > wrote: > > > see nutch-default.xml > > > > > > <property> > > <name>urlfilter.order</name> > > <value></value> > > <description>The order by which url filters are applied. > > If empty, all available url filters (as dictated by properties > > plugin-includes and plugin-excludes above) are loaded and applied in > > system > > defined order. If not empty, only named filters are loaded and applied > > in given order. For example, if this property has value: > > org.apache.nutch.urlfilter.regex.RegexURLFilter > > org.apache.nutch.urlfilter.prefix.PrefixURLFilter > > then RegexURLFilter is applied first, and PrefixURLFilter second. > > Since all filters are AND'ed, filter ordering does not have impact > > on end result, but it may have performance implication, depending > > on relative expensiveness of filters. > > </description> > > </property> > > > > On 3 April 2012 08:48, Vikas Hazrati <[EMAIL PROTECTED]> wrote: > > > > > When we specify the plugins in nutch-site.xml does their order matter? > > > > > > I have the plugins defined as > > > > > > <value>protocol-http|urlfilter-regex|*Myaggregator* > > > > > > > > > |parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> > > > > > > Myaggregator is a plugin which extends URLFilter. Does it mean that > when > > > Myaggreagtor is invoked, it would get URLs which have already been > > filtered > > > by urlfilter-regex? > > > > > > Also, I am adding the following extensions to the regex-urlfilter.txt > > file > > > to ignore, however these links do appear in my custom URL filter which > is > > > mentioned > > > > > > > > > > > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS| > > > *ics|kml|atom*)$ > > > > > > Any thoughts? > > > > > > Regards | Vikas > > > www.knoldus.com > > > > > > > > > > > -- > > * > > *Open Source Solutions for Text Engineering > > > > http://digitalpebble.blogspot.com/ > > http://www.digitalpebble.com > > http://twitter.com/digitalpebble > > > |