Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Nutch, mail # user - Order of plugins, regex-urlfilter being ignored


+
Vikas Hazrati 2012-04-03, 07:48
+
Julien Nioche 2012-04-03, 10:05
+
Vikas Hazrati 2012-04-03, 15:58
+
shlomi java 2012-04-03, 10:28
Copy link to this message
-
Re: Order of plugins, regex-urlfilter being ignored
Vikas Hazrati 2012-04-03, 15:59
Thanks SJ, response appreciated!

On Tue, Apr 3, 2012 at 3:58 PM, shlomi java <[EMAIL PROTECTED]> wrote:

> Also available (in Nutch 1.4) are the following properties:
> indexingfilter.order, urlnormalizer.order,
> htmlparsefilter.order,scoring.filter.order.
>  SJ
>
> On Tue, Apr 3, 2012 at 1:05 PM, Julien Nioche <
> [EMAIL PROTECTED]
> > wrote:
>
> > see nutch-default.xml
> >
> >
> > <property>
> >  <name>urlfilter.order</name>
> >  <value></value>
> >  <description>The order by which url filters are applied.
> >  If empty, all available url filters (as dictated by properties
> >  plugin-includes and plugin-excludes above) are loaded and applied in
> > system
> >  defined order. If not empty, only named filters are loaded and applied
> >  in given order. For example, if this property has value:
> >  org.apache.nutch.urlfilter.regex.RegexURLFilter
> > org.apache.nutch.urlfilter.prefix.PrefixURLFilter
> >  then RegexURLFilter is applied first, and PrefixURLFilter second.
> >  Since all filters are AND'ed, filter ordering does not have impact
> >  on end result, but it may have performance implication, depending
> >  on relative expensiveness of filters.
> >  </description>
> > </property>
> >
> > On 3 April 2012 08:48, Vikas Hazrati <[EMAIL PROTECTED]> wrote:
> >
> > > When we specify the plugins in nutch-site.xml does their order matter?
> > >
> > > I have the plugins defined as
> > >
> > > <value>protocol-http|urlfilter-regex|*Myaggregator*
> > >
> > >
> >
> |parse-(html|tika)|index-(basic|anchor)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
> > >
> > > Myaggregator is a plugin which extends URLFilter. Does it mean that
> when
> > > Myaggreagtor is invoked, it would get URLs which have already been
> > filtered
> > > by urlfilter-regex?
> > >
> > > Also, I am adding the following extensions to the regex-urlfilter.txt
> > file
> > > to ignore, however these links do appear in my custom URL filter which
> is
> > > mentioned
> > >
> > >
> > >
> >
> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ|mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS|
> > > *ics|kml|atom*)$
> > >
> > > Any thoughts?
> > >
> > > Regards | Vikas
> > > www.knoldus.com
> > >
> >
> >
> >
> > --
> > *
> > *Open Source Solutions for Text Engineering
> >
> > http://digitalpebble.blogspot.com/
> > http://www.digitalpebble.com
> > http://twitter.com/digitalpebble
> >
>