Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # dev - Component fetching during parsing. (vertical crawling)


Copy link to this message
-
Re: Component fetching during parsing. (vertical crawling)
Andrzej Bialecki 2010-07-20, 13:13
On 2010-07-20 14:30, Ferdy wrote:
> Hello,
>
> We are currently using a heavily modified version of nutch. The main
> reason for this is the fact that we do not only fetch the urls that the
> QueueFeeder submits, but also additional resources from urls that are
> constructed during parsing. So for example let's say the QueueFeeder
> submits a html page to the fetcher, and after the fetch the page gets
> parsed. Nothing special so far. However the parser decides it also needs
> some images on the page. Perhaps these images link to other html pages,
> and we might want to fetch these too. All this is needed to parse
> information about this particular url we started with. These extra fetch
> urls we like to call Components, because they are additional resources
> required to do the parsing of our initial html page that was selected
> for fetching.
>
> At first we tried to solve this "vertical crawling" problem by using
> multiple crawl cycles. Each crawl simply selects outlinks that are
> needed for the parsing of the initial html page. A single inspection can
> possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph
> depth). There are several problems with this approach, for one that the
> crawldb is cluttered with all these component urls and secondly that
> inspection completion times can be very long.
>
> As an alternative we decided to let the parser fetch needed components
> on-the-fly, so that additional urls are instantly added to the fetcher
> lists. Every fetched url can be either a non-component (the QueueFeeder
> fed it; start parsing this resource) or as a component (the fetcher
> hands the resource over to the parser that requested it). In order to
> keep parsers alive we always try to fetch components first, with respect
> to fetch politeness. A downside of this solution is that your fetch task
> total running time will be more difficult to anticipate to. For example,
> if you inject and generate 100 urls and they will be fetched in a single
> task, you might end up fetching a total of 1100 urls (in the assumption
> each inspection needs 10 components). We found this behaviour to be
> acceptable.
>
> Because of our custom version of nutch we cannot upgrade easily to newer
> versions (we're still using modified fetcher classes from nutch 0.9).
> Often we end up fixing bugs that have already been fixed by the
> community. Also, other users might benefit from our changes too.
>
> Therefore we propose to redesign our vertical crawling system from
> scratch for the newer nutch versions, should there be any interest from
> the community. Perhaps we are not the only one to implement such a
> system with nutch. So, what are your thoughts about this?

If I understand your use case properly, this is really a custom Fetcher
that you are talking about - a strategy to fetch complete pages
(together with its resources that relate to the display of the page)
should be possible to implement in a custom fetcher without changing
other Nutch areas.
--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com