|
|
-
Re: Component fetching during parsing. (vertical crawling)Andrzej Bialecki 2010-07-20, 13:13
On 2010-07-20 14:30, Ferdy wrote:
> Hello, > > We are currently using a heavily modified version of nutch. The main > reason for this is the fact that we do not only fetch the urls that the > QueueFeeder submits, but also additional resources from urls that are > constructed during parsing. So for example let's say the QueueFeeder > submits a html page to the fetcher, and after the fetch the page gets > parsed. Nothing special so far. However the parser decides it also needs > some images on the page. Perhaps these images link to other html pages, > and we might want to fetch these too. All this is needed to parse > information about this particular url we started with. These extra fetch > urls we like to call Components, because they are additional resources > required to do the parsing of our initial html page that was selected > for fetching. > > At first we tried to solve this "vertical crawling" problem by using > multiple crawl cycles. Each crawl simply selects outlinks that are > needed for the parsing of the initial html page. A single inspection can > possibly overlap 2, 3 or 4 cycles (depending on the inspection's graph > depth). There are several problems with this approach, for one that the > crawldb is cluttered with all these component urls and secondly that > inspection completion times can be very long. > > As an alternative we decided to let the parser fetch needed components > on-the-fly, so that additional urls are instantly added to the fetcher > lists. Every fetched url can be either a non-component (the QueueFeeder > fed it; start parsing this resource) or as a component (the fetcher > hands the resource over to the parser that requested it). In order to > keep parsers alive we always try to fetch components first, with respect > to fetch politeness. A downside of this solution is that your fetch task > total running time will be more difficult to anticipate to. For example, > if you inject and generate 100 urls and they will be fetched in a single > task, you might end up fetching a total of 1100 urls (in the assumption > each inspection needs 10 components). We found this behaviour to be > acceptable. > > Because of our custom version of nutch we cannot upgrade easily to newer > versions (we're still using modified fetcher classes from nutch 0.9). > Often we end up fixing bugs that have already been fixed by the > community. Also, other users might benefit from our changes too. > > Therefore we propose to redesign our vertical crawling system from > scratch for the newer nutch versions, should there be any interest from > the community. Perhaps we are not the only one to implement such a > system with nutch. So, what are your thoughts about this? If I understand your use case properly, this is really a custom Fetcher that you are talking about - a strategy to fetch complete pages (together with its resources that relate to the display of the page) should be possible to implement in a custom fetcher without changing other Nutch areas. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com |