|
|
-
Seed urls not getting crawled.
Sudip Datta 2012-02-09, 07:26
Hi,
I am using Nutch 1.4 for crawling ~50 sites. Since, I need to ensure that some urls on these sites must be crawled and I know the urls of these pages, I seed the crawl with these urls. At times, the seed size per host number in thousands.
Initially the crawl was very slow, despite having 20 threads at its disposal, resulting in almost sequential crawl. I could parallelize the crawl to a large degree, speeding up the crawl, by setting a bound on generate.max.count (say 400) parameter and generate.count.mode as host. Unfortunately, this results in missing out the crawl of all pages exceeding the generate.max.count, even though the crawldb is aware of them. So, for example a crawldb entry which is never actually crawled is:
<url> Version: 7 Status: 1 (db_unfetched) Fetch time: Thu Feb 09 11:53:29 GMT+05:30 2012 Modified time: Thu Jan 01 05:30:00 GMT+05:30 1970 Retries since fetch: 0 Retry interval: 86400 seconds (1 days) Score: 1.0 Signature: null Metadata:
While, this indicates that a reattempt will be made in 1 day, the 'url' never really gets the state db_fetched. On the other hand, if I set generate.max.count = -1, the page is indeed crawled but the crawl is painfully slow.
How can I keep the crawl parallelized yet not missing out on such urls?
Thanks,
--Sudip.
+
Sudip Datta 2012-02-09, 07:26
-
Re: Seed urls not getting crawled.
Lewis John Mcgibbney 2012-02-10, 21:00
Hi,
On Thu, Feb 9, 2012 at 7:26 AM, Sudip Datta <[EMAIL PROTECTED]> wrote:
> > While, this indicates that a reattempt will be made in 1 day, the > 'url' never really gets the state db_fetched. On the other hand, if I > set generate.max.count = -1, the page is indeed crawled but the crawl > is painfully slow. > Do you have any idea about which part of the crawl is painfully slow?
How are you running your crawls?
Thanks
-- *Lewis*
+
Lewis John Mcgibbney 2012-02-10, 21:00
|
|