-Re: using less resources
remi tassing 2012-05-23, 12:58
I was wondering how do you know if the page was changed without actually
On Wednesday, May 23, 2012, wrote:
> As far as I understood nutch recrawls urls when their fetch time has past
> current time regardless if those urls were modified or not.
> Is there any initiative on restricting recrawls to only those urls that
> have modified time(MT) greater than the old MT?
> In detail: if nutch have crawled a url with next fetch time in 30 days,
> then in the second recrawl nutch must visit this url, retrieve its modified
> time and compare it with modified time that we have in the crawldb and
> recrawl it if the new MT is greater than the old one, otherwise skip it.