Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - using less resources


Copy link to this message
-
Re: using less resources
remi tassing 2012-05-23, 12:58
I was wondering how do you know  if the page was changed without actually
fetching it

On Wednesday, May 23, 2012, wrote:

> Hello,
>
> As far as I understood nutch recrawls urls when their fetch time has past
>  current time regardless if those urls were modified or not.
> Is there any initiative on restricting recrawls to only those urls that
> have modified time(MT) greater than the old MT?
> In detail: if nutch have crawled a  url with next fetch time in 30 days,
> then in the second recrawl nutch must visit this url, retrieve its modified
> time and compare it  with modified time that we have in the crawldb and
> recrawl it if the new MT is greater than the old one, otherwise skip it.
>
> Thanks.
> Alex.
>
>
>