Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - Getting Started with NUTCH


Copy link to this message
-
Getting Started with NUTCH
eliea 2012-04-24, 23:02
I have just started picking up Nutch for a project that I am working on that
involves building a Framework for QA teams to check site pages for broken
links, expired content, etc... I have successfully setup Nutch on a Windows
OS, and fire up an initial Crawl with a single Url in the seed file. I have
setup the crawl 3 levels deep. I have couple of initial questions,
rudimentary questions

One, is there a way to launch the "Crawler" and have it craw a given site or
set of sites every so often (a configurable time). I do realize that Nutch
has the interval option; however, from a number of posts that I have read,
the need for a script that will trigger the crawl. Is that the only way?
 
Tow, the db_unfetched state. I really don't understand what does it really
mean. Again given a site that I need to crawl to x levels deep, I need to
provide means to detect broken links. I know that db_gone is one of the
status that I can use to report on those links, but what about db_unfetched.
A url marked with this status does not mean that the Url is broken, it means
that it has not been fetched yet. Given the fact that time to detect broken
links is very crucial, how can I minimize the number of db_unfetched and
make sure that each cycle all urls are fetched (successfully or not).
 

Any advice will be greatly appreciated.

--
View this message in context: http://lucene.472066.n3.nabble.com/Getting-Started-with-NUTCH-tp3936913p3936913.html
Sent from the Nutch - User mailing list archive at Nabble.com.