|
|
-
Getting Started with NUTCHeliea 2012-04-24, 23:02
I have just started picking up Nutch for a project that I am working on that
involves building a Framework for QA teams to check site pages for broken links, expired content, etc... I have successfully setup Nutch on a Windows OS, and fire up an initial Crawl with a single Url in the seed file. I have setup the crawl 3 levels deep. I have couple of initial questions, rudimentary questions One, is there a way to launch the "Crawler" and have it craw a given site or set of sites every so often (a configurable time). I do realize that Nutch has the interval option; however, from a number of posts that I have read, the need for a script that will trigger the crawl. Is that the only way? Tow, the db_unfetched state. I really don't understand what does it really mean. Again given a site that I need to crawl to x levels deep, I need to provide means to detect broken links. I know that db_gone is one of the status that I can use to report on those links, but what about db_unfetched. A url marked with this status does not mean that the Url is broken, it means that it has not been fetched yet. Given the fact that time to detect broken links is very crucial, how can I minimize the number of db_unfetched and make sure that each cycle all urls are fetched (successfully or not). Any advice will be greatly appreciated. -- View this message in context: http://lucene.472066.n3.nabble.com/Getting-Started-with-NUTCH-tp3936913p3936913.html Sent from the Nutch - User mailing list archive at Nabble.com. |