I've been trying to get nutch to crawl all of my site (let's call it my_domain_name.com) for a while now, but it's not working. These are my settings:
db.ignore.external.links = true
db.ignore.external.links.mode = byDomain
db.max.outlinks.per.page = -1
http, file and ftp content fetch limits = -1
http.redirect.max = 2
# skip file: ftp: and mailto: urls
# skip image and other suffixes we can't yet parse
# Accept everything else
bin/crawl -i -D solr.server.url=http://localhost:8983/solr/my_core_name
urls_seed_directory/ my_crawl_name/ -1
When I do a readdb, I find 29,000 pages in the db_unfetched state. I tried several crawls, but the number of unfetched documents just seems to increase.
There is no pattern as to which documents stay unfetched. Some documents of the exact same type and in the same portion of the sitemap get fetched correctly, but others don't. Some pdfs get fetched correctly, but others don't. (And it's not a size limit problem - I checked.) There's nothing in robots.txt that would disallow them from being fetched.
I took one of the pdf docs that are in the db_unfetched state, and ran parsechecker on it. It parsed the contents correctly.
I looked at the crawl dump generated by readdb and couldn't find any errors or detailed information re: why something wasn't fetched.
I'm at a loss here. How can I make nutch crawl the entire site and fetch all the pages/documents? I'm talking about a site with about 40,000 pages, not millions.
Confidentiality Notice:: This email, including attachments, may include non-public, proprietary, confidential or legally privileged information. If you are not an intended recipient or an authorized agent of an intended recipient, you are hereby notified that any dissemination, distribution or copying of the information contained in or transmitted with this e-mail is unauthorized and strictly prohibited. If you have received this email in error, please notify the sender by replying to this message and permanently delete this e-mail, its attachments, and any copies of it immediately. You should not retain, copy or use this e-mail or any attachment for any purpose, nor disclose all or any part of the contents to any other person. Thank you.