|
|
+
Dean Pullen 2012-02-02, 16:44
+
Dean Pullen 2012-02-02, 17:11
+
Dean Pullen 2012-02-02, 17:22
+
Lewis John Mcgibbney 2012-02-02, 18:01
+
Dean Pullen 2012-02-03, 11:06
+
tiagorcs 2012-02-06, 03:31
+
tiagorcs 2012-02-06, 04:37
+
Lewis John Mcgibbney 2012-02-10, 21:18
+
remi tassing 2012-02-14, 18:03
+
Lewis John Mcgibbney 2012-02-14, 18:08
+
tiagorcs 2012-02-15, 01:46
+
remi tassing 2012-02-15, 09:50
+
tiagorcs 2012-02-22, 01:11
-
Re: Failed fetchingMarkus Jelsma 2012-02-02, 18:17
It is default but you override it in nutch-site. Use protocol-http if you can
and stay away from protocol-httpclient. > What I see in logs/userlogs/myfetchjobxx/syslog is: > > 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.Fetcher: fetch of > http://nutch.apache.org/ failed with: > org.apache.nutch.protocol.ProtocolNotFound: protocol not found for url=http > > I look at the nutch-site.xml file and see: > > <property> > <name>plugin.includes</name> > <value> > > protocol-httpclient|urlfilter-regex|parse-(text|html|js|msexcel|mspowerpoin > t|msword|pdf|rss)|index-(basic|anchor|more)|query-(basic|site|url)|response > -(json|xml)|summary-basic|metatag|scoring-opic|urlnormalizer-(pass|regex|ba > sic)|url-query-normalizer </value> > </property> > > Do we have to manually add the protocol-http to it?! Surely this should > be there by default? > > Dean. > > On 02/02/2012 17:11, Dean Pullen wrote: > > I've added: > > > > <property> > > <name>http.verbose</name> > > <value>true</value> > > <description>If true, HTTP will log more verbosely.</description> > > </property> > > <property> > > <name>fetcher.verbose</name> > > <value>true</value> > > <description>If true, fetcher will log more verbosely.</description> > > </property> > > > > > > To the nutch-site.xml in an attempt for more info.... > > > > On 02/02/2012 16:44, Dean Pullen wrote: > >> Hi all, > >> > >> I'm trying to fetch from http://nutch.apache.org > >> > >> But after fetching, parsing, and updating the DB I examine the DB for > >> 'http://nutch.apache.org/' (oddly I must include the last slash) and > >> get: > >> > >> /URL: http://nutch.apache.org/ > >> Version: 7 > >> Status: 1 (*db_unfetched*) > >> Fetch time: Fri Feb 03 16:33:13 GMT 2012 > >> Modified time: Thu Jan 01 01:00:00 GMT 1970 > >> Retries since fetch: 1 > >> Retry interval: 2592000 seconds (30 days) > >> Score: 500.0 > >> Signature: null > >> Metadata: _pst_: *failed*(2), lastModified=0/ > >> > >> Why is the fetch failing and how can I show more nutch logging so as > >> to view the failure attempt/message? > >> Nothing is seen in my access logs when I try to crawl my own external > >> site. > >> > >> To ensure all URLs are permitted I've changed the regex-urlfilter.txt > >> to: > >> > >> /# accept anything else > >> +./ > >> > >> This has been puzzling me all day, I'm hoping someone can help! > >> > >> Dean. +
tiagorcs 2012-02-03, 10:01
+
tiagorcs 2012-02-03, 10:06
+
Lewis John Mcgibbney 2012-02-03, 10:11
+
tiagorcs 2012-02-03, 10:22
+
Markus Jelsma 2012-02-03, 10:22
+
tiagorcs 2012-02-03, 10:48
+
Markus Jelsma 2012-02-03, 10:49
+
tiagorcs 2012-02-03, 10:57
+
Markus Jelsma 2012-02-03, 11:02
|