Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Nutch, mail # user - Failed fetching


+
Dean Pullen 2012-02-02, 16:44
+
Dean Pullen 2012-02-02, 17:11
+
Dean Pullen 2012-02-02, 17:22
Copy link to this message
-
Re: Failed fetching
Lewis John Mcgibbney 2012-02-02, 18:01
Looks liek your using an old version of Nutc here.

Please try upgrading to 1.4 Dean

hth

On Thu, Feb 2, 2012 at 5:22 PM, Dean Pullen <[EMAIL PROTECTED]>wrote:

> What I see in logs/userlogs/myfetchjobxx/**syslog is:
>
> 2012-02-02 17:15:25,045 INFO org.apache.nutch.fetcher.**Fetcher: fetch of
> http://nutch.apache.org/ failed with: org.apache.nutch.protocol.**ProtocolNotFound:
> protocol not found for url=http
>
> I look at the nutch-site.xml file and see:
>
> <property>
> <name>plugin.includes</name>
> <value>
>            protocol-httpclient|urlfilter-**regex|parse-(text|html|js|**
> msexcel|mspowerpoint|msword|**pdf|rss)|index-(basic|anchor|**
> more)|query-(basic|site|url)|**response-(json|xml)|summary-**
> basic|metatag|scoring-opic|**urlnormalizer-(pass|regex|**
> basic)|url-query-normalizer
> </value>
> </property>
>
> Do we have to manually add the protocol-http to it?! Surely this should be
> there by default?
>
> Dean.
>
>
> On 02/02/2012 17:11, Dean Pullen wrote:
>
>> I've added:
>>
>> <property>
>> <name>http.verbose</name>
>> <value>true</value>
>> <description>If true, HTTP will log more verbosely.</description>
>> </property>
>> <property>
>> <name>fetcher.verbose</name>
>> <value>true</value>
>> <description>If true, fetcher will log more verbosely.</description>
>> </property>
>>
>>
>> To the nutch-site.xml in an attempt for more info....
>>
>> On 02/02/2012 16:44, Dean Pullen wrote:
>>
>>> Hi all,
>>>
>>> I'm trying to fetch from http://nutch.apache.org
>>>
>>> But after fetching, parsing, and updating the DB I examine the DB for '
>>> http://nutch.apache.org/' (oddly I must include the last slash) and get:
>>>
>>> /URL: http://nutch.apache.org/
>>> Version: 7
>>> Status: 1 (*db_unfetched*)
>>> Fetch time: Fri Feb 03 16:33:13 GMT 2012
>>> Modified time: Thu Jan 01 01:00:00 GMT 1970
>>> Retries since fetch: 1
>>> Retry interval: 2592000 seconds (30 days)
>>> Score: 500.0
>>> Signature: null
>>> Metadata: _pst_: *failed*(2), lastModified=0/
>>>
>>> Why is the fetch failing and how can I show more nutch logging so as to
>>> view the failure attempt/message?
>>> Nothing is seen in my access logs when I try to crawl my own external
>>> site.
>>>
>>> To ensure all URLs are permitted I've changed the regex-urlfilter.txt to:
>>>
>>> /# accept anything else
>>> +./
>>>
>>> This has been puzzling me all day, I'm hoping someone can help!
>>>
>>> Dean.
>>>
>>>
>>
>
--
*Lewis*
+
Dean Pullen 2012-02-03, 11:06
+
tiagorcs 2012-02-06, 03:31
+
tiagorcs 2012-02-06, 04:37
+
Lewis John Mcgibbney 2012-02-10, 21:18
+
remi tassing 2012-02-14, 18:03
+
Lewis John Mcgibbney 2012-02-14, 18:08
+
tiagorcs 2012-02-15, 01:46
+
remi tassing 2012-02-15, 09:50
+
tiagorcs 2012-02-22, 01:11
+
Markus Jelsma 2012-02-02, 18:17
+
tiagorcs 2012-02-03, 10:01
+
tiagorcs 2012-02-03, 10:06
+
Lewis John Mcgibbney 2012-02-03, 10:11
+
tiagorcs 2012-02-03, 10:22
+
Markus Jelsma 2012-02-03, 10:22
+
tiagorcs 2012-02-03, 10:48
+
Markus Jelsma 2012-02-03, 10:49
+
tiagorcs 2012-02-03, 10:57
+
Markus Jelsma 2012-02-03, 11:02