Home | About | Sematext search-lucene.com search-hadoop.com
clear query|facets|time Search criteria: .   Results from 61 to 70 from 155 (0.114s).
Loading phrases to help you
refine your search...
Re: can nutch output xml? - Nutch - [mail # user]
...Hi Mike,  afaik, it can't. But it would be really useful for archiving, post-processing, data mining, etc. Have a look at NUTCH-1047 and NUTCH-1088. Currently, you would need to write a...
   Author: Sebastian Nagel, 2012-10-24, 19:26
Re: Same pages crawled more than once and slow crawling - Nutch - [mail # user]
...There is a difference whether you run Nutch 1.x from the bin or src package: the former does not contain a runtime/local folder.  I'll add a section how to compile and run Nutch from th...
   Author: Sebastian Nagel, 2012-10-24, 16:14
Re: Same pages crawled more than once and slow crawling - Nutch - [mail # user]
...Hi Luca,  Um... I failed to reproduce the Pierre's problem with - a simpler configuration - HBase as back-end (Pierre and Luca both use mysql)  But after the 5th cycle the crawler ...
   Author: Sebastian Nagel, 2012-10-18, 19:07
Re: same page fetched severals times in one crawl - Nutch - [mail # user]
...It's not the crawl command alone. It worked for me. Can you try with a minimal nutch-site.xml?  Have a look at the patches of NUTCH-1087 there is also a patch for 2.x (but see Julien's ...
   Author: Sebastian Nagel, 2012-10-16, 07:09
Re: same page fetched severals times in one crawl - Nutch - [mail # user]
...Hi Pierre,  I tried almost the same with just the default settings (only the http-agent is set in nutch-site.xml: it's not Googlebot :-O). All went ok, no documents were crawled twice. ...
   Author: Sebastian Nagel, 2012-10-15, 20:11
[NUTCH-1476] SegmentReader getStats should set parsed = -1 if no parsing took place - Nutch - [issue]
...The method getStats in SegmentReader sets the number of parsed documents (and also the number of parseErrors) to 0 if no parsing took place for a segment. The values should be set to -1 anal...
http://issues.apache.org/jira/browse/NUTCH-1476    Author: Sebastian Nagel, 2012-10-12, 05:39
[NUTCH-1383] IndexingFiltersChecker to show error message instead of null pointer exception - Nutch - [issue]
...IndexingFiltersChecker may throw null pointer exceptions if content returned by protocol implementation is null (artifact of NUTCH-1293) if one of the indexing filters sets doc to null (the ...
http://issues.apache.org/jira/browse/NUTCH-1383    Author: Sebastian Nagel, 2012-10-12, 05:39
[NUTCH-1252] SegmentReader -get shows wrong data - Nutch - [issue]
...The command/option -get of the SegmentReader may show wrong data associated with the given URL. To reproduce:% mkdir -p test_readseg/urls% echo -e "http://nutch.apache.org/\ttest=ApacheNutch...
http://issues.apache.org/jira/browse/NUTCH-1252    Author: Sebastian Nagel, 2012-10-12, 05:39
Re: Error parsing html - Nutch - [mail # user]
...It's possible though it's hard.  See http://wiki.apache.org/nutch/RunNutchInEclipse#Debugging_and_Timeouts (default timeout is 30 sec., you cannot seriously debug within this time) &nbs...
   Author: Sebastian Nagel, 2012-10-09, 22:30
Re: detecting robots.txt aborts - Nutch - [mail # user]
...Hi Stefan,  there is a protocol status code ROBOTS_DENIED. It is stored in the CrawlDb (1.x) or WebTable (2.x). See "nutch readdb" for how to query the table for a single URL.  Seb...
   Author: Sebastian Nagel, 2012-10-05, 21:05
Sort:
project
Nutch (155)
type
mail # user (90)
mail # dev (38)
issue (27)
date
last 7 days (0)
last 30 days (9)
last 90 days (24)
last 6 months (52)
last 9 months (155)
author
Markus Jelsma (1767)
Lewis John Mcgibbney (1118)
Julien Nioche (805)
Mattmann, Chris A (402)
lewis john mcgibbney (334)
Andrzej Bialecki (302)
Ferdy Galema (224)
Bai Shen (161)
Tejas Patil (161)
Sebastian Nagel (155)
kiran chitturi (155)
alxsss@...)
remi tassing (133)
Lewis John McGibbney (129)
Gabriele Kahlout (115)