| clear query|facets|time |
Search criteria: .
Results from 61 to 70 from
155 (0.114s).
|
|
|
Loading phrases to help you refine your search...
|
|
Re: can nutch output xml? - Nutch - [mail # user]
|
|
...Hi Mike, afaik, it can't. But it would be really useful for archiving, post-processing, data mining, etc. Have a look at NUTCH-1047 and NUTCH-1088. Currently, you would need to write a...
|
|
|
Author: Sebastian Nagel,
2012-10-24, 19:26
|
|
|
Re: Same pages crawled more than once and slow crawling - Nutch - [mail # user]
|
|
...There is a difference whether you run Nutch 1.x from the bin or src package: the former does not contain a runtime/local folder. I'll add a section how to compile and run Nutch from th...
|
|
|
Author: Sebastian Nagel,
2012-10-24, 16:14
|
|
|
Re: Same pages crawled more than once and slow crawling - Nutch - [mail # user]
|
|
...Hi Luca, Um... I failed to reproduce the Pierre's problem with - a simpler configuration - HBase as back-end (Pierre and Luca both use mysql) But after the 5th cycle the crawler ...
|
|
|
Author: Sebastian Nagel,
2012-10-18, 19:07
|
|
|
Re: same page fetched severals times in one crawl - Nutch - [mail # user]
|
|
...It's not the crawl command alone. It worked for me. Can you try with a minimal nutch-site.xml? Have a look at the patches of NUTCH-1087 there is also a patch for 2.x (but see Julien's ...
|
|
|
Author: Sebastian Nagel,
2012-10-16, 07:09
|
|
|
Re: same page fetched severals times in one crawl - Nutch - [mail # user]
|
|
...Hi Pierre, I tried almost the same with just the default settings (only the http-agent is set in nutch-site.xml: it's not Googlebot :-O). All went ok, no documents were crawled twice. ...
|
|
|
Author: Sebastian Nagel,
2012-10-15, 20:11
|
|
|
[NUTCH-1476] SegmentReader getStats should set parsed = -1 if no parsing took place - Nutch - [issue]
|
|
...The method getStats in SegmentReader sets the number of parsed documents (and also the number of parseErrors) to 0 if no parsing took place for a segment. The values should be set to -1 anal...
|
|
|
http://issues.apache.org/jira/browse/NUTCH-1476
Author: Sebastian Nagel,
2012-10-12, 05:39
|
|
|
[NUTCH-1383] IndexingFiltersChecker to show error message instead of null pointer exception - Nutch - [issue]
|
|
...IndexingFiltersChecker may throw null pointer exceptions if content returned by protocol implementation is null (artifact of NUTCH-1293) if one of the indexing filters sets doc to null (the ...
|
|
|
http://issues.apache.org/jira/browse/NUTCH-1383
Author: Sebastian Nagel,
2012-10-12, 05:39
|
|
|
[NUTCH-1252] SegmentReader -get shows wrong data - Nutch - [issue]
|
|
...The command/option -get of the SegmentReader may show wrong data associated with the given URL. To reproduce:% mkdir -p test_readseg/urls% echo -e "http://nutch.apache.org/\ttest=ApacheNutch...
|
|
|
http://issues.apache.org/jira/browse/NUTCH-1252
Author: Sebastian Nagel,
2012-10-12, 05:39
|
|
|
Re: Error parsing html - Nutch - [mail # user]
|
|
...It's possible though it's hard. See http://wiki.apache.org/nutch/RunNutchInEclipse#Debugging_and_Timeouts (default timeout is 30 sec., you cannot seriously debug within this time) &nbs...
|
|
|
Author: Sebastian Nagel,
2012-10-09, 22:30
|
|
|
Re: detecting robots.txt aborts - Nutch - [mail # user]
|
|
...Hi Stefan, there is a protocol status code ROBOTS_DENIED. It is stored in the CrawlDb (1.x) or WebTable (2.x). See "nutch readdb" for how to query the table for a single URL. Seb...
|
|
|
Author: Sebastian Nagel,
2012-10-05, 21:05
|
|
|
|