Home | About | Sematext search-lucene.com search-hadoop.com
clear query|facets|time Search criteria: .   Results from 1 to 10 from 1767 (0.152s).
Loading phrases to help you
refine your search...
RE: rewriting urls that are index - Nutch - [mail # user]
...Hi,  The 1.x indexer takes a -normalize parameter and there you can rewrite your URL's. Judging from your patterns the RegexURLNormalizer should be sufficient. Make sure you use the con...
   Author: Markus Jelsma, 2013-04-22, 15:56
RE: Period-terminated hostnames - Nutch - [mail # user]
...Rodney,  Those are valid URL's but you clearly don't need them. You can either use filters to get rid of them or normalize them away. Use the org.apache.nutch.net.URLNormalizerChecker o...
   Author: Markus Jelsma, 2013-04-18, 21:26
[NUTCH-1038] Port IndexingFiltersChecker to 2.0 - Nutch - [issue]
http://issues.apache.org/jira/browse/NUTCH-1038    Author: Markus Jelsma, 2013-03-27, 20:27
RE: Does Nutch Checks Whether A Page crawled before or not - Nutch - [mail # user]
...The CrawlDB contains information on all URL's and their status e.g. what HTTP code did they get, the interval, some metadata and their fetch time. Use the readdb command to inspect a specifi...
   Author: Markus Jelsma, 2013-03-20, 22:54
RE: How to Continue to Crawl with Nutch Even An Error Occurs? - Nutch - [mail # user]
...If Nutch exits with an error then the segment is bad, a failing thread is not an error that leads to a failed segments. This means the segment is properly fetched but just that some records ...
   Author: Markus Jelsma, 2013-03-20, 22:53
RE: Does Nutch Checks Whether A Page crawled before or not - Nutch - [mail # user]
...Nutch selects records that are eligible for fetch. It's either due to a transient failure or if the fetch interval has been expired. This means that failed fetches due to network issues are ...
   Author: Markus Jelsma, 2013-03-20, 22:49
RE: [WELCOME] Feng Lu as Apache Nutch PMC and Committer - Nutch - [mail # user]
...Feng Lu, welcome! :)      ...
   Author: Markus Jelsma, 2013-03-18, 22:07
[NUTCH-961] Expose Tika's boilerpipe support - Nutch - [issue]
...Tika 0.8 comes with the Boilerpipe content handler which can be used to extract boilerplate content from HTML pages. We should see how we can expose Boilerplate in the Nutch cofiguration....
http://issues.apache.org/jira/browse/NUTCH-961    Author: Markus Jelsma, 2013-03-07, 04:46
RE: keep all pages from a domain in one slice - Nutch - [mail # user]
...Hi  You can't do this with -slice but you can merge segments and filter them. This would mean you'd have to merge the segments for each domain. But that's far too much work. Why do you ...
   Author: Markus Jelsma, 2013-03-05, 22:02
RE: [DISCUSS] Google Summer of Code - Nutch - [mail # dev]
...Ah yes! Please open an issue and if you can attach anything that matters such as a description of the algorithm, how it should work with Nutch/MapReduce or even code/tests.  If there's ...
   Author: Markus Jelsma, 2013-03-04, 21:43
Sort:
project
Nutch (1767)
Solr (909)
Tika (56)
Lucene (9)
type
mail # user (1302)
mail # dev (270)
issue (195)
date
last 7 days (0)
last 30 days (2)
last 90 days (22)
last 6 months (183)
last 9 months (1767)
author
Markus Jelsma (1767)
Lewis John Mcgibbney (1110)
Julien Nioche (805)
Mattmann, Chris A (399)
lewis john mcgibbney (334)
Andrzej Bialecki (302)
Ferdy Galema (224)
Bai Shen (161)
Tejas Patil (157)
Sebastian Nagel (155)
kiran chitturi (155)
alxsss@...)
remi tassing (133)
Lewis John McGibbney (129)
Gabriele Kahlout (115)