Lewis John Mcgibbney
Mattmann, Chris A
lewis john mcgibbney
Lewis John McGibbney
Jorge Luis Betancourt Gon...
mail # user
mail # dev
last 7 days (9)
last 30 days (30)
last 90 days (51)
last 6 months (99)
last 9 months (824)
newest on top
oldest on top
. Results from
Loading phrases to help you
refine your search...
[NUTCH-2609] urlnormalizer-basic to normalize path of file: URLs
...% echo "file:/var/www/html/foo/../bar/index.html" \ | nutch normalizerchecker -normalizer urlnormalizer-basic -stdinChecking combination of these URLNormalizers: BasicURLNormalizer file:/va...
, 2018-06-22, 13:49
[NUTCH-2547] urlnormalizer-basic fails on special characters in path/query
...If a URL contains one of the characters |"<>^` or a single % (not followed by a 2-characther hex-value), BasicURLNormalizer fails to normalize the URL path (here: remove /c/..):% for c...
, 2018-06-22, 13:48
[NUTCH-2576] HTTP protocol plugin based on okhttp
...Okhttp is an Apache2-licensed http library which supports HTTP/2. Julien Nioche's implementation storm-crawler#443 proves that it should be straightforward to implement a Nutch protocol plug...
, 2018-06-21, 12:16
[NUTCH-2608] Reduce size of Nutch job file and package
...The Nutch 1.15 binary package and the Nutch job file will reach or even exceed 300 MB. A huge job file isn't ideal as it needs to be distributed in the Hadoop cluster. There are several reas...
, 2018-06-21, 09:56
[NUTCH-2607] ParserChecker should call ScoringFilters.passScoreAfterParsing() on all parses
...A ParseResult may contain multiple parses, e.g., the feed parser adds one for every item in the RSS/Atom feed. The tool ParseSegment calls the method ScoringFilters.passScoreAfterParsing() f...
, 2018-06-21, 09:28
[ANNOUNCE] New Nutch committer and PMC - Omkar Reddy
- [mail # user]
...Dear all,it is my pleasure to announce that Omkar Reddy has joined usas a committer and member of the Nutch PMC. Omkar has workedon upgrading Nutch to use the new MapReduce API as part of hi...
, 2018-06-21, 08:18
[NUTCH-2606] MIME detection is wrong for plain-text documents send as Content-Type "application/msword"
...Plain-text documents send as Content-Type "application/msword" are tried to parse as Word documents. The MIME detection should be fixed, so that these are correctly identified as plain-text ...
, 2018-06-20, 16:38
[NUTCH-2578] Avoid lock by MimeUtil in constructor of protocol.Content
...The constructor of the class o.a.n.protocol.Content instantiates a new MimeUtil object. That's not cheap as it always creates a new Tika object and there is a lock on the job/jar file when c...
, 2018-06-18, 16:50
- [mail # user]
...Hi Michael,on the Common Crawl Nutch fork there is a plugin "fast-urlfilter" which does this, seehttps://github.com/commoncrawl/nutch/blob/cc-fast-url-filter/src/plugin/urlfilter-fast/src/ja...
, 2018-06-17, 20:30
[NUTCH-2598] URLNormalizerChecker fails on invalid URLs in input
...I use the URLNormalizerChecker (urlnormalizer-regex and urlnormalizer-basic) to normalize URLs before further processing them. If one of the used normalizers throws a MalformedURLException w...
, 2018-06-13, 16:18
Apache Lucene, Apache Solr and all other Apache Software Foundation project and their respective logos are trademarks of the Apache Software Foundation.
Elasticsearch, Kibana, Logstash, and Beats are trademarks of Elasticsearch BV, registered in the U.S. and in other countries. This site and Sematext Group is in no way affiliated with Elasticsearch BV.
Service operated by