|
|
-
2 things I noticed that I will file JIRA issues + fixMattmann, Chris A 2011-11-24, 19:31
...after I get back from Thanksgiving dinner :-)
1. In URLFilterChecker, the cmd line tool requires URLs to be fed into it on STDIN, but that isn't documented anywhere, even in the tool help printed to STDOUT. I'll fix that. 2. In ParseOutputFormat, I see a code block: {code} // collect outlinks for subsequent db update Outlink[] links = parseData.getOutlinks(); int outlinksToStore = Math.min(maxOutlinks, links.length); if (ignoreExternalLinks) { try { fromHost = new URL(fromUrl).getHost().toLowerCase(); } catch (MalformedURLException e) { fromHost = null; } } else { fromHost = null; } {code} The if(ignoreExternalLinks) part then gets subsequently set and reset in the ensuing for loop: {code} int validCount = 0; CrawlDatum adjust = null; List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, CrawlDatum>>(outlinksToStore); List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore); for (int i = 0; i < links.length && validCount < outlinksToStore; i++) { String toUrl = links[i].getToUrl(); // ignore links to self (or anchors within the page) if (fromUrl.equals(toUrl)) { continue; } if (ignoreExternalLinks) { try { toHost = new URL(toUrl).getHost().toLowerCase(); } catch (MalformedURLException e) { toHost = null; } if (toHost == null || !toHost.equals(fromHost)) { // external links continue; // skip it } } {code} So, what's the point of that initial if(...) block outside of the for loop. Isn't it redundant? If so, I'll file an issue and fix that. Cheers, Chris P.S. Happy Thanksgiving to Nutch'ers in the US! ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [EMAIL PROTECTED] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ |