Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # dev - 2 things I noticed that I will file JIRA issues + fix


Copy link to this message
-
2 things I noticed that I will file JIRA issues + fix
Mattmann, Chris A 2011-11-24, 19:31
...after I get back from Thanksgiving dinner :-)

1. In URLFilterChecker, the cmd line tool requires URLs to be fed into it on STDIN, but
that isn't documented anywhere, even in the tool help printed to STDOUT. I'll fix that.

2. In ParseOutputFormat, I see a code block:

{code}
          // collect outlinks for subsequent db update
          Outlink[] links = parseData.getOutlinks();
          int outlinksToStore = Math.min(maxOutlinks, links.length);
          if (ignoreExternalLinks) {
            try {
              fromHost = new URL(fromUrl).getHost().toLowerCase();
            } catch (MalformedURLException e) {
              fromHost = null;
            }
          } else {
            fromHost = null;
          }
{code}

The if(ignoreExternalLinks) part then gets subsequently set and
reset in the ensuing for loop:

{code}
          int validCount = 0;
          CrawlDatum adjust = null;
          List<Entry<Text, CrawlDatum>> targets = new ArrayList<Entry<Text, CrawlDatum>>(outlinksToStore);
          List<Outlink> outlinkList = new ArrayList<Outlink>(outlinksToStore);
          for (int i = 0; i < links.length && validCount < outlinksToStore; i++) {
            String toUrl = links[i].getToUrl();
            // ignore links to self (or anchors within the page)
            if (fromUrl.equals(toUrl)) {
              continue;
            }
            if (ignoreExternalLinks) {
              try {
                toHost = new URL(toUrl).getHost().toLowerCase();
              } catch (MalformedURLException e) {
                toHost = null;
              }
              if (toHost == null || !toHost.equals(fromHost)) { // external links
                continue; // skip it
              }
            }
{code}

So, what's the point of that initial if(...) block outside of the for loop. Isn't it
redundant?

If so, I'll file an issue and fix that.

Cheers,
Chris

P.S. Happy Thanksgiving to Nutch'ers in the US!
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Chris Mattmann, Ph.D.
Senior Computer Scientist
NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA
Office: 171-266B, Mailstop: 171-246
Email: [EMAIL PROTECTED]
WWW:   http://sunset.usc.edu/~mattmann/
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
Adjunct Assistant Professor, Computer Science Department
University of Southern California, Los Angeles, CA 90089 USA
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++