-Re: Parser choking on irregular url
Lewis John Mcgibbney 2012-06-22, 14:50
On Fri, Jun 22, 2012 at 12:17 AM, Markus Jelsma
<[EMAIL PROTECTED]> wrote:
> Hi Lewis,
> You got fooled by the ampersand switch on Unix terminals that sends a command to the background. The  integers are Unix process ID's of the commands you have given.
> $ a&b&c is not one but three commands, sending a and b to the background. Your shell will output the [process ID] if a backgrounded command is finished.
> Encapsulate your URL with quotes and you are safe.
> -----Original message-----
>> From:Lewis John Mcgibbney <[EMAIL PROTECTED]>
>> Sent: Fri 22-Jun-2012 00:36
>> To: [EMAIL PROTECTED]
>> Subject: Parser choking on irregular url
>> Something that that turned up on another list  was a scenario where
>> the following URL  was being fetched for processing.
>> Having tried fetching and parsing the URL unsuccessfully outside of
>> Nutch I decided to try the parsechecker with the following output.
>> More comments below the output...
>> ./bin/nutch parsechecker
>>  3086
>>  3087
>>  3088
>>  Done ./bin/nutch parsechecker
>> lewismc@lewismc-HP-Mini-110-3100:~/ASF/trunk/runtime/local$ fetching:
>> parsing: http://en.wikipedia.org/w/api.php?action=query
>> contentType: text/html
>> signature: e29908847945e7dc482c2f6d6129a11c
>> Version: 5
>> Status: success(1,0)
>> Title: MediaWiki API Result
>> Outlinks: 2
>> outlink: toUrl: https://www.mediawiki.org/wiki/API anchor: complete
>> outlink: toUrl: http://en.wikipedia.org/w/api.php anchor: API help
>> Content Metadata: Vary=Accept-Encoding,X-Forwarded-Proto Date=Thu, 21
>> Jun 2012 22:14:21 GMT Content-Length=427 Content-Encoding=gzip
>> Connection=close X-Cache-Lookup=MISS from
>> amssq38.esams.wikimedia.org:80 Content-Type=text/html; charset=utf-8
>> X-Cache=MISS from amssq38.esams.wikimedia.org Server=Apache
>> Cache-Control=private X-Content-Type-Options=nosniff
>> Parse Metadata: CharEncodingForConversion=utf-8 OriginalCharEncoding=utf-8
>> 1) What do the integers within the 's represent?
>> 2) After encountering the first ampersand the URL seems to be
>> truncated. Is this normalization or something else? My urlfilter regex
>> is default.
>> 3) The parser chokes and doesn't finish it's job.
>> Any ideas about how these urls should be dealt with, or of course what
>> suggestions there may be to prevent the parser from freezing on us?
>> Thanks in advance.
>>  http://mail-archives.apache.org/mod_mbox/incubator-any23-dev/201206.mbox/%3CCAPeLbhNzuepW90V33TLvZ4n-eWRrHUspACbm3qK34wsTY6xTxQ%40mail.gmail.com%3E
>>  http://en.wikipedia.org/w/api.php?action=query&list=search&srwhat=text&srsearch=meaning