Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - RSS parser


Copy link to this message
-
Re: RSS parser
Sebastian Nagel 2012-05-24, 19:28
(it's too late I know)

Have you checked the property http.content.limit
(default is only 64kB, RSS feeds are often larger).
Looks like the content is truncated:

 > Caused by: com.sun.syndication.io.ParsingFeedException:
 > Invalid XML: Error on line 300: XML document
 > structures must start and end within the same entity.

On 02/10/2012 01:24 PM, Michael Kazekin wrote:
> On 02/08/2012 06:44 PM, dspathis wrote:
>>> http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Frss.sciam.com%2Fsciam%2Fearth-and-environment
>>>
>> Hmmm. I just tried the URL you provided with my own Nutch 1.4 installation.
>> It gets parsed successfully *both* with the feed and with the tika parser (I
>> modified my config first to use the former, then to use the latter).
>>
>> I think your config might still have issues. Maybe you could turn on TRACE
>> level logging to see if you can get some clues that way?
>
> 1) I installed Nutch 1.4 from scratch,
>
> 2) changed nutch-site.xml from empty to:
>
> <configuration>
>
> <property>
> <name>http.agent.name</name>
> <value>Test Nutch Agent (http://www.nutch.org/docs/en/bot.html)</value>
> </property>
>
> <property>
> <name>http.robots.agents</name>
> <value>Test Nutch Agent (http://www.nutch.org/docs/en/bot.html),*</value>
> </property>
>
> </configuration>
>
> 3) commented out feed plugin (inparse-plugins.xml)
>
> <mimeType name="application/rss+xml">
> <plugin id="parse-tika" />
> <!--<plugin id="feed" />-->
> </mimeType>
>
> 4) Changed log level in log4j.properties
>
> log4j.logger.org.apache.nutch.fetcher.Fetcher=TRACE,cmdstdout
> log4j.logger.org.apache.nutch.parse.ParseSegment=TRACE,cmdstdout
>
>
> Then I injected, generated and fetched db with the only RSS link, and got this exception with Tika:
>
>
> 2012-02-10 15:06:32,782 ERROR tika.TikaParser - Error parsing
> http://rss.sciam.com/sciam/earth-and-environment
> org.apache.tika.exception.TikaException: RSS parse error
> at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:106)
> at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35)
> at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24)
> at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303)
> at java.util.concurrent.FutureTask.run(FutureTask.java:138)
> at java.lang.Thread.run(Thread.java:662)
> Caused by: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 300: XML document
> structures must start and end within the same entity.
> at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:207)
> at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:135)
> at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:68)
> ... 6 more
> Caused by: org.jdom.input.JDOMParseException: Error on line 300: XML document structures must start
> and end within the same entity.
> at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468)
> at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:203)
> ... 8 more
> Caused by: org.xml.sax.SAXParseException: XML document structures must start and end within the same
> entity.
> at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source)
> at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source)
> at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source)
> at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source)
> at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source)
> at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source)