|
|
-
Re: RSS parserSebastian Nagel 2012-05-24, 19:28
(it's too late I know)
Have you checked the property http.content.limit (default is only 64kB, RSS feeds are often larger). Looks like the content is truncated: > Caused by: com.sun.syndication.io.ParsingFeedException: > Invalid XML: Error on line 300: XML document > structures must start and end within the same entity. On 02/10/2012 01:24 PM, Michael Kazekin wrote: > On 02/08/2012 06:44 PM, dspathis wrote: >>> http://validator.w3.org/feed/check.cgi?url=http%3A%2F%2Frss.sciam.com%2Fsciam%2Fearth-and-environment >>> >> Hmmm. I just tried the URL you provided with my own Nutch 1.4 installation. >> It gets parsed successfully *both* with the feed and with the tika parser (I >> modified my config first to use the former, then to use the latter). >> >> I think your config might still have issues. Maybe you could turn on TRACE >> level logging to see if you can get some clues that way? > > 1) I installed Nutch 1.4 from scratch, > > 2) changed nutch-site.xml from empty to: > > <configuration> > > <property> > <name>http.agent.name</name> > <value>Test Nutch Agent (http://www.nutch.org/docs/en/bot.html)</value> > </property> > > <property> > <name>http.robots.agents</name> > <value>Test Nutch Agent (http://www.nutch.org/docs/en/bot.html),*</value> > </property> > > </configuration> > > 3) commented out feed plugin (inparse-plugins.xml) > > <mimeType name="application/rss+xml"> > <plugin id="parse-tika" /> > <!--<plugin id="feed" />--> > </mimeType> > > 4) Changed log level in log4j.properties > > log4j.logger.org.apache.nutch.fetcher.Fetcher=TRACE,cmdstdout > log4j.logger.org.apache.nutch.parse.ParseSegment=TRACE,cmdstdout > > > Then I injected, generated and fetched db with the only RSS link, and got this exception with Tika: > > > 2012-02-10 15:06:32,782 ERROR tika.TikaParser - Error parsing > http://rss.sciam.com/sciam/earth-and-environment > org.apache.tika.exception.TikaException: RSS parse error > at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:106) > at org.apache.nutch.parse.tika.TikaParser.getParse(TikaParser.java:95) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:35) > at org.apache.nutch.parse.ParseCallable.call(ParseCallable.java:24) > at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:303) > at java.util.concurrent.FutureTask.run(FutureTask.java:138) > at java.lang.Thread.run(Thread.java:662) > Caused by: com.sun.syndication.io.ParsingFeedException: Invalid XML: Error on line 300: XML document > structures must start and end within the same entity. > at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:207) > at com.sun.syndication.io.SyndFeedInput.build(SyndFeedInput.java:135) > at org.apache.tika.parser.feed.FeedParser.parse(FeedParser.java:68) > ... 6 more > Caused by: org.jdom.input.JDOMParseException: Error on line 300: XML document structures must start > and end within the same entity. > at org.jdom.input.SAXBuilder.build(SAXBuilder.java:468) > at com.sun.syndication.io.WireFeedInput.build(WireFeedInput.java:203) > ... 8 more > Caused by: org.xml.sax.SAXParseException: XML document structures must start and end within the same > entity. > at org.apache.xerces.util.ErrorHandlerWrapper.createSAXParseException(Unknown Source) > at org.apache.xerces.util.ErrorHandlerWrapper.fatalError(Unknown Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at org.apache.xerces.impl.XMLErrorReporter.reportError(Unknown Source) > at org.apache.xerces.impl.XMLScanner.reportFatalError(Unknown Source) > at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.endEntity(Unknown Source) > at org.apache.xerces.impl.XMLDocumentScannerImpl.endEntity(Unknown Source) > at org.apache.xerces.impl.XMLEntityManager.endEntity(Unknown Source) > at org.apache.xerces.impl.XMLEntityScanner.load(Unknown Source) > at org.apache.xerces.impl.XMLEntityScanner.scanContent(Unknown Source) > at org.apache.xerces.impl.XMLDocumentFragmentScannerImpl.scanContent(Unknown Source) |