Pulkit Singhal 2011-09-12, 18:45
Pulkit Singhal 2011-09-13, 15:55
-Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)
Chris Hostetter 2011-09-13, 16:09
: I've been investigating and I understand that using the RegexTransformer is
: an option that is open for identifying and extracting data to multiple
: fields from a single rss value source ... But rather than hack together
: something I once again wanted to check with the community: Is there another
: option for navigating the HTML DOM tree using some well-tested transformer
: or TIka or something?
I don't think so ... if it's a *really* wellformed feed, then the
description will actually be xhtml nodes (with the appropriate
namespace) that are already part of the Document's DOM.
But if it's just a blob of CDATA that happens to contain welformed HTML,
then I think a regex is currently your best option -- you'll probably want
something tailor made for the subtleties of the site whose RSS you're
scraping anyway since things like "are & chars in the URLs html escaped?"
is going to vary from site to site.
It would probably be possible to write a DIH Transformer based on
something like tagsoup to actually produce a DOM from an arbitrary html
string in an entity, so you could then treat it as a subentity and use the
XPathEntityProcessor -- but i don't think i've seen anyone talk about
doing anything like that before.