Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Solr, mail # user - How to combine RSS w/ Tika when using Data Import Handler (DIH)


+
Pulkit Singhal 2011-09-12, 18:45
+
Pulkit Singhal 2011-09-13, 15:55
Copy link to this message
-
Re: How to combine RSS w/ Tika when using Data Import Handler (DIH)
Chris Hostetter 2011-09-13, 16:09

: I've been investigating and I understand that using the RegexTransformer is
: an option that is open for identifying and extracting data to multiple
: fields from a single rss value source ... But rather than hack together
: something I once again wanted to check with the community: Is there another
: option for navigating the HTML DOM tree using some well-tested transformer
: or TIka or something?

I don't think so ... if it's a *really* wellformed feed, then the
description will actually be xhtml nodes (with the appropriate
namespace) that are already part of the Document's DOM.

But if it's just a blob of CDATA that happens to contain welformed HTML,
then I think a regex is currently your best option -- you'll probably want
something tailor made for the subtleties of the site whose RSS you're
scraping anyway since things like "are & chars in the URLs html escaped?"
is going to vary from site to site.

It would probably be possible to write a DIH Transformer based on
something like tagsoup to actually produce a DOM from an arbitrary html
string in an entity, so you could then treat it as a subentity and use the
XPathEntityProcessor -- but i don't think i've seen anyone talk about
doing anything like that before.

-Hoss