|
|
-
Support for Open Graph meta tags
Ken Krugler 2011-09-23, 00:23
We were recently using Tika to process HTML pages that might have Open Graph meta tags. The issue is that these tags get stripped out, and also aren't put into the metadata map. The reason why is that Open Graph uses RDFa http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted. But we could put them into the metadata map, by adding another test in the HtmlHandler code that currently has: if ("META".equals(name) && atts.getValue("content") != null) { // TIKA-478: For cases where we have either a name or // "http-equiv", assume that XHTMLContentHandler will emit // these in the <head>, thus passing them through safely. if (atts.getValue("http-equiv") != null) { addHtmlMetadata( atts.getValue("http-equiv"), atts.getValue("content")); } else if (atts.getValue("name") != null) { // Record the meta tag in the metadata addHtmlMetadata( atts.getValue("name"), atts.getValue("content")); } If we catch the case of having no name=xxx attribute, but there is a property=xxx, then that would take a tag like: <meta property="og:url" content=" http://www.imdb.com/title/tt0117500/" /> and put it into the metadata map as "og:url" => " http://www.imdb.com/title/tt0117500/"Thoughts on this? Thanks, -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.comcustom big data solutions & training Hadoop, Cascading, Mahout & Solr
+
Ken Krugler 2011-09-23, 00:23
-
Re: Support for Open Graph meta tags
Nick Burch 2011-09-23, 10:09
On Thu, 22 Sep 2011, Ken Krugler wrote: > The reason why is that Open Graph uses RDFa
Is it worth quickly checking what Any23 does for this kind of thing? (They a hopefully soon-to-be-incubating project that a few people here are helping with, which has some Tika links). If they have a good model for handling this sort of rdf data, then it might make sense to do the same
If not, I'd suggest we follow your suggested example :)
Nick
+
Nick Burch 2011-09-23, 10:09
-
Re: Support for Open Graph meta tags
Jukka Zitting 2011-09-23, 10:24
Hi,
On Fri, Sep 23, 2011 at 2:23 AM, Ken Krugler <[EMAIL PROTECTED]> wrote: > The reason why is that Open Graph uses RDFa
Instead of mapping the RDFa <meta> tags to Tika's Metadata and then back to normal XHTML <meta> tags, we might want to consider switching from plain XHTML to XHTML-with-RDFa as Tika's output format. That should make it easier to support more descriptive metadata and content annotations down the line.
In any case it would still be good to mapRDFa <meta> tags also to the Metadata object. To do that properly (and to open the way to better XMP integration, my favourite TODO item :-), we'll probably need to extend the Metadata class to handle things like namespaces and structured values.
BR,
Jukka Zitting
+
Jukka Zitting 2011-09-23, 10:24
-
Re: Support for Open Graph meta tags
Mattmann, Chris A 2011-09-23, 16:20
Hey Jukka, This sounds like a good approach. Cheers, Chris On Sep 23, 2011, at 3:24 AM, Jukka Zitting wrote: > Hi, > > On Fri, Sep 23, 2011 at 2:23 AM, Ken Krugler > <[EMAIL PROTECTED]> wrote: >> The reason why is that Open Graph uses RDFa > > Instead of mapping the RDFa <meta> tags to Tika's Metadata and then > back to normal XHTML <meta> tags, we might want to consider switching > from plain XHTML to XHTML-with-RDFa as Tika's output format. That > should make it easier to support more descriptive metadata and content > annotations down the line. > > In any case it would still be good to mapRDFa <meta> tags also to the > Metadata object. To do that properly (and to open the way to better > XMP integration, my favourite TODO item :-), we'll probably need to > extend the Metadata class to handle things like namespaces and > structured values. > > BR, > > Jukka Zitting ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [EMAIL PROTECTED] WWW: http://sunset.usc.edu/~mattmann/++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
Mattmann, Chris A 2011-09-23, 16:20
-
Re: Support for Open Graph meta tags
Ken Krugler 2011-09-23, 13:06
On Sep 23, 2011, at 3:24am, Jukka Zitting wrote: > Hi, > > On Fri, Sep 23, 2011 at 2:23 AM, Ken Krugler > <[EMAIL PROTECTED]> wrote: >> The reason why is that Open Graph uses RDFa > > Instead of mapping the RDFa <meta> tags to Tika's Metadata and then > back to normal XHTML <meta> tags, we might want to consider switching > from plain XHTML to XHTML-with-RDFa as Tika's output format. That > should make it easier to support more descriptive metadata and content > annotations down the line. > > In any case it would still be good to mapRDFa <meta> tags also to the > Metadata object. To do that properly (and to open the way to better > XMP integration, my favourite TODO item :-), we'll probably need to > extend the Metadata class to handle things like namespaces and > structured values. That's what I was afraid of :) My head starts to hurt when I have to deal with namespaces and RDF. So I think I'll just patch my local copy to do the Q&D thing, and wait for someone with more XML/RDF-fu to deal with it properly. -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.comcustom big data solutions & training Hadoop, Cascading, Mahout & Solr
+
Ken Krugler 2011-09-23, 13:06
-
Re: Support for Open Graph meta tags
Jukka Zitting 2011-09-23, 13:12
Hi,
On Fri, Sep 23, 2011 at 3:06 PM, Ken Krugler <[EMAIL PROTECTED]> wrote: > On Sep 23, 2011, at 3:24am, Jukka Zitting wrote: >> In any case it would still be good to mapRDFa <meta> tags also to the >> Metadata object. To do that properly (and to open the way to better >> XMP integration, my favourite TODO item :-), we'll probably need to >> extend the Metadata class to handle things like namespaces and >> structured values. > > That's what I was afraid of :) > > My head starts to hurt when I have to deal with namespaces and RDF.
>From the client perspective the Metadata class should still provide a simple key-value interface for basic things, just like the Tika facade hides the more powerful constructs of the Parser and Detector interfaces under a simplified API. Of course the implementation side would be more complex...
> So I think I'll just patch my local copy to do the Q&D thing, and wait for > someone with more XML/RDF-fu to deal with it properly.
Until Someone (TM, :-) does that, I'd be very happy to see the simple property=xxx mapping you described added to HtmlParser. It's obviously an improvement to the way Tika currently works, and I don't see any major backwards compatibility issues caused by starting with a simple solution like that and later on migrating to a more complete RDF-based metadata model.
BR,
Jukka Zitting
+
Jukka Zitting 2011-09-23, 13:12
-
Re: Support for Open Graph meta tags
Antoni Mylka 2011-09-23, 14:00
W dniu 2011-09-23 15:12, Jukka Zitting pisze: >> So I think I'll just patch my local copy to do the Q&D thing, and wait for >> someone with more XML/RDF-fu to deal with it properly. > > Until Someone (TM, :-) does that, I'd be very happy to see the simple > property=xxx mapping you described added to HtmlParser.
There seems to be a long tradition in ASF to appeal to Someone when there is talk about RDF. Chris Mattman wrote back in November 2007:
"... it's reasonable that someone may need to rewrite the ability to represent metadata in RDF ..."
Whoever that Someone is - he has my support. ;-)
On a more serious note though. In the four years since that metadata discussion three separate RDF-related projects have appeared in/around ASF: Clerezza, Jena and Any23. Two are already in incubation, the third one tries to. Jeremias Maerki noticed the lack of coordination in the metadata field four years ago. It's not getting any better.
Antoni Myłka [EMAIL PROTECTED]
+
Antoni Mylka 2011-09-23, 14:00
-
Re: Support for Open Graph meta tags
Ken Krugler 2011-09-23, 14:19
On Sep 23, 2011, at 7:00am, Antoni Mylka wrote: > W dniu 2011-09-23 15:12, Jukka Zitting pisze: >>> So I think I'll just patch my local copy to do the Q&D thing, and wait for >>> someone with more XML/RDF-fu to deal with it properly. >> >> Until Someone (TM, :-) does that, I'd be very happy to see the simple >> property=xxx mapping you described added to HtmlParser. > > There seems to be a long tradition in ASF to appeal to Someone when there is talk about RDF. Chris Mattman wrote back in November 2007: > > "... it's reasonable that someone may need to rewrite the ability to represent metadata in RDF ..." > > Whoever that Someone is - he has my support. ;-) > > On a more serious note though. In the four years since that metadata discussion three separate RDF-related projects have appeared in/around ASF: Clerezza, Jena and Any23. Two are already in incubation, the third one tries to. Jeremias Maerki noticed the lack of coordination in the metadata field four years ago. It's not getting any better. Agreed. From my fairly naive perspective, it seems like one of the challenges here is that Tika tries to normalize/simplify interacting with data. E.g. I just want the text from any document I come across. That seems to be the primary use case. Whereas RDF is more focused on precision, in being explicit about the relationships between data. So I would expect to see many interesting tradeoffs in figuring out how best to straddle both worlds. Heck, figuring out how best to map fairly simple document elements to XHTML 1.0 has proven challenging. It would be great to get patches from that Mythical Someone who knows RDF - versus, say, me, where the end result is likely to be horribly wrong. For better or worse, RDF has never been an itch that I've needed to scratch. -- Ken -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.comcustom big data solutions & training Hadoop, Cascading, Mahout & Solr
+
Ken Krugler 2011-09-23, 14:19
-
Re: Support for Open Graph meta tags
Jukka Zitting 2011-09-23, 16:57
Hi,
On Fri, Sep 23, 2011 at 4:19 PM, Ken Krugler <[EMAIL PROTECTED]> wrote: > From my fairly naive perspective, it seems like one of the challenges > here is that Tika tries to normalize/simplify interacting with data. [...] > Whereas RDF is more focused on precision, in being explicit about > the relationships between data.
Yep, as you mention that's obviously an issue that needs work and sometimes tricky tradeoffs.
That said, I'm pretty confident that there is no fundamental disconnect between these two goals, and I think over time (years most likely) we will be able to work out all the details. We're already taking steps along that road with our parsers exposing increasingly more detailed document structure and our metadata model already handling things like dates in a more structured manner.
At least that seems to me like an obvious candidate for inclusion in a future roadmap for post-1.0 Tika.
> It would be great to get patches from that Mythical Someone who knows RDF
Agreed. :-) As Antoni said, this is an area where we could and should be able to do better. There are quite a few RDF experts already at and around Apache, and it shouldn't be too hard to position Tika more prominently on their radars. The Any23 proposal that Chris is championing is one good chance for this.
Also, now that I work at Adobe, my XMP itch has been growing quite a bit, so I wouldn't be surprised if I ended up working on better XMP (and thus RDF) support soon after Tika 1.0 is out.
BR,
Jukka Zitting
+
Jukka Zitting 2011-09-23, 16:57
-
Re: Support for Open Graph meta tags
Nick Burch 2011-09-23, 19:48
On Fri, 23 Sep 2011, Jukka Zitting wrote: >> It would be great to get patches from that Mythical Someone who knows >> RDF > > Agreed. :-) As Antoni said, this is an area where we could and should > be able to do better. There are quite a few RDF experts already at and > around Apache, and it shouldn't be too hard to position Tika more > prominently on their radars. The Any23 proposal that Chris is > championing is one good chance for this.
I suggest a solution involving ApacheCon and some beer :)
Also at ApacheCon on the Tuesday is the BarCamp, so assuming a few of us will be there by then (I think we will be...) we could do a session there and hopefully get some RDF experts in to advice us
Nick
+
Nick Burch 2011-09-23, 19:48
-
Re: Support for Open Graph meta tags
Mattmann, Chris A 2011-09-23, 00:51
Hey Ken, Super +1, this sounds like a great idea. Cheers, Chris On Sep 22, 2011, at 6:23 PM, Ken Krugler wrote: > We were recently using Tika to process HTML pages that might have Open Graph meta tags. > > The issue is that these tags get stripped out, and also aren't put into the metadata map. > > The reason why is that Open Graph uses RDFa > > http://stackoverflow.com/questions/2704942/html-validation-error-for-property-attribute/2705090#2705090> > Since <meta property="xxx" content="yyy" /> isn't valid for XHTML 1.0, these tags can't be emitted. > > But we could put them into the metadata map, by adding another test in the HtmlHandler code that currently has: > > if ("META".equals(name) && atts.getValue("content") != null) { > // TIKA-478: For cases where we have either a name or > // "http-equiv", assume that XHTMLContentHandler will emit > // these in the <head>, thus passing them through safely. > if (atts.getValue("http-equiv") != null) { > addHtmlMetadata( > atts.getValue("http-equiv"), > atts.getValue("content")); > } else if (atts.getValue("name") != null) { > // Record the meta tag in the metadata > addHtmlMetadata( > atts.getValue("name"), > atts.getValue("content")); > } > > If we catch the case of having no name=xxx attribute, but there is a property=xxx, then that would take a tag like: > > <meta property="og:url" content=" http://www.imdb.com/title/tt0117500/" /> > > and put it into the metadata map as "og:url" => " http://www.imdb.com/title/tt0117500/"> > Thoughts on this? > > Thanks, > > -- Ken > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com> custom big data solutions & training > Hadoop, Cascading, Mahout & Solr > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [EMAIL PROTECTED] WWW: http://sunset.usc.edu/~mattmann/++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
+
Mattmann, Chris A 2011-09-23, 00:51
|
|