Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika >> mail # dev >> Metadata situation and XMP support in Tika

Copy link to this message
RE: Metadata situation and XMP support in Tika
Ups, forgot the links...

-----Original Message-----

Hi Nick,

Yes, I agree that Tika should support a unifying access to common metadata properties like title, description, keywords, creator, rating, etc. So there should be a clear semantic for those common properties regardless of the underlying implementation in various metadata containers. And the access to these properties can be or should be as simple as "Metadata.title".
On the other hand, if you think about Tika being used in business workflow where clients really care about the underlying semantic and file format specific metadata, you might need something more powerful and flexible to access and manage metadata.
And I also agree that the latter should be possible without sacrificing the first.

On a side note:
While the idea of "Someone who understands the format works out how to map the file format's metadata onto a common set" is very compelling and is easy to do, in reality this can get very complicated. And if people have big business depending on such mappings, they tend to have different opinions about what the right way is. That's why we have organizations like the "Metadata Working Group" [1] or the W3C "Media Annotation Working Group" [2] trying to clean up the mess that has evolved over the last decades in this area.
And the moment you start writing metadata back into files, you will also start running in all sorts of complications when you have done too much simplification in the read case. But that is no problem for Tika, right now.

I agree with Ray that the current implementation can support both approaches to make metadata accessible.
While the metadata map can be used to offer easy access to the common set of properties, an XMP output could be used to offer a more extensive, flexible and semantically clearer access to a file's metadata.
I agree with Ray that the common set of keys in the Metadata map should inherit/alias from well known, standard namespaces like Dublin Core. That's why I said the Tika parsers should read metadata using the standard namespaces and properties. This would also make the mapping in the parsers more clearer for developers that want to change something. Currently you always have to guess where something is mapped to.
In general, I'd recommend Dublin Core and the semantic of the ISO part of XMP - which builds on top DC - for common and file format neutral Tika properties that are offered to clients.
And I agree with Ray that having all metadata interfaces be part of the Metadata class is more confusing than helpful for clients.

I am about to put an architectural metadata roadmap on the Tika Wiki for further discussion.
There I want to illustrate a couple of ideas I have also been discussing with Jukka so far and the steps we see on a roadmap that should help us to improve the metadata situation for Tika.


[1] http://metadataworkinggroup.com/specs/
[2] http://www.w3.org/TR/2012/REC-mediaont-10-20120209/

-----Original Message-----
From: Ray Gauss II [mailto:[EMAIL PROTECTED]]
Sent: Dienstag, 24. April 2012 15:10
Subject: Re: Metadata situation and XMP support in Tika

I think the aliasing approach supports both use cases nicely, i.e.:

   Property TITLE = DublinCore.DC_TITLE; ...

Users then only have to concern themselves with "give me the metadata that best fits the idea of Title, as defined by Tika", and not even have to know about DublinCore, but can dig into details of the implementation as needed.

This separation is less of a concern in the particular case of DublinCore since it is such as basic, broad, and widely accepted standard, but for other standards that direct inclusion in the Metadata interface makes less sense.  For example, at the moment we're essentially asking users to say "give me the metadata that best fits the idea of Keywords, as defined by MSOffice" which doesn't make a lot of sense when dealing with something like images.  If we aliased:

   Property KEYWORDS = MSOffice.MS_KEYWORDS; ...

we're back to the intended "give me the metadata that best fits the idea of Keywords, as defined by Tika".  In this case, DublinCore.DC_SUBJECT is probably a much better standard to alias keywords from than MSOffice, but I'm just sticking to the current mappings for this example.

On Apr 24, 2012, at 7:43 AM, Nick Burch wrote: