Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Tika, mail # dev - Metadata situation and XMP support in Tika


+
Joerg Ehrlich 2012-04-05, 12:58
+
Mattmann, Chris A 2012-04-05, 14:20
+
Joerg Ehrlich 2012-04-13, 12:32
+
Ray Gauss II 2012-04-13, 12:52
+
Joerg Ehrlich 2012-04-13, 13:12
+
Ray Gauss II 2012-04-13, 13:25
+
Nick Burch 2012-04-24, 11:43
+
Ingo Renner 2012-04-24, 23:33
+
Ray Gauss II 2012-04-24, 13:10
+
Joerg Ehrlich 2012-04-24, 13:48
+
Joerg Ehrlich 2012-04-24, 16:02
Copy link to this message
-
Re: Metadata situation and XMP support in Tika
Ray Gauss II 2012-04-10, 22:04
Hi Jörg,

As you've seen from TIKA-859 and TIKA-842 I've had to deal with similar issues.

Those issues were needed by TIKA-774 which itself contains another mapping that converts the data output by ExifTool to the proper IPTC metadata defined in TIKA-842.

The code for the ExifTool parser is now at https://github.com/Alfresco/tika-exiftool, and that mapping specifically is at:

https://github.com/Alfresco/tika-exiftool/blob/master/src/main/java/org/apache/tika/parser/exiftool/ExiftoolTikaIptcMapper.java

I'm more than happy to coordinate with you on the XMP stuff going forward if you'd like.

Ray Gauss II
DAM Architect, Alfresco

On Apr 5, 2012, at 8:58 AM, Joerg Ehrlich wrote:

> Hi everyone,
>
> I am an engineer in the XMP/Metadata team at Adobe and we would like to leverage Tika in current projects for metadata extraction (and mimetype detection).
> Our current systems primarily use the XMP data model to manage and interact with metadata.
> As far as I can see, the support for the XMP data model and also for standard metadata schemas/namespaces (like IPTC, Exif, etc.) in Tika is pretty suboptimal as of today.
> But instead of wrapping Tika in own layers of code in our systems, we feel that it would be more useful to contribute to the project instead going forward.
>
> I have had a deeper look in Tika and how to improve the metadata/XMP output of it.
> I saw that you have a bug for XMP already (TIKA-756), which I would probably use to submit any patches related to that.
> But I am currently unsure what the best approach would be to do the mapping to XMP and I would like to hear your opinion on it before starting any work.
>
> Let me quickly summarize if I have understood the basic metadata concept correctly:
>
> 1.       Each parser fills a Metadata map which is a simple key-value list where values can also be multi-values
>
> 2.       Mostly the keys for the Metadata map are taken from fixed lists which are defined as interfaces in the Metadata class
>
> 3.       Those keys are usually Property objects, where the Property class also serves as a static list which registers every property that is created in the Metadata interfaces. This Property class resembles the XMP data model to some extend but does not store e.g. any hierarchical information. And it leaves every client the choice to store property names with prefixes or not.
>
> 4.       Any metadata outputter just iterates over the Metadata map and could query the Property list for additional information.
>
> 5.       In case of the XMP outputter (XMPContentHandler) only those properties are outputted which are stored with a prefix in the Property list.
>
>
> I see two potential ways to improve the situation:
>
>
> 1.       Have a fixed mapping table for each mime type which would be used in XMPContentHandler to map from the Metadata map to the XMP data model. Such mapping tables would be pretty ugly as each parser produces different metadata maps and there is no consistent way to handle them. This option would be least invasive for other clients of Tika but would also be a real hack and would not really improve the metadata situation in Tika in general.
>
> 2.       Try to improve the Key interface lists of Metadata class and adjust all parsers accordingly. This could be done by adding new keys with prefixes and keeping/deprecating the existing ones to not disturb existing clients. Similar to what is proposed for the DublinCore namespace in TIKA-859 and TIKA-842.
> This would be more invasive but would offer the opportunity to really improve the metadata situation. I already saw a couple of places in the code that clearly break existing standards. But there are also examples where mapping might have to be done to different properties at the same time: If you look at the mapping of GPS data from Exif, this is currently mapped to W3C vocabulary in Tika. But in XMP this mapping is defined otherwise [1] by CIPA (the EXIF standardization committee). So probably both mappings have to be supported.
+
Joerg Ehrlich 2012-04-13, 13:13