Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # user - Parse metadata only


Copy link to this message
-
Re: Parse metadata only
Nick Burch 2012-05-29, 13:50
On Tue, 29 May 2012, Thinus Prinsloo wrote:
> I would like to parse the meta-data of a massive amount of PDF files
> only. I do not want to extract the text, not yet anyway, only get
> meta-data information such as "Creation-Date", etc.  Is it possible for
> Tika to provide the meta-data without doing a parse on the whole
> document (with a content handler, etc.)?

At the moment, that's not possible. Most file formats don't have all their
metadata in entirely separate places, so you end up having to process
almost all of the file anyway. (There has been talk about implementing
this in the past, but this problem has largely meant it hasn't been
tackled)

If you don't want the text, you can just pass in a content handler that
ignores everything

Nick