Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Tika, mail # dev - Pushing parsers upstream


+
Jukka Zitting 2011-12-13, 09:42
Copy link to this message
-
Re: Pushing parsers upstream
Nick Burch 2011-12-13, 11:23
On Tue, 13 Dec 2011, Jukka Zitting wrote:
> To avoid this issue I propose that we start moving some of our parser
> implementations to upstream projects. Now with Tika 1.0 out we have a
> stable Parser and Detector interfaces and related APIs that upstream
> libraries could implement directly without us having to worry about
> changing Tika code whenever a new version of a parser library becomes
> available.

A couple of issues do spring to mind with this plan:
* Metadata keys - if a parser enhancement or new feature needs a new
   metadata key, then you end up having to wait for a new tika release to
   get it (so you can add the code to use it to release)
* Consistency - both or markup and metadata keys will be harder to
   ensure when it isn't in the same codebase

For detectors, there's extra issue here. At the moment, both the Zip and
OLE2 detectors handle more than just the POI formats, and in the Zip case
rely on code shared between the parsers (poi+keynote) and detector. How
would this work if the container detectors were handed to POI? And who's
job would it be to test it?

That's a general thing actually, how much testing would need to remain on
the Tika side?

Oh, but I guess this counts as your answer on what I should be doing with
my Ogg Vorbis parser :)

Nick
+
Antoni Mylka 2011-12-13, 13:44
+
Jukka Zitting 2011-12-16, 15:12
+
Antoni Mylka 2011-12-16, 18:45
+
Jukka Zitting 2011-12-16, 19:32
+
Antoni Mylka 2011-12-16, 20:27
+
Antoni Mylka 2011-12-16, 19:04
+
Jukka Zitting 2011-12-16, 19:36
+
Nick Burch 2011-12-23, 07:37
+
Mattmann, Chris A 2011-12-13, 15:16
+
Michael McCandless 2011-12-13, 17:05
+
Jukka Zitting 2011-12-16, 15:21
+
Antoni Mylka 2011-12-13, 17:34