Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Tika, mail # dev - Pushing parsers upstream


+
Jukka Zitting 2011-12-13, 09:42
+
Nick Burch 2011-12-13, 11:23
+
Antoni Mylka 2011-12-13, 13:44
+
Jukka Zitting 2011-12-16, 15:12
+
Antoni Mylka 2011-12-16, 18:45
+
Jukka Zitting 2011-12-16, 19:32
+
Antoni Mylka 2011-12-16, 20:27
+
Antoni Mylka 2011-12-16, 19:04
+
Jukka Zitting 2011-12-16, 19:36
+
Nick Burch 2011-12-23, 07:37
+
Mattmann, Chris A 2011-12-13, 15:16
Copy link to this message
-
Re: Pushing parsers upstream
Michael McCandless 2011-12-13, 17:05
+0
I agree, logically, parsers "belong" with their upstream project,since
as that project improves how the document format is cracked,they can
also make the matching fixes to Tika's parser.  As long asthere's
enough love / advocate / testing for the Tika parser in thatproject...
My only concern is the possible added latency in getting
parser-onlyfixes out to Tika's users.
Ie, once a parser is upstream, if there's a fix that would onlyrequire
a change to the parser's source code (say we open up controlover
another PDFBox option, or workaround an issue in PDFBox), PDFBoxmust
fix it, then release, then Tika must upgrade, then Tika mustrelease.
It's true users could directly upgrade their PDFBox w/owaiting for a
Tika release but I suspect most users don't do that...
Vs today, where we just fix & release Tika directly.
Would it somehow be possible for Tika to ship an unreleased PDFBox?Or
does Maven fully tie our hands here?
Mike McCandless

http://blog.mikemccandless.com

On Tue, Dec 13, 2011 at 10:16 AM, Mattmann, Chris A (388J)
<[EMAIL PROTECTED]> wrote:
> Hey Jukka,
>
> For places like POI and PDFBox I think this could definitely work. And then for
> places where we have Parsers, but aren't ready to push upstream yet (I can
> think of two examples of this relevant to me, NetCDF/HDF and GDAL),
> we can just leave the Parser in tika-parsers I think.
>
> In this manner, what you're really suggesting is that it would be great for
> our mature Parsers to be "promoted" upstream to the communities that
> really understand the underlying Parser implementation toolkit. I think
> this makes sense to me, so long as there is a Champion or someone in
> that community willing to spend the small amount of time to learn Tika
> and its interfaces (if they haven't done so already).
>
> The net effect to the casual Tika user is nil, since we have Parser loading via
> service factories, and the only thing that'll change there is the package
> name (and potentially the class name) but it's all behind the scenes.
> The net effect to the Tika developer is that the class and package name
> changes may cause folks to have to recompile code/etc., and the
> code/unit tests/maintenance of some of the parsers would no longer
> be readily available in Tika's tika-parsers artifact, but would live
> in the tika-parser dependency library upstream.
>
> Cheers,
> Chris
>
> On Dec 13, 2011, at 1:42 AM, Jukka Zitting wrote:
>
>> Hi,
>>
>> As you know, we see a lot of questions about version mismatches (which
>> POI or PDFBox version should go with this Tika version) and there's a
>> long queue of patches that are waiting for new official releases of
>> our upstream dependencies to become available.
>>
>> To avoid this issue I propose that we start moving some of our parser
>> implementations to upstream projects. Now with Tika 1.0 out we have a
>> stable Parser and Detector interfaces and related APIs that upstream
>> libraries could implement directly without us having to worry about
>> changing Tika code whenever a new version of a parser library becomes
>> available.
>>
>> This would allow our users to for example directly upgrade to a new
>> POI version without waiting for a releated Tika release first.
>> Similarly, a new PDF parsing option or improvement could be
>> implemented directly in PDFBox and be usable without any code changes
>> in Tika.
>>
>> The classloading and OSGi service mechanisms we've added should make
>> such upstream Parser implementations trivially easy to use, and we
>> could still keep the dependencies in tika-parsers as a way to pull in
>> the libraries even if the relevant implementation classes would no
>> longer reside in org.apache.tika.parsers.*.
>>
>> In addition to some of the GPL libraries for which we've already done
>> this, I recently took the liberty of trying this out also with PDFBox.
>> See PDFBOX-1132 [1] for the issue where I copied the
>> org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works
+
Jukka Zitting 2011-12-16, 15:21
+
Antoni Mylka 2011-12-13, 17:34