|
Jukka Zitting
2011-12-13, 09:42
Nick Burch
2011-12-13, 11:23
Antoni Mylka
2011-12-13, 13:44
Jukka Zitting
2011-12-16, 15:12
Antoni Mylka
2011-12-16, 18:45
Jukka Zitting
2011-12-16, 19:32
Antoni Mylka
2011-12-16, 20:27
Antoni Mylka
2011-12-16, 19:04
Jukka Zitting
2011-12-16, 19:36
Nick Burch
2011-12-23, 07:37
Mattmann, Chris A
2011-12-13, 15:16
Michael McCandless
2011-12-13, 17:05
Jukka Zitting
2011-12-16, 15:21
Antoni Mylka
2011-12-13, 17:34
|
-
Pushing parsers upstreamJukka Zitting 2011-12-13, 09:42
Hi,
As you know, we see a lot of questions about version mismatches (which POI or PDFBox version should go with this Tika version) and there's a long queue of patches that are waiting for new official releases of our upstream dependencies to become available. To avoid this issue I propose that we start moving some of our parser implementations to upstream projects. Now with Tika 1.0 out we have a stable Parser and Detector interfaces and related APIs that upstream libraries could implement directly without us having to worry about changing Tika code whenever a new version of a parser library becomes available. This would allow our users to for example directly upgrade to a new POI version without waiting for a releated Tika release first. Similarly, a new PDF parsing option or improvement could be implemented directly in PDFBox and be usable without any code changes in Tika. The classloading and OSGi service mechanisms we've added should make such upstream Parser implementations trivially easy to use, and we could still keep the dependencies in tika-parsers as a way to pull in the libraries even if the relevant implementation classes would no longer reside in org.apache.tika.parsers.*. In addition to some of the GPL libraries for which we've already done this, I recently took the liberty of trying this out also with PDFBox. See PDFBOX-1132 [1] for the issue where I copied the org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works without problems, so now I'd like to propose that we copy any more recent PDF parser changes to PDFBox and prepare to drop the parser implementation in tika-parsers. Any further PDF parser work should then be done directly in PDFBox. I haven't yet talked about this with the PDFBox PMC (of which I'm a member), but I suppose we should be able to come up with an arrangement where Tika committers can commit directly to the Tika parser implementation in PDFBox. It would be cool if we could do the same thing also with POI. WDYT? [1] https://issues.apache.org/jira/browse/PDFBOX-1132 BR, Jukka Zitting +
Jukka Zitting 2011-12-13, 09:42
-
Re: Pushing parsers upstreamNick Burch 2011-12-13, 11:23
On Tue, 13 Dec 2011, Jukka Zitting wrote:
> To avoid this issue I propose that we start moving some of our parser > implementations to upstream projects. Now with Tika 1.0 out we have a > stable Parser and Detector interfaces and related APIs that upstream > libraries could implement directly without us having to worry about > changing Tika code whenever a new version of a parser library becomes > available. A couple of issues do spring to mind with this plan: * Metadata keys - if a parser enhancement or new feature needs a new metadata key, then you end up having to wait for a new tika release to get it (so you can add the code to use it to release) * Consistency - both or markup and metadata keys will be harder to ensure when it isn't in the same codebase For detectors, there's extra issue here. At the moment, both the Zip and OLE2 detectors handle more than just the POI formats, and in the Zip case rely on code shared between the parsers (poi+keynote) and detector. How would this work if the container detectors were handed to POI? And who's job would it be to test it? That's a general thing actually, how much testing would need to remain on the Tika side? Oh, but I guess this counts as your answer on what I should be doing with my Ogg Vorbis parser :) Nick +
Nick Burch 2011-12-13, 11:23
-
Re: Pushing parsers upstreamAntoni Mylka 2011-12-13, 13:44
W dniu 2011-12-13 12:23, Nick Burch pisze:
> On Tue, 13 Dec 2011, Jukka Zitting wrote: >> To avoid this issue I propose that we start moving some of our parser >> implementations to upstream projects. Now with Tika 1.0 out we have a >> stable Parser and Detector interfaces and related APIs that upstream >> libraries could implement directly without us having to worry about >> changing Tika code whenever a new version of a parser library becomes >> available. > > A couple of issues do spring to mind with this plan: > * Metadata keys - if a parser enhancement or new feature needs a new > metadata key, then you end up having to wait for a new tika release to > get it (so you can add the code to use it to release) What's wrong with using plain strings in upstream parsers, until appropriate constants in TikaMetadataKeys become available? > * Consistency - both or markup and metadata keys will be harder to > ensure when it isn't in the same codebase Probably, though benefits are huge. > For detectors, there's extra issue here. At the moment, both the Zip and > OLE2 detectors handle more than just the POI formats, and in the Zip > case rely on code shared between the parsers (poi+keynote) and detector. > How would this work if the container detectors were handed to POI? And > who's job would it be to test it? The same people who now rely on it - the community, helped by a detailed test suite. > That's a general thing actually, how much testing would need to remain > on the Tika side? Dunno. There is no official policy in this regard, is there? ASF makes guarantees that a release is OK from the legal POV, has reasonable (released, available, proper license) dependencies and that the unit tests pass. Regression testing is done on a "best effort" basis anyway and from my POV there is no difference in effort whether the detectors are in POI or in Tika. Is there any? In Aperture we sidestepped this problem by pushing non-released versions of POI, PDFBox and other libraries to our own repository and depending on them. Sometimes these were "vanilla" trunks, sometimes trunks with my patches. See for instance http://aperture.sourceforge.net/maven/org/apache/poi/poi/ This would clearly work for an "internal" project, but didn't work too well for an open source project. It also takes lots of work. With Tika such a solution is impossible for a number of reasons and pushing parsers upstream sounds like a great alternative: * a way to allow for such cherry-picking of dependency trunks to take place in-house, when need arises, without the need to do it in public. * a way to ensure "graceful degradation" of Tika functionality when the libraries are missing, without ugly ClassNotFoundErrors. (probably the only reliable way). I'm all for. Antoni Mylka [EMAIL PROTECTED] +
Antoni Mylka 2011-12-13, 13:44
-
Re: Pushing parsers upstreamJukka Zitting 2011-12-16, 15:12
Hi,
On Tue, Dec 13, 2011 at 12:23 PM, Nick Burch <[EMAIL PROTECTED]> wrote: > A couple of issues do spring to mind with this plan: Good points. > * Metadata keys - if a parser enhancement or new feature needs a new > metadata key, then you end up having to wait for a new tika release to > get it (so you can add the code to use it to release) As mentioned by Antoni, in the end the metadata keys are just strings, so with a little coordination we don't need to delay the introduction of new keys over multiple releases. More generally though, I think it would make sense over time to have tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM, etc.) that aren't directly tied to any specific parser or file format. Format-specific keys like the ones we now have in the MSOffice interface would be better kept next to the actual parser implementation. That way, as long as the generic metadata keys in tika-core are more or less complete (i.e. cover all of the key metadata standards), there should be little need for a parser implementation to need changes in the rest of Tika if it wants to introduce a new custom metadata key. > * Consistency - both or markup and metadata keys will be harder to > ensure when it isn't in the same codebase Yep, that can be a problem. I guess the ultimate solution to this would be to come up with a well documented definition of what a parser should ideally output for specific kinds of content, but that's quite a bit of work. A partial solution could be the kind of shared committership model I was proposing. Then a single committer who wants to increase the level of consistency should be able to do so without worrying about karma boundaries. > For detectors, there's extra issue here. At the moment, both the Zip and > OLE2 detectors handle more than just the POI formats, and in the Zip case > rely on code shared between the parsers (poi+keynote) and detector. How > would this work if the container detectors were handed to POI? I guess this would require some level of code duplication, i.e. having a Zip detector in POI that knows about OOXML types, and another in tika-parsers that knows about other types of Zips. > And who's job would it be to test it? That's a general thing actually, how > much testing would need to remain on the Tika side? I'd still have the upstream libraries as dependencies of tika-parsers, and we definitely should continue maintaining a good set of integration tests there. On the other hand we already have many tests that actually test against issues in upstream parser libraries instead of any code in Tika, and I think those tests would be better located in the upstream projects. Ultimately test cases should go with the issues where particular problems or wishes were expressed. > Oh, but I guess this counts as your answer on what I should be doing with my > Ogg Vorbis parser :) :-) Yep, in a way. >From the beginning the idea behind Tika is that we should focus on being a thin integration layer on top of existing parser libraries. The fact that we're now implementing quite a few parsers by ourselves and the large amount of code we use to wrap especially POI and to a lesser degree PDFBox is a bit of a concern to me. We could and should be pushing more of this work to places where it would be useful also to people who aren't using Tika. There are many people who'd likely benefit from for example a good RTF or Ogg Vorbis parser but who don't really need Tika. Being able to get such people to use and contribute to the code we've written would indirectly help also Tika. Attracting such users and contributions is hard if the code lives only inside Tika. Similarly many bits and pieces in especially our bigger parser classes like those for POI and PDFBox would be useful also within the context of the upstream libraries. For example I could easily see the character run handling code in WordExtractor, the sparse sheet capturing and rendering code in ExcelExtractor, or the annotation handling code in PDF2XHTML becoming a more generally applicable part of the upstream libraries. So while having all this code in Tika makes it easy for us to maintain consistency and rapid evolution in Tika, it introduces a barrier to making the work we do useful also to a wider audience, and thus ultimately reduces the rate of useful contributions we can expect. During Tika 0.x I think the tradeoff favored focusing our work on Tika itself, but now with stable 1.0 APIs I think the time may be ripe to start reducing the size of tika-parsers (which has been growing pretty much, see [1]). [1] https://www.ohloh.net/p/tika/analyses/latest BR, Jukka Zitting +
Jukka Zitting 2011-12-16, 15:12
-
Re: Pushing parsers upstreamAntoni Mylka 2011-12-16, 18:45
W dniu 2011-12-16 16:12, Jukka Zitting pisze:
>> And who's job would it be to test it? That's a general thing actually, how >> much testing would need to remain on the Tika side? > > I'd still have the upstream libraries as dependencies of tika-parsers, > and we definitely should continue maintaining a good set of > integration tests there. On the other hand we already have many tests > that actually test against issues in upstream parser libraries instead > of any code in Tika, and I think those tests would be better located > in the upstream projects. Ultimately test cases should go with the > issues where particular problems or wishes were expressed. The moment upstream libraries start depending in tika-core, they stop being upstream libraries and become "side-stream" libraries. Putting POI between core and parsers in the dependency chain will bring all sorts of issues due to independent release cycles. Therefore I think we should drop the dependency from tika-parsers to POI, maintain integration tests in some other (new) maven module below parsers, poi and pdfbox, and expose some tika-integration pom, which will depend on tika-core, tika-parsers, and the latest-and-greatest versions of poi and pdfbox compatible with the given core version. The tika-integration pom could be updated after each release of an external parser. All tutorials could then point out that you need to add one dependency to your pom and that's tika-integration, e.g. with scope "import". In general pushing parsers "upstream" brings: - graceful degradation with missing dependencies - ability to use a later pdfbox without updating tika - "social" benefits of putting that code closer to people who'll know most about how to make it work But: - the contract between core and parsers will have to be super-rigid. Now we can allow ourselves to say that core and parser jars must be of the same version. With upstream parsers, it will be more difficult. This applies to utils, common abstract classes etc. we'll need to look out for two cases - new pdfbox will not work with old versions of tika - when we release new tika version, old pdfbox may not work with it until the next release (assumes that tika-parsers don't depend on pdfbox, because then we're in trouble) - we gotta bring a bit more complexity to the module setup I still feel it's worth it though. WDYT? Antoni Mylka [EMAIL PROTECTED] +
Antoni Mylka 2011-12-16, 18:45
-
Re: Pushing parsers upstreamJukka Zitting 2011-12-16, 19:32
Hi,
On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka <[EMAIL PROTECTED]> wrote: > The moment upstream libraries start depending in tika-core, they stop being > upstream libraries and become "side-stream" libraries. Putting POI between > core and parsers in the dependency chain will bring all sorts of issues due > to independent release cycles. What issues? As long as we maintain proper backwards compatibility in tika-core (we already have clirr configuration to automatically verify this) there should be no problems with independent release cycles. > Therefore I think we should drop the dependency from tika-parsers to POI, > maintain integration tests in some other (new) maven module below parsers, > poi and pdfbox, and expose some tika-integration pom, which will depend on > tika-core, tika-parsers, and the latest-and-greatest versions of poi and > pdfbox compatible with the given core version. The tika-parsers component can already be used like this. The setup I'm proposing has upstream parsers depending on tika-core, not tika-parsers. > - the contract between core and parsers will have to be super-rigid. > Now we can allow ourselves to say that core and parser jars must > be of the same version. With upstream parsers, it will be more > difficult. This applies to utils, common abstract classes etc. > we'll need to look out for two cases > - new pdfbox will not work with old versions of tika It will, as long as it's written against the 1.0 release instead of a more recent 1.x version. If pdfbox explicitly needs a more recent Tika version, then obviously it won't work with an older release, but such cases should be fairly rare and clearly documented in the relevant release notes or POM dependency settings. > - when we release new tika version, old pdfbox may not work > with it until the next release We're explicitly committed to maintaining backwards compatiblity (see https://issues.apache.org/jira/browse/TIKA-699) until Tika 2.0, so any case where a new Tika release breaks an existing upstream parser should be treated as a bug and fixed. BR, Jukka Zitting +
Jukka Zitting 2011-12-16, 19:32
-
Re: Pushing parsers upstreamAntoni Mylka 2011-12-16, 20:27
W dniu 2011-12-16 20:32, Jukka Zitting pisze:
> Hi, > > On Fri, Dec 16, 2011 at 7:45 PM, Antoni Mylka<[EMAIL PROTECTED]> wrote: >> The moment upstream libraries start depending in tika-core, they stop being >> upstream libraries and become "side-stream" libraries. Putting POI between >> core and parsers in the dependency chain will bring all sorts of issues due >> to independent release cycles. > > What issues? As long as we maintain proper backwards compatibility in > tika-core (we already have clirr configuration to automatically verify > this) there should be no problems with independent release cycles. Dunno, maybe I'm overreacting. I had two issues in mind 1. Incompatible changes in core which require adjustment of parsers. An API vs. SPI question, where user-level API is set in stone, while service implementor-level SPI is more flexible. Right now such tricks are possible, with POI outside parsers they would be possible, with POI between core and parsers they would be effectively impossible, as each one would introduce a release deadlock. Since we are committed to compatibility from all sides and make no distinction between API and SPI policies then it's impossible anyway and this is a non-issue. 2. Exposing core-level, parser-related improvements to the general public. Right now each parser may or may not implement support for EmbeddedDocumentExtractor, or for DocumentSelector. I can imagine expanding parsers with support for additional hooks like these, for instance a password list for all parsers to try before giving up on an encrypted document (doc, docx, pdf, zip etc.). With the scenario you're proposing, exposing such functionality to the general public will require two tika releases, not one: 1. release tika with that hook 2. release pdfbox with parser making use of that hook 3. release tika with new pdfbox With pdfbox outside parsers, step 3 wouldn't be necessary, but on a second thought the user will still be able to exclude the bundled version of pdfbox and call a new one in their app, with exactly the same effect. Moreover such cases are likely to be rare. So I guess I've refuted my own arguments :). Antoni Mylka [EMAIL PROTECTED] +
Antoni Mylka 2011-12-16, 20:27
-
Re: Pushing parsers upstreamAntoni Mylka 2011-12-16, 19:04
W dniu 2011-12-16 16:12, Jukka Zitting pisze:
>> * Consistency - both or markup and metadata keys will be harder to >> ensure when it isn't in the same codebase > > Yep, that can be a problem. I guess the ultimate solution to this > would be to come up with a well documented definition of what a parser > should ideally output for specific kinds of content, but that's quite > a bit of work. There are (at least) two efforts to create a "well documented definition of what a parser should ideally output for specific kinds of content". One is shared-desktop-ontologies, spearheaded by Sebastian Trueg from KDE (disclaimer: I was involved in early stages of this in 2007-2008). It lives at oscaf.sf.net. The second is XMP. I don't want to start new flames and understand that the current status quo is probably the best possible, given all requirements, yet let's not get carried away about creating yet another ultimate solution. Antoni Mylka [EMAIL PROTECTED] +
Antoni Mylka 2011-12-16, 19:04
-
Re: Pushing parsers upstreamJukka Zitting 2011-12-16, 19:36
Hi,
On Fri, Dec 16, 2011 at 8:04 PM, Antoni Mylka <[EMAIL PROTECTED]> wrote: > I don't want to start new flames and understand that the current status quo > is probably the best possible, given all requirements, yet let's not get > carried away about creating yet another ultimate solution. I was just thinking of stuff like that a parser should preferably use XMP schemas when exposing metadata, not about inventing our own schemas. BR, Jukka Zitting +
Jukka Zitting 2011-12-16, 19:36
-
Re: Pushing parsers upstreamNick Burch 2011-12-23, 07:37
On 16/12/11 15:12, Jukka Zitting wrote:
> As mentioned by Antoni, in the end the metadata keys are just strings, > so with a little coordination we don't need to delay the introduction > of new keys over multiple releases. Hmm, they're not quite just strings - with the new Property stuff they can also have validation too. I think, however, that having a parser temporarily include its only copy of a definition shouldn't be the end of the world > More generally though, I think it would make sense over time to have > tika-core maintain a shared set of metadata keys (Dublin Core, xmpDM, > etc.) that aren't directly tied to any specific parser or file format. > Format-specific keys like the ones we now have in the MSOffice > interface Ah, that MSOffice one is now badly named - lots of the other parsers make use of keys that it provides. We should maybe rename it to something more general, to indicate it relates to most productivity document formats In general though, I agree that re-using an existing defined key name (eg xmp where it covers it) makes sense. At the very least, it avoids work trying to come up with a name, and you get the documentation for the entry for free :) > That way, as long as the generic metadata keys in > tika-core are more or less complete (i.e. cover all of the key > metadata standards), there should be little need for a parser > implementation to need changes in the rest of Tika if it wants to > introduce a new custom metadata key. I think we're not quite there yet though, so for at least the next year (at a guess) we're going to need to be adding new keys, and rationalising existing ones >> * Consistency - both or markup and metadata keys will be harder to >> ensure when it isn't in the same codebase > > Yep, that can be a problem. I guess the ultimate solution to this > would be to come up with a well documented definition of what a parser > should ideally output for specific kinds of content, but that's quite > a bit of work. Possibly we could use some tooling to identify the differences, then have a periodic check to ensure things haven't got worse. My hunch is that this shouldn't be too hard to setup, but I'm not volunteering to do it...! >> For detectors, there's extra issue here. At the moment, both the Zip and >> OLE2 detectors handle more than just the POI formats, and in the Zip case >> rely on code shared between the parsers (poi+keynote) and detector. How >> would this work if the container detectors were handed to POI? > > I guess this would require some level of code duplication, i.e. having > a Zip detector in POI that knows about OOXML types, and another in > tika-parsers that knows about other types of Zips. Hmm, I'd rather we didn't have too much duplication. I think this might end up with quite a bit, and would need quite a lot of testing to ensure things worked well. Potentially we could end up with something like 5 Zip based detectors in that model, such as: * OOXML one, in POI (needs POI bits) * iWorks one, in future iWorks library (needs iWorks parser bits) * ODF one, in ODFToolkit (needs ODF bits) * Core Tika one (zip, jar, war etc) At that point maybe we need a zip detector plugin model... (The OLE2 case is fine - because the detector is powered by POIFS, non POI supported OLE2 formats are probably best detected by code within POI) Nick +
Nick Burch 2011-12-23, 07:37
-
Re: Pushing parsers upstreamMattmann, Chris A 2011-12-13, 15:16
Hey Jukka,
For places like POI and PDFBox I think this could definitely work. And then for places where we have Parsers, but aren't ready to push upstream yet (I can think of two examples of this relevant to me, NetCDF/HDF and GDAL), we can just leave the Parser in tika-parsers I think. In this manner, what you're really suggesting is that it would be great for our mature Parsers to be "promoted" upstream to the communities that really understand the underlying Parser implementation toolkit. I think this makes sense to me, so long as there is a Champion or someone in that community willing to spend the small amount of time to learn Tika and its interfaces (if they haven't done so already). The net effect to the casual Tika user is nil, since we have Parser loading via service factories, and the only thing that'll change there is the package name (and potentially the class name) but it's all behind the scenes. The net effect to the Tika developer is that the class and package name changes may cause folks to have to recompile code/etc., and the code/unit tests/maintenance of some of the parsers would no longer be readily available in Tika's tika-parsers artifact, but would live in the tika-parser dependency library upstream. Cheers, Chris On Dec 13, 2011, at 1:42 AM, Jukka Zitting wrote: > Hi, > > As you know, we see a lot of questions about version mismatches (which > POI or PDFBox version should go with this Tika version) and there's a > long queue of patches that are waiting for new official releases of > our upstream dependencies to become available. > > To avoid this issue I propose that we start moving some of our parser > implementations to upstream projects. Now with Tika 1.0 out we have a > stable Parser and Detector interfaces and related APIs that upstream > libraries could implement directly without us having to worry about > changing Tika code whenever a new version of a parser library becomes > available. > > This would allow our users to for example directly upgrade to a new > POI version without waiting for a releated Tika release first. > Similarly, a new PDF parsing option or improvement could be > implemented directly in PDFBox and be usable without any code changes > in Tika. > > The classloading and OSGi service mechanisms we've added should make > such upstream Parser implementations trivially easy to use, and we > could still keep the dependencies in tika-parsers as a way to pull in > the libraries even if the relevant implementation classes would no > longer reside in org.apache.tika.parsers.*. > > In addition to some of the GPL libraries for which we've already done > this, I recently took the liberty of trying this out also with PDFBox. > See PDFBOX-1132 [1] for the issue where I copied the > org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works > without problems, so now I'd like to propose that we copy any more > recent PDF parser changes to PDFBox and prepare to drop the parser > implementation in tika-parsers. Any further PDF parser work should > then be done directly in PDFBox. I haven't yet talked about this with > the PDFBox PMC (of which I'm a member), but I suppose we should be > able to come up with an arrangement where Tika committers can commit > directly to the Tika parser implementation in PDFBox. > > It would be cool if we could do the same thing also with POI. > > WDYT? > > [1] https://issues.apache.org/jira/browse/PDFBOX-1132 > > BR, > > Jukka Zitting ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [EMAIL PROTECTED] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ +
Mattmann, Chris A 2011-12-13, 15:16
-
Re: Pushing parsers upstreamMichael McCandless 2011-12-13, 17:05
+0
I agree, logically, parsers "belong" with their upstream project,since as that project improves how the document format is cracked,they can also make the matching fixes to Tika's parser. As long asthere's enough love / advocate / testing for the Tika parser in thatproject... My only concern is the possible added latency in getting parser-onlyfixes out to Tika's users. Ie, once a parser is upstream, if there's a fix that would onlyrequire a change to the parser's source code (say we open up controlover another PDFBox option, or workaround an issue in PDFBox), PDFBoxmust fix it, then release, then Tika must upgrade, then Tika mustrelease. It's true users could directly upgrade their PDFBox w/owaiting for a Tika release but I suspect most users don't do that... Vs today, where we just fix & release Tika directly. Would it somehow be possible for Tika to ship an unreleased PDFBox?Or does Maven fully tie our hands here? Mike McCandless http://blog.mikemccandless.com On Tue, Dec 13, 2011 at 10:16 AM, Mattmann, Chris A (388J) <[EMAIL PROTECTED]> wrote: > Hey Jukka, > > For places like POI and PDFBox I think this could definitely work. And then for > places where we have Parsers, but aren't ready to push upstream yet (I can > think of two examples of this relevant to me, NetCDF/HDF and GDAL), > we can just leave the Parser in tika-parsers I think. > > In this manner, what you're really suggesting is that it would be great for > our mature Parsers to be "promoted" upstream to the communities that > really understand the underlying Parser implementation toolkit. I think > this makes sense to me, so long as there is a Champion or someone in > that community willing to spend the small amount of time to learn Tika > and its interfaces (if they haven't done so already). > > The net effect to the casual Tika user is nil, since we have Parser loading via > service factories, and the only thing that'll change there is the package > name (and potentially the class name) but it's all behind the scenes. > The net effect to the Tika developer is that the class and package name > changes may cause folks to have to recompile code/etc., and the > code/unit tests/maintenance of some of the parsers would no longer > be readily available in Tika's tika-parsers artifact, but would live > in the tika-parser dependency library upstream. > > Cheers, > Chris > > On Dec 13, 2011, at 1:42 AM, Jukka Zitting wrote: > >> Hi, >> >> As you know, we see a lot of questions about version mismatches (which >> POI or PDFBox version should go with this Tika version) and there's a >> long queue of patches that are waiting for new official releases of >> our upstream dependencies to become available. >> >> To avoid this issue I propose that we start moving some of our parser >> implementations to upstream projects. Now with Tika 1.0 out we have a >> stable Parser and Detector interfaces and related APIs that upstream >> libraries could implement directly without us having to worry about >> changing Tika code whenever a new version of a parser library becomes >> available. >> >> This would allow our users to for example directly upgrade to a new >> POI version without waiting for a releated Tika release first. >> Similarly, a new PDF parsing option or improvement could be >> implemented directly in PDFBox and be usable without any code changes >> in Tika. >> >> The classloading and OSGi service mechanisms we've added should make >> such upstream Parser implementations trivially easy to use, and we >> could still keep the dependencies in tika-parsers as a way to pull in >> the libraries even if the relevant implementation classes would no >> longer reside in org.apache.tika.parsers.*. >> >> In addition to some of the GPL libraries for which we've already done >> this, I recently took the liberty of trying this out also with PDFBox. >> See PDFBOX-1132 [1] for the issue where I copied the >> org.apache.tika.pdf implementation to org.apache.pdfbox.tika. It works +
Michael McCandless 2011-12-13, 17:05
-
Re: Pushing parsers upstreamJukka Zitting 2011-12-16, 15:21
Hi,
On Tue, Dec 13, 2011 at 6:05 PM, Michael McCandless <[EMAIL PROTECTED]> wrote: > It's true users could directly upgrade their PDFBox w/owaiting for a > Tika release but I suspect most users don't do that... Currently people don't do that because it's so easy to break things by upgrading a parser library in sync with Tika. We've even been actively discouraging people from selectively upgrading parser libraries to avoid such problems. With my proposal this problem would no longer apply, and we could actually start proactively instructing people that they can and should try upgrading the relevant parser libraries if they face problems with a particular document. BR, Jukka Zitting +
Jukka Zitting 2011-12-16, 15:21
-
Re: Pushing parsers upstreamAntoni Mylka 2011-12-13, 17:34
W dniu 2011-12-13 18:05, Michael McCandless pisze:
> Would it somehow be possible for Tika to ship an unreleased PDFBox?Or > does Maven fully tie our hands here? That's the issue. Would it? AFAIU it's impossible. Tika can only depend on jars in maven central. Is it possible to push a snapshot jar to maven central (and label it with a version number which includes the date or something). There are such jars, but how does it look in practice? Who decides if a jar can or cannot be uploaded? Antoni Mylka [EMAIL PROTECTED] +
Antoni Mylka 2011-12-13, 17:34
|