-Limitations of iWork parsing
Gabriel Valencia 2012-04-24, 23:52
We have been using Tika to parse iWork files, but have found many
limitations and potentially bugs. Here is a sampling:
* Things like header and footer text and embedded text boxes are not
* Pages docs created in Layout mode are not parsed at all. Only the
metadata is extracted.
* Text box text in Keynote slides is extracted, but all of the text of all
the boxes is lumped together without any spaces.
* Password protected files throw an NPE.
Is there any work in progress or planned to improve the parsing of iWork
files? Or only as defects are opened?
Software Development for IBM Content Integrator, IBM Content Analytics, and
IBM Content and Predictive Analytics
Tel: 408-463-4133 TL: 543-4133