|
|
-
Limitations of iWork parsingGabriel Valencia 2012-04-24, 23:52
Hi all We have been using Tika to parse iWork files, but have found many limitations and potentially bugs. Here is a sampling: * Things like header and footer text and embedded text boxes are not parsed. * Pages docs created in Layout mode are not parsed at all. Only the metadata is extracted. * Text box text in Keynote slides is extracted, but all of the text of all the boxes is lumped together without any spaces. * Password protected files throw an NPE. Is there any work in progress or planned to improve the parsing of iWork files? Or only as defects are opened? -- Gabriel Valencia Software Development for IBM Content Integrator, IBM Content Analytics, and IBM Content and Predictive Analytics [EMAIL PROTECTED] Tel: 408-463-4133 TL: 543-4133 |