-Re: improving odf / general questions on forms and deleted text
Ken Krugler 2010-09-25, 21:19
I know very little about ODF, so just some general comments below...
On Sep 25, 2010, at 7:56am, Hanssens Bart wrote:
> I'm planning to further improve the ODF support in Tika. A few
> questions though,
> that might also be useful for other formats:
> Should Tika parse deleted text ? XHTML has INS and DEL, but they are
> to be used
> where the content is removed / inserted, while ODF stores removed
> content at the
> very beginning of the document (so "fixing" this will hurt
> performance, not sure if
> that's worth it)
> It can also be very confusing for the end user if one gets a result
> for "removed",
> then again, it is somewhere in the document...
If the above is similar to what you get when tracking changes in say
Word, then I would argue for not including the text.
My rule of thumb would be that if the text doesn't appear in "normal"
viewing mode (whatever that means) using a typical app, then it's more
confusing to include it.
> Forms: most form elements in ODF can be mapped to their HTML
> although I have to check if the result is always valid HTML (i.e.,
> when ODF parent
> and form element are mapped to HTML, is the HTML form still allowed
> within the
> mapped parent)
> Should they be mapped to HTML forms in the first place ? Or just to
> div / span ?
I wouldn't worry about trying to map explicitly to HTML forms -
capturing the text is 99% of the value here, versus trying to maintain
greater logical consistency between ODF and XHTML.
e l a s t i c w e b m i n i n g