Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # dev - improving odf / general questions on forms and deleted text


Copy link to this message
-
Re: improving odf / general questions on forms and deleted text
Ken Krugler 2010-09-25, 21:19
Hi Bart,

I know very little about ODF, so just some general comments below...

On Sep 25, 2010, at 7:56am, Hanssens Bart wrote:

> Hi,
>
> I'm planning to further improve the ODF support in Tika. A few  
> questions though,
> that might also be useful for other formats:
>
> Should Tika parse deleted text ? XHTML has INS and DEL, but they are  
> to be used
> where the content is removed / inserted, while ODF stores removed  
> content at the
> very beginning of the document (so "fixing" this will hurt  
> performance, not sure if
> that's worth it)
> It can also be very confusing for the end user if one gets a result  
> for "removed",
> then again, it is somewhere in the document...

If the above is similar to what you get when tracking changes in say  
Word, then I would argue for not including the text.

My rule of thumb would be that if the text doesn't appear in "normal"  
viewing mode (whatever that means) using a typical app, then it's more  
confusing to include it.

> Forms: most form elements in ODF can be mapped to their HTML  
> counterparts,
> although I have to check if the result is always valid HTML (i.e.,  
> when ODF parent
> and form element are mapped to HTML, is the HTML form still allowed  
> within the
> mapped parent)
> Should they be mapped to HTML forms in the first place ? Or just to  
> div / span ?

I wouldn't worry about trying to map explicitly to HTML forms -  
capturing the text is 99% of the value here, versus trying to maintain  
greater logical consistency between ODF and XHTML.

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g