Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # dev - Issue in text extraction in Solr / Tika


Copy link to this message
-
RE: Issue in text extraction in Solr / Tika
Uwe Schindler 2011-08-20, 16:11
Does it really add this newline, because this is strange? If you look at
XHTMLContentHandler it does not. So the newline must come from somewhere
else.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: [EMAIL PROTECTED]
> -----Original Message-----
> From: Michael McCandless [mailto:[EMAIL PROTECTED]]
> Sent: Saturday, August 20, 2011 5:33 PM
> To: [EMAIL PROTECTED]
> Subject: Re: Issue in text extraction in Solr / Tika
>
> On Sat, Aug 20, 2011 at 10:19 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote:
> >> Hmm, actually: the <p> element allows text, in addition to child
elements?
> > So
> >> shouldn't any whitespace within the <p>...</p> be treated as
> >> significant
> > (part of
> >> the content)?
> >
> > This is very indeed very complicated. For mixed content elements, the
> > whitespace inside is preserved, but not next to child elements - very
> > stupid rules. If you once coded HTML you know this :-)
>
> Hmm... are you sure? :)
>
> Because, I've tried Firefox and Chrome and Safari, on the xml file, and
all insert
> a space in rendering.
>
> Also, I tried Tika itself (feeding back the .xml it had created, to
produce text)
> and it also inserts a space.
>
> I also tried JTidy and it inserts the space though it thinks it's parsing
HTML so
> that may be an invalid test.
>
> Anyway... even if the strict XML white space rules state that this newline
should
> not be counted as whitespace in the content, because so many tools seem
not
> to do it correctly.... I think it's worth trying to fix Tika to not add
this newline.
>
> Mike McCandless
>
> http://blog.mikemccandless.com