Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # dev - Parse-tika ignores too much data...


Copy link to this message
-
Re: Parse-tika ignores too much data...
Andrzej Bialecki 2010-07-08, 07:15
On 2010-07-07 22:32, Ken Krugler wrote:
> Hi Julien,
>
>> See https://issues.apache.org/jira/browse/TIKA-457 for a description
>> of one of the cases found by Andrzej. There seems to be something very
>> wrong with the way <body> is handled, we also saw cases were it was
>> twice in the output.
>
> Don't know about the case of it appearing twice.
>
> But for the above issue, I added a comment. The test HTML is badly
> broken, in that you can either have a <body> OR a <frameset>, but not both.

The HTML was broken on purpose - one of the goals of the original test
was to get as much content and links in presence of grave errors - as
you know even major sites often produce a badly broken HTML, but the
parser sanitize it and produce a valid DOM. In this case, it produced
two nested <body> elements, which is not valid. I should also mention
that NekoHTML handled this test much better, by removing the <body> and
retaining only the <frameset>.

--
Best regards,
Andrzej Bialecki     <><
  ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com