-Re: Parse-tika ignores too much data...
Andrzej Bialecki 2010-07-08, 07:15
On 2010-07-07 22:32, Ken Krugler wrote:
> Hi Julien,
>> See https://issues.apache.org/jira/browse/TIKA-457 for a description
>> of one of the cases found by Andrzej. There seems to be something very
>> wrong with the way <body> is handled, we also saw cases were it was
>> twice in the output.
> Don't know about the case of it appearing twice.
> But for the above issue, I added a comment. The test HTML is badly
> broken, in that you can either have a <body> OR a <frameset>, but not both.
The HTML was broken on purpose - one of the goals of the original test
was to get as much content and links in presence of grave errors - as
you know even major sites often produce a badly broken HTML, but the
parser sanitize it and produce a valid DOM. In this case, it produced
two nested <body> elements, which is not valid. I should also mention
that NekoHTML handled this test much better, by removing the <body> and
retaining only the <frameset>.
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com