Hmmm…

What are you expecting?  What version of Tika are you using?

With master, the parse works as expected.  I get the links from the LinkHandler and the full content from the ToHTMLContentHandler
It is odd to escape the html, though; what is your goal?
If you’re trying to get just the text out, use the ToTextContentHandler instead of the ToHTMLContentHandler ?
From: Francesco Viscomi [mailto:[EMAIL PROTECTED]]
Sent: Wednesday, September 13, 2017 12:39 PM
To: [EMAIL PROTECTED]
Subject: Fwd: possible a bug?

---------- Forwarded message ----------
From: Francesco Viscomi <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>>
Date: 2017-09-13 18:37 GMT+02:00
Subject: possible a bug?
To: [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>

Hi all,
I'm trying to extract a content from a web page, and i find the following example on internet :
=======START CODE======
String url = "http://www.bbc.com/news/uk-england-41255962";
        URL _url = new URL(url);
InputStream input = _url.openStream();

                        LinkContentHandler linkHandler = new LinkContentHandler();
                        ContentHandler textHandler = new BodyContentHandler();
                        ToHTMLContentHandler toHTMLHandler = new ToHTMLContentHandler();

                        TeeContentHandler teeHandler = new TeeContentHandler(linkHandler, textHandler, toHTMLHandler);

                        Metadata metadata = new Metadata();
                        ParseContext parseContext = new ParseContext();
                        HtmlParser parser = new HtmlParser();

                        parser.parse(input, teeHandler, metadata, parseContext);
                        content = (StringEscapeUtils.escapeHtml(textHandler.toString()));
                        System.out.println("il contenuto   "+content);
=======END CODE========

But the output is useless, as i
===============START OUTPUT==================
 Accessibility links
         Skip to content
        Accessibility Help
      BBC iD

        Notifications

    BBC navigation
          Home
        Home
        News
        News
        Sport
        Weather
        Shop
==============END PART OF OUTPUT=============

How i can understand why this happen, and also how can solve it (for some other web page, for example http://www.vogella.com/tutorials/AndroidTestingEspresso/article.html) it work right good;
can please help me???
thanks really much

--
Ing. Viscomi Francesco

--
Ing. Viscomi Francesco
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB