Hi all,

I'm trying to extract a content from a web page, and i find the following
example on internet :

=======START CODE======

String url = "http://www.bbc.com/news/uk-england-41255962";

        URL _url = new URL(url);
InputStream input = _url.openStream();

                        LinkContentHandler linkHandler = new
LinkContentHandler();

                        ContentHandler textHandler = new
BodyContentHandler();

                        ToHTMLContentHandler toHTMLHandler = new
ToHTMLContentHandler();

                        TeeContentHandler teeHandler = new
TeeContentHandler(linkHandler, textHandler, toHTMLHandler);

                        Metadata metadata = new Metadata();

                        ParseContext parseContext = new ParseContext();

                        HtmlParser parser = new HtmlParser();

                        parser.parse(input, teeHandler, metadata,
parseContext);

                        content =
(StringEscapeUtils.escapeHtml(textHandler.toString()));

                        System.out.println("il contenuto   "+content);
=======END CODE========
But the output is useless, as i
===============START OUTPUT==================

 Accessibility links

         Skip to content

        Accessibility Help

      BBC iD

        Notifications

    BBC navigation

          Home

        Home

        News

        News

        Sport

        Weather

        Shop
==============END PART OF OUTPUT=============

How i can understand why this happen, and also how can solve it (for some
other web page, for example
http://www.vogella.com/tutorials/AndroidTestingEspresso/article.html)

--
Ing. Viscomi Francesco
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB