Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # user - HTML not listed as supported type in the AutoDetectParser


Copy link to this message
-
Re: HTML not listed as supported type in the AutoDetectParser
William Hays 2012-04-17, 15:21
Nick,

I believe you answered a different question than what I asked.  My
observation was specifically about the AutoDetectParser listing its
supported mediatypes, not about the HTMLParser.  The code I used is
similar to:

         for (MediaType mt : autoDetectParser.getSupportedTypes(pctx)) {
             System.out.println(mt.toString());
         }

The mimetypes you listed for the HtmlParser do not show up here.

Thanks,
Bill
On 04/16/2012 01:51 PM, Nick Burch wrote:
> On Thu, 12 Apr 2012, William Hays wrote:
>> Using the API, I have extracted the supported media types for the
>> AutoDetectParser in Tika 1.1 and I'm not seeing HTML or XHTML
>> mimetypes in that list of 92 items, though it parses such files fine.
>
> Hmm, HTML is showing up for me:
>
> java -jar tika-app-1.1.jar --list-parser-details | grep -A 4 HtmlParser
>     org.apache.tika.parser.html.HtmlParser
>       application/x-asp
>       application/xhtml+xml
>       application/vnd.wap.xhtml+xml
>       text/html
>
> Nick

--
------------
William Hays
Software Development&  Analysis
MIT Libraries E25-131
617.324.5682 (phone)
[EMAIL PROTECTED]