|
|
-
HTML not listed as supported type in the AutoDetectParser
William Hays 2012-04-12, 16:06
Using the API, I have extracted the supported media types for the AutoDetectParser in Tika 1.1 and I'm not seeing HTML or XHTML mimetypes in that list of 92 items, though it parses such files fine.
Why would this be the case? or am I missing something?
Thanks, Bill
-- ------------ William Hays Software Development& Analysis MIT Libraries E25-131 617.324.5682 (phone) [EMAIL PROTECTED]
-
Re: HTML not listed as supported type in the AutoDetectParser
Nick Burch 2012-04-16, 17:51
On Thu, 12 Apr 2012, William Hays wrote: > Using the API, I have extracted the supported media types for the > AutoDetectParser in Tika 1.1 and I'm not seeing HTML or XHTML mimetypes > in that list of 92 items, though it parses such files fine.
Hmm, HTML is showing up for me:
java -jar tika-app-1.1.jar --list-parser-details | grep -A 4 HtmlParser org.apache.tika.parser.html.HtmlParser application/x-asp application/xhtml+xml application/vnd.wap.xhtml+xml text/html
Nick
-
Re: HTML not listed as supported type in the AutoDetectParser
William Hays 2012-04-17, 15:21
Nick,
I believe you answered a different question than what I asked. My observation was specifically about the AutoDetectParser listing its supported mediatypes, not about the HTMLParser. The code I used is similar to:
for (MediaType mt : autoDetectParser.getSupportedTypes(pctx)) { System.out.println(mt.toString()); }
The mimetypes you listed for the HtmlParser do not show up here.
Thanks, Bill On 04/16/2012 01:51 PM, Nick Burch wrote: > On Thu, 12 Apr 2012, William Hays wrote: >> Using the API, I have extracted the supported media types for the >> AutoDetectParser in Tika 1.1 and I'm not seeing HTML or XHTML >> mimetypes in that list of 92 items, though it parses such files fine. > > Hmm, HTML is showing up for me: > > java -jar tika-app-1.1.jar --list-parser-details | grep -A 4 HtmlParser > org.apache.tika.parser.html.HtmlParser > application/x-asp > application/xhtml+xml > application/vnd.wap.xhtml+xml > text/html > > Nick
-- ------------ William Hays Software Development& Analysis MIT Libraries E25-131 617.324.5682 (phone) [EMAIL PROTECTED]
-
Re: HTML not listed as supported type in the AutoDetectParser
Nick Burch 2012-04-17, 16:50
On Tue, 17 Apr 2012, William Hays wrote: > I believe you answered a different question than what I asked. My > observation was specifically about the AutoDetectParser listing its > supported mediatypes, not about the HTMLParser.
The Tika App uses AutoDetectParser internally, so if it's finding the parser and the mimetypes then they should be correctly defined.
I wonder if perhaps in your application if you're missing some of the classes that the HtmlParser depends on? Parsers will only show up if they can be loaded correctly
As a general rule, it's often worth checking something with Tika App when your code misbehaves, as it can help diferentiate between Tika bugs and errors in the setup of Tika in your code
Nick
|
|