|
|
-
FYI: text/plain and text/html media types now come with charset infoJukka Zitting 2012-07-08, 23:17
Hi,
As of revision 1358858 Tika returns the detected character encoding as a part of the content type metadata field. For example, instead of "text/plain" the returned content type will be "text/plain; charset=UTF-8" for a UTF-8 encoded text document. This is conceptually correct (see TIKA-431), but may confuse some clients that depend on the exact content type string with code like this: String type = metadata.get(Metadata.CONTENT_TYPE); if ("text/html".equals(type)) { ... } To fix such code, use the MediaType class to parse the returned content type string: MediaType type = MediaType.parse(metadata.get(Metadata.CONTENT_TYPE)); if (type != null && "text/html".equals(type.getBaseType())) { ... } Or instead of using direct string comparison, an ideal solution would be to leverage the full type inheritance logic available in the media type registry. With the isInstanceOf helper method I just added this becomes: String type = metadata.get(Metadata.CONTENT_TYPE); MediaTypeRegistry registry = ...; if (registry.isInstanceOf(type, MediaType.TEXT_HTML)) { ... } BR, Jukka Zitting |