Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Tika, mail # user - content detection problem using tika-app


Copy link to this message
-
Re: content detection problem using tika-app
Nick Burch 2011-11-21, 00:31
On Sun, 20 Nov 2011, John M wrote:
> I apologize; I took a closer look.  I guess it's a matter of
> interpretation as to what the detector should be doing: in your example,
> Tika detected the correct format based off of the file name extensions,
> but, those copies you made weren't really PowerPoint or Excel files.

Ah, oops. More coffee needed! You're right, I wasn't seeing what I was
expecting - the file should come back as a .doc no matter the filename, on
the grounds of the content trumping the name

If you look at the TestMediaTypes class you'll see what you can get with
just the mime magic and filenames, and then there's
TestContainerAwareDetector which shows the correct detection happing by
using the extra detectors available

Any chance you could open a bug for this? You're correct, and it really is
a bug

Thanks
Nick