|
|
-
content detection problem using tika-app
John M 2011-11-20, 19:27
Hello,
I have a .ppt file that I've renamed to be a .doc file (by only changing its extension). If I use the Tika GUI, or the command line, to extract the file metadata, then Tika correctly identifies the content type as a Powerpoint file. However, if I use the command line -d option to detect its content type, the application returns "application/msword", which is of course only superficially correct. The source code indicates that the correct type comes from a call to a parser's parse method, while the less-accurate detection comes from a call to a detector's detect method. I'm not sure if this is a feature or a bug--I didn't see anything similar when browsing through JIRA--so I thought I'd ask if the project team is aware of the detector's performance vs the parser's performance on detecting content types before I or someone else would create a bug report / feature request in JIRA.
Thanks, John Mastarone
-
Re: content detection problem using tika-app
Nick Burch 2011-11-20, 19:31
On Sun, 20 Nov 2011, John M wrote: > I have a .ppt file that I've renamed to be a .doc file (by only changing > its extension). If I use the Tika GUI, or the command line, to extract > the file metadata, then Tika correctly identifies the content type as a > Powerpoint file. However, if I use the command line -d option to detect > its content type, the application returns "application/msword", which is > of course only superficially correct.
What version of Tika are you trying with? If it isn't 1.0, I'd suggest you upgrade and re-test. (We've made detectors pluggable like parsers fairly recently, which changed how the container aware detectors were made available and used)
Nick
-
Re: content detection problem using tika-app
John M 2011-11-20, 20:44
I'm using a build from the 1.1 source. John
On Sun, Nov 20, 2011 at 2:31 PM, Nick Burch <[EMAIL PROTECTED]> wrote: > On Sun, 20 Nov 2011, John M wrote: >> >> I have a .ppt file that I've renamed to be a .doc file (by only changing >> its extension). If I use the Tika GUI, or the command line, to extract the >> file metadata, then Tika correctly identifies the content type as a >> Powerpoint file. However, if I use the command line -d option to detect its >> content type, the application returns "application/msword", which is of >> course only superficially correct. > > What version of Tika are you trying with? If it isn't 1.0, I'd suggest you > upgrade and re-test. (We've made detectors pluggable like parsers fairly > recently, which changed how the container aware detectors were made > available and used) > > Nick >
-
Re: content detection problem using tika-app
Nick Burch 2011-11-20, 21:14
On Sun, 20 Nov 2011, John M wrote: > I'm using a build from the 1.1 source.
That's odd - with 1.1 TikaCLI will use DefaultDetector, which loads all available detectors including the container aware ones
However, I'm not able to reproduce your problem:
cd /tmp cp ~/test.doc C1.doc cp ~/test.doc C1.xls cp ~/test.doc C1.ppt cd ~/java/apache-tika/tika-app/target for i in /tmp/C1*; do echo ""; echo $i; java -jar tika-app-1.1-SNAPSHOT.jar --detect $i; done
/tmp/C1.doc application/msword
/tmp/C1.ppt application/vnd.ms-powerpoint
/tmp/C1.xls application/vnd.ms-excel So I do get the container aware detection working properly. Not sure what's not working for you....
Nick
-
Re: content detection problem using tika-app
John M 2011-11-20, 21:28
With genuine .doc, .xls, or .ppt files, I'm not having a problem. I was wondering how good Tika was about being fooled with misnamed files, and so I took a .ppt, and just changed the extension to a .doc to see what would occur. Using the -m option turns out to be better than -d in this case.
John
On Sun, Nov 20, 2011 at 4:14 PM, Nick Burch <[EMAIL PROTECTED]> wrote: > On Sun, 20 Nov 2011, John M wrote: >> >> I'm using a build from the 1.1 source. > > That's odd - with 1.1 TikaCLI will use DefaultDetector, which loads all > available detectors including the container aware ones > > However, I'm not able to reproduce your problem: > > cd /tmp > cp ~/test.doc C1.doc > cp ~/test.doc C1.xls > cp ~/test.doc C1.ppt > cd ~/java/apache-tika/tika-app/target > for i in /tmp/C1*; do echo ""; echo $i; java -jar tika-app-1.1-SNAPSHOT.jar > --detect $i; done > > /tmp/C1.doc > application/msword > > /tmp/C1.ppt > application/vnd.ms-powerpoint > > /tmp/C1.xls > application/vnd.ms-excel > > > So I do get the container aware detection working properly. Not sure what's > not working for you.... > > Nick >
-
Re: content detection problem using tika-app
Nick Burch 2011-11-20, 21:43
On Sun, 20 Nov 2011, John M wrote: > With genuine .doc, .xls, or .ppt files, I'm not having a problem. I > was wondering how good Tika was about being fooled with misnamed > files, and so I took a .ppt, and just changed the extension to a .doc > to see what would occur. Using the -m option turns out to be better > than -d in this case.
Please take another look at my example. I took a .doc, renamed it, and Tika detected it just fine for me, hence my wondering why it is different for you
Nick
> On Sun, Nov 20, 2011 at 4:14 PM, Nick Burch <[EMAIL PROTECTED]> wrote: >> On Sun, 20 Nov 2011, John M wrote: >>> >>> I'm using a build from the 1.1 source. >> >> That's odd - with 1.1 TikaCLI will use DefaultDetector, which loads all >> available detectors including the container aware ones >> >> However, I'm not able to reproduce your problem: >> >> cd /tmp >> cp ~/test.doc C1.doc >> cp ~/test.doc C1.xls >> cp ~/test.doc C1.ppt >> cd ~/java/apache-tika/tika-app/target >> for i in /tmp/C1*; do echo ""; echo $i; java -jar tika-app-1.1-SNAPSHOT.jar >> --detect $i; done >> >> /tmp/C1.doc >> application/msword >> >> /tmp/C1.ppt >> application/vnd.ms-powerpoint >> >> /tmp/C1.xls >> application/vnd.ms-excel >> >> >> So I do get the container aware detection working properly. Not sure what's >> not working for you.... >> >> Nick >> >
-
Re: content detection problem using tika-app
John M 2011-11-20, 22:15
I apologize; I took a closer look. I guess it's a matter of interpretation as to what the detector should be doing: in your example, Tika detected the correct format based off of the file name extensions, but, those copies you made weren't really PowerPoint or Excel files. If you run your test again with the -m option, the Content-Type field should display different results than what you see with --detect, and these are arguably better. I have a particular use case in mind where file names aren't necessarily to be trusted, so, maybe it's for the best that the detector can return a different result than the -m option; if this occurs, then a user might know that the file extension is suspect, or the software developer using Tika could take steps rename a file to its correct extension or make a copy with a correct extension. I can drop the issue at this point; I just wanted to see if someone thought that the behavior of -detect was obviously incorrect or not.
On Sun, Nov 20, 2011 at 4:43 PM, Nick Burch <[EMAIL PROTECTED]> wrote: > On Sun, 20 Nov 2011, John M wrote: >> >> With genuine .doc, .xls, or .ppt files, I'm not having a problem. I >> was wondering how good Tika was about being fooled with misnamed >> files, and so I took a .ppt, and just changed the extension to a .doc >> to see what would occur. Using the -m option turns out to be better >> than -d in this case. > > Please take another look at my example. I took a .doc, renamed it, and Tika > detected it just fine for me, hence my wondering why it is different for you > > Nick > >> On Sun, Nov 20, 2011 at 4:14 PM, Nick Burch <[EMAIL PROTECTED]> >> wrote: >>> >>> On Sun, 20 Nov 2011, John M wrote: >>>> >>>> I'm using a build from the 1.1 source. >>> >>> That's odd - with 1.1 TikaCLI will use DefaultDetector, which loads all >>> available detectors including the container aware ones >>> >>> However, I'm not able to reproduce your problem: >>> >>> cd /tmp >>> cp ~/test.doc C1.doc >>> cp ~/test.doc C1.xls >>> cp ~/test.doc C1.ppt >>> cd ~/java/apache-tika/tika-app/target >>> for i in /tmp/C1*; do echo ""; echo $i; java -jar >>> tika-app-1.1-SNAPSHOT.jar >>> --detect $i; done >>> >>> /tmp/C1.doc >>> application/msword >>> >>> /tmp/C1.ppt >>> application/vnd.ms-powerpoint >>> >>> /tmp/C1.xls >>> application/vnd.ms-excel >>> >>> >>> So I do get the container aware detection working properly. Not sure >>> what's >>> not working for you.... >>> >>> Nick >>> >> >
-
Re: content detection problem using tika-app
Nick Burch 2011-11-21, 00:31
On Sun, 20 Nov 2011, John M wrote: > I apologize; I took a closer look. I guess it's a matter of > interpretation as to what the detector should be doing: in your example, > Tika detected the correct format based off of the file name extensions, > but, those copies you made weren't really PowerPoint or Excel files.
Ah, oops. More coffee needed! You're right, I wasn't seeing what I was expecting - the file should come back as a .doc no matter the filename, on the grounds of the content trumping the name
If you look at the TestMediaTypes class you'll see what you can get with just the mime magic and filenames, and then there's TestContainerAwareDetector which shows the correct detection happing by using the extra detectors available
Any chance you could open a bug for this? You're correct, and it really is a bug
Thanks Nick
-
Re: content detection problem using tika-app
Nick Burch 2011-11-21, 13:18
On Mon, 21 Nov 2011, Nick Burch wrote: > Ah, oops. More coffee needed! You're right, I wasn't seeing what I was > expecting - the file should come back as a .doc no matter the filename, > on the grounds of the content trumping the name
With the fix now in, I can confirm that my earlier test now behaves as you'd really expect:
cp test.doc /tmp/C1.doc cp test.doc /tmp/C1.ppt cp test.doc /tmp/C1.xls for i in /tmp/C1*; do echo ""; echo $i; java -jar tika-app-1.1-SNAPSHOT.jar --detect $i; done
/tmp/C1.doc application/msword
/tmp/C1.ppt application/msword
/tmp/C1.xls application/msword Nick
|
|