|
|
Mark Kerzner 2011-09-07, 01:29
Hi, as part of testing my FreeEed < http://freeeed.org/> open source eDiscovery engine, I am processing the 153 Enron PSTs found here< http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>. Naturally, I see lot of errors and warning. For example, I started with the error described here < https://issues.apache.org/jira/browse/PDFBOX-1008>. For that, I replaced version of PDFBox from 1.5.0 to 1.6.0, since I am building with maven from the latest svn checkout anyway. However, for the future, my question is: is there a more systematic way to approach this. Is anybody interested in the results of all the testing that I am doing, and if yes, how should I report my findings? Thank you, Mark
+
Mark Kerzner 2011-09-07, 01:29
Julien Nioche 2011-09-07, 07:36
Hi Mark See http://digitalpebble.blogspot.com/2011/05/processing-enron-dataset-using-behemoth.htmlforcomments on processing the Enron corpus with Tika. Some of the errors that you are seeing are probably described there. Julien On 7 September 2011 02:29, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Hi, > > as part of testing my FreeEed < http://freeeed.org/> open source eDiscovery > engine, I am processing the 153 Enron PSTs found here< http://www.edrm.net/resources/data-sets/edrm-enron-email-data-set-v2>> . > > Naturally, I see lot of errors and warning. For example, I started with the > error described here < https://issues.apache.org/jira/browse/PDFBOX-1008>. > For that, I replaced version of PDFBox from 1.5.0 to 1.6.0, since I am > building with maven from the latest svn checkout anyway. > > However, for the future, my question is: is there a more systematic way to > approach this. Is anybody interested in the results of all the testing that > I am doing, and if yes, how should I report my findings? > > Thank you, > Mark > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/http://www.digitalpebble.com
+
Julien Nioche 2011-09-07, 07:36
Michael McCandless 2011-09-07, 12:29
On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Is anybody interested in the results of all the testing that > I am doing, and if yes, how should I report my findings? I'm interested! This sounds great.... Tika should strive to have no errors on any valid documents... so if you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's characterize them, open issues, and get them fixed :) Mike McCandless http://blog.mikemccandless.com
+
Michael McCandless 2011-09-07, 12:29
Steve Aulenbach 2011-09-07, 17:04
Hi Mike, Here you go. I ran a quick analysis on revision 1166216 and saw the following: Analysis Summary: Files: 510 *** Warning *** File(s) Not Found 5: /tika-parsers/src/main/java/org/apache/tika/detect/ContainerAwareDetector.java /tika-parsers/src/main/java/org/apache/tika/detect/POIFSContainerDetector.java /tika-parsers/src/main/java/org/apache/tika/detect/ZipContainerDetector.java /tika-parsers/src/test/java/org/apache/tika/parser/chm/TestUtils.java /tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.chm.TestUtils.xml * * *Thanks,* *Steve* On Wed, Sep 7, 2011 at 6:29 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > > > Is anybody interested in the results of all the testing that > > I am doing, and if yes, how should I report my findings? > > I'm interested! This sounds great.... > > Tika should strive to have no errors on any valid documents... so if > you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's > characterize them, open issues, and get them fixed :) > > Mike McCandless > > http://blog.mikemccandless.com>
+
Steve Aulenbach 2011-09-07, 17:04
Michael McCandless 2011-09-07, 17:30
Sorry, I don't understand what this output is telling me? Ie these 5 files are Tika's sources.... but, what's wrong with them? I thought we were talking about certain emails from the Enron corpus where Tika hits an exception or fails to extract text... Mike McCandless http://blog.mikemccandless.comOn Wed, Sep 7, 2011 at 1:04 PM, Steve Aulenbach <[EMAIL PROTECTED]> wrote: > Hi Mike, > Here you go. I ran a quick analysis on revision 1166216 and saw the > following: > > Analysis Summary: > > Files: 510 > > *** Warning *** File(s) Not Found 5: > > /tika-parsers/src/main/java/org/apache/tika/detect/ContainerAwareDetector.java > > /tika-parsers/src/main/java/org/apache/tika/detect/POIFSContainerDetector.java > > /tika-parsers/src/main/java/org/apache/tika/detect/ZipContainerDetector.java > > /tika-parsers/src/test/java/org/apache/tika/parser/chm/TestUtils.java > > /tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.chm.TestUtils.xml > > Thanks, > Steve > > > On Wed, Sep 7, 2011 at 6:29 AM, Michael McCandless > <[EMAIL PROTECTED]> wrote: >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> >> wrote: >> >> > Is anybody interested in the results of all the testing that >> > I am doing, and if yes, how should I report my findings? >> >> I'm interested! This sounds great.... >> >> Tika should strive to have no errors on any valid documents... so if >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's >> characterize them, open issues, and get them fixed :) >> >> Mike McCandless >> >> http://blog.mikemccandless.com> >
+
Michael McCandless 2011-09-07, 17:30
Steve Aulenbach 2011-09-07, 18:04
Hi Mike, My mistake. I thought this discussion was taking place on the dev list, not the user list. *Steve* On Wed, Sep 7, 2011 at 11:30 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > Sorry, I don't understand what this output is telling me? > > Ie these 5 files are Tika's sources.... but, what's wrong with them? > > I thought we were talking about certain emails from the Enron corpus > where Tika hits an exception or fails to extract text... > > Mike McCandless > > http://blog.mikemccandless.com> > On Wed, Sep 7, 2011 at 1:04 PM, Steve Aulenbach <[EMAIL PROTECTED]> > wrote: > > Hi Mike, > > Here you go. I ran a quick analysis on revision 1166216 and saw the > > following: > > > > Analysis Summary: > > > > Files: 510 > > > > *** Warning *** File(s) Not Found 5: > > > > > /tika-parsers/src/main/java/org/apache/tika/detect/ContainerAwareDetector.java > > > > > /tika-parsers/src/main/java/org/apache/tika/detect/POIFSContainerDetector.java > > > > > /tika-parsers/src/main/java/org/apache/tika/detect/ZipContainerDetector.java > > > > /tika-parsers/src/test/java/org/apache/tika/parser/chm/TestUtils.java > > > > > /tika-parsers/target/surefire-reports/TEST-org.apache.tika.parser.chm.TestUtils.xml > > > > Thanks, > > Steve > > > > > > On Wed, Sep 7, 2011 at 6:29 AM, Michael McCandless > > <[EMAIL PROTECTED]> wrote: > >> > >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> > >> wrote: > >> > >> > Is anybody interested in the results of all the testing that > >> > I am doing, and if yes, how should I report my findings? > >> > >> I'm interested! This sounds great.... > >> > >> Tika should strive to have no errors on any valid documents... so if > >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's > >> characterize them, open issues, and get them fixed :) > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com> > > > >
+
Steve Aulenbach 2011-09-07, 18:04
Mark Kerzner 2011-09-08, 12:43
The processing is complete, the summary found here< http://shmsoft.blogspot.com/2011/09/freeeed-used-to-process-complete-enron.html>. Mark On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > > > Is anybody interested in the results of all the testing that > > I am doing, and if yes, how should I report my findings? > > I'm interested! This sounds great.... > > Tika should strive to have no errors on any valid documents... so if > you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's > characterize them, open issues, and get them fixed :) > > Mike McCandless > > http://blog.mikemccandless.com>
+
Mark Kerzner 2011-09-08, 12:43
Michael McCandless 2011-09-15, 10:26
That summary is nice, but, can you provide specifics on which docs caused problems for Tika? Ie, if a certain doc hits an exception, we should open a Jira issue and get it fixed... Thanks, Mike McCandless http://blog.mikemccandless.comOn Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[EMAIL PROTECTED]> wrote: > The processing is complete, the summary found here. > Mark > > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless > <[EMAIL PROTECTED]> wrote: >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> >> wrote: >> >> > Is anybody interested in the results of all the testing that >> > I am doing, and if yes, how should I report my findings? >> >> I'm interested! This sounds great.... >> >> Tika should strive to have no errors on any valid documents... so if >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's >> characterize them, open issues, and get them fixed :) >> >> Mike McCandless >> >> http://blog.mikemccandless.com> >
+
Michael McCandless 2011-09-15, 10:26
Mark Kerzner 2011-09-15, 13:02
Mike, I certainly will do it. I am refactoring the code before I run those tests again. Sincerely, Mark On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless < [EMAIL PROTECTED]> wrote: > That summary is nice, but, can you provide specifics on which docs > caused problems for Tika? > > Ie, if a certain doc hits an exception, we should open a Jira issue > and get it fixed... > > Thanks, > > Mike McCandless > > http://blog.mikemccandless.com> > On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[EMAIL PROTECTED]> > wrote: > > The processing is complete, the summary found here. > > Mark > > > > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless > > <[EMAIL PROTECTED]> wrote: > >> > >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> > >> wrote: > >> > >> > Is anybody interested in the results of all the testing that > >> > I am doing, and if yes, how should I report my findings? > >> > >> I'm interested! This sounds great.... > >> > >> Tika should strive to have no errors on any valid documents... so if > >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's > >> characterize them, open issues, and get them fixed :) > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com> > > > >
+
Mark Kerzner 2011-09-15, 13:02
Albretch Mueller 2011-09-17, 07:08
from a corpus analysis point of view, who owns this data?, how do we know it is the real thing? ~ I don't see any validation data by Enron Email Dataset ( http://www.cs.cmu.edu/~enron/)~ lbrtchx On 9/15/11, Mark Kerzner <[EMAIL PROTECTED]> wrote: > Mike, > > I certainly will do it. I am refactoring the code before I run those tests > again. > > Sincerely, > Mark > > On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless < > [EMAIL PROTECTED]> wrote: > >> That summary is nice, but, can you provide specifics on which docs >> caused problems for Tika? >> >> Ie, if a certain doc hits an exception, we should open a Jira issue >> and get it fixed... >> >> Thanks, >> >> Mike McCandless >> >> http://blog.mikemccandless.com>> >> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[EMAIL PROTECTED]> >> wrote: >> > The processing is complete, the summary found here. >> > Mark >> > >> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless >> > <[EMAIL PROTECTED]> wrote: >> >> >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> >> >> wrote: >> >> >> >> > Is anybody interested in the results of all the testing that >> >> > I am doing, and if yes, how should I report my findings? >> >> >> >> I'm interested! This sounds great.... >> >> >> >> Tika should strive to have no errors on any valid documents... so if >> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's >> >> characterize them, open issues, and get them fixed :) >> >> >> >> Mike McCandless >> >> >> >> http://blog.mikemccandless.com>> > >> > >> >
+
Albretch Mueller 2011-09-17, 07:08
Mark Kerzner 2011-09-18, 01:45
I get it from this site, http://www.edrm.net/resources/data-sets, where it is much more complete. You can check there On Sat, Sep 17, 2011 at 2:08 AM, Albretch Mueller <[EMAIL PROTECTED]> wrote: > from a corpus analysis point of view, who owns this data?, how do we > know it is the real thing? > ~ > I don't see any validation data by Enron Email Dataset > ( http://www.cs.cmu.edu/~enron/)> ~ > lbrtchx > > On 9/15/11, Mark Kerzner <[EMAIL PROTECTED]> wrote: > > Mike, > > > > I certainly will do it. I am refactoring the code before I run those > tests > > again. > > > > Sincerely, > > Mark > > > > On Thu, Sep 15, 2011 at 5:26 AM, Michael McCandless < > > [EMAIL PROTECTED]> wrote: > > > >> That summary is nice, but, can you provide specifics on which docs > >> caused problems for Tika? > >> > >> Ie, if a certain doc hits an exception, we should open a Jira issue > >> and get it fixed... > >> > >> Thanks, > >> > >> Mike McCandless > >> > >> http://blog.mikemccandless.com> >> > >> On Thu, Sep 8, 2011 at 8:43 AM, Mark Kerzner <[EMAIL PROTECTED]> > >> wrote: > >> > The processing is complete, the summary found here. > >> > Mark > >> > > >> > On Wed, Sep 7, 2011 at 7:29 AM, Michael McCandless > >> > <[EMAIL PROTECTED]> wrote: > >> >> > >> >> On Tue, Sep 6, 2011 at 9:29 PM, Mark Kerzner <[EMAIL PROTECTED]> > >> >> wrote: > >> >> > >> >> > Is anybody interested in the results of all the testing that > >> >> > I am doing, and if yes, how should I report my findings? > >> >> > >> >> I'm interested! This sounds great.... > >> >> > >> >> Tika should strive to have no errors on any valid documents... so if > >> >> you (or anyone) are hitting bugs in Tika/POI/PDFBox/etc., let's > >> >> characterize them, open issues, and get them fixed :) > >> >> > >> >> Mike McCandless > >> >> > >> >> http://blog.mikemccandless.com> >> > > >> > > >> > > >
+
Mark Kerzner 2011-09-18, 01:45
|
|