|
Lewis John Mcgibbney
2012-02-14, 22:00
Lewis John Mcgibbney
2012-02-14, 22:34
Lewis John Mcgibbney
2012-02-14, 22:40
Markus Jelsma
2012-02-14, 23:51
Ken Krugler
2012-02-15, 00:10
Lewis John Mcgibbney
2012-02-15, 10:42
Julien Nioche
2012-02-15, 10:59
Lewis John Mcgibbney
2012-02-15, 12:17
Julien Nioche
2012-02-15, 12:27
Lewis John Mcgibbney
2012-02-15, 12:48
Julien Nioche
2012-02-15, 13:07
|
-
Detecting Encoding with pluginsLewis John Mcgibbney 2012-02-14, 22:00
Hi,
I can't see anywhere within our parser plugins where we detect encoding of documents. I've also begun looking through the o.a.n.p package but again I can't see anything. Can anyone provide some detail on this please? Thank you Lewis -- *Lewis*
-
Re: Detecting Encoding with pluginsLewis John Mcgibbney 2012-02-14, 22:34
It's in HTMLParser#private static String sniffCharacterEncoding
I'm still wondering where TikaParser gets the character encoding from though? Additionally, this doesn't look like something we check for in our JUnit classes? If we don't then I would like to write some tests to test for this. I am working on Any23 tests first, so this provides the justification behind my question. Thanks Lewis On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Hi, > > I can't see anywhere within our parser plugins where we detect encoding of > documents. I've also begun looking through the o.a.n.p package but again I > can't see anything. > > Can anyone provide some detail on this please? > > Thank you > > Lewis > > > > -- > *Lewis* > > -- *Lewis*
-
Re: Detecting Encoding with pluginsLewis John Mcgibbney 2012-02-14, 22:40
Also we fall back to windows-1252 encoding in the
parser.character.encoding.default property when we can't find anything else. On Tue, Feb 14, 2012 at 10:34 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > It's in HTMLParser#private static String sniffCharacterEncoding > > I'm still wondering where TikaParser gets the character encoding from > though? Additionally, this doesn't look like something we check for in our > JUnit classes? If we don't then I would like to write some tests to test > for this. > > I am working on Any23 tests first, so this provides the justification > behind my question. > > Thanks > > Lewis > > > On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > >> Hi, >> >> I can't see anywhere within our parser plugins where we detect encoding >> of documents. I've also begun looking through the o.a.n.p package but again >> I can't see anything. >> >> Can anyone provide some detail on this please? >> >> Thank you >> >> Lewis >> >> >> >> -- >> *Lewis* >> >> > > > -- > *Lewis* > > -- *Lewis*
-
Re: Detecting Encoding with pluginsMarkus Jelsma 2012-02-14, 23:51
Hi,
This was indeed an issue until today. The detected type is in the crawl datum metadata. https://issues.apache.org/jira/browse/NUTCH-1259 > Hi, > > I can't see anywhere within our parser plugins where we detect encoding of > documents. I've also begun looking through the o.a.n.p package but again I > can't see anything. > > Can anyone provide some detail on this please? > > Thank you > > Lewis
-
Re: Detecting Encoding with pluginsKen Krugler 2012-02-15, 00:10
On Feb 14, 2012, at 2:34pm, Lewis John Mcgibbney wrote: > It's in HTMLParser#private static String sniffCharacterEncoding > > I'm still wondering where TikaParser gets the character encoding from though? FYI, the individual Tika parsers have their own detection logic. The HTML parser, for example, uses the response headers and metadata tags in addition to ICU's statistical method. That's something I'm still working on cleaning up, but haven't made much progress in the past few months. -- Ken > Additionally, this doesn't look like something we check for in our JUnit classes? If we don't then I would like to write some tests to test for this. > > I am working on Any23 tests first, so this provides the justification behind my question. > > Thanks > > Lewis > > On Tue, Feb 14, 2012 at 10:00 PM, Lewis John Mcgibbney <[EMAIL PROTECTED]> wrote: > Hi, > > I can't see anywhere within our parser plugins where we detect encoding of documents. I've also begun looking through the o.a.n.p package but again I can't see anything. > > Can anyone provide some detail on this please? > > Thank you > > Lewis > > > > -- > Lewis > > > > > -- > Lewis > -------------------------- Ken Krugler http://www.scaleunlimited.com custom big data solutions & training Hadoop, Cascading, Mahout & Solr
-
Re: Detecting Encoding with pluginsLewis John Mcgibbney 2012-02-15, 10:42
Hi Markus,
I've been vaguely keeping up with yourself and Julien's work on this. I would really like to get a test case for this though! I'll try working towards this as a sub-target of another issue. For reference, there is a Tika mimeType test case here [1] and Tika document encoding test here [2]. Which we may or may not be interested in porting over to o.a.n? wdyt? Thanks Lewis [1] https://svn.apache.org/viewvc/incubator/any23/trunk/core/src/test/java/org/apache/any23/mime/TikaMIMETypeDetectorTest.java?view=markup [2] https://svn.apache.org/viewvc/incubator/any23/trunk/core/src/test/java/org/apache/any23/encoding/TikaEncodingDetectorTest.java?view=markup On Tue, Feb 14, 2012 at 11:51 PM, Markus Jelsma <[EMAIL PROTECTED]> wrote: > Hi, > > This was indeed an issue until today. The detected type is in the crawl > datum > metadata. > > https://issues.apache.org/jira/browse/NUTCH-1259 > > > Hi, > > > > I can't see anywhere within our parser plugins where we detect encoding > of > > documents. I've also begun looking through the o.a.n.p package but again > I > > can't see anything. > > > > Can anyone provide some detail on this please? > > > > Thank you > > > > Lewis > -- *Lewis*
-
Re: Detecting Encoding with pluginsJulien Nioche 2012-02-15, 10:59
The mimetype is not the same thing as the encoding. As Ken pointed out this
is done at the individual parser level On 14 February 2012 23:51, Markus Jelsma <[EMAIL PROTECTED]> wrote: > Hi, > > This was indeed an issue until today. The detected type is in the crawl > datum > metadata. > > https://issues.apache.org/jira/browse/NUTCH-1259 > > > Hi, > > > > I can't see anywhere within our parser plugins where we detect encoding > of > > documents. I've also begun looking through the o.a.n.p package but again > I > > can't see anything. > > > > Can anyone provide some detail on this please? > > > > Thank you > > > > Lewis > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
-
Re: Detecting Encoding with pluginsLewis John Mcgibbney 2012-02-15, 12:17
Yes this is correct, but we still don't test for either of the two.
On Wed, Feb 15, 2012 at 10:59 AM, Julien Nioche < [EMAIL PROTECTED]> wrote: > The mimetype is not the same thing as the encoding. As Ken pointed out > this is done at the individual parser level > > > On 14 February 2012 23:51, Markus Jelsma <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> This was indeed an issue until today. The detected type is in the crawl >> datum >> metadata. >> >> https://issues.apache.org/jira/browse/NUTCH-1259 >> >> > Hi, >> > >> > I can't see anywhere within our parser plugins where we detect encoding >> of >> > documents. I've also begun looking through the o.a.n.p package but >> again I >> > can't see anything. >> > >> > Can anyone provide some detail on this please? >> > >> > Thank you >> > >> > Lewis >> > > > > -- > * > *Open Source Solutions for Text Engineering > > http://digitalpebble.blogspot.com/ > http://www.digitalpebble.com > http://twitter.com/digitalpebble > > -- *Lewis*
-
Re: Detecting Encoding with pluginsJulien Nioche 2012-02-15, 12:27
I assume Tika does already - why should we duplicate the tests in Nutch? we
delegate the functionality to Tika, IMHO this means delegating the testing as well. What we could do to contribute tests to Tika instead if it does not have any. Re-any23 : why not handling it as a Tika parser instead of a Nutch one? This could be useful to other Tika users who do not necessarily use Nutch Julien On 15 February 2012 12:17, Lewis John Mcgibbney <[EMAIL PROTECTED]>wrote: > Yes this is correct, but we still don't test for either of the two. > > > On Wed, Feb 15, 2012 at 10:59 AM, Julien Nioche < > [EMAIL PROTECTED]> wrote: > >> The mimetype is not the same thing as the encoding. As Ken pointed out >> this is done at the individual parser level >> >> >> On 14 February 2012 23:51, Markus Jelsma <[EMAIL PROTECTED]> wrote: >> >>> Hi, >>> >>> This was indeed an issue until today. The detected type is in the crawl >>> datum >>> metadata. >>> >>> https://issues.apache.org/jira/browse/NUTCH-1259 >>> >>> > Hi, >>> > >>> > I can't see anywhere within our parser plugins where we detect >>> encoding of >>> > documents. I've also begun looking through the o.a.n.p package but >>> again I >>> > can't see anything. >>> > >>> > Can anyone provide some detail on this please? >>> > >>> > Thank you >>> > >>> > Lewis >>> >> >> >> >> -- >> * >> *Open Source Solutions for Text Engineering >> >> http://digitalpebble.blogspot.com/ >> http://www.digitalpebble.com >> http://twitter.com/digitalpebble >> >> > > > -- > *Lewis* > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
-
Re: Detecting Encoding with pluginsLewis John Mcgibbney 2012-02-15, 12:48
Hi Julien,
On Wed, Feb 15, 2012 at 12:27 PM, Julien Nioche < [EMAIL PROTECTED]> wrote: > I assume Tika does already - why should we duplicate the tests in Nutch? We don't want to I suppose. However the point I was trying to make was that as NUTCH-1259 detects the encoding type, however we don't have an automated test to cover this, I assume the case is somewhat important or else the ticket for NUTCH-1259 wouldn't have been opened originally? I agree with you that general cases should be dealt with further upstream within Tika development itself, however as the encoding detection is done in Nutch within the cd metadata we may wish to get some test case to check... it's not a huge thing I suppose. > we delegate the functionality to Tika, IMHO this means delegating the > testing as well. What we could do to contribute tests to Tika instead if it > does not have any. > > Yeah this is correct. I'm expecting you guys will know better than me but I would assume that Tika is mimetype and encoding detection compliant ;0) > Re-any23 : why not handling it as a Tika parser instead of a Nutch one? > This could be useful to other Tika users who do not necessarily use Nutch > OK so I suppose this is completely open for discussion and I really welcome it as well. On one hand I see working with Any23 as a parse-any23 plugin within Nutch as the first step in the road to answering this question. Regardless of whether Any23 graduates and is integrated into Tika itself or as a TLP you are completely right that it should be made as openly available to as many people. Personally I agree with you Julien. One last thing, I know this if off topic... but with regards to our microformats-reltag plugin... I think the RelTagParser could and should be move over to Any23. Any23 already supports extraction of an number of microformats. wdyt? Thanks
-
Re: Detecting Encoding with pluginsJulien Nioche 2012-02-15, 13:07
Hi Lewis
I assume Tika does already - why should we duplicate the tests in Nutch? > > We don't want to I suppose. However the point I was trying to make was > that as NUTCH-1259 detects the encoding type, > > however we don't have an automated test to cover this, I assume the case is > somewhat important or else the ticket for NUTCH-1259 wouldn't have been > opened originally? > nope. NUTCH-1259 is about storing the mime-type value detected by Tika. It is not the same as the encoding. This specific JIRA is not whether or not we get the correct value but a purely functional one about where we store it. There is not much to test wrt it > I agree with you that general cases should be dealt with further upstream > within Tika development itself, however as the encoding detection is done > in Nutch within the cd metadata we may wish to get some test case to > check... it's not a huge thing I suppose. > we do have tests for the EncodingDetector (TestEncodingDetector), which is used by parse-html already. It is Ok to have that as it is our own parser. As explained earlier, for the Tika parser the detection is delegated to the Tika parser implementations and as such should be tested there. > >> we delegate the functionality to Tika, IMHO this means delegating the >> testing as well. What we could do to contribute tests to Tika instead if it >> does not have any. >> >> Yeah this is correct. I'm expecting you guys will know better than me but > I would assume that Tika is mimetype and encoding detection compliant ;0) > I definitely do not pretend to know more than anyone else BTW :-) I don't understand what you mean by 'compliant'. Perfect? Probably not. There was an interesting experiment made by Ken on measuring the accuracy of the charset detection in the Tika book - which anyone remotely interested in Nutch should get BTW. There has been an interesting blog entry recently on comparing the language detection in Tika and other libraries (cant find ref and am in a hurry - sorry) > > >> Re-any23 : why not handling it as a Tika parser instead of a Nutch one? >> This could be useful to other Tika users who do not necessarily use Nutch >> > OK so I suppose this is completely open for discussion and I really > welcome it as well. On one hand I see working with Any23 as a parse-any23 > plugin within Nutch as the first step in the road to answering this > question. Regardless of whether Any23 graduates and is integrated into Tika > itself or as a TLP you are completely right that it should be made as > openly available to as many people. Personally I agree with you Julien. > > One last thing, I know this if off topic... but with regards to our > microformats-reltag plugin... I think the RelTagParser could and should be > move over to Any23. Any23 already supports extraction of an number of > microformats. wdyt? > it would probably make sense as an initial step if you don't want to venture in trying to wrap it as a Tika parser :-) Julien -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble |