|
Chris Hostetter
2009-09-04, 00:24
Yonik Seeley
2009-09-05, 17:44
Chris Hostetter
2009-09-08, 23:46
Yonik Seeley
2009-09-09, 01:10
Chris Hostetter
2009-09-11, 21:06
Robert Muir
2009-09-11, 21:13
Yonik Seeley
2009-09-11, 21:24
Chris Hostetter
2009-09-11, 21:34
Donovan Jimenez
2009-09-11, 22:21
Yonik Seeley
2009-09-11, 22:32
Yonik Seeley
2009-09-11, 22:37
Donovan Jimenez
2009-09-11, 22:49
Chris Hostetter
2009-09-12, 00:37
Yonik Seeley
2009-09-13, 16:07
|
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaChris Hostetter 2009-09-04, 00:24
: +61. SOLR-1091: Jetty's use of CESU-8 for code points outside the BMP : + resulted in invalid output from the serialized PHP writer. (yonik) ... : + static boolean modifiedUTF8 = System.getProperty("jetty.home") != null; ...that seems really hackish to me, particularly since for all we know there are other servlet containers that might have the same problem. Wouldn't it make more sense to add an init param for the PHPSerializedResponseWriter that lets people set this deterministicly in solrconfig.xml? (and then we could set it and comment on it in the example configs so things are more transparent) -Hoss
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaYonik Seeley 2009-09-05, 17:44
On Thu, Sep 3, 2009 at 8:24 PM, Chris Hostetter<[EMAIL PROTECTED]> wrote:
> > : +61. SOLR-1091: Jetty's use of CESU-8 for code points outside the BMP > : + resulted in invalid output from the serialized PHP writer. (yonik) > > ... > > : + static boolean modifiedUTF8 = System.getProperty("jetty.home") != null; > > ...that seems really hackish to me, particularly since for all we know > there are other servlet containers that might have the same problem. Yeah, it is. But it's not really a valid option, it's a bug/limitation in the servlet container IMO. It would also suck to bloat configuration (and users brains) with options that don't really control anything, except that they must correctly match it up with how their servlet container behaves. And this doesn't actually fix everything - it simply makes it such that encapsulation at the transport layer isn't broken - the end user will still be getting back incorrect UTF8. I guess one better fix is to take the UTF8 encoding out of the servlet containers hands and do it all ourselves. Or just don't support any servlet containers that can't handle code points outside the BMP? Or is there simply a Jetty config option we've been missing. It's hard to believe that such a popular servlet container can't handle this. -Yonik http://www.lucidimagination.com
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaChris Hostetter 2009-09-08, 23:46
: > : + static boolean modifiedUTF8 = System.getProperty("jetty.home") != null;
: > : > ...that seems really hackish to me, particularly since for all we know : > there are other servlet containers that might have the same problem. : : Yeah, it is. : But it's not really a valid option, it's a bug/limitation in the : servlet container IMO. It would also suck to bloat configuration (and : users brains) with options that don't really control anything, except : that they must correctly match it up with how their servlet container : behaves. And this doesn't actually fix everything - it simply makes : it such that encapsulation at the transport layer isn't broken - the : end user will still be getting back incorrect UTF8. my suggestion for adding an explicit option for the modifiedUTF8 behavior to the phps response writer was on the (miss)understanding that the value was always correct, it was just the length calculation that was being done wrong by jetty (and correct by other implementations) so if we noticed we were running in jetty we'd replace the length calculation with our own (which might be less efficient). but i guess i don't really understand the patch ... actually, the more i look at it the less i understand it.... The modifiedUTF8 boolean only influence the numeric length returned as the "s" option ... the actaully "val" string is still written "as is" by the servlet container. you've also changed the behavior so that even when false==modifiedUTF8, the length is now computed differently then before the patch (using UnicodeUtil.UTF16toUTF8) ... it's this second change i don't understand based on the context of the bug: why does the length value need to be computed differnetly for non-jetty implemntations? : containers hands and do it all ourselves. Or just don't support any : servlet containers that can't handle code points outside the BMP? Or I'm in favor of this approach .. if the container can't correctly output some characters, i see no reason to hide the bug -- if somebody else wants to code arround the bug by doing it all in solr that's fine, but untill then i don't see an advantage in outputing the correct number of bytes for a garbage string -- anybody who really cares is going to need to switch to a differnet servlet container anyway. : is there simply a Jetty config option we've been missing. It's hard : to believe that such a popular servlet container can't handle this. hey, i don't even understand the bug enough to know what to search for to try and find an option. -Hoss
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaYonik Seeley 2009-09-09, 01:10
On Tue, Sep 8, 2009 at 7:46 PM, Chris Hostetter<[EMAIL PROTECTED]> wrote:
> The modifiedUTF8 boolean only influence the numeric length returned as the > "s" option ... the actaully "val" string is still written "as is" by the > servlet container. Yep. A code point (unicode character) outside of the BMP (basic multilingual plane, fits in 16 bits) is represented as two java chars - a surrogate pair. It's a single logical character - see String.codePointAt(). In correct UTF-8 it should be encoded as a single code point... but Jetty is ignoring the fact that it's a surrogate pair and encoding each Java char as it's own code point... this is often called modified-UTF8 or CESU-8. So... say you have this incorrect CESU-8 that is masquerading as UTF-8: all is not lost. - A decoder can unambiguously recognize that the characters decoded actually form a surrogate pair and correctly decode - but I don't know if there are aby requirements to do so (doubt it), and I don't know which do so. - A decoder decoding into UTF-16 (like Java Strings) will correctly decode anyway, even if treating each code point in the pair as separate. PHP5 doesn't even do unicode, so it won't care. PHP6 apparently has unicode support - don't know much about it. Bottom line - if we correctly encapsulate whatever the servlet container is writing, it's certainly possible for clients to use the output correctly. > you've also changed the behavior so that even when > false==modifiedUTF8, the length is now computed differently then before > the patch (using UnicodeUtil.UTF16toUTF8) ... it's this second change i > don't understand based on the context of the bug: why does the length > value need to be computed differnetly for non-jetty implemntations? It will be the same length for non-jetty implementations - I just rewrote the entire method and used the Lucene UTF16toUTF8 for performance reasons. (bad developer, bad!) -Yonik http://www.lucidimagination.com
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaChris Hostetter 2009-09-11, 21:06
: A code point (unicode character) outside of the BMP (basic : multilingual plane, fits in 16 bits) is represented as two java chars : - a surrogate pair. It's a single logical character - see : String.codePointAt(). In correct UTF-8 it should be encoded as a : single code point... but Jetty is ignoring the fact that it's a : surrogate pair and encoding each Java char as it's own code point... : this is often called modified-UTF8 or CESU-8. : : So... say you have this incorrect CESU-8 that is masquerading as : UTF-8: all is not lost. ... I must be missunderstanding something still ... based on your description, it sounds like it shouldn't matter if the encoder knows that it's one logical character or not, either way it should wind up outputing the same number of bytes.... Except that if that were the case, why would we have had this bug in the first place? clearly i'm still missunderstanding something. : Bottom line - if we correctly encapsulate whatever the servlet : container is writing, it's certainly possible for clients to use the : output correctly. I still come back to not liking that this is a hardcoded hack just for jetty ... it seems like it would be easy for a future version of jetty to change this behavior in some way that makes solr start breaking -- jetty could fix this bug and break solr's byte counting ... that doesn't seem cool why don't we just output the raw bytes ourselves? the code to generate the byte[] was/is allready there, we're just ignoring it and only using the length. -Hoss
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaRobert Muir 2009-09-11, 21:13
On Fri, Sep 11, 2009 at 5:06 PM, Chris Hostetter
<[EMAIL PROTECTED]> wrote: > I must be missunderstanding something still ... based on your description, > it sounds like it shouldn't matter if the encoder knows that it's one > logical character or not, either way it should wind up outputing the same > number of bytes.... yes, the # of bytes is different: 6 bytes versus 4 bytes in UTF-8 -- Robert Muir [EMAIL PROTECTED]
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaYonik Seeley 2009-09-11, 21:24
On Fri, Sep 11, 2009 at 5:06 PM, Chris Hostetter
<[EMAIL PROTECTED]> wrote: > why don't we just output the raw bytes ourselves? That would require writing TextResponseWriter and friends as binary writers, right? Or did you have a different way in mind for injecting bytes into the output stream? -Yonik http://www.lucidimagination.com
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaChris Hostetter 2009-09-11, 21:34
: > why don't we just output the raw bytes ourselves? : : That would require writing TextResponseWriter and friends as binary : writers, right? Or did you have a different way in mind for injecting : bytes into the output stream? Grr.... you're right. i got so turned arround thinking about counting the bytes and encapsulating it all in the PHPS markup i forgot that we still want/need the *raw* bytes output as part of the character stream ... not some sort of string representation of the bytes ... wow that sounds even more comfusing when i look at it on paper. character data sucks. I still repeat my earlier vote: let's yank this patch and just document that byte counts for strings included in the PHPS format are only accurate for characters outside the BMP if the servlet container being used writes them correctly -- that seems like a totally fair requirement to having in place, and points the figner squarly where in belongs (without putting us at risk for having broken code if/when jetty fixes this problem. Alternately: have an option and/or system property to force/disable this behavior even if jetty.home isn't/is set. Even if we change nothing else: there needs to be a big fat freaking disclaimer in the javadocs for the writer that it's looking at the jetty.home variable, and explaining why. -Hoss
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaDonovan Jimenez 2009-09-11, 22:21
Is it possible (and would it even help) to normalize all strings
with regards to surrogate pairs at indexing time instead? or will container still possibly differ in byte for byte output? - Donovan On Sep 11, 2009, at 5:34 PM, Chris Hostetter wrote: > > : > why don't we just output the raw bytes ourselves? > : > : That would require writing TextResponseWriter and friends as binary > : writers, right? Or did you have a different way in mind for > injecting > : bytes into the output stream? > > > Grr.... you're right. i got so turned arround thinking about > counting the bytes and encapsulating it all in the PHPS markup i > forgot > that we still want/need the *raw* bytes output as part of the > character > stream ... not some sort of string representation of the bytes ... wow > that sounds even more comfusing when i look at it on paper. > > character data sucks. > > I still repeat my earlier vote: let's yank this patch and just > document > that byte counts for strings included in the PHPS format are only > accurate > for characters outside the BMP if the servlet container being used > writes > them correctly -- that seems like a totally fair requirement to > having in > place, and points the figner squarly where in belongs (without > putting us > at risk for having broken code if/when jetty fixes this problem. > > Alternately: have an option and/or system property to force/disable > this > behavior even if jetty.home isn't/is set. > > Even if we change nothing else: there needs to be a big fat freaking > disclaimer in the javadocs for the writer that it's looking at the > jetty.home variable, and explaining why. > > > > -Hoss >
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaYonik Seeley 2009-09-11, 22:32
On Fri, Sep 11, 2009 at 6:21 PM, Donovan Jimenez
<[EMAIL PROTECTED]> wrote: > Is it possible (and would it even help) to normalize all strings with > regards to surrogate pairs at indexing time instead? Already done, in a way... there's only one way to represent a character outside the BMP in UTF-16 (which is the in-memory encoding used by Java String). Unless I misunderstood what you meant by normalization. -Yonik http://www.lucidimagination.com
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaYonik Seeley 2009-09-11, 22:37
On Tue, Sep 8, 2009 at 7:46 PM, Chris Hostetter
<[EMAIL PROTECTED]> wrote: > if the container can't correctly output > some characters, i see no reason to hide the bug Another problem is that it won't reliably break. The bug breaks our encapsulation (before the patch) and thus the client reads the wrong number of chars for the string, and who knows what happens after that. The majority of the time will result in an exception, but it really depends. This is the type of stuff (buffer underflows / overflows) that could be used to mess with security too... a carefully crafted request could inject / change fields in the response and have it look valid. -Yonik http://www.lucidimagination.com
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaDonovan Jimenez 2009-09-11, 22:49
you are correct, it was my misunderstanding of the problem - now that
I've read more than I ever wanted to know about UCS-2, UTF-16 and modified UTF-8, I'm more upto speed. Thanks for the patience. On Sep 11, 2009, at 6:32 PM, Yonik Seeley wrote: > On Fri, Sep 11, 2009 at 6:21 PM, Donovan Jimenez > <[EMAIL PROTECTED]> wrote: >> Is it possible (and would it even help) to normalize all strings >> with >> regards to surrogate pairs at indexing time instead? > > Already done, in a way... there's only one way to represent a > character outside the BMP in UTF-16 (which is the in-memory encoding > used by Java String). Unless I misunderstood what you meant by > normalization. > > -Yonik > http://www.lucidimagination.com
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaChris Hostetter 2009-09-12, 00:37
: > if the container can't correctly output : > some characters, i see no reason to hide the bug : : Another problem is that it won't reliably break. The bug breaks our : encapsulation (before the patch) and thus the client reads the wrong : number of chars for the string, and who knows what happens after that. : The majority of the time will result in an exception, but it really : depends. This is the type of stuff (buffer underflows / overflows) : that could be used to mess with security too... a carefully crafted : request could inject / change fields in the response and have it look : valid. Isn't that an argument in favor of having an explicit option to control how we do the counting? otherwise we're still at risk of the scenerio i discribed (ie: jetty fixes the byte conversion code, but we're still counting the bytes "wrong" for them) with an explicit option we put control in the hands of the solr admin to decide what's right for them (we can even give them a little shell script to test it with) -Hoss
-
Re: svn commit: r808988 - in /lucene/solr/trunk: CHANGES.txt src/java/org/apache/solr/request/PHPSerializedResponseWriter.javaYonik Seeley 2009-09-13, 16:07
On Fri, Sep 11, 2009 at 8:37 PM, Chris Hostetter
<[EMAIL PROTECTED]> wrote: > Isn't that an argument in favor of having an explicit option to control > how we do the counting? otherwise we're still at risk of the scenerio i > discribed (ie: jetty fixes the byte conversion code, but we're still > counting the bytes "wrong" for them) I tried testing w/ Jetty7 RC5 but didn't get too far (couldn't bring it up). I've attached an additional patch to SOLR-1091 that adds a boolean "CESU8" system property. -Yonik http://www.lucidimagination.com |