Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Lucene, mail # dev - Re: svn commit: r1141510 - /lucene/dev/trunk/modules/facet/src/java/org/apache/lucene/util/UnsafeByteArrayOutputStream.java


Copy link to this message
-
Re: svn commit: r1141510 - /lucene/dev/trunk/modules/facet/src/java/org/apache/lucene/util/UnsafeByteArrayOutputStream.java
Shai Erera 2011-07-01, 16:07
About the encoders package - there are several encoders there besides
VInt, so I wouldn't dispose of it so quickly. That said, I think we
should definitely explore consolidating VInt with the core classes,
and maybe write an encoder which delegate to them.

Or, come up w/ a different approach for allowing to plug in different
Encoders. I don't rule out anything, as long as we preserve
functionality and capabilities.

Shai

On Friday, July 1, 2011, Michael McCandless <[EMAIL PROTECTED]> wrote:
> On Fri, Jul 1, 2011 at 2:33 AM, Uwe Schindler <[EMAIL PROTECTED]> wrote:
>> Hi,
>>
>> I don't understand the whole discussion here, so please compare these two implementations and tell me which one is faster. Please don't hurt me, if you don't want to see src.jar code from OpenJDK Java6 - just delete this mail if you don’t want to (the code here is licensed under GPL):
>
> This is the source code for a specific version of one specific Java
> impl.  If we knew all Java impls simply implemented the primitive case
> using System.arraycopy (admittedly it's hard to imagine that they
> wouldn't!) then we are fine.
>
>> This is our implementation, simon replaced and Robert reverted (UnsafeByteArrayOutputStream):
>>
>>  private void grow(int newLength) {
>>    // It actually should be: (Java 1.7, when its intrinsic on all machines)
>>    // buffer = Arrays.copyOf(buffer, newLength);
>>    byte[] newBuffer = new byte[newLength];
>>    System.arraycopy(buffer, 0, newBuffer, 0, buffer.length);
>>    buffer = newBuffer;
>>  }
>>
>> So please look at the code, where is a difference that could slow down, except the Math.min() call which is an intrinsic in almost every JDK on earth?
>
> Right, in this case (if you used OpenJDK 6) we are obviously OK.  Not
> sure about other cases...
>
>> The problem we are talking about here is only about the generic Object[] copyOf method and also affects e.g. *all* Collection.toArray() methods - they all use this code, so whenever you use ArrayList.toArray() or similar, the slow code is executed. This is why we replaced Collections.sort() by CollectionUtil.sort, that does no array copy. Simon & me were not willing to replace the reallocations in FST code (Mike you remember, we reverted that on your GIT repo when we did perf tests) and other parts in Lucene (there are only few of them). The idea was only to replace primitive type code to make it easier readable. And with later JDK code it could even get faster (not slower), if Oracle starts to add intrinsics for those new methods (and that’s Dawid and mine reason to change to copyTo for primitive types). In general, if you use Java SDK methods, that are as fast as ours, they always have a chance to get faster in later JDKs. So we should always prefer Java SDK methods, unless they are slower because their default impl is too generic or has too much safety checks or uses reflection.
>
> OK I'm convinced (I think!) that for primitive types only, let's use
> Arrays.copyOf!
>
>> To come back to UnsafeByteArrayOutputStream:
>>
>> I would change the whole code, as I don’t like the allocation strategy in it (it's exponential, on every grow it doubles its size). We should change that to use ArrayUtils.grow() and ArrayUtils.oversize(), to have a similar allocation strategy like in trunk. Then we can discuss about this problem again when Simon & me wants to change ArrayUtils.grow methods to use Arrays.copyOf... *g* [just joking, I will never ask again, because this discussion here is endless and does not bring us forward].
>
> Well, it sounds like for primitive types, we can cutover
> ArrayUtils.grow methods.  Then we can look @ the nightly bench the
> next day ;)
>
> But I agree we should fix UnsafeByteArrayOutputStream... or, isn't it
> (almost) a dup of ByteArrayDataOutput?
>
>> The other thing I don’t like in the new faceting module is duplication of vint code. Why don’t we change it to use DataInput/DataOutput and use Dawid's new In/OutStream wrapper for DataOutput everywhere. This would be much more streamlined with all the code we currently have. Then we can encode the payloads (or later docvalues) using the new UnsafeByteArrayOutputStream, wrapped with a OutputStreamDataOutput wrapper? Or maybe add a ByteArrayDataOutput class.