Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Lucene, mail # dev - need some advice for modifying index format which supporting PForDelta compression/decompression


+
Li Li 2010-11-24, 10:16
+
Michael McCandless 2010-11-24, 16:57
Copy link to this message
-
Re: need some advice for modifying index format which supporting PForDelta compression/decompression
Li Li 2010-11-25, 09:07
Because we did some modification to index files, migrating to trunk
4.0 is not an easy task.

I think Partial decode is possible. To speed up decode, PForDelta
first decode all the arrays including exceptions without judge whether
it's exception or not. There can make cpu predict next instruction
correctly. Then use the linkedlist to decode exceptions. Even in
decode non-exceptions, it can be stoped when we get our needed.

  e.g. we consider implementation in
http://code.google.com/p/integer-array-compress-kit/
  for decode 5 bit frames, it decodes 5 integers for a loop. which
contains 5*32/5=32 encoded integers
  it first decodes 6 integer from val1
  then 6 integer from val2(and partial bits from val1)
  ...
  So we can stop if we only need the value of the 5th integer and
store the status of the decoder
  and if next needed is the 12th integer, we can continue from current status

  exception is a linked list, it's also easy to stop and resume it.

  But I don't know the efficiency and need doing some experiments
which comparing it with vint decoder.

static int fastDeCompressFor5Bit( int offSet, int[] encodedValue, int
dataNum, int decodeOffset, int[] decode ){
int maxBlocks =  dataNum >> 5;
int rest =  dataNum % 32;
// block process
for( int  block = 0 ; block < maxBlocks ; block++ ){
int val1 = encodedValue[ offSet ++ ];
int val2 = encodedValue[ offSet ++ ];
int val3 = encodedValue[ offSet ++ ];
int val4 = encodedValue[ offSet ++ ];
int val5 = encodedValue[ offSet ++ ];

decode[ decodeOffset ++ ]  =  ( val1 << 27 ) >>>  27 ;
decode[ decodeOffset ++ ]  =  ( val1 << 22 ) >>>  27 ;
decode[ decodeOffset ++ ]  =  ( val1 << 17 ) >>>  27 ;
decode[ decodeOffset ++ ]  =  ( val1 << 12 ) >>>  27 ;
decode[ decodeOffset ++ ]  =  ( val1 << 7  ) >>>  27 ;
decode[ decodeOffset ++ ]  =  ( val1 << 2  ) >>>  27 ;

decode[ decodeOffset ++ ]  =  ( ( val2 << 29 ) >>> 27 ) | ( val1 >>> 30 ) ;
decode[ decodeOffset ++ ]  =  ( val2 << 24 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val2 << 19 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val2 << 14 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val2 << 9  ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val2 << 4  ) >>> 27 ;

decode[ decodeOffset ++ ]  =  ( ( val3 << 31 ) >>> 27 ) | ( val2 >>> 28 ) ;
decode[ decodeOffset ++ ]  =  ( val3 << 26 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val3 << 21 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val3 << 16 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val3 << 11 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val3 << 6  ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val3 << 1  ) >>> 27 ;

decode[ decodeOffset ++ ]  =  ( ( val4 << 28 ) >>> 27 ) | ( val3 >>> 31 ) ;
decode[ decodeOffset ++ ]  =  ( val4 << 23 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val4 << 18 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val4 << 13 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val4 << 8  ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val4 << 3  ) >>> 27 ;

decode[ decodeOffset ++ ]  =  ( ( val5 << 30 ) >>> 27 ) | ( val4 >>> 29 ) ;
decode[ decodeOffset ++ ]  =  ( val5 << 25 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val5 << 20 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val5 << 15 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val5 << 10 ) >>> 27 ;
decode[ decodeOffset ++ ]  =  ( val5 << 5  ) >>> 27 ;
decode[ decodeOffset ++ ]  =  val5 >>> 27 ;

}

2010/11/25 Michael McCandless <[EMAIL PROTECTED]>:
> Are you sure you can't use trunk?  Swapping in PForDelta as a codec is
> much easier and we are wanting to do that, anyway (as a hybrid codec
> w/ Pulsing as well), before releasing 4.0 :)
>
> Partial decode of PForDelta is challenging I think... because the
> exceptions are a linked list, you have to walk the exceptions up
> front?
>
> That's a nice idea to interleave the encoding of 128 docDeltas then
> 128 docFreqs... in trunk's Sep codec we currently put them in separate
> files.
>
> Mike