|
|
-
Big Longs in RecommenderJob
Matthew Bryan 2010-06-07, 21:42
I'm trying to use some real big longs in the RecommenderJob and I ran into the following problem:
java.lang.IllegalArgumentException: Can't encode value as signed: -9223224018927274648 at org.apache.mahout.math.Varint.writeSignedVarLong(Varint.java:59) at org.apache.mahout.math.VarLongWritable.write(VarLongWritable.java:77) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:90) at org.apache.hadoop.io.serializer.WritableSerialization$WritableSerializer.serialize(WritableSerialization.java:77) at org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTask.java:909) at org.apache.hadoop.mapred.MapTask$NewOutputCollector.write(MapTask.java:549) at org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80) at org.apache.mahout.cf.taste.hadoop.ToEntityPrefsMapper.map(ToEntityPrefsMapper.java:68) at org.apache.mahout.cf.taste.hadoop.ToEntityPrefsMapper.map(ToEntityPrefsMapper.java:30) at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:144) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:629) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:310) at org.apache.hadoop.mapred.Child.main(Child.java:170)
It looks like the code in Varint.java wasn't accepting the full range of Signed Longs (MIN_SIGNED_VAR_LONG = -(1L << 62) instead of MIN_SIGNED_VAR_LONG = -(1L << 63)).
When I increase the max/min value it breaks on the negative check in writeUnsignedVarLong....which seems like it shouldn't be there because the number can be "interpreted" as negative since we're storing an unsigned long in a signed long (since Java has no unsigned). So, when I remove that check it breaks with "Variable length quantity is too long" in the readUnsignedVarLong....and this is where things get fuzzy for me. I basically replaced the code in both the readUnsigned blocks with the Google code that is referenced. I also replaced the DecodeZigZag blocks with the Google code. I then commented out the exception tests and everything ran fine for me....with the other tests passing I felt like like the end-to-end conversion was working, but I'm not in a position to really validate my recommendations yet.
So perhaps if my technique sounds reasonable maybe someone could apply the patch I've attached and check out the results on a known sample?
Loving Mahout so far, thanks all!
Matt
-
Re: Big Longs in RecommenderJob
Sean Owen 2010-06-07, 22:39
Yeah the problem is that signed values are zig-zag encoded into an unsigned value, which loses 1 bit, in addition to losing another bit by mapping to unsigned values.
Still there is definitely a way to make it work; the encoding is certainly defined for larger values and there is a need for it. I can work on the right fix.
-
Re: Big Longs in RecommenderJob
Ted Dunning 2010-06-07, 23:36
The other solution would be to be satisfied with 62 bits of id space and only generate "small" longs.
On Mon, Jun 7, 2010 at 3:39 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> Yeah the problem is that signed values are zig-zag encoded into an > unsigned value, which loses 1 bit, in addition to losing another bit > by mapping to unsigned values. > > Still there is definitely a way to make it work; the encoding is > certainly defined for larger values and there is a need for it. I can > work on the right fix. >
-
Re: Big Longs in RecommenderJob
Sean Owen 2010-06-07, 23:46
Really, the mistake here (is mine and) is writing these IDs as signed values. As used in the recommender bit, the IDs are already nonnegative longs and so can be written with the current implementation just fine, if encoded as unsigned.
That is part 2 of what I should change here since it will increase encoding efficiency a little.
On Tue, Jun 8, 2010 at 12:36 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > The other solution would be to be satisfied with 62 bits of id space and > only generate "small" longs. > > On Mon, Jun 7, 2010 at 3:39 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > >> Yeah the problem is that signed values are zig-zag encoded into an >> unsigned value, which loses 1 bit, in addition to losing another bit >> by mapping to unsigned values. >> >> Still there is definitely a way to make it work; the encoding is >> certainly defined for larger values and there is a need for it. I can >> work on the right fix. >> >
-
Re: Big Longs in RecommenderJob
Sean Owen 2010-06-08, 11:42
I committed a change to just Varint. It is clever enough that I award myself a pat on the back:
public static long readSignedVarLong(DataInput in) throws IOException { long raw = readUnsignedVarLong(in); return (((raw << 63) >> 63) ^ raw) >> 1; }
becomes
public static long readSignedVarLong(DataInput in) throws IOException { long raw = readUnsignedVarLong(in); long temp = (((raw << 63) >> 63) ^ raw) >> 1; return temp ^ ((raw >> 63) << 63); }
and likewise for writing. It basically treats negative values as unsigned when asked to write unsigned and all is well.
Obvious right? On Tue, Jun 8, 2010 at 1:46 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > Really, the mistake here (is mine and) is writing these IDs as signed > values. As used in the recommender bit, the IDs are already > nonnegative longs and so can be written with the current > implementation just fine, if encoded as unsigned. > > That is part 2 of what I should change here since it will increase > encoding efficiency a little. > > On Tue, Jun 8, 2010 at 12:36 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: >> The other solution would be to be satisfied with 62 bits of id space and >> only generate "small" longs. >> >> On Mon, Jun 7, 2010 at 3:39 PM, Sean Owen <[EMAIL PROTECTED]> wrote: >> >>> Yeah the problem is that signed values are zig-zag encoded into an >>> unsigned value, which loses 1 bit, in addition to losing another bit >>> by mapping to unsigned values. >>> >>> Still there is definitely a way to make it work; the encoding is >>> certainly defined for larger values and there is a need for it. I can >>> work on the right fix. >>> >> >
-
Re: Big Longs in RecommenderJob
Ted Dunning 2010-06-08, 15:32
I confess I don't quite understand the issue. A comment or three might help.
Does this line:
return temp ^ ((raw >> 63) << 63);
Invert the sign bit if present in raw?
If so, would this be any different?
return temp ^ (raw & (1<<63)); On Tue, Jun 8, 2010 at 4:42 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> public static long readSignedVarLong(DataInput in) throws IOException { > long raw = readUnsignedVarLong(in); > return (((raw << 63) >> 63) ^ raw) >> 1; > } > > becomes > > public static long readSignedVarLong(DataInput in) throws IOException { > long raw = readUnsignedVarLong(in); > long temp = (((raw << 63) >> 63) ^ raw) >> 1; > return temp ^ ((raw >> 63) << 63); > } > > and likewise for writing. It basically treats negative values as > unsigned when asked to write unsigned and all is well. >
-
Re: Big Longs in RecommenderJob
Sean Owen 2010-06-08, 17:19
Yep nice one Ted, that's equivalent and faster. Will do.
On Jun 8, 2010 5:33 PM, "Ted Dunning" <[EMAIL PROTECTED]> wrote:
I confess I don't quite understand the issue. A comment or three might help.
Does this line: return temp ^ ((raw >> 63) << 63); Invert the sign bit if present in raw?
If so, would this be any different?
return temp ^ (raw & (1<<63));
On Tue, Jun 8, 2010 at 4:42 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> public static long readSig...
|
|