|
|
Jeff Eastman 2012-02-12, 01:18
I'm wondering how to tease the elephant into accepting any concrete instance of the interface o.a.m.clustering.Cluster when writing trained clusters in the cleanup() method of CIMapper. I've gotten the MR version of the ClusterIterator to get to that point in testing but it blows chunks with an IOException when I try to pass a o.a.m.clustering.kmeans.Cluster (I will rename the latter for 0.7). Seems the MapTask.collect() wants == and not instanceof.
I've talked with Ted about passing Clusters rather than the current ClusterObservations but don't see how at this point. Any ideas?
Paritosh Ranjan 2012-02-12, 15:00
Can something like this help?
public class CIMapper<T extends Cluster> extends Mapper<WritableComparable<?>,VectorWritable,IntWritable,T> { ... }
On 12-02-2012 06:48, Jeff Eastman wrote: > I'm wondering how to tease the elephant into accepting any concrete > instance of the interface o.a.m.clustering.Cluster when writing > trained clusters in the cleanup() method of CIMapper. I've gotten the > MR version of the ClusterIterator to get to that point in testing but > it blows chunks with an IOException when I try to pass a > o.a.m.clustering.kmeans.Cluster (I will rename the latter for 0.7). > Seems the MapTask.collect() wants == and not instanceof. > > I've talked with Ted about passing Clusters rather than the current > ClusterObservations but don't see how at this point. Any ideas? > >
Sean Owen 2012-02-12, 15:27
The problem really arises when you have to tell the Job what the class of the Mapper key/value is. It needs something concrete. The issue is not here in the Mapper declaration.
The general answer is, no, it has to somehow know what it's reading before it reads it. You can accomplish this by, say, writing the class name in the output. By default this is how Java serialization works. It doesn't work at all for many purposes here because that class name is so much overhead.
VectorWritable does something that splits the two -- has a tiny header where a few bits indicate "sparse" or "named", etc. and this is enough to know what representation was written and so how to read it.
On Sun, Feb 12, 2012 at 3:00 PM, Paritosh Ranjan <[EMAIL PROTECTED]> wrote: > Can something like this help? > > public class CIMapper<T extends Cluster> extends > Mapper<WritableComparable<?>,VectorWritable,IntWritable,T> { > ... > > } > > On 12-02-2012 06:48, Jeff Eastman wrote: >> >> I'm wondering how to tease the elephant into accepting any concrete >> instance of the interface o.a.m.clustering.Cluster when writing trained >> clusters in the cleanup() method of CIMapper. I've gotten the MR version of >> the ClusterIterator to get to that point in testing but it blows chunks with >> an IOException when I try to pass a o.a.m.clustering.kmeans.Cluster (I will >> rename the latter for 0.7). Seems the MapTask.collect() wants == and not >> instanceof. >> >> I've talked with Ted about passing Clusters rather than the current >> ClusterObservations but don't see how at this point. Any ideas? >> >> >
Ted Dunning 2012-02-12, 16:01
But this sounds like a runtime problem, not a type checking problem.
Polymorphism is generally a problem in the Hadoop API. That is why we have VectorWritable and why I added PolymorphicWritable.
Jeff,
Two questions:
1) would PolymorphicWritable<Cluster> help?
2) can you say more about what the IOException is? Does it give any hints?
On Sun, Feb 12, 2012 at 7:00 AM, Paritosh Ranjan <[EMAIL PROTECTED]> wrote:
> Can something like this help? > > public class CIMapper<T extends Cluster> extends > Mapper<WritableComparable<?>,**VectorWritable,IntWritable,T> { > ... > } > > On 12-02-2012 06:48, Jeff Eastman wrote: > >> I'm wondering how to tease the elephant into accepting any concrete >> instance of the interface o.a.m.clustering.Cluster when writing trained >> clusters in the cleanup() method of CIMapper. I've gotten the MR version of >> the ClusterIterator to get to that point in testing but it blows chunks >> with an IOException when I try to pass a o.a.m.clustering.kmeans.**Cluster >> (I will rename the latter for 0.7). Seems the MapTask.collect() wants =>> and not instanceof. >> >> I've talked with Ted about passing Clusters rather than the current >> ClusterObservations but don't see how at this point. Any ideas? >> >> >> >
Sean Owen 2012-02-12, 16:27
Exactly right, and that's exactly the answer in some form. PolymorphicWritable isn't suitable if you're writing a lot of records as the overhead of writing a 40-byte string is too much at scale.
On Sun, Feb 12, 2012 at 4:01 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > But this sounds like a runtime problem, not a type checking problem. > > Polymorphism is generally a problem in the Hadoop API. That is why we > have VectorWritable and why I added PolymorphicWritable. > > Jeff, > > Two questions: > > 1) would PolymorphicWritable<Cluster> help? > > 2) can you say more about what the IOException is? Does it give any hints? > > On Sun, Feb 12, 2012 at 7:00 AM, Paritosh Ranjan <[EMAIL PROTECTED]> wrote: > >> Can something like this help? >> >> public class CIMapper<T extends Cluster> extends >> Mapper<WritableComparable<?>,**VectorWritable,IntWritable,T> { >> ... >> } >> >> On 12-02-2012 06:48, Jeff Eastman wrote: >> >>> I'm wondering how to tease the elephant into accepting any concrete >>> instance of the interface o.a.m.clustering.Cluster when writing trained >>> clusters in the cleanup() method of CIMapper. I've gotten the MR version of >>> the ClusterIterator to get to that point in testing but it blows chunks >>> with an IOException when I try to pass a o.a.m.clustering.kmeans.**Cluster >>> (I will rename the latter for 0.7). Seems the MapTask.collect() wants =>>> and not instanceof. >>> >>> I've talked with Ted about passing Clusters rather than the current >>> ClusterObservations but don't see how at this point. Any ideas? >>> >>> >>> >>
Jeff Eastman 2012-02-12, 16:35
Thanks Sean & Ted. That is what I've observed experimentally. I was going to pursue a ClusterWriteable along the lines of VectorWritable but will try PolymorphicWritable<Cluster> first. Looking at it, I see it does send the class name which might be onerous as Sean observed except for the fact that I am only sending (k) clusters between each mapper and the reducer. I will report on this an an hour or so.
On 2/12/12 9:01 AM, Ted Dunning wrote: > But this sounds like a runtime problem, not a type checking problem. > > Polymorphism is generally a problem in the Hadoop API. That is why we > have VectorWritable and why I added PolymorphicWritable. > > Jeff, > > Two questions: > > 1) would PolymorphicWritable<Cluster> help? > > 2) can you say more about what the IOException is? Does it give any hints? > > On Sun, Feb 12, 2012 at 7:00 AM, Paritosh Ranjan<[EMAIL PROTECTED]> wrote: > >> Can something like this help? >> >> public class CIMapper<T extends Cluster> extends >> Mapper<WritableComparable<?>,**VectorWritable,IntWritable,T> { >> ... >> } >> >> On 12-02-2012 06:48, Jeff Eastman wrote: >> >>> I'm wondering how to tease the elephant into accepting any concrete >>> instance of the interface o.a.m.clustering.Cluster when writing trained >>> clusters in the cleanup() method of CIMapper. I've gotten the MR version of >>> the ClusterIterator to get to that point in testing but it blows chunks >>> with an IOException when I try to pass a o.a.m.clustering.kmeans.**Cluster >>> (I will rename the latter for 0.7). Seems the MapTask.collect() wants =>>> and not instanceof. >>> >>> I've talked with Ted about passing Clusters rather than the current >>> ClusterObservations but don't see how at this point. Any ideas? >>> >>> >>>
Raphael Cendrillon 2012-02-12, 16:43
Hi Jeff,
It's great to see some discussion on this. I ran into a similar problem when trying to make the SplitInput job work for any arbitrary key and value classes. In the end I was able to side step the issue by just reading the key and value classes from the SequenceFileInput, but I never found a way to deal with this head on.
On 12 Feb, 2012, at 8:35 AM, Jeff Eastman wrote:
> Thanks Sean & Ted. That is what I've observed experimentally. I was going to pursue a ClusterWriteable along the lines of VectorWritable but will try PolymorphicWritable<Cluster> first. Looking at it, I see it does send the class name which might be onerous as Sean observed except for the fact that I am only sending (k) clusters between each mapper and the reducer. I will report on this an an hour or so. > > On 2/12/12 9:01 AM, Ted Dunning wrote: >> But this sounds like a runtime problem, not a type checking problem. >> >> Polymorphism is generally a problem in the Hadoop API. That is why we >> have VectorWritable and why I added PolymorphicWritable. >> >> Jeff, >> >> Two questions: >> >> 1) would PolymorphicWritable<Cluster> help? >> >> 2) can you say more about what the IOException is? Does it give any hints? >> >> On Sun, Feb 12, 2012 at 7:00 AM, Paritosh Ranjan<[EMAIL PROTECTED]> wrote: >> >>> Can something like this help? >>> >>> public class CIMapper<T extends Cluster> extends >>> Mapper<WritableComparable<?>,**VectorWritable,IntWritable,T> { >>> ... >>> } >>> >>> On 12-02-2012 06:48, Jeff Eastman wrote: >>> >>>> I'm wondering how to tease the elephant into accepting any concrete >>>> instance of the interface o.a.m.clustering.Cluster when writing trained >>>> clusters in the cleanup() method of CIMapper. I've gotten the MR version of >>>> the ClusterIterator to get to that point in testing but it blows chunks >>>> with an IOException when I try to pass a o.a.m.clustering.kmeans.**Cluster >>>> (I will rename the latter for 0.7). Seems the MapTask.collect() wants =>>>> and not instanceof. >>>> >>>> I've talked with Ted about passing Clusters rather than the current >>>> ClusterObservations but don't see how at this point. Any ideas? >>>> >>>> >>>> >
Jeff Eastman 2012-02-12, 17:22
This approach worked out, not exactly as below, but I was able to create a ClusterWritable which used PolymorphicWritable to read and write its Cluster value field. This makes it through the mapper and reducer but I'm still working on getting it all to fly in the ClusterIterator.
On 2/12/12 9:43 AM, Raphael Cendrillon wrote: > Hi Jeff, > > It's great to see some discussion on this. I ran into a similar problem when trying to make the SplitInput job work for any arbitrary key and value classes. In the end I was able to side step the issue by just reading the key and value classes from the SequenceFileInput, but I never found a way to deal with this head on. > > On 12 Feb, 2012, at 8:35 AM, Jeff Eastman wrote: > >> Thanks Sean& Ted. That is what I've observed experimentally. I was going to pursue a ClusterWriteable along the lines of VectorWritable but will try PolymorphicWritable<Cluster> first. Looking at it, I see it does send the class name which might be onerous as Sean observed except for the fact that I am only sending (k) clusters between each mapper and the reducer. I will report on this an an hour or so. >> >> On 2/12/12 9:01 AM, Ted Dunning wrote: >>> But this sounds like a runtime problem, not a type checking problem. >>> >>> Polymorphism is generally a problem in the Hadoop API. That is why we >>> have VectorWritable and why I added PolymorphicWritable. >>> >>> Jeff, >>> >>> Two questions: >>> >>> 1) would PolymorphicWritable<Cluster> help? >>> >>> 2) can you say more about what the IOException is? Does it give any hints? >>> >>> On Sun, Feb 12, 2012 at 7:00 AM, Paritosh Ranjan<[EMAIL PROTECTED]> wrote: >>> >>>> Can something like this help? >>>> >>>> public class CIMapper<T extends Cluster> extends >>>> Mapper<WritableComparable<?>,**VectorWritable,IntWritable,T> { >>>> ... >>>> } >>>> >>>> On 12-02-2012 06:48, Jeff Eastman wrote: >>>> >>>>> I'm wondering how to tease the elephant into accepting any concrete >>>>> instance of the interface o.a.m.clustering.Cluster when writing trained >>>>> clusters in the cleanup() method of CIMapper. I've gotten the MR version of >>>>> the ClusterIterator to get to that point in testing but it blows chunks >>>>> with an IOException when I try to pass a o.a.m.clustering.kmeans.**Cluster >>>>> (I will rename the latter for 0.7). Seems the MapTask.collect() wants =>>>>> and not instanceof. >>>>> >>>>> I've talked with Ted about passing Clusters rather than the current >>>>> ClusterObservations but don't see how at this point. Any ideas? >>>>> >>>>> >>>>> > >
Lance Norskog 2012-02-13, 04:57
Another option is TupleWritable. But pull the source and make sure it works, I had problems.
On Sun, Feb 12, 2012 at 9:22 AM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > This approach worked out, not exactly as below, but I was able to create a > ClusterWritable which used PolymorphicWritable to read and write its Cluster > value field. This makes it through the mapper and reducer but I'm still > working on getting it all to fly in the ClusterIterator. > > > On 2/12/12 9:43 AM, Raphael Cendrillon wrote: >> >> Hi Jeff, >> >> It's great to see some discussion on this. I ran into a similar problem >> when trying to make the SplitInput job work for any arbitrary key and value >> classes. In the end I was able to side step the issue by just reading the >> key and value classes from the SequenceFileInput, but I never found a way to >> deal with this head on. >> >> On 12 Feb, 2012, at 8:35 AM, Jeff Eastman wrote: >> >>> Thanks Sean& Ted. That is what I've observed experimentally. I was going >>> to pursue a ClusterWriteable along the lines of VectorWritable but will try >>> PolymorphicWritable<Cluster> first. Looking at it, I see it does send the >>> class name which might be onerous as Sean observed except for the fact that >>> I am only sending (k) clusters between each mapper and the reducer. I will >>> report on this an an hour or so. >>> >>> >>> On 2/12/12 9:01 AM, Ted Dunning wrote: >>>> >>>> But this sounds like a runtime problem, not a type checking problem. >>>> >>>> Polymorphism is generally a problem in the Hadoop API. That is why we >>>> have VectorWritable and why I added PolymorphicWritable. >>>> >>>> Jeff, >>>> >>>> Two questions: >>>> >>>> 1) would PolymorphicWritable<Cluster> help? >>>> >>>> 2) can you say more about what the IOException is? Does it give any >>>> hints? >>>> >>>> On Sun, Feb 12, 2012 at 7:00 AM, Paritosh Ranjan<[EMAIL PROTECTED]> >>>> wrote: >>>> >>>>> Can something like this help? >>>>> >>>>> public class CIMapper<T extends Cluster> extends >>>>> Mapper<WritableComparable<?>,**VectorWritable,IntWritable,T> { >>>>> ... >>>>> } >>>>> >>>>> On 12-02-2012 06:48, Jeff Eastman wrote: >>>>> >>>>>> I'm wondering how to tease the elephant into accepting any concrete >>>>>> instance of the interface o.a.m.clustering.Cluster when writing >>>>>> trained >>>>>> clusters in the cleanup() method of CIMapper. I've gotten the MR >>>>>> version of >>>>>> the ClusterIterator to get to that point in testing but it blows >>>>>> chunks >>>>>> with an IOException when I try to pass a >>>>>> o.a.m.clustering.kmeans.**Cluster >>>>>> (I will rename the latter for 0.7). Seems the MapTask.collect() wants >>>>>> =>>>>>> and not instanceof. >>>>>> >>>>>> I've talked with Ted about passing Clusters rather than the current >>>>>> ClusterObservations but don't see how at this point. Any ideas? >>>>>> >>>>>> >>>>>> >> >> >
-- Lance Norskog [EMAIL PROTECTED]
Jeff Eastman 2012-02-13, 05:07
PolymorphicWritable actually works great in the two applications of it I committed today. They are low-volume of course so the overhead of writing the class name is not onerous.
On 2/12/12 9:57 PM, Lance Norskog wrote: > Another option is TupleWritable. But pull the source and make sure it > works, I had problems. > > On Sun, Feb 12, 2012 at 9:22 AM, Jeff Eastman > <[EMAIL PROTECTED]> wrote: >> This approach worked out, not exactly as below, but I was able to create a >> ClusterWritable which used PolymorphicWritable to read and write its Cluster >> value field. This makes it through the mapper and reducer but I'm still >> working on getting it all to fly in the ClusterIterator. >> >> >> On 2/12/12 9:43 AM, Raphael Cendrillon wrote: >>> Hi Jeff, >>> >>> It's great to see some discussion on this. I ran into a similar problem >>> when trying to make the SplitInput job work for any arbitrary key and value >>> classes. In the end I was able to side step the issue by just reading the >>> key and value classes from the SequenceFileInput, but I never found a way to >>> deal with this head on. >>> >>> On 12 Feb, 2012, at 8:35 AM, Jeff Eastman wrote: >>> >>>> Thanks Sean& Ted. That is what I've observed experimentally. I was going >>>> to pursue a ClusterWriteable along the lines of VectorWritable but will try >>>> PolymorphicWritable<Cluster> first. Looking at it, I see it does send the >>>> class name which might be onerous as Sean observed except for the fact that >>>> I am only sending (k) clusters between each mapper and the reducer. I will >>>> report on this an an hour or so. >>>> >>>> >>>> On 2/12/12 9:01 AM, Ted Dunning wrote: >>>>> But this sounds like a runtime problem, not a type checking problem. >>>>> >>>>> Polymorphism is generally a problem in the Hadoop API. That is why we >>>>> have VectorWritable and why I added PolymorphicWritable. >>>>> >>>>> Jeff, >>>>> >>>>> Two questions: >>>>> >>>>> 1) would PolymorphicWritable<Cluster> help? >>>>> >>>>> 2) can you say more about what the IOException is? Does it give any >>>>> hints? >>>>> >>>>> On Sun, Feb 12, 2012 at 7:00 AM, Paritosh Ranjan<[EMAIL PROTECTED]> >>>>> wrote: >>>>> >>>>>> Can something like this help? >>>>>> >>>>>> public class CIMapper<T extends Cluster> extends >>>>>> Mapper<WritableComparable<?>,**VectorWritable,IntWritable,T> { >>>>>> ... >>>>>> } >>>>>> >>>>>> On 12-02-2012 06:48, Jeff Eastman wrote: >>>>>> >>>>>>> I'm wondering how to tease the elephant into accepting any concrete >>>>>>> instance of the interface o.a.m.clustering.Cluster when writing >>>>>>> trained >>>>>>> clusters in the cleanup() method of CIMapper. I've gotten the MR >>>>>>> version of >>>>>>> the ClusterIterator to get to that point in testing but it blows >>>>>>> chunks >>>>>>> with an IOException when I try to pass a >>>>>>> o.a.m.clustering.kmeans.**Cluster >>>>>>> (I will rename the latter for 0.7). Seems the MapTask.collect() wants >>>>>>> =>>>>>>> and not instanceof. >>>>>>> >>>>>>> I've talked with Ted about passing Clusters rather than the current >>>>>>> ClusterObservations but don't see how at this point. Any ideas? >>>>>>> >>>>>>> >>>>>>> >>> > >
|
|