|
Jake Mannix
2011-09-05, 04:42
Sebastian Schelter
2011-09-05, 05:36
Ted Dunning
2011-09-05, 07:07
Tommaso Teofili
2011-09-05, 07:44
Sean Owen
2011-09-05, 09:13
Lance Norskog
2011-09-05, 10:47
Sean Owen
2011-09-05, 10:52
Jake Mannix
2011-09-05, 13:02
Jake Mannix
2011-09-05, 13:10
Sean Owen
2011-09-05, 15:56
Ted Dunning
2011-09-05, 17:26
Ted Dunning
2011-09-05, 17:31
Ted Dunning
2011-09-05, 17:33
Dhruv Kumar
2011-09-05, 18:08
Jake Mannix
2011-09-05, 19:14
Jake Mannix
2011-09-05, 19:16
Benson Margulies
2011-09-06, 20:52
Jake Mannix
2011-09-06, 21:09
Sebastian Schelter
2011-09-09, 08:12
Benson Margulies
2011-09-09, 14:01
Jake Mannix
2011-09-09, 14:34
Jake Mannix
2011-09-09, 14:36
Grant Ingersoll
2011-09-15, 21:22
Jake Mannix
2011-09-16, 00:14
Lance Norskog
2011-09-16, 03:25
Jake Mannix
2011-09-16, 04:19
Ted Dunning
2011-09-16, 04:27
Ted Dunning
2011-09-16, 04:28
Ted Dunning
2011-09-16, 04:31
Grant Ingersoll
2011-09-16, 10:45
Ted Dunning
2011-09-16, 20:24
Jake Mannix
2011-09-16, 20:31
Ted Dunning
2011-09-16, 22:36
Jake Mannix
2011-09-16, 22:44
|
-
Apache Giraph?Jake Mannix 2011-09-05, 04:42
Hey gang,
Has anyone here played much with Giraph<http://incubator.apache.org/giraph/>(currently now in the Apache Incubator)? One of my co-workers ran it on our corporate Hadoop cluster this past weekend, and found it did a very fast PageRank computation (far faster than even well-tuned M/R code on the same data), and it worked pretty close to out-of-the box. Seems like that style of computation (in-memory distributed datasets), as used by Giraph (and the recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and Spark<http://www.spark-project.org/>, and Twister <http://www.iterativemapreduce.org/>, and Vowpal Wabbit<http://hunch.net/~vw/>, and probably a few others) is more and more the way to go for a lot of the things we want to do - scalable machine learning. "RAM is the new Disk, and Disk is the new Tape" after all... Giraph in particular seems nice, in that it runs on top of "old fashioned" Hadoop - it takes up (long-lived) Mapper slots on your regular cluster, spins up a ZK cluster if you don't supply the location of one, and is all in java (which may be a minus, for some people, I guess, but having to run some big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly awesome) Mesos (Spark [which while running on the JVM, is also in Scala]), or run its own totally custom inter-server communication and data structures (Twister and many of the others)). Seems we should be not just supportive of this kind of thing, but try and find some common ground and integration points. -jake
-
Re: Apache Giraph?Sebastian Schelter 2011-09-05, 05:36
I didn't have the time to try it yet, but I'm really interested in it.
As far as I know it's an implementation of Google's Pregel paper. One of my current research goals is to implement a bunch of graph algorithms on top of M/R to get a feeling which properties of the underlying system would need to be changed to make the algorithm faster and easier implementable. Maybe Pregel/Giraph is already the answer to that. I'd be very open towards playing with such a system but as far as integration in Mahout goes, it's a very tough question what other systems should be supported and how we would proceed to integrate them. From my experience its already hard enough for a lot of users to get our hadoop code running... --sebastian On 05.09.2011 06:42, Jake Mannix wrote: > Hey gang, > > Has anyone here played much with > Giraph<http://incubator.apache.org/giraph/>(currently now in the > Apache Incubator)? One of my co-workers ran it on our > corporate Hadoop cluster this past weekend, and found it did a very fast > PageRank computation (far faster than even well-tuned M/R code on the same > data), and it worked pretty close to out-of-the box. Seems like that style > of computation (in-memory distributed datasets), as used by Giraph (and the > recently-discussed-on-this-list GraphLab<http://graphlab.org/>, and > Spark<http://www.spark-project.org/>, and > Twister<http://www.iterativemapreduce.org/>, and Vowpal > Wabbit<http://hunch.net/~vw/>, > and probably a few others) is more and more the way to go for a lot of the > things we want to do - scalable machine learning. "RAM is the new Disk, and > Disk is the new Tape" after all... > > Giraph in particular seems nice, in that it runs on top of "old fashioned" > Hadoop - it takes up (long-lived) Mapper slots on your regular cluster, > spins up a ZK cluster if you don't supply the location of one, and is all in > java (which may be a minus, for some people, I guess, but having to run some > big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly > awesome) Mesos (Spark [which while running on the JVM, is also in Scala]), > or run its own totally custom inter-server communication and data structures > (Twister and many of the others)). > > Seems we should be not just supportive of this kind of thing, but try and > find some common ground and integration points. > > -jake >
-
Re: Apache Giraph?Ted Dunning 2011-09-05, 07:07
I think that alternatives to traditional map-reduce are really important for
ML. If you think about it, it is now common for 30-40 machine clusters to have a terabyte of RAM and it is actually slightly unusual to have ML datasets that large. As such, BSP-like compute models have a lot to recommend them. Also, there are implementations of BSP or pregel that allow the computation to exceed memory by requiring all nodes to be serializable. Only the highly active ones are kept in memory. This allows a graceful transition from complete memory-residency to something more like traditional hadoop-style map-reduce. Another interesting model is that of Spark. I can well imagine that much of what we do could be replaced with very small Spark programs which could be composed much more easily than our current map-reduce command-line stuff could be glued together. Lots of codes should experience two orders of magnitude speedup from the use of these alternative systems. The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should actually liberate us from the tyranny of a single compute model, but it does bring us to the point of having to decide how much of this sort of thing we want to depend on from another project or how much we want to take on ourselves. For instance, it might well be possible to co-opt the Spark community as part of our own ... their purpose was to support machine learning and joining forces might help achieve that on a broader scale than previously envisioned. Graphlab is an interesting case as well. On Sun, Sep 4, 2011 at 11:42 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > Hey gang, > > Has anyone here played much with > Giraph<http://incubator.apache.org/giraph/>(currently now in the > Apache Incubator)? One of my co-workers ran it on our > corporate Hadoop cluster this past weekend, and found it did a very fast > PageRank computation (far faster than even well-tuned M/R code on the same > data), and it worked pretty close to out-of-the box. Seems like that style > of computation (in-memory distributed datasets), as used by Giraph (and the > recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and > Spark<http://www.spark-project.org/>, and > Twister <http://www.iterativemapreduce.org/>, and Vowpal > Wabbit<http://hunch.net/~vw/>, > and probably a few others) is more and more the way to go for a lot of the > things we want to do - scalable machine learning. "RAM is the new Disk, > and > Disk is the new Tape" after all... > > Giraph in particular seems nice, in that it runs on top of "old fashioned" > Hadoop - it takes up (long-lived) Mapper slots on your regular cluster, > spins up a ZK cluster if you don't supply the location of one, and is all > in > java (which may be a minus, for some people, I guess, but having to run > some > big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly > awesome) Mesos (Spark [which while running on the JVM, is also in Scala]), > or run its own totally custom inter-server communication and data > structures > (Twister and many of the others)). > > Seems we should be not just supportive of this kind of thing, but try and > find some common ground and integration points. > > -jake >
-
Re: Apache Giraph?Tommaso Teofili 2011-09-05, 07:44
2011/9/5 Ted Dunning <[EMAIL PROTECTED]>
> I think that alternatives to traditional map-reduce are really important > for > ML. If you think about it, it is now common for 30-40 machine clusters to > have a terabyte of RAM and it is actually slightly unusual to have ML > datasets that large. As such, BSP-like compute models have a lot to > recommend them. > > Also, there are implementations of BSP or pregel that allow the computation > to exceed memory by requiring all nodes to be serializable. Only the > highly > active ones are kept in memory. This allows a graceful transition from > complete memory-residency to something more like traditional hadoop-style > map-reduce. > in regards of that, you may have a look at Apache Hama which is based on BSP model but also offers a Pregel-like API: http://incubator.apache.org/hama/ My 2 cents, Tommaso > > Another interesting model is that of Spark. I can well imagine that much > of > what we do could be replaced with very small Spark programs which could be > composed much more easily than our current map-reduce command-line stuff > could be glued together. Lots of codes should experience two orders of > magnitude speedup from the use of these alternative systems. > > The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should > actually liberate us from the tyranny of a single compute model, but it > does > bring us to the point of having to decide how much of this sort of thing we > want to depend on from another project or how much we want to take on > ourselves. For instance, it might well be possible to co-opt the Spark > community as part of our own ... their purpose was to support machine > learning and joining forces might help achieve that on a broader scale than > previously envisioned. Graphlab is an interesting case as well. > > On Sun, Sep 4, 2011 at 11:42 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > Hey gang, > > > > Has anyone here played much with > > Giraph<http://incubator.apache.org/giraph/>(currently now in the > > Apache Incubator)? One of my co-workers ran it on our > > corporate Hadoop cluster this past weekend, and found it did a very fast > > PageRank computation (far faster than even well-tuned M/R code on the > same > > data), and it worked pretty close to out-of-the box. Seems like that > style > > of computation (in-memory distributed datasets), as used by Giraph (and > the > > recently-discussed-on-this-list GraphLab <http://graphlab.org/>, and > > Spark<http://www.spark-project.org/>, and > > Twister <http://www.iterativemapreduce.org/>, and Vowpal > > Wabbit<http://hunch.net/~vw/>, > > and probably a few others) is more and more the way to go for a lot of > the > > things we want to do - scalable machine learning. "RAM is the new Disk, > > and > > Disk is the new Tape" after all... > > > > Giraph in particular seems nice, in that it runs on top of "old > fashioned" > > Hadoop - it takes up (long-lived) Mapper slots on your regular cluster, > > spins up a ZK cluster if you don't supply the location of one, and is all > > in > > java (which may be a minus, for some people, I guess, but having to run > > some > > big exec'ed out C++ code (GraphLab, VW), or run on-top of (admittedly > > awesome) Mesos (Spark [which while running on the JVM, is also in > Scala]), > > or run its own totally custom inter-server communication and data > > structures > > (Twister and many of the others)). > > > > Seems we should be not just supportive of this kind of thing, but try > and > > find some common ground and integration points. > > > > -jake > > >
-
Re: Apache Giraph?Sean Owen 2011-09-05, 09:13
My high-level view is that Hadoop was very excellent for its intended use
case, and that because of this, people have abused it to do things quite unlike what it was designed for. It's amazing that a glorified logs processing framework could do anything like machine learning well. Mahout embodies that interesting struggle. I can only believe that most any of the "next gen" frameworks discussed here, which are necessarily more general-purpose, will be better for things like machine learning. I am not so interesting in MR 2.0 -- nothing wrong with it just not something better conceptually for machine learning. I like projects like Ciel from MS Research -- simply more general purpose graph- and data-flow-oriented frameworks. I personally believe that while Mahout *could* be anything, that it's reached about the level of scope it can possibly sustain given the amount of effort coming in, in trying to do something interesting on top of MapReduce. This will be useful for a couple years to come yet. That is to say: I think it will be interesting to explore another machine-learning-at-scale project in 2 years or so on top of one of these next-gen frameworks. (Was that the question?)
-
Re: Apache Giraph?Lance Norskog 2011-09-05, 10:47
Haha. Part of the abuse is "Google envy". (Google "Sigmund Freud" to fully
understand this.) I'm finding inherent difficulty in documenting map/reduce code, and assimilating an existing job. Haven't seen a "UML for Map/Reduce" yet; Hamake is the cleanest "everything in one file" description, and it only stores half of what's going on. Mahout's "in-memory" code is all single-threaded, and is bifurcated from the map/reduce versions. A few places have custom multi-threading shoehorned in. You can't buy a stationary single-processor computer. We bought an 8-core server 1.5 years ago for under 5 grand. We can't easily write multi-processor java for it. If Mahout wants to stay M/R focused it could use an in-memory M/R executor as the "in-memory" option. Several systems (including the QT graphics framework!) include such a beast. It's not very hard. The big overhead is sorting, and you often don't care. https://github.com/LanceNorskog/parallel/tree/master/project/src/java/parallel/littlemr https://github.com/LanceNorskog/parallel/blob/master/project/test/java/parallel/littlemr/TestFullPass.java About m/r's future: Riak supports doing a map/reduce job during a query. That is, m/r is a distributed version of the classic DB stored procedure; the query happens between the DB and the (multiple, parallel) clients. This is a natural place for m/r, and it may live on in that context after all Google envy fades away. On Mon, Sep 5, 2011 at 2:13 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > My high-level view is that Hadoop was very excellent for its intended use > case, and that because of this, people have abused it to do things quite > unlike what it was designed for. It's amazing that a glorified logs > processing framework could do anything like machine learning well. Mahout > embodies that interesting struggle. > > I can only believe that most any of the "next gen" frameworks discussed > here, which are necessarily more general-purpose, will be better for things > like machine learning. I am not so interesting in MR 2.0 -- nothing wrong > with it just not something better conceptually for machine learning. I like > projects like Ciel from MS Research -- simply more general purpose graph- > and data-flow-oriented frameworks. > > I personally believe that while Mahout *could* be anything, that it's > reached about the level of scope it can possibly sustain given the amount > of > effort coming in, in trying to do something interesting on top of > MapReduce. > This will be useful for a couple years to come yet. > > That is to say: I think it will be interesting to explore another > machine-learning-at-scale project in 2 years or so on top of one of these > next-gen frameworks. > > (Was that the question?) > -- Lance Norskog [EMAIL PROTECTED]
-
Re: Apache Giraph?Sean Owen 2011-09-05, 10:52
On Mon, Sep 5, 2011 at 11:47 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:
> Haha. Part of the abuse is "Google envy". (Google "Sigmund Freud" to fully > understand this.) > > I'm finding inherent difficulty in documenting map/reduce code, and > assimilating an existing job. Haven't seen a "UML for Map/Reduce" yet; > Hamake is the cleanest "everything in one file" description, and it only > stores half of what's going on. > I think Pig deserves a lot of credit for being the closest to a real high-level language for M/R. It's still funky to write these things in M/R, even with that level of help. > > Mahout's "in-memory" code is all single-threaded, and is bifurcated from > the > map/reduce versions. A few places have custom multi-threading shoehorned > in. > You can't buy a stationary single-processor computer. We bought an 8-core > server 1.5 years ago for under 5 grand. We can't easily write > multi-processor java for it. If Mahout wants to stay M/R focused it could > Well, the idea is that the parallelism comes one level above. You can handle N simultaneous requests from callers at once on N cores. I think this is a pretty good theory.
-
Re: Apache Giraph?Jake Mannix 2011-09-05, 13:02
Ted,
As usual, you're thinking the same way I am, it seems. On Mon, Sep 5, 2011 at 12:07 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > I think that alternatives to traditional map-reduce are really important > for > ML. If you think about it, it is now common for 30-40 machine clusters to > have a terabyte of RAM and it is actually slightly unusual to have ML > datasets that large. As such, BSP-like compute models have a lot to > recommend them. > Yep - Big Data seems to me to fall into a hierarchy these days: Sensor data >> Log data >> User-Generated data, Graph data >> User data, Text corpora Where the first one lives in the mother of all Big Data domains, hard science (think: CERN, and time-series spatial weather data), and already most compute clusters have more RAM than anything except the first two. Sure, web logs don't fit into distributed RAM, but even FB and Twitter's entire social graph will fit into main memory of even a *very small* cluster. I don't know of many text corpora which don't fit into a TB or two of RAM (unless you're talking web-wide, or "all Tweets across time"). Also, there are implementations of BSP or pregel that allow the computation > to exceed memory by requiring all nodes to be serializable. Only the > highly > active ones are kept in memory. This allows a graceful transition from > complete memory-residency to something more like traditional hadoop-style > map-reduce. > Yeah, when this is done right (HARD!), it basically makes for the universally scalable kind of computation we want. Truly in-memory when possible, but doesn't require it. > Another interesting model is that of Spark. I can well imagine that much > of > what we do could be replaced with very small Spark programs which could be > composed much more easily than our current map-reduce command-line stuff > could be glued together. Lots of codes should experience two orders of > magnitude speedup from the use of these alternative systems. > This is my impression too. The more I play with Spark, the more it looks like "the Right Paradigm" for this kind of computation: how many years has I been complaining that all I've ever wanted from Hadoop (and/or Mahout) is to be able to say something like: vectors = load("hdfs://mydataFile"); vectors.map(new Function<Vector, Vector>() { Vector apply(Vector in) { return in.normailze(1); }) .filter(new Predicate<Vector>() { boolean apply(Vector in) { return in.numNonDefaultValues() < 1000; }) .reduce(new Function<Pair<Vector, Vector>, Vector>() { Vector apply(Pair<Vector, Vector> pair) { return pair.getFirst().plus(pair.getSecond()); }); Spark lets you do exactly this kind of strongly typed mixed OO+functional thinking (except without the verbosity of Java, and with proper closures). > The arrival of map-reduce 2.0 (or the equivalent use of Mesos) should > actually liberate us from the tyranny of a single compute model, but it > does > bring us to the point of having to decide how much of this sort of thing we > want to depend on from another project or how much we want to take on > ourselves. Yes, this is a tricky part: Spark is awesome, but a) scala (good thing?), and b) dependent on Mesos (awesome thing, but not widely available/adopted yet) Giraph is closer to our dependency-set: a) runs on raw Hadoop 0.20.3 and 0.20.203 b) in java c) and is very similar to the usual programming model of M/R Point c) is pretty important: while you do get everything in-memory, you still program it in the way that is familiar to Hadoop-people: you have Job objects you can configure to fit your cluster's special shape and characteristics, etc. Anything YARN-based kinda worries me. I am not sure how soon people will really see production-grade next-gen M/R environments available to them. For instance, it might well be possible to co-opt the Spark So there could be several steps in here. Some people are never ever going to move to Scala (I might not be one of those people, but in another flip of the quantum bit that triggered some neural process, I could have been), and may never get access to Mesos (so sad for them - it really is pretty magical), so while I fully think finding a way to work with the Spark folk / adopt them as our own / start dating them is a good idea, I think that Giraph is of more immediate help: Giraph is a) an already mavenized b) Apache Incubator java project We could modify a lot of our current algorithms to run on the exact same clusters they're currently running on, with a few more jars (which have already been checked for ASF compliance), but be tons faster, and let us start "thinking like a vertex". (although since everything for me is really a matrix, I'll be "thinking like a vector which is a row of a matrix", which is exactly the same thing). So maybe it's a multi-step process: start integrating, depending on, or absorbing wholesale Giraph *soon*, help people write their adapters for things like GraphLab in parallel, and learn from Spark and see how much of our work could be replaced with that in the longer term (most likely Spark and YARN will not stay so far apart in the long term either). -jake
-
Re: Apache Giraph?Jake Mannix 2011-09-05, 13:10
On Mon, Sep 5, 2011 at 2:13 AM, Sean Owen <[EMAIL PROTECTED]> wrote:
> My high-level view is that Hadoop was very excellent for its intended use > case, and that because of this, people have abused it to do things quite > unlike what it was designed for. It's amazing that a glorified logs > processing framework could do anything like machine learning well. Mahout > embodies that interesting struggle. > Yeah, and the more I try to do straightforward things on raw M/R, the more often I get warning looks from my cluster admins who ask me if I really need to run iterative algorithms with 10 TB of intermediate data (between each step...). > I can only believe that most any of the "next gen" frameworks discussed > here, which are necessarily more general-purpose, will be better for things > like machine learning. More general purpose? Or less? > I am not so interesting in MR 2.0 -- nothing wrong > with it just not something better conceptually for machine learning. I like > projects like Ciel from MS Research -- simply more general purpose graph- > and data-flow-oriented frameworks. > Yeah, I'll believe the hype on MR 2.0 when I see it (er, the "it" being "something more than the hype"). > I personally believe that while Mahout *could* be anything, that it's > reached about the level of scope it can possibly sustain given the amount > of > effort coming in, in trying to do something interesting on top of > MapReduce. > This will be useful for a couple years to come yet. > Definitely useful, but also reaching a point of limitedness, given the ratio of available RAM to useful data size (as Ted mentioned). > That is to say: I think it will be interesting to explore another > machine-learning-at-scale project in 2 years or so on top of one of these > next-gen frameworks. > (Was that the question?) > Well, I think it's actually the time to start working with some of the more promising ones *now* before they become their own fully fledged communities and have their own technical debt which is hard to interoperate with. As I see it, things like GraphLab and VW, being off the JVM, require much more work from the other community, and as such the best we can do is help test out the integration layers, for the time being. NextGen MR is something to think about in the future, as you say, and maybe Spark is closer, but also requires mesos, so is also more "future", but Giraph is very new, very small community, and very easily adoptable, I think. -jake
-
Re: Apache Giraph?Sean Owen 2011-09-05, 15:56
On Mon, Sep 5, 2011 at 2:10 PM, Jake Mannix <[EMAIL PROTECTED]> wrote:
> > More general purpose? Or less? > > Hmm interesting question. I suppose I thought of these sorts of things as more general purpose than Hadoop, which I still can't but think of as basically an analytics tool. But a graph-oriented framework is arguably fairly special-purpose too. > >
-
Re: Apache Giraph?Ted Dunning 2011-09-05, 17:26
Actually, the virtue here is that it allows alternative frameworks. Some of
those alternatives are things we could use. So MR2.0 isn't interesting per se in a machine learning context, but the flexibility it offers is very interesting. On Mon, Sep 5, 2011 at 4:13 AM, Sean Owen <[EMAIL PROTECTED]> wrote: > I am not so interesting in MR 2.0 -- nothing wrong > with it just not something better conceptually for machine learning. >
-
Re: Apache Giraph?Ted Dunning 2011-09-05, 17:31
Of course, FlumeJava (or the open source version Plume) will let this same
kind of code be written for map-reduce and plausible could generate code for spark-like execution or Giraph execution. That would let us stay with java at the cost of a few extra lines of code when expressing a closure. On Mon, Sep 5, 2011 at 8:02 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > > Another interesting model is that of Spark. I can well imagine that much > > of > > what we do could be replaced with very small Spark programs which could > be > > composed much more easily than our current map-reduce command-line stuff > > could be glued together. Lots of codes should experience two orders of > > magnitude speedup from the use of these alternative systems. > > > > This is my impression too. The more I play with Spark, the more it looks > like > "the Right Paradigm" for this kind of computation: how many years has I > been > complaining that all I've ever wanted from Hadoop (and/or Mahout) is to be > able > to say something like: > > vectors = load("hdfs://mydataFile"); > vectors.map(new Function<Vector, Vector>() { > Vector apply(Vector in) { return in.normailze(1); }) > .filter(new Predicate<Vector>() { > boolean apply(Vector in) { return > in.numNonDefaultValues() < 1000; }) > .reduce(new Function<Pair<Vector, Vector>, Vector>() { > Vector apply(Pair<Vector, Vector> pair) { return > pair.getFirst().plus(pair.getSecond()); }); > > Spark lets you do exactly this kind of strongly typed mixed OO+functional > thinking > (except without the verbosity of Java, and with proper closures).
-
Re: Apache Giraph?Ted Dunning 2011-09-05, 17:33
Giraph really is fairly close to what we need and it is conceivable that we
could do something FlumeJava-ish on top of that. YARN is supposed to be ready for prime time in the Q1-Q2 time frame next year. When it is ready, it will be available from essentially all the Hadoop-compatible vendors. On Mon, Sep 5, 2011 at 8:02 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > Giraph is closer to our dependency-set: > a) runs on raw Hadoop 0.20.3 and 0.20.203 > b) in java > c) and is very similar to the usual programming model of M/R > > Point c) is pretty important: while you do get everything in-memory, you > still > program it in the way that is familiar to Hadoop-people: you have Job > objects you > can configure to fit your cluster's special shape and characteristics, etc. > > Anything YARN-based kinda worries me. I am not sure how soon people > will really see production-grade next-gen M/R environments available to > them. >
-
Re: Apache Giraph?Dhruv Kumar 2011-09-05, 18:08
On Mon, Sep 5, 2011 at 9:02 AM, Jake Mannix <[EMAIL PROTECTED]> wrote:
> > This is my impression too. The more I play with Spark, the more it looks > like > "the Right Paradigm" for this kind of computation: how many years has I > been > complaining that all I've ever wanted from Hadoop (and/or Mahout) is to be > able > to say something like: > > vectors = load("hdfs://mydataFile"); > vectors.map(new Function<Vector, Vector>() { > Vector apply(Vector in) { return in.normailze(1); }) > .filter(new Predicate<Vector>() { > boolean apply(Vector in) { return > in.numNonDefaultValues() < 1000; }) > .reduce(new Function<Pair<Vector, Vector>, Vector>() { > Vector apply(Pair<Vector, Vector> pair) { return > pair.getFirst().plus(pair.getSecond()); }); > +1 for advocating side effect free programming! Twister is pretty interesting too and can model Hadoop jobs in a functional style: http://www.iterativemapreduce.org/
-
Re: Apache Giraph?Jake Mannix 2011-09-05, 19:14
On Mon, Sep 5, 2011 at 10:33 AM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Giraph really is fairly close to what we need and it is conceivable that we > could do something FlumeJava-ish on top of that. > Yeah, I'll play with it a bit more, and see how easy it is to work with, and report back here! > YARN is supposed to be ready for prime time in the Q1-Q2 time frame next > year. When it is ready, it will be available from essentially all the > Hadoop-compatible vendors. > "ready for prime-time" and "my cluster admin is willing to install it" are two very different timeframes for most people. For example, LinkedIn only moved to Java 1.6 *2 years ago*, and Twitter's still on back-patched Hadoop 0.20.3 in production (although we may soon have hdfs+append branch-compatible). -jake
-
Re: Apache Giraph?Jake Mannix 2011-09-05, 19:16
On Mon, Sep 5, 2011 at 11:08 AM, Dhruv Kumar <[EMAIL PROTECTED]> wrote:
> > > Twister is pretty interesting too and can model Hadoop jobs in a functional > style: > http://www.iterativemapreduce.org/ > Yeah. Twister looks cool, but seems to do all it's own stuff, and does not look like the community around it has made the "academia -> industry" jump at all yet. -jake
-
Re: Apache Giraph?Benson Margulies 2011-09-06, 20:52
A quick look at Spark seems to suggest that the data it's working on
is read-only. If you wanted to have an evolving data set, would any of these others be more useful? On Mon, Sep 5, 2011 at 3:16 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Mon, Sep 5, 2011 at 11:08 AM, Dhruv Kumar <[EMAIL PROTECTED]> wrote: >> >> >> Twister is pretty interesting too and can model Hadoop jobs in a functional >> style: >> http://www.iterativemapreduce.org/ >> > > Yeah. Twister looks cool, but seems to do all it's own stuff, and does not > look > like the community around it has made the "academia -> industry" jump at > all yet. > > -jake >
-
Re: Apache Giraph?Jake Mannix 2011-09-06, 21:09
On Tue, Sep 6, 2011 at 1:52 PM, Benson Margulies <[EMAIL PROTECTED]>wrote:
> A quick look at Spark seems to suggest that the data it's working on > is read-only. If you wanted to have an evolving data set, would any of > these others be more useful? > The original data set is read-only, but the same could be said for HDFS, right? Or are you saying that you don't see a lot of support for outputting back to HDFS the results of a computation? -jake
-
Re: Apache Giraph?Sebastian Schelter 2011-09-09, 08:12
On 05.09.2011 06:42, Jake Mannix wrote:
> One of my co-workers ran it on our > corporate Hadoop cluster this past weekend, and found it did a very fast > PageRank computation (far faster than even well-tuned M/R code on the same > data), and it worked pretty close to out-of-the box. Could you share some more details? Which kind of implementations did he use? Power-iterations over the adjacency matrix should be the fastest way to in M/R. I'm beginning to look at Giraph (and also taking a deeper look at the Pregel paper). I think the "Vertex"-paradigm is much more intuitive and easier to use for implementing graph algorithms than plain MapReduce. So if Giraph is so much faster and at the same time easier to use, we should think about basing our graph algorithms (that are still very much in their infancy) on it, given a stable release exists. As far as I understand it, Giraph runs as a standard M/R job in Hadoop right? So there is no installation necessary in the cluster. --sebastian
-
Re: Apache Giraph?Benson Margulies 2011-09-09, 14:01
I've since reached the conclusion that the thing I'm trying to compare
it to is a 'data grid', e.g. gigaspaces. We want a large, evolving, data structure, which is essentially cached in memory split over nodes. On Tue, Sep 6, 2011 at 5:09 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Tue, Sep 6, 2011 at 1:52 PM, Benson Margulies <[EMAIL PROTECTED]>wrote: > >> A quick look at Spark seems to suggest that the data it's working on >> is read-only. If you wanted to have an evolving data set, would any of >> these others be more useful? >> > > The original data set is read-only, but the same could be said for HDFS, > right? > Or are you saying that you don't see a lot of support for outputting back to > HDFS the results of a computation? > > -jake >
-
Re: Apache Giraph?Jake Mannix 2011-09-09, 14:34
On Fri, Sep 9, 2011 at 1:12 AM, Sebastian Schelter <[EMAIL PROTECTED]> wrote:
> On 05.09.2011 06:42, Jake Mannix wrote: > > > One of my co-workers ran it on our > > corporate Hadoop cluster this past weekend, and found it did a very fast > > PageRank computation (far faster than even well-tuned M/R code on the > same > > data), and it worked pretty close to out-of-the box. > > Could you share some more details? Which kind of implementations did he > use? Power-iterations over the adjacency matrix should be the fastest > way to in M/R. > When you're in "pregel-land", you don't think about "matrices" anymore, you think in graph terms (which is hard for me, because for me, everything is a matrix), but basic implementation in Giraph is power iterations, yes. But the point is that it's all in memory, all the time, there's no "shuffle", and you're talking shard-to-shard via direct (Hadoop RPC, currently) remote network connection. > I'm beginning to look at Giraph (and also taking a deeper look at the > Pregel paper). I think the "Vertex"-paradigm is much more intuitive and > easier to use for implementing graph algorithms than plain MapReduce. > For graph algorithms, yes. But that's not all. Anything which fits in the land of "BSP" computations can be done this way, and I've been exploring relaxing that a bit as well (if shards can talk to each other in bulk, once per "superstep", why not also allow vertices communicate asynchronously *during* a superstep?), and seeing what further iterative algorithms are possible. > So if Giraph is so much faster and at the same time easier to use, we > should think about basing our graph algorithms (that are still very much > in their infancy) on it, given a stable release exists. > Well, the constraint with Giraph is that everything must fit in (distributed) memory. But it's not much of a scalability bottleneck, as many (non-logfile) datasets on reasonably sized clusters can indeed fit in memory. But yes, I think Giraph is the place to develop any graph algorithms we have. Not sure of the right integration points, however. We could depend on Giraph, but why not just implement graph-specific algorithms *in Giraph*? If it needs some of our math, why not just have Giraph depend on *us*? > As far as I understand it, Giraph runs as a standard M/R job in Hadoop > right? So there is no installation necessary in the cluster. > That's correct. Something schematically like this runs it: "hadoop jar giraph-with-dependencies-0.70.jar \ org.apache.giraph.examples.SimplePageRankVertex \ -i hdfs://mygraphinput -o hdfs://mypagerankoutput -maxIter 30" And it runs on a vanilla Hadoop install, yes. It's really young, yet, with not too many developers, but it's actually a lot of code already (find . -name "*.java" | xargs wc -l => 17462 in my current dev branch I'm monkeying with). -jake
-
Re: Apache Giraph?Jake Mannix 2011-09-09, 14:36
On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies <[EMAIL PROTECTED]>wrote:
> I've since reached the conclusion that the thing I'm trying to compare > it to is a 'data grid', e.g. gigaspaces. > > We want a large, evolving, data structure, which is essentially cached > in memory split over nodes. > I should mention that Giraph certainly allows for the graph to change (both in edge values, and in actual graph structure). But it's currently a very BSP-specific paradigm: run _this_ algorithm, via BSP, over _this_ initial data set, until _this_ many iterations have run, then exit. You could hack it to do other things, but it wasn't the original intent, from what I can tell. -jake
-
Re: Apache Giraph?Grant Ingersoll 2011-09-15, 21:22
Seems like the bigger thing I see us discussing/needing is a distributed memory layer. Do each of these tools invent their own or is there a good, open (ASL compatible) implementation out there somewhere that we could use? Given such a layer, wouldn't it be fairly straightforward to implement both graph based and matrix based approaches? Thinking aloud (and perhaps a bit crazy), I wonder if one could simply implement a Hadoop filesystem that was based on distributed memory (and persistable to disk, perhaps) thereby allowing existing code to simply work.
--Grant On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote: > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies <[EMAIL PROTECTED]>wrote: > >> I've since reached the conclusion that the thing I'm trying to compare >> it to is a 'data grid', e.g. gigaspaces. >> >> We want a large, evolving, data structure, which is essentially cached >> in memory split over nodes. >> > > I should mention that Giraph certainly allows for the graph to change (both > in > edge values, and in actual graph structure). But it's currently a very > BSP-specific > paradigm: run _this_ algorithm, via BSP, over _this_ initial data set, until > _this_ many iterations have run, then exit. You could hack it to do other > things, > but it wasn't the original intent, from what I can tell. > > -jake
-
Re: Apache Giraph?Jake Mannix 2011-09-16, 00:14
On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
> Seems like the bigger thing I see us discussing/needing is a distributed > memory layer. Do each of these tools invent their own or is there a good, > open (ASL compatible) implementation out there somewhere that we could use? > Given such a layer, wouldn't it be fairly straightforward to implement both > graph based and matrix based approaches? Thinking aloud (and perhaps a bit > crazy), I wonder if one could simply implement a Hadoop filesystem that was > based on distributed memory (and persistable to disk, perhaps) thereby > allowing existing code to simply work. > The problem with raw Hadoop jobs which are iterative is that they launch multiple jobs, which can get executed on whatever machines the JobTracker sends them to, with open mapper slots. An in-memory HDFS would still have files living at various locations, not necessarily the same as where all of the mappers go, which means the chunks need to get moved over to local disk of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on a memory-mapped filesystem, it would still go to memory, I guess, but this doesn't like a very efficient process: Hadoop is optimized for streaming over big files, and the map/reduce shuffle requires a lot of disk (in this case, memory!) to do what it does as well. As for "matrix-based" vs. "graph based", since every graph has an adjacency matrix which describes it, and every matrix can describe a (possibly bipartite) graph, there's an isomorphism hiding here, and while I've always thought of "everything as being a matrix", calling everything a graph probably works just as well, and the translation shouldn't be too terribly hard (famous last words). A big "distributed memory layer" does indeed sound great, however. Spark and Giraph both provide their own, although the former seems to lean more toward "read-only, with allowed side-effects", and very general purpose, while the latter is couched in the language of graphs, and computation is specifically BSP (currently), but allows for fairly arbitrary mutation (and persisting final results back to HDFS). -jake --Grant > > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote: > > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies <[EMAIL PROTECTED] > >wrote: > > > >> I've since reached the conclusion that the thing I'm trying to compare > >> it to is a 'data grid', e.g. gigaspaces. > >> > >> We want a large, evolving, data structure, which is essentially cached > >> in memory split over nodes. > >> > > > > I should mention that Giraph certainly allows for the graph to change > (both > > in > > edge values, and in actual graph structure). But it's currently a very > > BSP-specific > > paradigm: run _this_ algorithm, via BSP, over _this_ initial data set, > until > > _this_ many iterations have run, then exit. You could hack it to do > other > > things, > > but it wasn't the original intent, from what I can tell. > > > > -jake > > >
-
Re: Apache Giraph?Lance Norskog 2011-09-16, 03:25
The distinction is in uses: matrix algorithms are generally macroscopic over
the entire graph, while graph algorithms are generally microscopic within the entire matrix. (Now that is a Big-Endian/Little-Endian dispute I had not considered.) http://www.ietf.org/rfc/ien/ien137.txt On Thu, Sep 15, 2011 at 5:14 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[EMAIL PROTECTED] > >wrote: > > > Seems like the bigger thing I see us discussing/needing is a distributed > > memory layer. Do each of these tools invent their own or is there a > good, > > open (ASL compatible) implementation out there somewhere that we could > use? > > Given such a layer, wouldn't it be fairly straightforward to implement > both > > graph based and matrix based approaches? Thinking aloud (and perhaps a > bit > > crazy), I wonder if one could simply implement a Hadoop filesystem that > was > > based on distributed memory (and persistable to disk, perhaps) thereby > > allowing existing code to simply work. > > > > The problem with raw Hadoop jobs which are iterative is that they launch > multiple jobs, which can get executed on whatever machines the JobTracker > sends them to, with open mapper slots. An in-memory HDFS would still have > files living at various locations, not necessarily the same as where all of > the mappers go, which means the chunks need to get moved over to local disk > of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on a > memory-mapped filesystem, it would still go to memory, I guess, but this > doesn't like a very efficient process: Hadoop is optimized for streaming > over big files, and the map/reduce shuffle requires a lot of disk (in this > case, memory!) to do what it does as well. > > As for "matrix-based" vs. "graph based", since every graph has an adjacency > matrix which describes it, and every matrix can describe a (possibly > bipartite) graph, there's an isomorphism hiding here, and while I've always > thought of "everything as being a matrix", calling everything a graph > probably works just as well, and the translation shouldn't be too terribly > hard (famous last words). > > A big "distributed memory layer" does indeed sound great, however. Spark > and Giraph both provide their own, although the former seems to lean more > toward "read-only, with allowed side-effects", and very general purpose, > while the latter is couched in the language of graphs, and computation is > specifically BSP (currently), but allows for fairly arbitrary mutation (and > persisting final results back to HDFS). > > -jake > > --Grant > > > > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote: > > > > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies < > [EMAIL PROTECTED] > > >wrote: > > > > > >> I've since reached the conclusion that the thing I'm trying to compare > > >> it to is a 'data grid', e.g. gigaspaces. > > >> > > >> We want a large, evolving, data structure, which is essentially cached > > >> in memory split over nodes. > > >> > > > > > > I should mention that Giraph certainly allows for the graph to change > > (both > > > in > > > edge values, and in actual graph structure). But it's currently a very > > > BSP-specific > > > paradigm: run _this_ algorithm, via BSP, over _this_ initial data set, > > until > > > _this_ many iterations have run, then exit. You could hack it to do > > other > > > things, > > > but it wasn't the original intent, from what I can tell. > > > > > > -jake > > > > > > > -- Lance Norskog [EMAIL PROTECTED]
-
Re: Apache Giraph?Jake Mannix 2011-09-16, 04:19
On Thu, Sep 15, 2011 at 8:25 PM, Lance Norskog <[EMAIL PROTECTED]> wrote:
> The distinction is in uses: matrix algorithms are generally macroscopic > over > the entire graph, while graph algorithms are generally microscopic within > the entire matrix. > I'm not sure I've seen that distinction. All the work I've done on Mahout's DistributedRowMatrix is based around the idea that the rows of the matrix do no direct communication (as they will be dealt with by different Mapper instances possibly on different machines). Similarly, BSP calculations on a graph assume that each vertex with its attendant edges (read: matrix row or column) only communicates with its direct neighbors. PageRank is a good example of a global calculation, typically done iteratively over the entire graph / matrix, by a series of operations which may be looked at as aggregating lots of local ("microscopic") graph / matrix sub-parts. -jake > > (Now that is a Big-Endian/Little-Endian dispute I had not considered.) > > http://www.ietf.org/rfc/ien/ien137.txt > > On Thu, Sep 15, 2011 at 5:14 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[EMAIL PROTECTED] > > >wrote: > > > > > Seems like the bigger thing I see us discussing/needing is a > distributed > > > memory layer. Do each of these tools invent their own or is there a > > good, > > > open (ASL compatible) implementation out there somewhere that we could > > use? > > > Given such a layer, wouldn't it be fairly straightforward to implement > > both > > > graph based and matrix based approaches? Thinking aloud (and perhaps a > > bit > > > crazy), I wonder if one could simply implement a Hadoop filesystem that > > was > > > based on distributed memory (and persistable to disk, perhaps) thereby > > > allowing existing code to simply work. > > > > > > > The problem with raw Hadoop jobs which are iterative is that they launch > > multiple jobs, which can get executed on whatever machines the JobTracker > > sends them to, with open mapper slots. An in-memory HDFS would still > have > > files living at various locations, not necessarily the same as where all > of > > the mappers go, which means the chunks need to get moved over to local > disk > > of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on > a > > memory-mapped filesystem, it would still go to memory, I guess, but this > > doesn't like a very efficient process: Hadoop is optimized for streaming > > over big files, and the map/reduce shuffle requires a lot of disk (in > this > > case, memory!) to do what it does as well. > > > > As for "matrix-based" vs. "graph based", since every graph has an > adjacency > > matrix which describes it, and every matrix can describe a (possibly > > bipartite) graph, there's an isomorphism hiding here, and while I've > always > > thought of "everything as being a matrix", calling everything a graph > > probably works just as well, and the translation shouldn't be too > terribly > > hard (famous last words). > > > > A big "distributed memory layer" does indeed sound great, however. Spark > > and Giraph both provide their own, although the former seems to lean more > > toward "read-only, with allowed side-effects", and very general purpose, > > while the latter is couched in the language of graphs, and computation is > > specifically BSP (currently), but allows for fairly arbitrary mutation > (and > > persisting final results back to HDFS). > > > > -jake > > > > --Grant > > > > > > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote: > > > > > > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> I've since reached the conclusion that the thing I'm trying to > compare > > > >> it to is a 'data grid', e.g. gigaspaces. > > > >> > > > >> We want a large, evolving, data structure, which is essentially > cached > > > >> in memory split over nodes. > > > >> > > > > > > > > I should mention that Giraph certainly allows for the graph to change
-
Re: Apache Giraph?Ted Dunning 2011-09-16, 04:27
Actually, I don't think that these really provide a distributed memory
layer. What they is multiple iterations without having to renegotiate JVM launches, local memory that persists across iterations and decent message passing. (and of course some level of synchronization). And that is plenty for us. On Fri, Sep 16, 2011 at 12:14 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > A big "distributed memory layer" does indeed sound great, however. Spark > and Giraph both provide their own, although the former seems to lean more > toward "read-only, with allowed side-effects", and very general purpose, > while the latter is couched in the language of graphs, and computation is > specifically BSP (currently), but allows for fairly arbitrary mutation (and > persisting final results back to HDFS). >
-
Re: Apache Giraph?Ted Dunning 2011-09-16, 04:28
If we stay with Hadoop, then just nuking task failure tolerance and assuming
that all reducers are live would allow a very fast and simply push model for memory resident algorithms. On Fri, Sep 16, 2011 at 12:14 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > The problem with raw Hadoop jobs which are iterative is that they launch > multiple jobs, which can get executed on whatever machines the JobTracker > sends them to, with open mapper slots. An in-memory HDFS would still have > files living at various locations, not necessarily the same as where all of > the mappers go, which means the chunks need to get moved over to local disk > of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on a > memory-mapped filesystem, it would still go to memory, I guess, but this > doesn't like a very efficient process: Hadoop is optimized for streaming > over big files, and the map/reduce shuffle requires a lot of disk (in this > case, memory!) to do what it does as well. >
-
Re: Apache Giraph?Ted Dunning 2011-09-16, 04:31
Actually, I would say that basis changing computations are global and many
others are local. Most people don't think of a basis for a graph so it doesn't make a lot of sense to them to think this way, but the algorithms are the same. Basis changing algorithms include PageRank and other eigenvector oriented algorithms such as SVD, stochastic projections and harmonic analyses such as FFT's. When applied to graphs, these algorithms have the virtue of being able to preserve invariants while transforming them into a convenient form of some kind. On Fri, Sep 16, 2011 at 3:25 AM, Lance Norskog <[EMAIL PROTECTED]> wrote: > The distinction is in uses: matrix algorithms are generally macroscopic > over > the entire graph, while graph algorithms are generally microscopic within > the entire matrix. > > (Now that is a Big-Endian/Little-Endian dispute I had not considered.) > > http://www.ietf.org/rfc/ien/ien137.txt > > On Thu, Sep 15, 2011 at 5:14 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > On Thu, Sep 15, 2011 at 2:22 PM, Grant Ingersoll <[EMAIL PROTECTED] > > >wrote: > > > > > Seems like the bigger thing I see us discussing/needing is a > distributed > > > memory layer. Do each of these tools invent their own or is there a > > good, > > > open (ASL compatible) implementation out there somewhere that we could > > use? > > > Given such a layer, wouldn't it be fairly straightforward to implement > > both > > > graph based and matrix based approaches? Thinking aloud (and perhaps a > > bit > > > crazy), I wonder if one could simply implement a Hadoop filesystem that > > was > > > based on distributed memory (and persistable to disk, perhaps) thereby > > > allowing existing code to simply work. > > > > > > > The problem with raw Hadoop jobs which are iterative is that they launch > > multiple jobs, which can get executed on whatever machines the JobTracker > > sends them to, with open mapper slots. An in-memory HDFS would still > have > > files living at various locations, not necessarily the same as where all > of > > the mappers go, which means the chunks need to get moved over to local > disk > > of the mapper nodes. Now if the entire HDFS-accessible-filesystem is on > a > > memory-mapped filesystem, it would still go to memory, I guess, but this > > doesn't like a very efficient process: Hadoop is optimized for streaming > > over big files, and the map/reduce shuffle requires a lot of disk (in > this > > case, memory!) to do what it does as well. > > > > As for "matrix-based" vs. "graph based", since every graph has an > adjacency > > matrix which describes it, and every matrix can describe a (possibly > > bipartite) graph, there's an isomorphism hiding here, and while I've > always > > thought of "everything as being a matrix", calling everything a graph > > probably works just as well, and the translation shouldn't be too > terribly > > hard (famous last words). > > > > A big "distributed memory layer" does indeed sound great, however. Spark > > and Giraph both provide their own, although the former seems to lean more > > toward "read-only, with allowed side-effects", and very general purpose, > > while the latter is couched in the language of graphs, and computation is > > specifically BSP (currently), but allows for fairly arbitrary mutation > (and > > persisting final results back to HDFS). > > > > -jake > > > > --Grant > > > > > > On Sep 9, 2011, at 10:36 AM, Jake Mannix wrote: > > > > > > > On Fri, Sep 9, 2011 at 7:01 AM, Benson Margulies < > > [EMAIL PROTECTED] > > > >wrote: > > > > > > > >> I've since reached the conclusion that the thing I'm trying to > compare > > > >> it to is a 'data grid', e.g. gigaspaces. > > > >> > > > >> We want a large, evolving, data structure, which is essentially > cached > > > >> in memory split over nodes. > > > >> > > > > > > > > I should mention that Giraph certainly allows for the graph to change > > > (both > > > > in > > > > edge values, and in actual graph structure). But it's currently a
-
Re: Apache Giraph?Grant Ingersoll 2011-09-16, 10:45
On Sep 16, 2011, at 12:27 AM, Ted Dunning wrote: > Actually, I don't think that these really provide a distributed memory > layer. > > What they is multiple iterations without having to renegotiate JVM launches, > local memory that persists across iterations and decent message passing. > (and of course some level of synchronization). > > And that is plenty for us. > That sounds a lot like a distributed memory layer (i.e. the JVM stays up w/ it's memory) and then a msg passing layer on top of it. It smells like to me that it does for memory what the map-reduce + DFS abstraction did for that space, i.e. it gave a base platform + API that made it easy for people to build large scale distributed, disk-based, batch oriented systems. We need a base platform for large-scale, distributed memory-based systems so that it is easy to write implementations on top of it. > On Fri, Sep 16, 2011 at 12:14 AM, Jake Mannix <[EMAIL PROTECTED]> wrote: > >> A big "distributed memory layer" does indeed sound great, however. Spark >> and Giraph both provide their own, although the former seems to lean more >> toward "read-only, with allowed side-effects", and very general purpose, >> while the latter is couched in the language of graphs, and computation is >> specifically BSP (currently), but allows for fairly arbitrary mutation (and >> persisting final results back to HDFS). >>
-
Re: Apache Giraph?Ted Dunning 2011-09-16, 20:24
Well, distributed memory to me would have fetch and store operations. Here
we can send a message, but we can't actually fetch or store data without cooperation. On Fri, Sep 16, 2011 at 4:45 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > > On Sep 16, 2011, at 12:27 AM, Ted Dunning wrote: > > > Actually, I don't think that these really provide a distributed memory > > layer. > > > > What they is multiple iterations without having to renegotiate JVM > launches, > > local memory that persists across iterations and decent message passing. > > (and of course some level of synchronization). > > > > And that is plenty for us. > > > > That sounds a lot like a distributed memory layer (i.e. the JVM stays up w/ > it's memory) and then a msg passing layer on top of it. It smells like to > me that it does for memory what the map-reduce + DFS abstraction did for > that space, i.e. it gave a base platform + API that made it easy for people > to build large scale distributed, disk-based, batch oriented systems. We > need a base platform for large-scale, distributed memory-based systems so > that it is easy to write implementations on top of it. > > > > On Fri, Sep 16, 2011 at 12:14 AM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > >> A big "distributed memory layer" does indeed sound great, however. > Spark > >> and Giraph both provide their own, although the former seems to lean > more > >> toward "read-only, with allowed side-effects", and very general purpose, > >> while the latter is couched in the language of graphs, and computation > is > >> specifically BSP (currently), but allows for fairly arbitrary mutation > (and > >> persisting final results back to HDFS). > >> > > >
-
Re: Apache Giraph?Jake Mannix 2011-09-16, 20:31
On Fri, Sep 16, 2011 at 1:24 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Well, distributed memory to me would have fetch and store operations. Here > we can send a message, but we can't actually fetch or store data without > cooperation. > Funny you mention that - I've been considering suggesting that Giraph modify the "sendMsg()" method contract to not be void, but return something too... -jake > On Fri, Sep 16, 2011 at 4:45 AM, Grant Ingersoll <[EMAIL PROTECTED] > >wrote: > > > > > On Sep 16, 2011, at 12:27 AM, Ted Dunning wrote: > > > > > Actually, I don't think that these really provide a distributed memory > > > layer. > > > > > > What they is multiple iterations without having to renegotiate JVM > > launches, > > > local memory that persists across iterations and decent message > passing. > > > (and of course some level of synchronization). > > > > > > And that is plenty for us. > > > > > > > That sounds a lot like a distributed memory layer (i.e. the JVM stays up > w/ > > it's memory) and then a msg passing layer on top of it. It smells like > to > > me that it does for memory what the map-reduce + DFS abstraction did for > > that space, i.e. it gave a base platform + API that made it easy for > people > > to build large scale distributed, disk-based, batch oriented systems. We > > need a base platform for large-scale, distributed memory-based systems so > > that it is easy to write implementations on top of it. > > > > > > > On Fri, Sep 16, 2011 at 12:14 AM, Jake Mannix <[EMAIL PROTECTED]> > > wrote: > > > > > >> A big "distributed memory layer" does indeed sound great, however. > > Spark > > >> and Giraph both provide their own, although the former seems to lean > > more > > >> toward "read-only, with allowed side-effects", and very general > purpose, > > >> while the latter is couched in the language of graphs, and computation > > is > > >> specifically BSP (currently), but allows for fairly arbitrary mutation > > (and > > >> persisting final results back to HDFS). > > >> > > > > > > >
-
Re: Apache Giraph?Ted Dunning 2011-09-16, 22:36
Returning something halves performance or worse since you can't fire and
forget. IN Pregel style, you should expect the message to be processed in the next super step and a value returned in the super step after that. On Fri, Sep 16, 2011 at 2:31 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > On Fri, Sep 16, 2011 at 1:24 PM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > Well, distributed memory to me would have fetch and store operations. > Here > > we can send a message, but we can't actually fetch or store data without > > cooperation. > > > > Funny you mention that - I've been considering suggesting that Giraph > modify > the "sendMsg()" method contract to not be void, but return something too... > > -jake > > > > On Fri, Sep 16, 2011 at 4:45 AM, Grant Ingersoll <[EMAIL PROTECTED] > > >wrote: > > > > > > > > On Sep 16, 2011, at 12:27 AM, Ted Dunning wrote: > > > > > > > Actually, I don't think that these really provide a distributed > memory > > > > layer. > > > > > > > > What they is multiple iterations without having to renegotiate JVM > > > launches, > > > > local memory that persists across iterations and decent message > > passing. > > > > (and of course some level of synchronization). > > > > > > > > And that is plenty for us. > > > > > > > > > > That sounds a lot like a distributed memory layer (i.e. the JVM stays > up > > w/ > > > it's memory) and then a msg passing layer on top of it. It smells like > > to > > > me that it does for memory what the map-reduce + DFS abstraction did > for > > > that space, i.e. it gave a base platform + API that made it easy for > > people > > > to build large scale distributed, disk-based, batch oriented systems. > We > > > need a base platform for large-scale, distributed memory-based systems > so > > > that it is easy to write implementations on top of it. > > > > > > > > > > On Fri, Sep 16, 2011 at 12:14 AM, Jake Mannix <[EMAIL PROTECTED] > > > > > wrote: > > > > > > > >> A big "distributed memory layer" does indeed sound great, however. > > > Spark > > > >> and Giraph both provide their own, although the former seems to lean > > > more > > > >> toward "read-only, with allowed side-effects", and very general > > purpose, > > > >> while the latter is couched in the language of graphs, and > computation > > > is > > > >> specifically BSP (currently), but allows for fairly arbitrary > mutation > > > (and > > > >> persisting final results back to HDFS). > > > >> > > > > > > > > > > > >
-
Re: Apache Giraph?Jake Mannix 2011-09-16, 22:44
On Fri, Sep 16, 2011 at 3:36 PM, Ted Dunning <[EMAIL PROTECTED]> wrote:
> Returning something halves performance or worse since you can't fire and > forget. IN Pregel style, you should expect the message to be processed in > the next super step and a value returned in the super step after that. > I guess it depends on what you're doing. Sometimes you may want something returned which doesn't depend on what you're sending over (ie. it was computed in the _previous_ superstep), cutting 3 supersteps at least down to 2. But of course, you're right, the right way to do this is to just have the response from the previous step "sent back" at the same time as you're sending out our current message. Then it's never 2 steps - you're sending out your message now, the other side processes it during the next superstep, and it often can send the response as soon as it has done so. Async is definitely right here. > > On Fri, Sep 16, 2011 at 2:31 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: > > > On Fri, Sep 16, 2011 at 1:24 PM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > Well, distributed memory to me would have fetch and store operations. > > Here > > > we can send a message, but we can't actually fetch or store data > without > > > cooperation. > > > > > > > Funny you mention that - I've been considering suggesting that Giraph > > modify > > the "sendMsg()" method contract to not be void, but return something > too... > > > > -jake > > > > > > > On Fri, Sep 16, 2011 at 4:45 AM, Grant Ingersoll <[EMAIL PROTECTED] > > > >wrote: > > > > > > > > > > > On Sep 16, 2011, at 12:27 AM, Ted Dunning wrote: > > > > > > > > > Actually, I don't think that these really provide a distributed > > memory > > > > > layer. > > > > > > > > > > What they is multiple iterations without having to renegotiate JVM > > > > launches, > > > > > local memory that persists across iterations and decent message > > > passing. > > > > > (and of course some level of synchronization). > > > > > > > > > > And that is plenty for us. > > > > > > > > > > > > > That sounds a lot like a distributed memory layer (i.e. the JVM stays > > up > > > w/ > > > > it's memory) and then a msg passing layer on top of it. It smells > like > > > to > > > > me that it does for memory what the map-reduce + DFS abstraction did > > for > > > > that space, i.e. it gave a base platform + API that made it easy for > > > people > > > > to build large scale distributed, disk-based, batch oriented systems. > > We > > > > need a base platform for large-scale, distributed memory-based > systems > > so > > > > that it is easy to write implementations on top of it. > > > > > > > > > > > > > On Fri, Sep 16, 2011 at 12:14 AM, Jake Mannix < > [EMAIL PROTECTED] > > > > > > > wrote: > > > > > > > > > >> A big "distributed memory layer" does indeed sound great, however. > > > > Spark > > > > >> and Giraph both provide their own, although the former seems to > lean > > > > more > > > > >> toward "read-only, with allowed side-effects", and very general > > > purpose, > > > > >> while the latter is couched in the language of graphs, and > > computation > > > > is > > > > >> specifically BSP (currently), but allows for fairly arbitrary > > mutation > > > > (and > > > > >> persisting final results back to HDFS). > > > > >> > > > > > > > > > > > > > > > > > > |