|
Grant Ingersoll
2010-01-10, 23:16
deneche abdelhakim
2010-01-11, 04:03
Ted Dunning
2010-01-11, 04:06
Liang Chenmin
2010-01-11, 06:44
Ted Dunning
2010-01-11, 07:56
zaki rahaman
2010-01-11, 15:43
Grant Ingersoll
2010-01-11, 19:51
Ted Dunning
2010-01-11, 22:10
deneche abdelhakim
2010-01-12, 02:44
Liang Chenmin
2010-01-12, 03:25
Robin Anil
2010-01-12, 03:33
deneche abdelhakim
2010-01-12, 03:43
Grant Ingersoll
2010-01-18, 14:54
Olivier Grisel
2010-01-18, 15:38
Drew Farris
2010-01-18, 15:07
Grant Ingersoll
2010-01-18, 15:20
Drew Farris
2010-01-18, 15:59
Grant Ingersoll
2010-01-18, 20:07
Robin Anil
2010-01-18, 15:20
Robin Anil
2010-01-18, 15:23
Grant Ingersoll
2010-01-18, 15:26
Sean Owen
2010-01-18, 15:31
Grant Ingersoll
2010-01-18, 20:14
Ted Dunning
2010-01-18, 20:15
Ken Krugler
2010-01-18, 20:31
Sean Owen
2010-01-18, 20:33
Grant Ingersoll
2010-01-19, 00:42
Sean Owen
2010-01-10, 23:18
Ted Dunning
2010-01-11, 01:47
Grant Ingersoll
2010-01-11, 02:08
Ken Krugler
2010-01-11, 02:19
Ted Dunning
2010-01-11, 02:41
Robin Anil
2010-01-11, 03:29
Benson Margulies
2010-01-10, 23:21
Jake Mannix
2010-01-10, 23:27
Benson Margulies
2010-01-11, 00:21
Ken Krugler
2010-01-11, 01:32
Ken Krugler
2010-01-11, 01:03
|
-
Good starting instance for AMIGrant Ingersoll 2010-01-10, 23:16
Anyone have recs on a good AMI to start with on EC2 to load with Mahout? Preferably Linux and already has Java 1.6 installed.
Thanks, Grant +
Grant Ingersoll 2010-01-10, 23:16
-
Re : Good starting instance for AMIdeneche abdelhakim 2010-01-11, 04:03
I use the Cloudera distribution and it works just fine. It already includes Java and Hadoop.
http://archive.cloudera.com/docs/ec2.html The default AMI uses Hadoop 0.18.3 but you can launch a special AMI with Hadoop 0.20 using the following command: % hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20 \ my-hadoop-cluster 10 --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> a écrit : > De: Grant Ingersoll <[EMAIL PROTECTED]> > Objet: Good starting instance for AMI > À: [EMAIL PROTECTED] > Date: Lundi 11 Janvier 2010, 0h16 > Anyone have recs on a good AMI to > start with on EC2 to load with Mahout? Preferably > Linux and already has Java 1.6 installed. > > Thanks, > Grant +
deneche abdelhakim 2010-01-11, 04:03
-
Re: Re : Good starting instance for AMITed Dunning 2010-01-11, 04:06
This seems the easiest answer so far!
On Sun, Jan 10, 2010 at 8:03 PM, deneche abdelhakim <[EMAIL PROTECTED]>wrote: > > % hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20 \ > my-hadoop-cluster 10 -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2010-01-11, 04:06
-
Re: Re : Good starting instance for AMILiang Chenmin 2010-01-11, 06:44
I used EMR for our project, and it works. It took some time to set up
though. EMR requires S3 bucket, but S3 instance has a limit of file size(5GB), so need some extra care here. Has any one encounter the file size problem on S3 also? I kind of think that it's unreasonable to have a 5G size limit when we want to use the system to deal with large data set. On Sun, Jan 10, 2010 at 8:06 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > This seems the easiest answer so far! > > On Sun, Jan 10, 2010 at 8:03 PM, deneche abdelhakim <[EMAIL PROTECTED] > >wrote: > > > > > % hadoop-ec2 launch-cluster --env REPO=testing --env HADOOP_VERSION=0.20 > \ > > my-hadoop-cluster 10 > > > > > -- > Ted Dunning, CTO > DeepDyve > -- Chenmin Liang Language Technologies Institute, School of Computer Science Carnegie Mellon University +
Liang Chenmin 2010-01-11, 06:44
-
Re: Re : Good starting instance for AMITed Dunning 2010-01-11, 07:56
Just use several of these files.
On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <[EMAIL PROTECTED]>wrote: > EMR requires S3 bucket, but S3 instance has a limit of file > size(5GB), so need some extra care here. Has any one encounter the file > size > problem on S3 also? I kind of think that it's unreasonable to have a 5G > size limit when we want to use the system to deal with large data set. > -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2010-01-11, 07:56
-
Re: Re : Good starting instance for AMIzaki rahaman 2010-01-11, 15:43
Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR).
I have used both to get hadoop jobs up and running (although my EMR use has mostly been limited to running batch Pig scripts weekly). Deciding on which one to use really depends on what kind of job/data you're working with. EMR is most useful if you're already storing the dataset you're using on S3 and plan on running a one-off job. My understanding is that it's configured to use jets3t to stream data from s3 rather than copying it to the cluster, which is fine for a single pass over a small to medium sized dataset, but obviously slower for multiple passes or larger datasets. The API is also useful if you have a set workflow that you plan to run on a regular basis, and I often prototype quick and dirty jobs on very small EMR clusters to test how some things run in the wild (obviously not the most cost effective solution, but I've foudn pseudo-distributed mode doesn't catch everything). CDH gives you greater control over the initial setup and configuration of your cluster. From my understanding, it's not really an AMI. Rather, it's a set of Python scripts that's been modified from the ec2 scripts from hadoop/contrib with some nifty additions like being able to specify and set up EBS volumes, proxy on the cluster, and some others. The scripts use the boto Python module (a very useful Python module for working with EC2) to make a request to EC2 to setup a specified sized cluster with whatever vanilla AMI that's specified. It sets up the security groups and opens up the relevant ports and it then passes the init script to each of the instances once they've booted (same user-data file setup which is limited to 16K I believe). The init script tells each node to download hadoop (from Clouderas OS-specific repos) and any other user-specified packages and set them up. The hadoop config xml is hardcoded into the init script (although you can pass a modified config beforehand). The master is started first, and then the slaves are started so that the slaves can be given info about what NN and JT to connect to (the config uses the public DNS I believe to make things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2) when it comes to Hadoop versions, although I've had mixed results with the latter. Personally, I'd still like some kind of facade or something similar to further abstract things and make it easier for others to quickly set up ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane have been released recently, but given the language of choice (Clojure), I haven't yet had a chance to really investigate. On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Just use several of these files. > > On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <[EMAIL PROTECTED] > >wrote: > > > EMR requires S3 bucket, but S3 instance has a limit of file > > size(5GB), so need some extra care here. Has any one encounter the file > > size > > problem on S3 also? I kind of think that it's unreasonable to have a 5G > > size limit when we want to use the system to deal with large data set. > > > > > > -- > Ted Dunning, CTO > DeepDyve > -- Zaki Rahaman +
zaki rahaman 2010-01-11, 15:43
-
Re: Re : Good starting instance for AMIGrant Ingersoll 2010-01-11, 19:51
One quick question for all who responded:
How many have tried Mahout with the setup they recommended? -Grant On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > Some comments on Cloudera's Hadoop (CDH) and Elastic MapReduce (EMR). > > I have used both to get hadoop jobs up and running (although my EMR use has > mostly been limited to running batch Pig scripts weekly). Deciding on which > one to use really depends on what kind of job/data you're working with. > > EMR is most useful if you're already storing the dataset you're using on S3 > and plan on running a one-off job. My understanding is that it's configured > to use jets3t to stream data from s3 rather than copying it to the cluster, > which is fine for a single pass over a small to medium sized dataset, but > obviously slower for multiple passes or larger datasets. The API is also > useful if you have a set workflow that you plan to run on a regular basis, > and I often prototype quick and dirty jobs on very small EMR clusters to > test how some things run in the wild (obviously not the most cost effective > solution, but I've foudn pseudo-distributed mode doesn't catch everything). > > CDH gives you greater control over the initial setup and configuration of > your cluster. From my understanding, it's not really an AMI. Rather, it's a > set of Python scripts that's been modified from the ec2 scripts from > hadoop/contrib with some nifty additions like being able to specify and set > up EBS volumes, proxy on the cluster, and some others. The scripts use the > boto Python module (a very useful Python module for working with EC2) to > make a request to EC2 to setup a specified sized cluster with whatever > vanilla AMI that's specified. It sets up the security groups and opens up > the relevant ports and it then passes the init script to each of the > instances once they've booted (same user-data file setup which is limited to > 16K I believe). The init script tells each node to download hadoop (from > Clouderas OS-specific repos) and any other user-specified packages and set > them up. The hadoop config xml is hardcoded into the init script (although > you can pass a modified config beforehand). The master is started first, and > then the slaves are started so that the slaves can be given info about what > NN and JT to connect to (the config uses the public DNS I believe to make > things easier to set up). You can use either 0.18.3 (CDH) or 0.20 (CDH2) > when it comes to Hadoop versions, although I've had mixed results with the > latter. > > Personally, I'd still like some kind of facade or something similar to > further abstract things and make it easier for others to quickly set up > ad-hoc clusters for 'quick n dirty' jobs. I know other libraries like Crane > have been released recently, but given the language of choice (Clojure), I > haven't yet had a chance to really investigate. > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > >> Just use several of these files. >> >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin <[EMAIL PROTECTED] >>> wrote: >> >>> EMR requires S3 bucket, but S3 instance has a limit of file >>> size(5GB), so need some extra care here. Has any one encounter the file >>> size >>> problem on S3 also? I kind of think that it's unreasonable to have a 5G >>> size limit when we want to use the system to deal with large data set. >>> >> >> >> >> -- >> Ted Dunning, CTO >> DeepDyve >> > > > > -- > Zaki Rahaman -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search +
Grant Ingersoll 2010-01-11, 19:51
-
Re: Re : Good starting instance for AMITed Dunning 2010-01-11, 22:10
I have only run Mahout on a single node or a fixed cluster in our data
center. On Mon, Jan 11, 2010 at 11:51 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > How many have tried Mahout with the setup they recommended? -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2010-01-11, 22:10
-
Re: Re : Good starting instance for AMIdeneche abdelhakim 2010-01-12, 02:44
I used Cloudera's with Mahout to test the Decision Forest implementation.
--- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> a écrit : > De: Grant Ingersoll <[EMAIL PROTECTED]> > Objet: Re: Re : Good starting instance for AMI > À: [EMAIL PROTECTED] > Date: Lundi 11 Janvier 2010, 20h51 > One quick question for all who > responded: > How many have tried Mahout with the setup they > recommended? > > -Grant > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > Some comments on Cloudera's Hadoop (CDH) and Elastic > MapReduce (EMR). > > > > I have used both to get hadoop jobs up and running > (although my EMR use has > > mostly been limited to running batch Pig scripts > weekly). Deciding on which > > one to use really depends on what kind of job/data > you're working with. > > > > EMR is most useful if you're already storing the > dataset you're using on S3 > > and plan on running a one-off job. My understanding is > that it's configured > > to use jets3t to stream data from s3 rather than > copying it to the cluster, > > which is fine for a single pass over a small to medium > sized dataset, but > > obviously slower for multiple passes or larger > datasets. The API is also > > useful if you have a set workflow that you plan to run > on a regular basis, > > and I often prototype quick and dirty jobs on very > small EMR clusters to > > test how some things run in the wild (obviously not > the most cost effective > > solution, but I've foudn pseudo-distributed mode > doesn't catch everything). > > > > CDH gives you greater control over the initial setup > and configuration of > > your cluster. From my understanding, it's not really > an AMI. Rather, it's a > > set of Python scripts that's been modified from the > ec2 scripts from > > hadoop/contrib with some nifty additions like being > able to specify and set > > up EBS volumes, proxy on the cluster, and some others. > The scripts use the > > boto Python module (a very useful Python module for > working with EC2) to > > make a request to EC2 to setup a specified sized > cluster with whatever > > vanilla AMI that's specified. It sets up the security > groups and opens up > > the relevant ports and it then passes the init script > to each of the > > instances once they've booted (same user-data file > setup which is limited to > > 16K I believe). The init script tells each node to > download hadoop (from > > Clouderas OS-specific repos) and any other > user-specified packages and set > > them up. The hadoop config xml is hardcoded into the > init script (although > > you can pass a modified config beforehand). The master > is started first, and > > then the slaves are started so that the slaves can be > given info about what > > NN and JT to connect to (the config uses the public > DNS I believe to make > > things easier to set up). You can use either 0.18.3 > (CDH) or 0.20 (CDH2) > > when it comes to Hadoop versions, although I've had > mixed results with the > > latter. > > > > Personally, I'd still like some kind of facade or > something similar to > > further abstract things and make it easier for others > to quickly set up > > ad-hoc clusters for 'quick n dirty' jobs. I know other > libraries like Crane > > have been released recently, but given the language of > choice (Clojure), I > > haven't yet had a chance to really investigate. > > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[EMAIL PROTECTED]> > wrote: > > > >> Just use several of these files. > >> > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin > <[EMAIL PROTECTED] > >>> wrote: > >> > >>> EMR requires S3 bucket, but S3 instance has a > limit of file > >>> size(5GB), so need some extra care here. Has > any one encounter the file > >>> size > >>> problem on S3 also? I kind of think that it's > unreasonable to have a 5G > >>> size limit when we want to use the system to > deal with large data set. > >>> > >> > >> > >> > >> -- > >> Ted Dunning, CTO > +
deneche abdelhakim 2010-01-12, 02:44
-
Re: Re : Good starting instance for AMILiang Chenmin 2010-01-12, 03:25
I first followed the tutorial about running mahout on EMR, need some
revision on the command line though. On Mon, Jan 11, 2010 at 6:44 PM, deneche abdelhakim <[EMAIL PROTECTED]>wrote: > I used Cloudera's with Mahout to test the Decision Forest implementation. > > --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> a > écrit : > > > De: Grant Ingersoll <[EMAIL PROTECTED]> > > Objet: Re: Re : Good starting instance for AMI > > À: [EMAIL PROTECTED] > > Date: Lundi 11 Janvier 2010, 20h51 > > One quick question for all who > > responded: > > How many have tried Mahout with the setup they > > recommended? > > > > -Grant > > > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > > > Some comments on Cloudera's Hadoop (CDH) and Elastic > > MapReduce (EMR). > > > > > > I have used both to get hadoop jobs up and running > > (although my EMR use has > > > mostly been limited to running batch Pig scripts > > weekly). Deciding on which > > > one to use really depends on what kind of job/data > > you're working with. > > > > > > EMR is most useful if you're already storing the > > dataset you're using on S3 > > > and plan on running a one-off job. My understanding is > > that it's configured > > > to use jets3t to stream data from s3 rather than > > copying it to the cluster, > > > which is fine for a single pass over a small to medium > > sized dataset, but > > > obviously slower for multiple passes or larger > > datasets. The API is also > > > useful if you have a set workflow that you plan to run > > on a regular basis, > > > and I often prototype quick and dirty jobs on very > > small EMR clusters to > > > test how some things run in the wild (obviously not > > the most cost effective > > > solution, but I've foudn pseudo-distributed mode > > doesn't catch everything). > > > > > > CDH gives you greater control over the initial setup > > and configuration of > > > your cluster. From my understanding, it's not really > > an AMI. Rather, it's a > > > set of Python scripts that's been modified from the > > ec2 scripts from > > > hadoop/contrib with some nifty additions like being > > able to specify and set > > > up EBS volumes, proxy on the cluster, and some others. > > The scripts use the > > > boto Python module (a very useful Python module for > > working with EC2) to > > > make a request to EC2 to setup a specified sized > > cluster with whatever > > > vanilla AMI that's specified. It sets up the security > > groups and opens up > > > the relevant ports and it then passes the init script > > to each of the > > > instances once they've booted (same user-data file > > setup which is limited to > > > 16K I believe). The init script tells each node to > > download hadoop (from > > > Clouderas OS-specific repos) and any other > > user-specified packages and set > > > them up. The hadoop config xml is hardcoded into the > > init script (although > > > you can pass a modified config beforehand). The master > > is started first, and > > > then the slaves are started so that the slaves can be > > given info about what > > > NN and JT to connect to (the config uses the public > > DNS I believe to make > > > things easier to set up). You can use either 0.18.3 > > (CDH) or 0.20 (CDH2) > > > when it comes to Hadoop versions, although I've had > > mixed results with the > > > latter. > > > > > > Personally, I'd still like some kind of facade or > > something similar to > > > further abstract things and make it easier for others > > to quickly set up > > > ad-hoc clusters for 'quick n dirty' jobs. I know other > > libraries like Crane > > > have been released recently, but given the language of > > choice (Clojure), I > > > haven't yet had a chance to really investigate. > > > > > > On Mon, Jan 11, 2010 at 2:56 AM, Ted Dunning <[EMAIL PROTECTED]> > > wrote: > > > > > >> Just use several of these files. > > >> > > >> On Sun, Jan 10, 2010 at 10:44 PM, Liang Chenmin > > <[EMAIL PROTECTED] > > >>> wrote: > > >> Chenmin Liang Language Technologies Institute, School of Computer Science Carnegie Mellon University +
Liang Chenmin 2010-01-12, 03:25
-
Re: Re : Good starting instance for AMIRobin Anil 2010-01-12, 03:33
Since i dont have a personal linux box these days. I code on eclipse on
windows and fire up an instance attach the ebs and patch and test my code. yes, I have only tried a single node yet. On Tue, Jan 12, 2010 at 8:55 AM, Liang Chenmin <[EMAIL PROTECTED]>wrote: > I first followed the tutorial about running mahout on EMR, need some > revision on the command line though. > > On Mon, Jan 11, 2010 at 6:44 PM, deneche abdelhakim <[EMAIL PROTECTED] > >wrote: > > > I used Cloudera's with Mahout to test the Decision Forest implementation. > > > > --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> a > > écrit : > > > > > De: Grant Ingersoll <[EMAIL PROTECTED]> > > > Objet: Re: Re : Good starting instance for AMI > > > À: [EMAIL PROTECTED] > > > Date: Lundi 11 Janvier 2010, 20h51 > > > One quick question for all who > > > responded: > > > How many have tried Mahout with the setup they > > > recommended? > > > > > > -Grant > > > > > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > > > > > Some comments on Cloudera's Hadoop (CDH) and Elastic > > > MapReduce (EMR). > > > > > > > > I have used both to get hadoop jobs up and running > > > (although my EMR use has > > > > mostly been limited to running batch Pig scripts > > > weekly). Deciding on which > > > > one to use really depends on what kind of job/data > > > you're working with. > > > > > > > > EMR is most useful if you're already storing the > > > dataset you're using on S3 > > > > and plan on running a one-off job. My understanding is > > > that it's configured > > > > to use jets3t to stream data from s3 rather than > > > copying it to the cluster, > > > > which is fine for a single pass over a small to medium > > > sized dataset, but > > > > obviously slower for multiple passes or larger > > > datasets. The API is also > > > > useful if you have a set workflow that you plan to run > > > on a regular basis, > > > > and I often prototype quick and dirty jobs on very > > > small EMR clusters to > > > > test how some things run in the wild (obviously not > > > the most cost effective > > > > solution, but I've foudn pseudo-distributed mode > > > doesn't catch everything). > > > > > > > > CDH gives you greater control over the initial setup > > > and configuration of > > > > your cluster. From my understanding, it's not really > > > an AMI. Rather, it's a > > > > set of Python scripts that's been modified from the > > > ec2 scripts from > > > > hadoop/contrib with some nifty additions like being > > > able to specify and set > > > > up EBS volumes, proxy on the cluster, and some others. > > > The scripts use the > > > > boto Python module (a very useful Python module for > > > working with EC2) to > > > > make a request to EC2 to setup a specified sized > > > cluster with whatever > > > > vanilla AMI that's specified. It sets up the security > > > groups and opens up > > > > the relevant ports and it then passes the init script > > > to each of the > > > > instances once they've booted (same user-data file > > > setup which is limited to > > > > 16K I believe). The init script tells each node to > > > download hadoop (from > > > > Clouderas OS-specific repos) and any other > > > user-specified packages and set > > > > them up. The hadoop config xml is hardcoded into the > > > init script (although > > > > you can pass a modified config beforehand). The master > > > is started first, and > > > > then the slaves are started so that the slaves can be > > > given info about what > > > > NN and JT to connect to (the config uses the public > > > DNS I believe to make > > > > things easier to set up). You can use either 0.18.3 > > > (CDH) or 0.20 (CDH2) > > > > when it comes to Hadoop versions, although I've had > > > mixed results with the > > > > latter. > > > > > > > > Personally, I'd still like some kind of facade or > > > something similar to > > > > further abstract things and make it easier for others > > > to quickly set up > > > > ad-hoc clusters for 'quick n dirty' jobs. I know other +
Robin Anil 2010-01-12, 03:33
-
Re: Re : Good starting instance for AMIdeneche abdelhakim 2010-01-12, 03:43
I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and the data from S3, and launch my job.
--- En date de : Mar 12.1.10, deneche abdelhakim <[EMAIL PROTECTED]> a écrit : > De: deneche abdelhakim <[EMAIL PROTECTED]> > Objet: Re: Re : Good starting instance for AMI > À: [EMAIL PROTECTED] > Date: Mardi 12 Janvier 2010, 3h44 > I used Cloudera's with Mahout to test > the Decision Forest implementation. > > --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> > a écrit : > > > De: Grant Ingersoll <[EMAIL PROTECTED]> > > Objet: Re: Re : Good starting instance for AMI > > À: [EMAIL PROTECTED] > > Date: Lundi 11 Janvier 2010, 20h51 > > One quick question for all who > > responded: > > How many have tried Mahout with the setup they > > recommended? > > > > -Grant > > > > On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > > > > > Some comments on Cloudera's Hadoop (CDH) and > Elastic > > MapReduce (EMR). > > > > > > I have used both to get hadoop jobs up and > running > > (although my EMR use has > > > mostly been limited to running batch Pig scripts > > weekly). Deciding on which > > > one to use really depends on what kind of > job/data > > you're working with. > > > > > > EMR is most useful if you're already storing the > > dataset you're using on S3 > > > and plan on running a one-off job. My > understanding is > > that it's configured > > > to use jets3t to stream data from s3 rather than > > copying it to the cluster, > > > which is fine for a single pass over a small to > medium > > sized dataset, but > > > obviously slower for multiple passes or larger > > datasets. The API is also > > > useful if you have a set workflow that you plan > to run > > on a regular basis, > > > and I often prototype quick and dirty jobs on > very > > small EMR clusters to > > > test how some things run in the wild (obviously > not > > the most cost effective > > > solution, but I've foudn pseudo-distributed mode > > doesn't catch everything). > > > > > > CDH gives you greater control over the initial > setup > > and configuration of > > > your cluster. From my understanding, it's not > really > > an AMI. Rather, it's a > > > set of Python scripts that's been modified from > the > > ec2 scripts from > > > hadoop/contrib with some nifty additions like > being > > able to specify and set > > > up EBS volumes, proxy on the cluster, and some > others. > > The scripts use the > > > boto Python module (a very useful Python module > for > > working with EC2) to > > > make a request to EC2 to setup a specified sized > > cluster with whatever > > > vanilla AMI that's specified. It sets up the > security > > groups and opens up > > > the relevant ports and it then passes the init > script > > to each of the > > > instances once they've booted (same user-data > file > > setup which is limited to > > > 16K I believe). The init script tells each node > to > > download hadoop (from > > > Clouderas OS-specific repos) and any other > > user-specified packages and set > > > them up. The hadoop config xml is hardcoded into > the > > init script (although > > > you can pass a modified config beforehand). The > master > > is started first, and > > > then the slaves are started so that the slaves > can be > > given info about what > > > NN and JT to connect to (the config uses the > public > > DNS I believe to make > > > things easier to set up). You can use either > 0.18.3 > > (CDH) or 0.20 (CDH2) > > > when it comes to Hadoop versions, although I've > had > > mixed results with the > > > latter. > > > > > > Personally, I'd still like some kind of facade > or > > something similar to > > > further abstract things and make it easier for > others > > to quickly set up > > > ad-hoc clusters for 'quick n dirty' jobs. I know > other > > libraries like Crane > > > have been released recently, but given the +
deneche abdelhakim 2010-01-12, 03:43
-
Re: Re : Good starting instance for AMIGrant Ingersoll 2010-01-18, 14:54
OK, thanks for all the advice. I'm wondering if this makes sense:'
Create an AMI with: 1. Java 1.6 2. Maven 3. svn 4. Mahout's exact Hadoop version 5. A checkout of Mahout I want to be able to run the trunk version of Mahout with little upgrade pain, both on an individual node and in a cluster. Is this the shortest path? I don't have much experience w/ creating AMIs, but I want my work to be reusable by the community (remember, committers can get credits from Amazon for testing Mahout) After that, I want to convert some of the public datasets to vector format and run some performance benchmarks. Thoughts? On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and the data from S3, and launch my job. > > --- En date de : Mar 12.1.10, deneche abdelhakim <[EMAIL PROTECTED]> a écrit : > >> De: deneche abdelhakim <[EMAIL PROTECTED]> >> Objet: Re: Re : Good starting instance for AMI >> À: [EMAIL PROTECTED] >> Date: Mardi 12 Janvier 2010, 3h44 >> I used Cloudera's with Mahout to test >> the Decision Forest implementation. >> >> --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> >> a écrit : >> >>> De: Grant Ingersoll <[EMAIL PROTECTED]> >>> Objet: Re: Re : Good starting instance for AMI >>> À: [EMAIL PROTECTED] >>> Date: Lundi 11 Janvier 2010, 20h51 >>> One quick question for all who >>> responded: >>> How many have tried Mahout with the setup they >>> recommended? >>> >>> -Grant >>> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: >>> >>>> Some comments on Cloudera's Hadoop (CDH) and >> Elastic >>> MapReduce (EMR). >>>> >>>> I have used both to get hadoop jobs up and >> running >>> (although my EMR use has >>>> mostly been limited to running batch Pig scripts >>> weekly). Deciding on which >>>> one to use really depends on what kind of >> job/data >>> you're working with. >>>> >>>> EMR is most useful if you're already storing the >>> dataset you're using on S3 >>>> and plan on running a one-off job. My >> understanding is >>> that it's configured >>>> to use jets3t to stream data from s3 rather than >>> copying it to the cluster, >>>> which is fine for a single pass over a small to >> medium >>> sized dataset, but >>>> obviously slower for multiple passes or larger >>> datasets. The API is also >>>> useful if you have a set workflow that you plan >> to run >>> on a regular basis, >>>> and I often prototype quick and dirty jobs on >> very >>> small EMR clusters to >>>> test how some things run in the wild (obviously >> not >>> the most cost effective >>>> solution, but I've foudn pseudo-distributed mode >>> doesn't catch everything). >>>> >>>> CDH gives you greater control over the initial >> setup >>> and configuration of >>>> your cluster. From my understanding, it's not >> really >>> an AMI. Rather, it's a >>>> set of Python scripts that's been modified from >> the >>> ec2 scripts from >>>> hadoop/contrib with some nifty additions like >> being >>> able to specify and set >>>> up EBS volumes, proxy on the cluster, and some >> others. >>> The scripts use the >>>> boto Python module (a very useful Python module >> for >>> working with EC2) to >>>> make a request to EC2 to setup a specified sized >>> cluster with whatever >>>> vanilla AMI that's specified. It sets up the >> security >>> groups and opens up >>>> the relevant ports and it then passes the init >> script >>> to each of the >>>> instances once they've booted (same user-data >> file >>> setup which is limited to >>>> 16K I believe). The init script tells each node >> to >>> download hadoop (from >>>> Clouderas OS-specific repos) and any other >>> user-specified packages and set >>>> them up. The hadoop config xml is hardcoded into >> the >>> init script (although >>>> you can pass a modified config beforehand). The >> master >>> is started first, and +
Grant Ingersoll 2010-01-18, 14:54
-
Re: Re : Good starting instance for AMIOlivier Grisel 2010-01-18, 15:38
2010/1/18 Grant Ingersoll <[EMAIL PROTECTED]>:
> OK, thanks for all the advice. I'm wondering if this makes sense:' > > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout I am running CDH2 with hadoop currently in version 0.20.1+152-1~j (using cloudera's intrepid-testing apt repo on a regular ubuntu karmic distro) with on my 2 dev boxes (one is 32bit bi core and one is 64bit quad core) in conf-pseudo (single node cIuser). I could successfully run mahout-0.3-SNAPSHOT jobs (including the hadoop-0.20.2-SNAPSHOT. I guess this would run exactly the same on a real EC2 cluster setup with http://archive.cloudera.com/docs/ec2.html . > I want to be able to run the trunk version of Mahout with little upgrade pain, both on an individual node and in a cluster. > > Is this the shortest path? I don't have much experience w/ creating AMIs, but I want my work to be reusable by the community (remember, committers can get credits from Amazon for testing Mahout) > > After that, I want to convert some of the public datasets to vector format and run some performance benchmarks. I think we should host sample datasets that are know to be vectorizable using mahout utilities either on S3 (using s3:// and not s3n:// when individual files are larger than 5GB) or using a dedicated EBS volume with a public snapshot. -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name +
Olivier Grisel 2010-01-18, 15:38
-
Re: Re : Good starting instance for AMIDrew Farris 2010-01-18, 15:07
Sounds great.
It might be handy to include with the AMI a local maven repo pre-populated with build dependencies to shorten the build time as well. I wonder if the CDH2 ami's could be used as a starting point? Not sure if you're allowed to unbundle and modify public AMI's. It would certainly be more difficult to start from scratch. Amazon hosts some public datasets for free: http://aws.amazon.com/publicdatasets/ Perhaps the mahout test data in vector form could be bundled up into a snapshot that could be re-used by anyone. On Mon, Jan 18, 2010 at 9:54 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > OK, thanks for all the advice. I'm wondering if this makes sense:' > > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout > > I want to be able to run the trunk version of Mahout with little upgrade pain, both on an individual node and in a cluster. > > Is this the shortest path? I don't have much experience w/ creating AMIs, but I want my work to be reusable by the community (remember, committers can get credits from Amazon for testing Mahout) > > After that, I want to convert some of the public datasets to vector format and run some performance benchmarks. > > Thoughts? > > On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: > >> I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is wget the Mahout's job files and the data from S3, and launch my job. >> >> --- En date de : Mar 12.1.10, deneche abdelhakim <[EMAIL PROTECTED]> a écrit : >> >>> De: deneche abdelhakim <[EMAIL PROTECTED]> >>> Objet: Re: Re : Good starting instance for AMI >>> À: [EMAIL PROTECTED] >>> Date: Mardi 12 Janvier 2010, 3h44 >>> I used Cloudera's with Mahout to test >>> the Decision Forest implementation. >>> >>> --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> >>> a écrit : >>> >>>> De: Grant Ingersoll <[EMAIL PROTECTED]> >>>> Objet: Re: Re : Good starting instance for AMI >>>> À: [EMAIL PROTECTED] >>>> Date: Lundi 11 Janvier 2010, 20h51 >>>> One quick question for all who >>>> responded: >>>> How many have tried Mahout with the setup they >>>> recommended? >>>> >>>> -Grant >>>> >>>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: >>>> >>>>> Some comments on Cloudera's Hadoop (CDH) and >>> Elastic >>>> MapReduce (EMR). >>>>> >>>>> I have used both to get hadoop jobs up and >>> running >>>> (although my EMR use has >>>>> mostly been limited to running batch Pig scripts >>>> weekly). Deciding on which >>>>> one to use really depends on what kind of >>> job/data >>>> you're working with. >>>>> >>>>> EMR is most useful if you're already storing the >>>> dataset you're using on S3 >>>>> and plan on running a one-off job. My >>> understanding is >>>> that it's configured >>>>> to use jets3t to stream data from s3 rather than >>>> copying it to the cluster, >>>>> which is fine for a single pass over a small to >>> medium >>>> sized dataset, but >>>>> obviously slower for multiple passes or larger >>>> datasets. The API is also >>>>> useful if you have a set workflow that you plan >>> to run >>>> on a regular basis, >>>>> and I often prototype quick and dirty jobs on >>> very >>>> small EMR clusters to >>>>> test how some things run in the wild (obviously >>> not >>>> the most cost effective >>>>> solution, but I've foudn pseudo-distributed mode >>>> doesn't catch everything). >>>>> >>>>> CDH gives you greater control over the initial >>> setup >>>> and configuration of >>>>> your cluster. From my understanding, it's not >>> really >>>> an AMI. Rather, it's a >>>>> set of Python scripts that's been modified from >>> the >>>> ec2 scripts from >>>>> hadoop/contrib with some nifty additions like >>> being >>>> able to specify and set >>>>> up EBS volumes, proxy on the cluster, and some >>> others. >>>> The scripts use the >>>>> boto Python module (a very useful Python module +
Drew Farris 2010-01-18, 15:07
-
Re: Re : Good starting instance for AMIGrant Ingersoll 2010-01-18, 15:20
On Jan 18, 2010, at 10:07 AM, Drew Farris wrote: > Sounds great. > > It might be handy to include with the AMI a local maven repo > pre-populated with build dependencies to shorten the build time as > well. Running as I type... > > I wonder if the CDH2 ami's could be used as a starting point? Not sure > if you're allowed to unbundle and modify public AMI's. It would > certainly be more difficult to start from scratch. I'd prefer to be dependent on the official Apache distro that we use. > > Amazon hosts some public datasets for free: > http://aws.amazon.com/publicdatasets/ > Perhaps the mahout test data in vector form could be bundled up into a > snapshot that could be re-used by anyone. Yes! I would welcome help on this. I also wonder if we can talk to Amazon about hosting that data publicly so that we don't have to pay for it. Either that or maybe we could ask the ASF for some small budget to do so. Any insight from those w/ more experience would be greatly appreciated. I can talk to the Amazon contact who runs the Apache donation project. -Grant +
Grant Ingersoll 2010-01-18, 15:20
-
Re: Re : Good starting instance for AMIDrew Farris 2010-01-18, 15:59
On Mon, Jan 18, 2010 at 10:20 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote:
>> >> I wonder if the CDH2 ami's could be used as a starting point? Not sure >> if you're allowed to unbundle and modify public AMI's. It would >> certainly be more difficult to start from scratch. > > I'd prefer to be dependent on the official Apache distro that we use. > Do you mean the distro of Hadoop, or something else? From what I understand the convenience that CDH2 provides is largely based on the launch/management scripts, I agree that it would make sense to replace the actual hadoop distro with something that we use. It is pretty simple to create AMI's from scratch, but I was wondering about getting things set up to auto-launch the various parts of hadoop at boot time and get the configuration right so that they are bound into a single cluster etc. If those sorts of things are trivial or otherwise covered, no need to start from CDH2. Drew +
Drew Farris 2010-01-18, 15:59
-
Re: Re : Good starting instance for AMIGrant Ingersoll 2010-01-18, 20:07
On Jan 18, 2010, at 10:59 AM, Drew Farris wrote: > On Mon, Jan 18, 2010 at 10:20 AM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >>> >>> I wonder if the CDH2 ami's could be used as a starting point? Not sure >>> if you're allowed to unbundle and modify public AMI's. It would >>> certainly be more difficult to start from scratch. >> >> I'd prefer to be dependent on the official Apache distro that we use. >> > > Do you mean the distro of Hadoop, or something else? From what I > understand the convenience that CDH2 provides is largely based on the > launch/management scripts, I agree that it would make sense to replace > the actual hadoop distro with something that we use. I just want the exact version that is in our MavenPOM. > > It is pretty simple to create AMI's from scratch, but I was wondering > about getting things set up to auto-launch the various parts of hadoop > at boot time and get the configuration right so that they are bound > into a single cluster etc. If those sorts of things are trivial or > otherwise covered, no need to start from CDH2. Yes, I'd like that too. +
Grant Ingersoll 2010-01-18, 20:07
-
Re: Re : Good starting instance for AMIRobin Anil 2010-01-18, 15:20
Perfect!. We can have two ami's. Mahout trunk and mahout release version.
On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > OK, thanks for all the advice. I'm wondering if this makes sense:' > > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout > > I want to be able to run the trunk version of Mahout with little upgrade > pain, both on an individual node and in a cluster. > > Is this the shortest path? I don't have much experience w/ creating AMIs, > but I want my work to be reusable by the community (remember, committers can > get credits from Amazon for testing Mahout) > > After that, I want to convert some of the public datasets to vector format > and run some performance benchmarks. > > Thoughts? > > On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: > > > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs > Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is > wget the Mahout's job files and the data from S3, and launch my job. > > > > --- En date de : Mar 12.1.10, deneche abdelhakim <[EMAIL PROTECTED]> a > écrit : > > > >> De: deneche abdelhakim <[EMAIL PROTECTED]> > >> Objet: Re: Re : Good starting instance for AMI > >> À: [EMAIL PROTECTED] > >> Date: Mardi 12 Janvier 2010, 3h44 > >> I used Cloudera's with Mahout to test > >> the Decision Forest implementation. > >> > >> --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> > >> a écrit : > >> > >>> De: Grant Ingersoll <[EMAIL PROTECTED]> > >>> Objet: Re: Re : Good starting instance for AMI > >>> À: [EMAIL PROTECTED] > >>> Date: Lundi 11 Janvier 2010, 20h51 > >>> One quick question for all who > >>> responded: > >>> How many have tried Mahout with the setup they > >>> recommended? > >>> > >>> -Grant > >>> > >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: > >>> > >>>> Some comments on Cloudera's Hadoop (CDH) and > >> Elastic > >>> MapReduce (EMR). > >>>> > >>>> I have used both to get hadoop jobs up and > >> running > >>> (although my EMR use has > >>>> mostly been limited to running batch Pig scripts > >>> weekly). Deciding on which > >>>> one to use really depends on what kind of > >> job/data > >>> you're working with. > >>>> > >>>> EMR is most useful if you're already storing the > >>> dataset you're using on S3 > >>>> and plan on running a one-off job. My > >> understanding is > >>> that it's configured > >>>> to use jets3t to stream data from s3 rather than > >>> copying it to the cluster, > >>>> which is fine for a single pass over a small to > >> medium > >>> sized dataset, but > >>>> obviously slower for multiple passes or larger > >>> datasets. The API is also > >>>> useful if you have a set workflow that you plan > >> to run > >>> on a regular basis, > >>>> and I often prototype quick and dirty jobs on > >> very > >>> small EMR clusters to > >>>> test how some things run in the wild (obviously > >> not > >>> the most cost effective > >>>> solution, but I've foudn pseudo-distributed mode > >>> doesn't catch everything). > >>>> > >>>> CDH gives you greater control over the initial > >> setup > >>> and configuration of > >>>> your cluster. From my understanding, it's not > >> really > >>> an AMI. Rather, it's a > >>>> set of Python scripts that's been modified from > >> the > >>> ec2 scripts from > >>>> hadoop/contrib with some nifty additions like > >> being > >>> able to specify and set > >>>> up EBS volumes, proxy on the cluster, and some > >> others. > >>> The scripts use the > >>>> boto Python module (a very useful Python module > >> for > >>> working with EC2) to > >>>> make a request to EC2 to setup a specified sized > >>> cluster with whatever > >>>> vanilla AMI that's specified. It sets up the > >> security > >>> groups and opens up > >>>> the relevant ports and it then passes the init > >> script > >>> to each of the > >>>> instances once they've booted (same user-data +
Robin Anil 2010-01-18, 15:20
-
Re: Re : Good starting instance for AMIRobin Anil 2010-01-18, 15:23
It would be great if we can bundle lzo codec too
We need to put some script to add to the hadoop slaves to run a cluster easily(needn't be optimized configuration) One problem i see is we may have to make for both 386 and x64 kernel(or we wont be able to run small/large instances respectively) Robin On Mon, Jan 18, 2010 at 8:50 PM, Robin Anil <[EMAIL PROTECTED]> wrote: > Perfect!. We can have two ami's. Mahout trunk and mahout release version. > > > On Mon, Jan 18, 2010 at 8:24 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> OK, thanks for all the advice. I'm wondering if this makes sense:' >> >> Create an AMI with: >> 1. Java 1.6 >> 2. Maven >> 3. svn >> 4. Mahout's exact Hadoop version >> 5. A checkout of Mahout >> >> I want to be able to run the trunk version of Mahout with little upgrade >> pain, both on an individual node and in a cluster. >> >> Is this the shortest path? I don't have much experience w/ creating AMIs, >> but I want my work to be reusable by the community (remember, committers can >> get credits from Amazon for testing Mahout) >> >> After that, I want to convert some of the public datasets to vector format >> and run some performance benchmarks. >> >> Thoughts? >> >> On Jan 11, 2010, at 10:43 PM, deneche abdelhakim wrote: >> >> > I'm using Cloudera's with a 5 nodes cluster (+ 1 master node) that runs >> Hadoop 0.20+ . Hadoop is pre-installed and configured all I have to do is >> wget the Mahout's job files and the data from S3, and launch my job. >> > >> > --- En date de : Mar 12.1.10, deneche abdelhakim <[EMAIL PROTECTED]> a >> écrit : >> > >> >> De: deneche abdelhakim <[EMAIL PROTECTED]> >> >> Objet: Re: Re : Good starting instance for AMI >> >> À: [EMAIL PROTECTED] >> >> Date: Mardi 12 Janvier 2010, 3h44 >> >> I used Cloudera's with Mahout to test >> >> the Decision Forest implementation. >> >> >> >> --- En date de : Lun 11.1.10, Grant Ingersoll <[EMAIL PROTECTED]> >> >> a écrit : >> >> >> >>> De: Grant Ingersoll <[EMAIL PROTECTED]> >> >>> Objet: Re: Re : Good starting instance for AMI >> >>> À: [EMAIL PROTECTED] >> >>> Date: Lundi 11 Janvier 2010, 20h51 >> >>> One quick question for all who >> >>> responded: >> >>> How many have tried Mahout with the setup they >> >>> recommended? >> >>> >> >>> -Grant >> >>> >> >>> On Jan 11, 2010, at 10:43 AM, zaki rahaman wrote: >> >>> >> >>>> Some comments on Cloudera's Hadoop (CDH) and >> >> Elastic >> >>> MapReduce (EMR). >> >>>> >> >>>> I have used both to get hadoop jobs up and >> >> running >> >>> (although my EMR use has >> >>>> mostly been limited to running batch Pig scripts >> >>> weekly). Deciding on which >> >>>> one to use really depends on what kind of >> >> job/data >> >>> you're working with. >> >>>> >> >>>> EMR is most useful if you're already storing the >> >>> dataset you're using on S3 >> >>>> and plan on running a one-off job. My >> >> understanding is >> >>> that it's configured >> >>>> to use jets3t to stream data from s3 rather than >> >>> copying it to the cluster, >> >>>> which is fine for a single pass over a small to >> >> medium >> >>> sized dataset, but >> >>>> obviously slower for multiple passes or larger >> >>> datasets. The API is also >> >>>> useful if you have a set workflow that you plan >> >> to run >> >>> on a regular basis, >> >>>> and I often prototype quick and dirty jobs on >> >> very >> >>> small EMR clusters to >> >>>> test how some things run in the wild (obviously >> >> not >> >>> the most cost effective >> >>>> solution, but I've foudn pseudo-distributed mode >> >>> doesn't catch everything). >> >>>> >> >>>> CDH gives you greater control over the initial >> >> setup >> >>> and configuration of >> >>>> your cluster. From my understanding, it's not >> >> really >> >>> an AMI. Rather, it's a >> >>>> set of Python scripts that's been modified from >> >> the >> >>> ec2 scripts from >> >>>> hadoop/contrib with some nifty additions like >> >> being >> >>> able to specify and set >> > +
Robin Anil 2010-01-18, 15:23
-
Re: Re : Good starting instance for AMIGrant Ingersoll 2010-01-18, 15:26
On Jan 18, 2010, at 10:20 AM, Robin Anil wrote: > Perfect!. We can have two ami's. Mahout trunk and mahout release version. Cool. I'll get my base AMI up (just as soon as I figure out the security stuff) and then we can coordinate. Is it possible to have multiple people "manage" an AMI so that the Mahout committers can reasonably take on keeping them up to date? -Grant +
Grant Ingersoll 2010-01-18, 15:26
-
Re: Re : Good starting instance for AMISean Owen 2010-01-18, 15:31
AFAIK AMIs are fixed. You make your instance as you like it, then run
some special voodoo to save it off as an AMI. Later you can run the AMI, change it, build a new one, but that's a new one. Yeah anyone can do it. I think this came up before and my only question is, what's the use case for this we're trying to answer? So far it sounds like a regular instance with a copy of a Mahout .jar. Is this meaningfully more useful for someone than simply providing the .jar? I can't exactly migrate from one Mahout AMI to another in any sense, when upgrades are provided -- AMIs aren't a mechanism for distributing a library. We're also not talking about providing a ready-to-go Hadoop cluster. And shouldn't. This is something Elastic Mapreduce is already great for. Once upon a time I wrote an AMI that would fire up, automatically download data from a location, run recommendations, upload them, and quit. Pretty simple, pretty nice. *That* kind of thing I think is really useful. The AMI is like one big remote method invocation. On Mon, Jan 18, 2010 at 3:26 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > > On Jan 18, 2010, at 10:20 AM, Robin Anil wrote: > >> Perfect!. We can have two ami's. Mahout trunk and mahout release version. > > Cool. I'll get my base AMI up (just as soon as I figure out the security stuff) and then we can coordinate. Is it possible to have multiple people "manage" an AMI so that the Mahout committers can reasonably take on keeping them up to date? > > -Grant +
Sean Owen 2010-01-18, 15:31
-
Re: Re : Good starting instance for AMIGrant Ingersoll 2010-01-18, 20:14
On Jan 18, 2010, at 10:31 AM, Sean Owen wrote: > AFAIK AMIs are fixed. You make your instance as you like it, then run > some special voodoo to save it off as an AMI. Later you can run the > AMI, change it, build a new one, but that's a new one. Yeah anyone can > do it. Right, I just mostly want a way for others, presumably committers, to be able to edit the same image, so that we aren't duplicating efforts or spinning off a bunch of different AMI's that confuse people. > > I think this came up before and my only question is, what's the use > case for this we're trying to answer? So far it sounds like a regular > instance with a copy of a Mahout .jar. Is this meaningfully more > useful for someone than simply providing the .jar? I can't exactly > migrate from one Mahout AMI to another in any sense, when upgrades are > provided -- AMIs aren't a mechanism for distributing a library. > > We're also not talking about providing a ready-to-go Hadoop cluster. > And shouldn't. This is something Elastic Mapreduce is already great > for. > Except EMR is on 0.18.3. So, yes, I am interested in a ready-to-go Hadoop cluster along w/ a suite of data sets that we can use to benchmark Mahout trunk and make it easier for people to try out Mahout or even run in production. So while I would agree they aren't a mechanism for distributing a library, they are very useful for getting people up and running very quickly. At any rate, I think the bigger takeaway from your point is this doesn't have to be some officially supported thing and it isn't required of releases. I mostly, right now, have a need to benchmark Mahout's clustering capabilities and thus need a Hadoop cluster. Rather than do a one off like many others have done, I'd like to share my efforts w/ others so that we all, hopefully, benefit. I can definitely say that if there was an AMI on it that was already preconfigured for me w/ Mahout trunk and Hadoop ready to go, I'd use it and I bet others would too. So far, I have everything on an instance (mvn, svn, java, Mahout, etc.) except the Hadoop cluster stuff. I've already run mvn install on Mahout. In other words, it's pretty ready to go. > Once upon a time I wrote an AMI that would fire up, automatically > download data from a location, run recommendations, upload them, and > quit. Pretty simple, pretty nice. *That* kind of thing I think is > really useful. The AMI is like one big remote method invocation. +1. > > On Mon, Jan 18, 2010 at 3:26 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> >> On Jan 18, 2010, at 10:20 AM, Robin Anil wrote: >> >>> Perfect!. We can have two ami's. Mahout trunk and mahout release version. >> >> Cool. I'll get my base AMI up (just as soon as I figure out the security stuff) and then we can coordinate. Is it possible to have multiple people "manage" an AMI so that the Mahout committers can reasonably take on keeping them up to date? >> >> -Grant -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem using Solr/Lucene: http://www.lucidimagination.com/search +
Grant Ingersoll 2010-01-18, 20:14
-
Re: Re : Good starting instance for AMITed Dunning 2010-01-18, 20:15
Is there an important difference between creating an existing AMI or using
an existing AMI with a startup script that populates everything from S3? Building an AMI takes a few hours of time and is a total pain in the butt. My eventual result was that I didn't need to do it at all. I found that I had roughly three levels of variation in my production systems: - the OS - the infrastructural components like java, hadoop and zookeeeper - the application that I wanted to run My initial thought was that the AMI should cover the first two aspects of variability. But I also found that I wanted to change the version of the infrastructure stuff fairly often in development of the AMI and not infrequently in production. For Mahout customers, I would imagine that there is a reasonable amount of variability in desired OS (Ubuntu versus Redhat versus Centos at least), JDK and Hadoop versions. We definitely can't afford the time to build AMI's for all options. My final answer for deepdyve was to use a standard alestic.com AMI. That let me change the OS whenever I needed to and would let Mahout customers pick their preference. These AMI's allow a 16K startup script which I used to handle infrastructure variation. That worked very well for me and could be used for Mahout. The cost was a few 10's of seconds at boot time. The benefit was vastly better debug and development cycle. Somebody else handled the OS and I could test many variations of setup script very quickly. This practice is very much in line with what RightScale does. Generally, I would avoid the full-custom AMI in favor of a few S3 hosted tar balls rooted at / that anybody can rain down on any Linux version they want. On Mon, Jan 18, 2010 at 6:54 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > Create an AMI with: > 1. Java 1.6 > 2. Maven > 3. svn > 4. Mahout's exact Hadoop version > 5. A checkout of Mahout > -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2010-01-18, 20:15
-
Re: Re : Good starting instance for AMIKen Krugler 2010-01-18, 20:31
On Jan 18, 2010, at 12:15pm, Ted Dunning wrote: > Is there an important difference between creating an existing AMI or > using > an existing AMI with a startup script that populates everything from > S3? > > Building an AMI takes a few hours of time and is a total pain in the > butt. > My eventual result was that I didn't need to do it at all. [snip] Leaving aside the pros/cons of having a pre-installed Hadoop, there were two things that I found non-trivial to handle via the init script: 1. Get LZO support installed. Though I didn't dig into the various ways to do a scripted install. 2. Turn off noatime. You can do it via the script, but it feels kind of odd to have to re- mount disks, and either know about the set of volumes or do fancy sed- fu to dynamically generate the list. Maybe there's an easy way that I missed? Input welcome... -- Ken The two things that > > I found that I had roughly three levels of variation in my production > systems: > > - the OS > - the infrastructural components like java, hadoop and zookeeeper > - the application that I wanted to run > > My initial thought was that the AMI should cover the first two > aspects of > variability. But I also found that I wanted to change the version > of the > infrastructure stuff fairly often in development of the AMI and not > infrequently in production. > > For Mahout customers, I would imagine that there is a reasonable > amount of > variability in desired OS (Ubuntu versus Redhat versus Centos at > least), JDK > and Hadoop versions. We definitely can't afford the time to build > AMI's for > all options. > > My final answer for deepdyve was to use a standard alestic.com AMI. > That > let me change the OS whenever I needed to and would let Mahout > customers > pick their preference. These AMI's allow a 16K startup script which > I used > to handle infrastructure variation. That worked very well for me > and could > be used for Mahout. > > The cost was a few 10's of seconds at boot time. The benefit was > vastly > better debug and development cycle. Somebody else handled the OS > and I > could test many variations of setup script very quickly. This > practice is > very much in line with what RightScale does. > > Generally, I would avoid the full-custom AMI in favor of a few S3 > hosted tar > balls rooted at / that anybody can rain down on any Linux version they > want. > > On Mon, Jan 18, 2010 at 6:54 AM, Grant Ingersoll > <[EMAIL PROTECTED]>wrote: > >> Create an AMI with: >> 1. Java 1.6 >> 2. Maven >> 3. svn >> 4. Mahout's exact Hadoop version >> 5. A checkout of Mahout >> > > > > -- > Ted Dunning, CTO > DeepDyve -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g +
Ken Krugler 2010-01-18, 20:31
-
Re: Re : Good starting instance for AMISean Owen 2010-01-18, 20:33
+1 this is a smarter version of what I tried to put together too. A
semi-custom AMI would download components and configure via an /etc/rc script. Quite nice. Point taken about Hadoop and the usefulness amongst ourselves of such a thing. Based on incomplete experience with running AMIs, and a Hadoop cluster, it's going to be no small feet to craft a series of AMIs (or one configurable one) that will reliably come up, find its workers, accept jobs, etc. It's not terrible but the work of a week I'm guessing. That would be pretty great, for the whole community, should you succeed. You could probably make a nice paid AMI out of it! On Mon, Jan 18, 2010 at 8:15 PM, Ted Dunning <[EMAIL PROTECTED]> wrote: > Is there an important difference between creating an existing AMI or using > an existing AMI with a startup script that populates everything from S3? > > Building an AMI takes a few hours of time and is a total pain in the butt. > My eventual result was that I didn't need to do it at all. > > I found that I had roughly three levels of variation in my production > systems: > > - the OS > - the infrastructural components like java, hadoop and zookeeeper > - the application that I wanted to run > > My initial thought was that the AMI should cover the first two aspects of > variability. But I also found that I wanted to change the version of the > infrastructure stuff fairly often in development of the AMI and not > infrequently in production. > > For Mahout customers, I would imagine that there is a reasonable amount of > variability in desired OS (Ubuntu versus Redhat versus Centos at least), JDK > and Hadoop versions. We definitely can't afford the time to build AMI's for > all options. > > My final answer for deepdyve was to use a standard alestic.com AMI. That > let me change the OS whenever I needed to and would let Mahout customers > pick their preference. These AMI's allow a 16K startup script which I used > to handle infrastructure variation. That worked very well for me and could > be used for Mahout. > > The cost was a few 10's of seconds at boot time. The benefit was vastly > better debug and development cycle. Somebody else handled the OS and I > could test many variations of setup script very quickly. This practice is > very much in line with what RightScale does. > > Generally, I would avoid the full-custom AMI in favor of a few S3 hosted tar > balls rooted at / that anybody can rain down on any Linux version they > want. > > On Mon, Jan 18, 2010 at 6:54 AM, Grant Ingersoll <[EMAIL PROTECTED]>wrote: > >> Create an AMI with: >> 1. Java 1.6 >> 2. Maven >> 3. svn >> 4. Mahout's exact Hadoop version >> 5. A checkout of Mahout >> > > > > -- > Ted Dunning, CTO > DeepDyve > +
Sean Owen 2010-01-18, 20:33
-
Re: Re : Good starting instance for AMIGrant Ingersoll 2010-01-19, 00:42
On Jan 18, 2010, at 3:15 PM, Ted Dunning wrote: > Is there an important difference between creating an existing AMI or using > an existing AMI with a startup script that populates everything from S3? > > Building an AMI takes a few hours of time and is a total pain in the butt. > My eventual result was that I didn't need to do it at all. > > I found that I had roughly three levels of variation in my production > systems: > > - the OS > - the infrastructural components like java, hadoop and zookeeeper > - the application that I wanted to run > > My initial thought was that the AMI should cover the first two aspects of > variability. But I also found that I wanted to change the version of the > infrastructure stuff fairly often in development of the AMI and not > infrequently in production. > > For Mahout customers, I would imagine that there is a reasonable amount of > variability in desired OS (Ubuntu versus Redhat versus Centos at least), JDK > and Hadoop versions. I only see a need for two: the version in trunk and the one in latest release. This is all well and good, but I have yet to see anyone say: here's the AMI, the download script and the instructions. So I'm just going to go ahead with what I think is useful for my needs, document it, and put it up there for people to use or not. If anything, it will be useful for me to do it since I've never setup a Hadoop cluster on EC2 before. -Grant +
Grant Ingersoll 2010-01-19, 00:42
-
Re: Good starting instance for AMISean Owen 2010-01-10, 23:18
I like the Alestic instances, though they don't have Java (IIRC).
http://alestic.com/ On Sun, Jan 10, 2010 at 11:16 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > Anyone have recs on a good AMI to start with on EC2 to load with Mahout? Preferably Linux and already has Java 1.6 installed. > > Thanks, > Grant +
Sean Owen 2010-01-10, 23:18
-
Re: Good starting instance for AMITed Dunning 2010-01-11, 01:47
If you aren't going to use EMR, possibly because of hadoop version issues,
then I strongly second the recommendation of the alestic instances. All of these include a start script that is downloaded from what is called a "user-data file". This can be up to 16K in length. I used that script to customize my instances with additional loftware like hadoop, java, our own software as well as reconfiguring the instance as necessary, mounting elastic block volumes and tweaking the DHCP configuration to add an over-ride to avoid a few gotchas. Total boot time was still typically < 40 s and I hear that it has gotten faster since then. On Sun, Jan 10, 2010 at 3:18 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > I like the Alestic instances, though they don't have Java (IIRC). > http://alestic.com/ > > On Sun, Jan 10, 2010 at 11:16 PM, Grant Ingersoll <[EMAIL PROTECTED]> > wrote: > > Anyone have recs on a good AMI to start with on EC2 to load with Mahout? > Preferably Linux and already has Java 1.6 installed. > > > > Thanks, > > Grant > -- Ted Dunning, CTO DeepDyve +
Ted Dunning 2010-01-11, 01:47
-
Re: Good starting instance for AMIGrant Ingersoll 2010-01-11, 02:08
On Jan 10, 2010, at 8:47 PM, Ted Dunning wrote: > If you aren't going to use EMR, possibly because of hadoop version issues, > then I strongly second the recommendation of the alestic instances. > Right, I want to run trunk. > All of these include a start script that is downloaded from what is called a > "user-data file". This can be up to 16K in length. I used that script to > customize my instances with additional loftware like hadoop, java, our own > software as well as reconfiguring the instance as necessary, mounting > elastic block volumes and tweaking the DHCP configuration to add an > over-ride to avoid a few gotchas. Total boot time was still typically < 40 > s and I hear that it has gotten faster since then. Can you share the script, obviously removing the part for your prop. software? -Grant +
Grant Ingersoll 2010-01-11, 02:08
-
Re: Good starting instance for AMIKen Krugler 2010-01-11, 02:19
Hi Grant,
[snip] >> All of these include a start script that is downloaded from what is >> called a >> "user-data file". This can be up to 16K in length. I used that >> script to >> customize my instances with additional loftware like hadoop, java, >> our own >> software as well as reconfiguring the instance as necessary, mounting >> elastic block volumes and tweaking the DHCP configuration to add an >> over-ride to avoid a few gotchas. Total boot time was still >> typically < 40 >> s and I hear that it has gotten faster since then. > > Can you share the script, obviously removing the part for your prop. > software? FWIW, you can look at what we use for Bixo, with our EC2 AMI - it's at: http://github.com/bixo/bixo/blob/master/bin/ec2/hadoop-aws/etc/hadoop-ec2-init-remote.sh Though the version current in GitHub is missing one important correction - you want to call ulimit -n 20000 right before running the hadoop-daemon script to start the tasktracker on the slave, as in: ulimit -n 20000 "$HADOOP_HOME"/bin/hadoop-daemon.sh start tasktracker -- Ken -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g +
Ken Krugler 2010-01-11, 02:19
-
Re: Good starting instance for AMITed Dunning 2010-01-11, 02:41
On Sun, Jan 10, 2010 at 6:08 PM, Grant Ingersoll <[EMAIL PROTECTED]>wrote:
> Can you share the script, obviously removing the part for your prop. > software? Sure. Apologies for *really* ugly code. Below is the launch script, with some boring and *secret* bits expunged. The only real downfall of this sort of approach is that the startup script needs to have stuff injected into it for different kinds of servers. Doing it over, I would make a completely static boot script that looks to zookeeper to find out what tasks need doing. Basically, by making the client-boot scripts dynamic, I was trying to inject configuration management via an inappropriate mechanism. Since I had ZK running in the cloud already, I should have just used a real configuration management system instead of gross scripting. You need to make sure you have a command line program on the client to receive any secret keys because user-data is not considered secure. We do this with something like this: # Send in secrets via stdin instead of command line to avoid snoopers. echo echo "Running remote script..." echo "Process $$ : $CLOUD_SERVER:$ZK_PORT $VOLUME $ZK_INTERNAL $INT_HOST_NAME $VOLUME_ZONE" ssh -n -o StrictHostKeyChecking=false -i $ADMIN_KEY_RSA root@$i "echo $AWS_API_KEY $AWS_API_SECRET | /home/client/$script $ENV $REV "$CLOUD_SERVER":"$ZK_PORT" $VOLUME $ZK_INTERNAL $INT_HOST_NAME $VOLUME_ZONE" Here is a less than complete excerpt of the launch script. It should have most of the bits you need. It won't run as it stands because a fair bit of stuff has been expunged. #!/bin/sh ######### # ASSUMPTIONS: # ---- keys are available and referenced # cert and secret key are available and correct perms # client-boot.sh has been construct to download and install all necessary software # file named cloud-key in the current directory contains the key # obtained using # # ec2-add-keypair cloud-admin-key # ---- environment variables # EC2_PRIVATE_KEY=~/.ec2/pk-xx.pem # EC2_CERT=~/.ec2/cert-xx.pem # EC2_HOME points to EC2 distro directory # AWS_API_KEY=xx # AWS_API_SECRET=xx/+yy/zz # path includes $EC2_HOME/bin ######### # DEFINITIONS: ami=ami-1c5db975 . ./.cloud_client_env_settings if [ $# -eq 5 -a "$5" = "-large" ] then ami="ami-b1fe19d8 -t m1.large" fi ######### # This script will accept two arguments: # Uasge: client_cloud_launch.sh 3 namenode.sh # the first parameter is the instance number want to launch # the second parameter is the script want to start on instance # # it will create a node start script and then launch a bunch of instances #START echo $ADMIN_KEY echo $ADMIN_KEY_RSA echo $ZK_ADMIN_KEY echo $ENV echo $REV #please pay attention to this key-pair, used for creating instances, you should have the corret $ADMIN_KEY_RSA go with this key pair KEY_PAIR=cloud-admin-key ZK_GROUP_NAME=zk_cluster CLIENT_GROUP_NAME=zk_client cluster_size=$1 ZK_PORT=4099 VOLUME_ZONE=us-east-1a VOLUME=... TIMEOUT=600 start=$(date +%s) echo started at $(date) # try to do ec2-describe-instances to get information on what ZK servers are available for use # assumptions are that ZK_GROUP_NAME is the group with which one ZK cluster will be started, otherwise we would not know # which one is which ec2-describe-instances > zk_instances.tmp.$$ ... really silly code to do what should just be grep deleted here. all it does is hack zk_instances.tmp.$$ into better form in zk_instances.$$ ... if [ "$ALREADYREAD" = 1 ] then sed -n "$NEXTLINE","$lineno"p zk_instances.tmp.$$ >> zk_instances.$$ fi ZK_EXTERNAL=$(grep INSTANCE zk_instances.$$ | grep running | grep $ZK_ADMIN_KEY | cut -f4 | tr '\n' '~' | sed -e 's/~/:2181,/g' -e 's/,$//') ZK_INTERNAL=$(grep INSTANCE zk_instances.$$ | grep running | grep $ZK_ADMIN_KEY | cut -f5 | tr '\n' '~' | sed -e 's/~/:2181,/g' -e 's/,$//') CLOUD_SERVER=$(grep INSTANCE zk_instances.$$ | grep running | grep $ZK_ADMIN_KEY | cut -f5 | head -1) echo $CLOUD_SERVER # launch client nodes. This also causes client-boot.sh to be run on each node. Somebody else should have built client-boot.sh for us echo starting $cluster_size instances now... ins_start_time=$(date +%s) ec2-run-instances $ami -g $CLIENT_GROUP_NAME -k $ADMIN_KEY -f client-boot.sh -z $VOLUME_ZONE -n $cluster_size > client_instances.tmp.$$ cat client_instances.tmp.$$ | grep INSTANCE | cut -f2 > client_instances.$$ T1=0 # this factor is 90% of total cluster we want to start, # once the number of running instance reaches this factor, # we will continue our job, killing rather than waiting for the last 10% factor=`awk -v x=$cluster_size BEGIN'{printf "%d\n",x*0.9+0.5 }'` while [ "$T1" != $cluster_size ] do rm -f client_instances.tmp.$$ ec2-describe-instances | grep INSTANCE | grep running > current_running.$$ all_instance=`cat client_instances.$$` T1=0 for inst in $all_instance; do if [ -z "$inst" ]; then continue; fi ok=`cat current_running.$$ | grep $inst` if [ -z "$ok" ]; then echo Wait a moment, $inst is not ready yet. else T1=`expr $T1 + 1` echo $ok >> client_instances.tmp.$$ fi done # check timeout or not ins_curr_time=$(date +%s) elapse=`expr $ins_curr_time - $ins_start_time` if [ $elapse -gt $TIMEOUT ]; then # if we have had 90% instances started, we can stop waiting and continue the following process if [ ! $T1 -lt $factor ]; then echo We have had $T1 instances started, kill the unstarted ones... #should KILL the unstarted instances here for everyinst in $all_instance; do isrunning=`cat current_running.$$ | grep $everyinst` if [ -z "$isrunning" ]; then ec2-terminate-instances $everyinst fi done break fi echo We have waited for $elapse seconds, but only $T1 started, will not wait any more. Program will exit now! #before exit we need to stop all instances we planed to start for +
Ted Dunning 2010-01-11, 02:41
-
Re: Good starting instance for AMIRobin Anil 2010-01-11, 03:29
I have a 20GB EBS with the code and hadoop already checked out. I just start
any instance. install svn mvn java(2-3 mins) and just start hadoop. Robin On Mon, Jan 11, 2010 at 8:11 AM, Ted Dunning <[EMAIL PROTECTED]> wrote: > On Sun, Jan 10, 2010 at 6:08 PM, Grant Ingersoll <[EMAIL PROTECTED] > >wrote: > > > Can you share the script, obviously removing the part for your prop. > > software? > > > Sure. Apologies for *really* ugly code. Below is the launch script, with > some boring and *secret* bits expunged. The only real downfall of this > sort > of approach is that the startup script needs to have stuff injected into it > for different kinds of servers. Doing it over, I would make a completely > static boot script that looks to zookeeper to find out what tasks need > doing. Basically, by making the client-boot scripts dynamic, I was trying > to inject configuration management via an inappropriate mechanism. Since I > had ZK running in the cloud already, I should have just used a real > configuration management system instead of gross scripting. > > You need to make sure you have a command line program on the client to > receive any secret keys because user-data is not considered secure. > We do this with something like this: > > # Send in secrets via stdin instead of command line to avoid snoopers. > echo > echo "Running remote script..." > echo "Process $$ : $CLOUD_SERVER:$ZK_PORT $VOLUME $ZK_INTERNAL > $INT_HOST_NAME $VOLUME_ZONE" > > ssh -n -o StrictHostKeyChecking=false -i $ADMIN_KEY_RSA root@$i > "echo $AWS_API_KEY $AWS_API_SECRET | /home/client/$script $ENV $REV > "$CLOUD_SERVER":"$ZK_PORT" $VOLUME $ZK_INTERNAL $INT_HOST_NAME > $VOLUME_ZONE" > > > Here is a less than complete excerpt of the launch script. It should have > most of the bits you need. It won't run as it stands because a fair bit of > stuff has been expunged. > > #!/bin/sh > > ######### > > # ASSUMPTIONS: > > # ---- keys are available and referenced > # cert and secret key are available and correct perms > # client-boot.sh has been construct to download and install all > necessary software > # file named cloud-key in the current directory contains the key > > # obtained using > # > # ec2-add-keypair cloud-admin-key > > # ---- environment variables > # EC2_PRIVATE_KEY=~/.ec2/pk-xx.pem > # EC2_CERT=~/.ec2/cert-xx.pem > # EC2_HOME points to EC2 distro directory > # AWS_API_KEY=xx > > # AWS_API_SECRET=xx/+yy/zz > # path includes $EC2_HOME/bin > > > ######### > # DEFINITIONS: > ami=ami-1c5db975 > > . ./.cloud_client_env_settings > > if [ $# -eq 5 -a "$5" = "-large" ] > > then > ami="ami-b1fe19d8 -t m1.large" > fi > > ######### > # This script will accept two arguments: > # Uasge: client_cloud_launch.sh 3 namenode.sh > # the first parameter is the instance number want to launch > > # the second parameter is the script want to start on instance > # > # it will create a node start script and then launch a bunch of instances > > #START > > echo $ADMIN_KEY > echo $ADMIN_KEY_RSA > echo $ZK_ADMIN_KEY > > echo $ENV > echo $REV > > #please pay attention to this key-pair, used for creating instances, > you should have the corret $ADMIN_KEY_RSA go with this key pair > KEY_PAIR=cloud-admin-key > > > ZK_GROUP_NAME=zk_cluster > > CLIENT_GROUP_NAME=zk_client > cluster_size=$1 > ZK_PORT=4099 > > VOLUME_ZONE=us-east-1a > VOLUME=... > TIMEOUT=600 > > start=$(date +%s) > echo started at $(date) > > # try to do ec2-describe-instances to get information on what ZK > servers are available for use > > # assumptions are that ZK_GROUP_NAME is the group with which one ZK > cluster will be started, otherwise we would not know > # which one is which > > ec2-describe-instances > zk_instances.tmp.$$ > > ... really silly code to do what should just be grep deleted here. > all it does is hack zk_instances.tmp.$$ into better form in > zk_instances.$$ ... > > if [ "$ALREADYREAD" = 1 ] > then > sed -n "$NEXTLINE","$lineno"p zk_instances.tmp.$$ >> zk_instances.$$ > fi +
Robin Anil 2010-01-11, 03:29
-
Re: Good starting instance for AMIBenson Margulies 2010-01-10, 23:21
Stupid question: I thought there was a way to use the cloud as a
hadoop farm directly without having to configure instances. On Sun, Jan 10, 2010 at 6:18 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > I like the Alestic instances, though they don't have Java (IIRC). > http://alestic.com/ > > On Sun, Jan 10, 2010 at 11:16 PM, Grant Ingersoll <[EMAIL PROTECTED]> wrote: >> Anyone have recs on a good AMI to start with on EC2 to load with Mahout? Preferably Linux and already has Java 1.6 installed. >> >> Thanks, >> Grant > +
Benson Margulies 2010-01-10, 23:21
-
Re: Good starting instance for AMIJake Mannix 2010-01-10, 23:27
You mean Elastic MapReduce (EMR)? Has anyone here had any luck with that
for this or other projects? -jake On Jan 10, 2010 3:21 PM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: Stupid question: I thought there was a way to use the cloud as a hadoop farm directly without having to configure instances. On Sun, Jan 10, 2010 at 6:18 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > I like the Alestic instances... +
Jake Mannix 2010-01-10, 23:27
-
Re: Good starting instance for AMIBenson Margulies 2010-01-11, 00:21
That's what I meant. I haven't tried it yet, so I've got the same
question Jake has. On Sun, Jan 10, 2010 at 6:27 PM, Jake Mannix <[EMAIL PROTECTED]> wrote: > You mean Elastic MapReduce (EMR)? Has anyone here had any luck with that > for this or other projects? > > -jake > > On Jan 10, 2010 3:21 PM, "Benson Margulies" <[EMAIL PROTECTED]> wrote: > > Stupid question: I thought there was a way to use the cloud as a > hadoop farm directly without having to configure instances. > > On Sun, Jan 10, 2010 at 6:18 PM, Sean Owen <[EMAIL PROTECTED]> wrote: > I > like the Alestic instances... > +
Benson Margulies 2010-01-11, 00:21
-
Re: Good starting instance for AMIKen Krugler 2010-01-11, 01:32
BTW, I assume everybody knows about http://cwiki.apache.org/MAHOUT/mahout-on-elastic-mapreduce.html
-- Ken On Jan 10, 2010, at 4:21pm, Benson Margulies wrote: > That's what I meant. I haven't tried it yet, so I've got the same > question Jake has. > > On Sun, Jan 10, 2010 at 6:27 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: >> You mean Elastic MapReduce (EMR)? Has anyone here had any luck >> with that >> for this or other projects? >> >> -jake >> >> On Jan 10, 2010 3:21 PM, "Benson Margulies" <[EMAIL PROTECTED]> >> wrote: >> >> Stupid question: I thought there was a way to use the cloud as a >> hadoop farm directly without having to configure instances. >> >> On Sun, Jan 10, 2010 at 6:18 PM, Sean Owen <[EMAIL PROTECTED]> >> wrote: > I >> like the Alestic instances... >> -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g +
Ken Krugler 2010-01-11, 01:32
-
Re: Good starting instance for AMIKen Krugler 2010-01-11, 01:03
I've been using EMR for the public terabyte dataset project.
In general it's worked for me, with the following caveats: 1. Hadoop 0.18.3, which meant I had to re-work some of my code that depended on newer (Hadoop 0.19.x) support. 2. It was kind of painful to get it running initially (setting up the right credentials.json file, etc) 3. You'll need S3 access, of course, which is another series of hoops to jump through. 4. You really want to run in the mode where you create an EMR job with no steps, then add steps to run - otherwise you can waste a lot of time firing up EMR jobs that fail immediately. 5. For bigger clusters, some of the Hadoop configuration parameters aren't set very well. -- Ken On Jan 10, 2010, at 4:21pm, Benson Margulies wrote: > That's what I meant. I haven't tried it yet, so I've got the same > question Jake has. > > On Sun, Jan 10, 2010 at 6:27 PM, Jake Mannix <[EMAIL PROTECTED]> > wrote: >> You mean Elastic MapReduce (EMR)? Has anyone here had any luck >> with that >> for this or other projects? >> >> -jake >> >> On Jan 10, 2010 3:21 PM, "Benson Margulies" <[EMAIL PROTECTED]> >> wrote: >> >> Stupid question: I thought there was a way to use the cloud as a >> hadoop farm directly without having to configure instances. >> >> On Sun, Jan 10, 2010 at 6:18 PM, Sean Owen <[EMAIL PROTECTED]> >> wrote: > I >> like the Alestic instances... >> -------------------------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com e l a s t i c w e b m i n i n g +
Ken Krugler 2010-01-11, 01:03
|