-Re: recommendations with Hadoop and RecommenderJob in Amazon EC2, suggestions for performance?
Ted Dunning 2011-01-06, 17:24
My experience is that until you get to 5-10 nodes using Hadoop will be
slower than a sequential implementation.
You can definitely continue with 3 nodes as Sean suggests for testing, but I
would not expect this to be a performant solution.
On Thu, Jan 6, 2011 at 9:00 AM, Stefano Bellasio
> Ok, so can i continue with just 3 nodes? Im a bit confused right now. With
> computation time i mean that i need to know how much time takes every
> test...as i said i can see nothing from my JobTracker, it says the number of
> nodes but no job active or map/reduce operations, and i dont know why :/
> Il giorno 06/gen/2011, alle ore 17.52, Sean Owen ha scritto:
> > Those numbers seem "reasonable" to a first approximation, maybe a
> > little higher than I would have expected given past experience.
> > You should be able to increase speed with more nodes, sure, but I use
> > 3 for testing too.
> > The jobs are I/O bound for sure. I don't think you will see
> > appreciable difference with different algorithms.
> > Yes the amount of data used in the similarity computation is the big
> > factor for time. You probably need to tell it to keep fewer item-item
> > pairs with the "max" parameters you mentioned earlier.
> > mapred.num.tasks controls the number of mappers -- or at leasts
> > suggests it to Hadoop.
> > What do you mean about the time of computation? The job tracker shows
> > you when the individual tasks start and finish.
> > On Thu, Jan 6, 2011 at 1:31 PM, Stefano Bellasio
> > <[EMAIL PROTECTED]> wrote:
> >> Hi guys, well i'm doing some tests in those days and i have some
> questions. Here there is my environment and basic configuration:
> >> 1) Amazon EC2 Cluster powered by Cloudera script with Apache Whirr, i'm
> using a 3 node with large instances + one master node to control the
> >> 2) Movielens data set, based on 100k, 1 mln and 10mln ... my tests right
> now are on 10 mln versions.
> >> This is the command that i'm using to start my cluster:
> >> hadoop jar /home/ste/Desktop/mahout-core-0.5-SNAPSHOT-job.jar
> -Dmapred.input.dir=input -Dmapred.output.dir=data/movielens_2gennaio
> --maxSimilaritiesPerItem 150 --maxPrefsPerUser 30 --maxCooccurrencesPerItem
> 100 -s SIMILARITY_COOCCURRENCE -n 10 -u users.txt
> >> I'm trying different values for :
> >> maxSimilaritiesPerItem
> >> maxPrefsPerUser
> >> maxCooccurrencesPerItem
> >> and using about 10 users per time. With this command, 10 mln user data
> set, my cluster took more than 4 hours (with 3 nodes) to give
> recommendations. Is a good time?
> >> Well, right now i have 2 goals, and im posting here to request your help
> to figure out some problems :) My primary goal is to run item-based
> recommendations and see what happens when i change the parameters in time
> and performance of my cluster. Also, i need to look at the similarities, i
> will be test three of them: cousine, pearson, and co-occurence. Good
> choices? I noted also that all the similarities computation is in RAM
> (right?) so my matrix is built and stored in RAM, is there an other way to
> do that?
> >> - I need to understand what kind of scalability i obtain with many nodes
> (3 for now, i can arrive to 5), i think that similarities calculation took
> most of the time, am i right?
> >> - I know there is something like mapred.task to define how many
> instances some task can use...do i need that? How can i specify this?
> >> - I need to see the exact time of each computation, i'm looking to
> jobtracker but seems that never happens in my cluster even if job (with
> mapping and reducing) is running. Is there another way to know the perfect
> time of any computation?
> >> - Finally, i will take all the data and try to plot them to figure out