Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Nutch, mail # user - Finally got hadoop + nutch 1.3 + cygwin cluster working! ? now


Copy link to this message
-
Finally got hadoop + nutch 1.3 + cygwin cluster working! ? now
webdev1977 2011-09-29, 18:50
I finally got a three machine cluster working with nutch 1.3, hadoop 0.20.0
and cygwin!  I have a few questions about configuration.

I am only going to be crawling a few domains and I need this cluster to be
very fast.  Right now it is slower using hadoop in distributed mode then
using just the local crawl.  I am *guessing* that is due to the network
overhead?   It is very, very slow.

What settings in mapred-site.xml and hdfs-site.xml might make my crawl
faster?  Seems like the crawldb update takes the longest.  I was digging
around in the hadoop documentation and the following seemed like good
settings:

mapred.reduce.tasks = <2 x slave processors>
mapred.map.tasks = <10 x the number of slave processors>

increase mapred.child.opts memory

Any thing else I am missing?  What about running another crawl cycle
immediately after the first generate is complete? Would that cause problem
with concurrency and updating files/dbs?

--
View this message in context: http://lucene.472066.n3.nabble.com/Finally-got-hadoop-nutch-1-3-cygwin-cluster-working-now-tp3380170p3380170.html
Sent from the Nutch - User mailing list archive at Nabble.com.
+
Markus Jelsma 2011-09-29, 18:55