webdev1977 2011-05-10, 14:37
J. Delgado 2011-05-10, 16:05
webdev1977 2011-05-10, 16:56
webdev1977 2011-05-12, 17:58
Dietrich 2011-05-12, 18:19
webdev1977 2011-05-12, 18:24
Dietrich 2011-05-12, 18:30
Julien Nioche 2011-05-12, 20:12
-Re: Going Beyond the Prototype
webdev1977 2011-05-16, 10:41
Julien Nioche-4 wrote:
>> I was saying that based on what the previous poster stated. Also the
>> that I have read through quite a bit of posts stating that the problem
>> crawling in a vertical environment has to do with the way fetcher2 was
>> built. The fetches are grouped by domain name and if you have a lot of
>> from the same domain then you are not able to do quick mapreduce jobs.
> Nutch's default behaviour is to be polite to the hosts it visits. If you
> the hosts (or have an agreement with the owner) you can of course hit them
> as hard as you want and set a higher number of threads per host or time
> between hits. If you don't own the hosts then you simply should not do
> and use the defaults used in Nutch as a matter of courtesy. (moreover if
> are too aggressive in your choice of parameters then you'll probably be
> blacklisted by the target servers and won't be allowed to fetch any
> Let's be completely clear once and for all : there is no particular issue
> with using Nutch for vertical crawls - loads of people have done and still
> do that.
> *Open Source Solutions for Text Engineering
We indeed own the hosts, and I have been experimenting with the number of
threads I am able to use without crashing our web server/ database. This
has led me to the refactoring of some of our code to improve connection
pooling and resource allocation.
What I don't know how to speed up is the mapreduce jobs.. It takes
approximately 12 hours JUST do to the fetching of 250,000 or so urls. The
map reduce part takes about 36 hours.
Is this normal? Is there anyway to speed this up?
I have seen talk of setting generate.max.per.host setting, but I don't want
to limit the number of urls I fetch. And to me, this is what this setting
View this message in context: http://lucene.472066.n3.nabble.com/Going-Beyond-the-Prototype-tp2923289p2947297.html
Sent from the Nutch - User mailing list archive at Nabble.com.
webdev1977 2011-05-25, 11:52
Julien Nioche 2011-05-25, 13:34
webdev1977 2011-05-25, 15:55
Julien Nioche 2011-05-25, 20:09
Dietrich 2011-05-12, 18:22