Hi Michael,

> reducer spills a lot of records

The job counter "Spilled Records" is not for the reducers alone.

> 255K input records

Does your CrawlDb only contain 250,000 entries?

Also, how many hosts (resp. domains/ips depending on partition.url.mode)
are in the CrawlDb? Note: the counts per host/domain/ip are kept in a
HashMap, that does not scale up to 100 millions of hosts.

> 100 million spilled records
> 13G file bytes written

With these, my estimation would be 10s or 100s millions of CrawlDb items.
Something is wrong if the CrawlDb is really so small.

> -D generate.update.crawldb=true

That's expensive if your CrawlDb is large.

> Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb?

If it's about a large number of hosts, this might help.
But you could also try to make sure that all data (HDFS and temporary) is on SSDs
try different compression settings (CrawlDb and temporary data), see
  mapreduce.output.fileoutputformat.compress.codec
  mapreduce.map.output.compress
  mapreduce.map.output.compress.codec

Best,
Sebastian
On 04/13/2018 02:52 AM, Michael Coffey wrote:
> Greetings Nutchlings,
> I would like to make my generate jobs go faster, and I see that the reducer spills a lot of records.
> Here are the numbers for a typical long-running reduce task of the generate-select job: 100 million spilled records, 255K input records, 90k output records, 13G file bytes written, only 3G committed heap usage. mapreduce.reduce.java.opts is 8000M, mapreduce.reduce.memory.mb is12000.
> Do I have to increase  mapreduce.reduce.java.opts and mapreduce.reduce.memory.mb? If so, how can I compute how big they should be? Also, are there other settings changes needed?
> My actual commd line is apache-nutch-1.12/runtime/deploy/bin/nutch generate  -D mapreduce.job.reduces=16 -D mapreduce.input.fileinputformat.split.minsize=536870912 -D mapreduce.reduce.memory.mb=12000 -D mapreduce.reduce.java.opts=-Xmx8000m  -D db.fetch.interval.default=5184000 -D db.fetch.schedule.adaptive.min_interval=3888000 -D generate.update.crawldb=true  -D generate.max.count=25 /crawls/popular/data/crawldb /crawls/popular/data/segments/ -topN 60000 -numFetchers 2 -noFilter -maxNumSegments 24
>
>
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB