|
|
-
Nutch on a shared filesystem
rishi pathak 2011-01-17, 08:21
Hi, Our setup has 2 data node with 16 cores each. We are trying to setup nutch to use shared local filesystem instead of HDFS. For single tasktracker, it works fine but for more than one tasktracker it gives an error and comes out. The error is related to tmp data dir for map/red asks. #mapred. conf :
<configuration>
<property> <name>mapred.job.tracker</name> <value>yc1.cn:9001</value> </property>
<property> <name>mapred.system.dir</name>
<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredSystemDir/</value> </property>
<property> <name>mapred.local.dir</name>
<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredLocalDir/</value> <!--<value>/tmp/</value> --> </property>
<property> <name>mapred.tasktracker.map.task.maximum</name> <value>16</value> </property>
<property> <name>mapred.tasktracker.map.task.maximum</name> <value>16</value> </property>
<property> <name>mapreduce.cluster.local.dir</name>
<value>/home/internal/sysadmin/nazgul/hadoop/dfs/local/mapredClusterLocalDir/</value> </property>
</configuration>
# Error ########
java.io.IOException: The temporary job-output directory file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary doesn't exist! at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:204) at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:234) at org.apache.hadoop.mapred.SequenceFileOutputFormat.getRecordWriter(SequenceFileOutputFormat.java:48) at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:433) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:411) at org.apache.hadoop.mapred.Child.main(Child.java:170)
Injector: Merging injected urls into crawl db. Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199 at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Injector.inject(Injector.java:226) at org.apache.nutch.crawl.Crawl.main(Crawl.java:124)
-- --- Rishi Pathak National PARAM Supercomputing Facility C-DAC, Pune, India
-
Re: Nutch on a shared filesystem
Alex McLintock 2011-01-17, 08:29
I'm not sure if you can do this (I would recommend HDFS instead of a shared area) but can you insert the hostname of the node into the temp dir? That might stop separate nodes from messing up each others temp areas.
(However I am guessing here) On 17 January 2011 08:21, rishi pathak <[EMAIL PROTECTED]> wrote:
> # Error ######## > > java.io.IOException: The temporary job-output directory > file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary > doesn't exist! > >
-
Re: Nutch on a shared filesystem
rishi pathak 2011-01-17, 09:59
Hello Alex, We have tried the setup with HDFS and worked fine. The shared filesstem talked in here is a Lustre parallel filesystem and is mounted on all the compute nodes(tasktracker). The problem as it seems to me is not about different nodes messing up but temp data written by one tasktracker on one node and being accessed by another. The dir : /tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary does exists on the second node. On Mon, Jan 17, 2011 at 1:59 PM, Alex McLintock <[EMAIL PROTECTED]>wrote:
> I'm not sure if you can do this (I would recommend HDFS instead of a shared > area) but can you insert the hostname of the node into the temp dir? That > might stop separate nodes from messing up each others temp areas. > > (However I am guessing here) > > > On 17 January 2011 08:21, rishi pathak <[EMAIL PROTECTED]> wrote: > > > # Error ######## > > > > java.io.IOException: The temporary job-output directory > > file:/tmp/hadoop-nazgul/mapred/temp/inject-temp-1557478199/_temporary > > doesn't exist! > > > > >
-- --- Rishi Pathak National PARAM Supercomputing Facility C-DAC, Pune, India
|
|
All projects made searchable here are trademarks of the Apache Software Foundation.
Service operated by
Sematext