|
Casey McTaggart
2012-09-15, 23:22
Lewis John Mcgibbney
2012-09-15, 23:49
jiuling
2012-09-16, 10:59
Casey McTaggart
2012-09-16, 15:58
Casey McTaggart
2012-09-17, 16:31
Walter Tietze
2012-09-17, 17:30
jiuling
2012-09-18, 02:39
Walter Tietze
2012-09-18, 10:37
Casey McTaggart
2012-09-18, 16:46
Walter Tietze
2012-09-18, 16:58
jiuling
2012-09-19, 08:44
Casey McTaggart
2012-09-19, 17:37
|
-
problem running Nutch 1.5.1 in distributed mode- simple crawlCasey McTaggart 2012-09-15, 23:22
Hi everyone,
I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what I'd expect. However, I need to take advantage of the mapreduce functionality, since I want to crawl a local filesystem with many GB of files. I'm going to put all of these files on an apache server so they can be crawled. First, though, I want to just crawl a simple website, and I can't make it work. My urls/seed.txt is on hdfs and is this: http://lucene.apache.org I run this command: sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl Sometimes, it fetches the URL, but does not go beyond depth 1... and when I examine the CrawlDatum that's in /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the seed url as the key, and the value of the CrawlDatum is _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError: org/apache/tika/mime/MimeTypeException Okay, so I tried running the command again with -libjars nutch1.5.1.jar, and it fails with an ArrayIndexOutOfBoundsException. I tried running it with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with: 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for fetching, exiting ... 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to fetch. 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed list and URL filters. 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and still 0 URLs are fetched. I'm totally at a loss. can someone help? Here's my regex-urlfilter: # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse # for a more extensive coverage use the urlfilter-suffix plugin -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ # skip URLs containing certain characters as probable queries, etc. -[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +. here's my nutch-site.xml: <configuration> <property> <name>http.agent.name</name> <value>nutchtest</value> </property> <property> <name>plugin.folders</name> <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value> </property> </configuration> which also does not work if I include this part: <property> <name>plugin.includes</name> <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor) |query-(basic|site|url)|response-(json|xml)|addhdfskey</value> </property>
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlLewis John Mcgibbney 2012-09-15, 23:49
Hi Casey,
On Sun, Sep 16, 2012 at 12:22 AM, Casey McTaggart <[EMAIL PROTECTED]> wrote: > I run this command: > sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job > org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl I don-t think you should do this. Please see a similar post a couple days back [0] and Julien's [1] answer. Get back to us if you have probs. I hope this works for you. Lewis [1] http://www.mail-archive.com/user%40nutch.apache.org/msg07564.html [0] http://www.mail-archive.com/user%40nutch.apache.org/msg07565.html
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawljiuling 2012-09-16, 10:59
Dear Lewis:
I have met the same problem. I compile in the your same way. But it still cause the problem. The configuration of seeds and filters do work for a local crawl, but failed in deploy mode. Please help me , thank you a lot. The procedure is as following: [Jiuling@crawler-3 deploy]$ bin/nutch crawl urls -dir crawls -depth 20 (*i have also execute by "bin/hadoop jar apache-nutch-1.6-SNAPSHOT.job org.apache.nutch.crawl.Crawl urls -dir crawls -depth 20"* ) Warning: $HADOOP_HOME is deprecated. 12/09/16 18:40:16 WARN crawl.Crawl: solrUrl is not set, indexing will be skipped... 12/09/16 18:40:16 INFO crawl.Crawl: crawl started in: crawls 12/09/16 18:40:16 INFO crawl.Crawl: rootUrlDir = urls 12/09/16 18:40:16 INFO crawl.Crawl: threads = 10 12/09/16 18:40:16 INFO crawl.Crawl: depth = 20 12/09/16 18:40:16 INFO crawl.Crawl: solrUrl=null 12/09/16 18:40:16 INFO crawl.Injector: Injector: starting at 2012-09-16 18:40:16 12/09/16 18:40:16 INFO crawl.Injector: Injector: crawlDb: crawls/crawldb 12/09/16 18:40:16 INFO crawl.Injector: Injector: urlDir: urls 12/09/16 18:40:16 INFO crawl.Injector: Injector: Converting injected urls to crawl db entries. 12/09/16 18:40:23 INFO util.NativeCodeLoader: Loaded the native-hadoop library 12/09/16 18:40:23 WARN snappy.LoadSnappy: Snappy native library not loaded 12/09/16 18:40:23 INFO mapred.FileInputFormat: Total input paths to process : 1 12/09/16 18:40:23 INFO mapred.JobClient: Running job: job_201209161612_0047 12/09/16 18:40:24 INFO mapred.JobClient: map 0% reduce 0% 12/09/16 18:40:39 INFO mapred.JobClient: map 100% reduce 0% 12/09/16 18:40:51 INFO mapred.JobClient: map 100% reduce 50% 12/09/16 18:40:54 INFO mapred.JobClient: map 100% reduce 100% 12/09/16 18:40:59 INFO mapred.JobClient: Job complete: job_201209161612_0047 12/09/16 18:40:59 INFO mapred.JobClient: Counters: 30 12/09/16 18:40:59 INFO mapred.JobClient: Job Counters 12/09/16 18:40:59 INFO mapred.JobClient: Launched reduce tasks=2 12/09/16 18:40:59 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=16534 12/09/16 18:40:59 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/09/16 18:40:59 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/09/16 18:40:59 INFO mapred.JobClient: Launched map tasks=2 12/09/16 18:40:59 INFO mapred.JobClient: Data-local map tasks=2 12/09/16 18:40:59 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=20086 12/09/16 18:40:59 INFO mapred.JobClient: File Input Format Counters 12/09/16 18:40:59 INFO mapred.JobClient: Bytes Read=321 12/09/16 18:40:59 INFO mapred.JobClient: File Output Format Counters 12/09/16 18:40:59 INFO mapred.JobClient: Bytes Written=716 12/09/16 18:40:59 INFO mapred.JobClient: FileSystemCounters 12/09/16 18:40:59 INFO mapred.JobClient: FILE_BYTES_READ=502 12/09/16 18:40:59 INFO mapred.JobClient: HDFS_BYTES_READ=517 12/09/16 18:40:59 INFO mapred.JobClient: FILE_BYTES_WRITTEN=132358 12/09/16 18:40:59 INFO mapred.JobClient: HDFS_BYTES_WRITTEN=716 12/09/16 18:40:59 INFO mapred.JobClient: Map-Reduce Framework 12/09/16 18:40:59 INFO mapred.JobClient: Map output materialized bytes=514 12/09/16 18:40:59 INFO mapred.JobClient: Map input records=11 12/09/16 18:40:59 INFO mapred.JobClient: Reduce shuffle bytes=231 12/09/16 18:40:59 INFO mapred.JobClient: Spilled Records=18 12/09/16 18:40:59 INFO mapred.JobClient: Map output bytes=472 12/09/16 18:40:59 INFO mapred.JobClient: Total committed heap usage (bytes)=358285312 12/09/16 18:40:59 INFO mapred.JobClient: CPU time spent (ms)=3070 12/09/16 18:40:59 INFO mapred.JobClient: Map input bytes=213 12/09/16 18:40:59 INFO mapred.JobClient: SPLIT_RAW_BYTES=196 12/09/16 18:40:59 INFO mapred.JobClient: Combine input records=0 12/09/16 18:40:59 INFO mapred.JobClient: Reduce input records=9 12/09/16 18:40:59 INFO mapred.JobClient: Reduce input groups=9 12/09/16 18:40:59 INFO mapred.JobClient: Combine output records=0 12/09/16 18:40:59 INFO mapred.JobClient: Physical memory (bytes) snapshot=580689920 12/09/16 18:40:59 INFO mapred.JobClient: Reduce output records=9 12/09/16 18:40:59 INFO mapred.JobClient: Virtual memory (bytes) snapshot=8829870080 12/09/16 18:40:59 INFO mapred.JobClient: Map output records=9 12/09/16 18:40:59 INFO crawl.Injector: Injector: Merging injected urls into crawl db. 12/09/16 18:41:05 INFO mapred.FileInputFormat: Total input paths to process 12/09/16 18:41:06 INFO mapred.JobClient: Running job: job_201209161612_0048 12/09/16 18:41:07 INFO mapred.JobClient: map 0% reduce 0% 12/09/16 18:41:22 INFO mapred.JobClient: map 50% reduce 0% 12/09/16 18:41:28 INFO mapred.JobClient: map 100% reduce 0% 12/09/16 18:41:31 INFO mapred.JobClient: map 100% reduce 8% 12/09/16 18:41:37 INFO mapred.JobClient: map 100% reduce 58% 12/09/16 18:41:40 INFO mapred.JobClient: map 100% reduce 100% 12/09/16 18:41:45 INFO mapred.JobClient: Job complete: job_201209161612_0048 12/09/16 18:41:45 INFO mapred.JobClient: Counters: 30 12/09/16 18:41:45 INFO mapred.JobClient: Job Counters 12/09/16 18:41:45 INFO mapred.JobClient: Launched reduce tasks=2 12/09/16 18:41:45 INFO mapred.JobClient: SLOTS_MILLIS_MAPS=26468 12/09/16 18:41:45 INFO mapred.JobClient: Total time spent by all reduces waiting after reserving slots (ms)=0 12/09/16 18:41:45 INFO mapred.JobClient: Total time spent by all maps waiting after reserving slots (ms)=0 12/09/16 18:41:45 INFO mapred.JobClient: Launched map tasks=4 12/09/16 18:41:45 INFO mapred.JobClient: Data-local map tasks=4 12/09/16 18:41:45 INFO mapred.JobClient: SLOTS_MILLIS_REDUCES=26867 12/09/16 18:41:45 INFO mapred.JobClient: File Input Format Counters 12/09/16 18:41:45 INFO mapred.JobClient: Bytes Read=51222 12/09/16 18:41:45 INFO mapred.JobClient: File Output Format Counters 12/09/16 18:41:45 INFO mapred.JobClient: Bytes Written=51056 12/09/1
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlCasey McTaggart 2012-09-16, 15:58
Hi Lewis,
I get the exact same results when I run the bin/nutch script from runtime/deploy... any other help? sorry, thanks! I run it like this sudo -u hdfs bin/nutch crawl urls/seed.txt -dir crawl On Sat, Sep 15, 2012 at 5:49 PM, Lewis John Mcgibbney <[EMAIL PROTECTED]> wrote: > Hi Casey, > > On Sun, Sep 16, 2012 at 12:22 AM, Casey McTaggart > <[EMAIL PROTECTED]> wrote: > >> I run this command: >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl > > I don-t think you should do this. > > Please see a similar post a couple days back [0] and Julien's [1] answer. > > Get back to us if you have probs. I hope this works for you. > > Lewis > > > [1] http://www.mail-archive.com/user%40nutch.apache.org/msg07564.html > [0] http://www.mail-archive.com/user%40nutch.apache.org/msg07565.html
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlCasey McTaggart 2012-09-17, 16:31
I would also like to add that I can run the same crawl locally and it's
successful. So, it's just the distributed mode that's not working. can anyone offer any advice? Do you think it might be something with CDH4? On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart <[EMAIL PROTECTED]>wrote: > Hi everyone, > > I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version > 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what > I'd expect. However, I need to take advantage of the mapreduce > functionality, since I want to crawl a local filesystem with many GB of > files. I'm going to put all of these files on an apache server so they can > be crawled. First, though, I want to just crawl a simple website, and I > can't make it work. > > My urls/seed.txt is on hdfs and is this: > http://lucene.apache.org > > I run this command: > sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job > org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl > > Sometimes, it fetches the URL, but does not go beyond depth 1... and when > I examine the CrawlDatum that's in > /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the > seed url as the key, and the value of the CrawlDatum is > _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError: > org/apache/tika/mime/MimeTypeException > > Okay, so I tried running the command again with -libjars nutch1.5.1.jar, > and it fails with an ArrayIndexOutOfBoundsException. I tried running it > with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with: > > 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for > fetching, exiting ... > 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to > fetch. > 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed > list and URL filters. > 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl > > I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and > still 0 URLs are fetched. > > I'm totally at a loss. can someone help? > > Here's my regex-urlfilter: > > # skip file: ftp: and mailto: urls > -^(file|ftp|mailto): > # skip image and other suffixes we can't yet parse > # for a more extensive coverage use the urlfilter-suffix plugin > > -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ > |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ > # skip URLs containing certain characters as probable queries, etc. > -[?*!@=] > # skip URLs with slash-delimited segment that repeats 3+ times, to break > loops > -.*(/[^/]+)/[^/]+\1/[^/]+\1/ > # accept anything else > +. > > > here's my nutch-site.xml: > > <configuration> > <property> > <name>http.agent.name</name> > <value>nutchtest</value> > </property> > <property> > <name>plugin.folders</name> > > <value>/projects/nutch/apache-nutch-1.5.1/build/plugins,/projects/nutch/apache-nutch-1.5.1/lib</value> > </property> > </configuration> > > > which also does not work if I include this part: > > <property> > <name>plugin.includes</name> > > <value>protocol-http|urlnormalizer-(basic|pass|regex)|urlfilter-regex|parse-(xml|text|html|tika)|index-(basic|anchor) > |query-(basic|site|url)|response-(json|xml)|addhdfskey</value> > </property> > >
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlWalter Tietze 2012-09-17, 17:30
Hi, I had the same problems and couldn't get around in a proper way satisfyingly. I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without MR_v1 and couldn't make it simply work. But I found a workaround to make nutch 1.5.1 work on CDH4. Since MR_v2 it is no longer allowed to pack a project as *nutch*.job altogether and since the former TaskManager is divided into the ResourceManager and the NodeManager, the NodeManager seems not to be able to handle the packed nutch-project. (see also: http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ ) Something one can do, is to unpack the job in the Nodemanager manually and to load the classes from within the code into the current classloader. I modified the org/apache/nutch/plugin/PluginManifestParser.java slightly and everything works fine at least for the moment. I attached the modified file. Please remark, I don't have experience yet, if CDH4 removes the application directories and the unpacked files properly. You should consider to check the directories, if they are still needed after the crawl succeeded. Hope this helps, cheers, Walter Am 17.09.2012 18:31, schrieb Casey McTaggart: > I would also like to add that I can run the same crawl locally and it's > successful. So, it's just the distributed mode that's not working. can > anyone offer any advice? Do you think it might be something with CDH4? > > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart > <[EMAIL PROTECTED]>wrote: > >> Hi everyone, >> >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version >> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns what >> I'd expect. However, I need to take advantage of the mapreduce >> functionality, since I want to crawl a local filesystem with many GB of >> files. I'm going to put all of these files on an apache server so they can >> be crawled. First, though, I want to just crawl a simple website, and I >> can't make it work. >> >> My urls/seed.txt is on hdfs and is this: >> http://lucene.apache.org >> >> I run this command: >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl >> >> Sometimes, it fetches the URL, but does not go beyond depth 1... and when >> I examine the CrawlDatum that's in >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the >> seed url as the key, and the value of the CrawlDatum is >> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError: >> org/apache/tika/mime/MimeTypeException >> >> Okay, so I tried running the command again with -libjars nutch1.5.1.jar, >> and it fails with an ArrayIndexOutOfBoundsException. I tried running it >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with: >> >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected for >> fetching, exiting ... >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs to >> fetch. >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed >> list and URL filters. >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl >> >> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib, and >> still 0 URLs are fetched. >> >> I'm totally at a loss. can someone help? >> >> Here's my regex-urlfilter: >> >> # skip file: ftp: and mailto: urls >> -^(file|ftp|mailto): >> # skip image and other suffixes we can't yet parse >> # for a more extensive coverage use the urlfilter-suffix plugin >> >> -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|CSS|sit|SIT|eps|EPS|wmf|WMF|zip|ZIP|ppt|PPT|mpg|MPG|xls|XLS|gz|GZ|rpm|RPM|tgz|TGZ >> |mov|MOV|exe|EXE|jpeg|JPEG|bmp|BMP|js|JS)$ >> # skip URLs containing certain characters as probable queries, etc. >> -[?*!@=] >> # skip URLs with slash-delimited segment that repeats 3+ times, to break >> loops >> -.*(/[^/]+)/[^/]+\1/[^/]+\1/ >> # accept anything else >> +. >> >> >> here's my nutch-site.xml: >> >> <configuration> Walter Tietze Senior Softwareengineer Research Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T +49.30 24627 318 F +49.30 24627 120 [EMAIL PROTECTED] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Gesch�ftsf�hrung: Thomas Kitlitschko
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawljiuling 2012-09-18, 02:39
Dir Walter:
I am sorry for I want your more help. I have update the corresponding java and recompiled. At the first step, I do not unpack the job and directly excute hadoop jar *.job ..., it still not work. Finally, I unpacked the job, but don't known how to compile the command? Can you help me for more information about "Something one can do, is to unpack the job in the Nodemanager manually and to load the classes from within the code into the current classloader. "? Thank you a lot. -- View this message in context: http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008512.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlWalter Tietze 2012-09-18, 10:37
Hi Jiuling, It should suffice to recompile! You don't have to unpack your job. I start the job with the command 'runtime/deploy/bin/nutch crawl your_seeds_dir -depth 1' which does nothing else then calling 'hadoop jar apache-nutch-1.5.1.job ....'! That should suffice. For accessing the plugins from the job, the parameter <property> <name>plugin.folders</name> <!-- value>plugins</value --> <value>classes/plugins</value> </property> might have to be adjusted like the example above, Please check the structure of the plugins directory in your job. I made one further modification, which came from the need to be able to set hadoop parameters for the jobs. I modified class ./src/java/org/apache/nutch/util/NutchJob.java to public class NutchJob extends JobConf { public NutchJob(Configuration conf) { super(conf, NutchJob.class); checkMyOpts(); } public void checkMyOpts() { Map<String, String> env = System.getenv(); String myOpts = env.get("MY_CRAWLER_OPTS"); if(null != myOpts) { String[] myOptsArray = myOpts.split(" "); for(int i = 0; i < myOptsArray.length; i++) { String[] keyval = myOptsArray[i].split("="); if(null != keyval && keyval.length == 2) { set(keyval[0], keyval[1]); } } } } } to be able to set hadoop parameters for the jobs from the commandline, because I had problems with the default settings for the hadoop child processes. If you add the code above, you can set an environment variable to export MY_CRAWLER_OPTS="mapreduce.map.java.opts=-Xmx4096m mapreduce.reduce.java.opts=-Xmx4096m mapreduce.map.memory.mb=4096 mapreduce.reduce.memory.mb=4096 mapreduce.job.maps=21 mapreduce.job.reduces=21" which sets for YarnChild processes the java parameter -Xmx to 4GB and requests for the crawl 21 maps and 21 reduces. This variables get important, when for example you want to generate a nutch webgraph and the hadoop default settings are choosen for 'normally' sized jobs. Please remark, if hadoop unpacks the job, the container must have at least space for the unpacked files and enough memory space to load the jars into the jvms of the child processes. Hope this helps! Cheers, Walter Am 18.09.2012 04:39, schrieb jiuling: > Dir Walter: > > I am sorry for I want your more help. > > I have update the corresponding java and recompiled. At the first step, > I do not unpack the job and directly excute hadoop jar *.job ..., it still > not work. > Finally, I unpacked the job, but don't known how to compile the command? > Can you help me for more information about "Something one can do, is to > unpack the job in the Nodemanager manually > and to load the classes from within the code into the current > classloader. "? > > Thank you a lot. > > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008512.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- -------------------------------- Walter Tietze Senior Softwareengineer Research Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T +49.30 24627 318 F +49.30 24627 120 [EMAIL PROTECTED] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Gesch�ftsf�hrung: Thomas Kitlitschko --------------------------------
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlCasey McTaggart 2012-09-18, 16:46
thanks Walter, I still am unable to get anything to run- I think it's
because Hadoop is for some reason not finding the tika jar. I tried running Hadoop with -libjars and including both the Nutch jar and the Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch the seed list! When I don't run it with -libjars, it fetches the seed list, then stops with the ClassNotFound exception in the CrawlDatum. I'll try your solution that you just posted. But, any idea why this is happening? thanks! Casey On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[EMAIL PROTECTED]> wrote: > > > Hi, > > I had the same problems and couldn't get around in a proper way > satisfyingly. > > I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without > MR_v1 and couldn't make it simply work. > > > But I found a workaround to make nutch 1.5.1 work on CDH4. > > > Since MR_v2 it is no longer allowed to pack a project as *nutch*.job > altogether and since the former TaskManager is divided into > the ResourceManager and the NodeManager, the NodeManager seems not to > be able to handle the packed nutch-project. > > (see also: > > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > ) > > > Something one can do, is to unpack the job in the Nodemanager manually > and to load the classes from within the code into the current > classloader. > > I modified the org/apache/nutch/plugin/PluginManifestParser.java > slightly and everything works fine at least for the moment. > > > I attached the modified file. > > > Please remark, I don't have experience yet, if CDH4 removes the > application directories and the unpacked files properly. > You should consider to check the directories, if they are still > needed after the crawl succeeded. > > > > Hope this helps, cheers, Walter > > > > > Am 17.09.2012 18:31, schrieb Casey McTaggart: > > I would also like to add that I can run the same crawl locally and it's > > successful. So, it's just the distributed mode that's not working. can > > anyone offer any advice? Do you think it might be something with CDH4? > > > > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart > > <[EMAIL PROTECTED]>wrote: > > > >> Hi everyone, > >> > >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's version > >> 1.0.1. I can run a local filesystem crawl with Nutch, and it returns > what > >> I'd expect. However, I need to take advantage of the mapreduce > >> functionality, since I want to crawl a local filesystem with many GB of > >> files. I'm going to put all of these files on an apache server so they > can > >> be crawled. First, though, I want to just crawl a simple website, and I > >> can't make it work. > >> > >> My urls/seed.txt is on hdfs and is this: > >> http://lucene.apache.org > >> > >> I run this command: > >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job > >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl > >> > >> Sometimes, it fetches the URL, but does not go beyond depth 1... and > when > >> I examine the CrawlDatum that's in > >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one entry: the > >> seed url as the key, and the value of the CrawlDatum is > >> _pst_=exception(16), lastModified=0: java.lang.NoClassDefFoundError: > >> org/apache/tika/mime/MimeTypeException > >> > >> Okay, so I tried running the command again with -libjars nutch1.5.1.jar, > >> and it fails with an ArrayIndexOutOfBoundsException. I tried running it > >> with -libjars /user/hdfs/lib/tika-core-1.1.jar, and that fails with: > >> > >> 12/09/15 17:09:55 WARN crawl.Generator: Generator: 0 records selected > for > >> fetching, exiting ... > >> 12/09/15 17:09:55 INFO crawl.Crawl: Stopping at depth=0 - no more URLs > to > >> fetch. > >> 12/09/15 17:09:55 WARN crawl.Crawl: No URLs to fetch - check your seed > >> list and URL filters. > >> 12/09/15 17:09:55 INFO crawl.Crawl: crawl finished: crawl > >> > >> I tried copying lib/tika-core-1.1.jar to /usr/local/hadoop-1.0.1/lib,
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlWalter Tietze 2012-09-18, 16:58
Am 18.09.2012 18:46, schrieb Casey McTaggart:
> thanks Walter, I still am unable to get anything to run- I think it's > because Hadoop is for some reason not finding the tika jar. I tried > running Hadoop with -libjars and including both the Nutch jar and the > Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch > the seed list! When I don't run it with -libjars, it fetches the seed > list, then stops with the ClassNotFound exception in the CrawlDatum. > > I'll try your solution that you just posted. But, any idea why this is > happening? > thanks! > Casey > Hi Casey, sry, but I think the changes I mentioned were really all changes I made. I'll try to check my code again, if I forgot something to post. Remark: I also tried to insert the workaround with the nutch-2.0 code base, but was unable to make it work, because nutch-2.0 uses already the new Mapreduce classes and seems not to implement the same loading mechanism for the plugin repository. Any other ideas? Cheers, Walter > On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> wrote: > > > > Hi, > > I had the same problems and couldn't get around in a proper way > satisfyingly. > > I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without > MR_v1 and couldn't make it simply work. > > > But I found a workaround to make nutch 1.5.1 work on CDH4. > > > Since MR_v2 it is no longer allowed to pack a project as *nutch*.job > altogether and since the former TaskManager is divided into > the ResourceManager and the NodeManager, the NodeManager seems not to > be able to handle the packed nutch-project. > > (see also: > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > ) > > > Something one can do, is to unpack the job in the Nodemanager manually > and to load the classes from within the code into the current > classloader. > > I modified the org/apache/nutch/plugin/PluginManifestParser.java > slightly and everything works fine at least for the moment. > > > I attached the modified file. > > > Please remark, I don't have experience yet, if CDH4 removes the > application directories and the unpacked files properly. > You should consider to check the directories, if they are still > needed after the crawl succeeded. > > > > Hope this helps, cheers, Walter > > > > > Am 17.09.2012 18:31, schrieb Casey McTaggart: > > I would also like to add that I can run the same crawl locally and > it's > > successful. So, it's just the distributed mode that's not working. can > > anyone offer any advice? Do you think it might be something with CDH4? > > > > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart > > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>>wrote: > > > >> Hi everyone, > >> > >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's > version > >> 1.0.1. I can run a local filesystem crawl with Nutch, and it > returns what > >> I'd expect. However, I need to take advantage of the mapreduce > >> functionality, since I want to crawl a local filesystem with many > GB of > >> files. I'm going to put all of these files on an apache server so > they can > >> be crawled. First, though, I want to just crawl a simple website, > and I > >> can't make it work. > >> > >> My urls/seed.txt is on hdfs and is this: > >> http://lucene.apache.org > >> > >> I run this command: > >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job > >> org.apache.nutch.crawl.Crawl urls/seed.txt -dir crawl > >> > >> Sometimes, it fetches the URL, but does not go beyond depth 1... > and when > >> I examine the CrawlDatum that's in > >> /user/hdfs/crawl/crawldb/current/part-00000/data, it has one > entry: the Walter Tietze Senior Softwareengineer Research Neofonie GmbH Robert-Koch-Platz 4 10115 Berlin T +49.30 24627 318 F +49.30 24627 120 [EMAIL PROTECTED] http://www.neofonie.de Handelsregister Berlin-Charlottenburg: HRB 67460 Gesch�ftsf�hrung: Thomas Kitlitschko
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawljiuling 2012-09-19, 08:44
Thank Walter a lot. It does work following your advise. Thank you again.
-- View this message in context: http://lucene.472066.n3.nabble.com/problem-running-Nutch-1-5-1-in-distributed-mode-simple-crawl-tp4008073p4008842.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: problem running Nutch 1.5.1 in distributed mode- simple crawlCasey McTaggart 2012-09-19, 17:37
including /plugins/classes in plugin.folders made it work. thank you!!!
On Tue, Sep 18, 2012 at 10:58 AM, Walter Tietze <[EMAIL PROTECTED]> wrote: > Am 18.09.2012 18:46, schrieb Casey McTaggart: > > thanks Walter, I still am unable to get anything to run- I think it's > > because Hadoop is for some reason not finding the tika jar. I tried > > running Hadoop with -libjars and including both the Nutch jar and the > > Tika jar, and when I do this it gives me 0 URLs - it doesn't even fetch > > the seed list! When I don't run it with -libjars, it fetches the seed > > list, then stops with the ClassNotFound exception in the CrawlDatum. > > > > I'll try your solution that you just posted. But, any idea why this is > > happening? > > thanks! > > Casey > > > > Hi Casey, > > > > sry, but I think the changes I mentioned were really all changes I made. > > I'll try to check my code again, if I forgot something to post. > > > Remark: I also tried to insert the workaround with the nutch-2.0 code > base, but was unable to make it work, because nutch-2.0 uses already > the new Mapreduce classes and seems not to implement the same loading > mechanism for the plugin repository. > > > > Any other ideas? > > > > Cheers, Walter > > > > On Mon, Sep 17, 2012 at 11:30 AM, Walter Tietze <[EMAIL PROTECTED] > > <mailto:[EMAIL PROTECTED]>> wrote: > > > > > > > > Hi, > > > > I had the same problems and couldn't get around in a proper way > > satisfyingly. > > > > I also tried nutch-2.0 with CDH4 and Yarn / MR_v2 and without > > MR_v1 and couldn't make it simply work. > > > > > > But I found a workaround to make nutch 1.5.1 work on CDH4. > > > > > > Since MR_v2 it is no longer allowed to pack a project as *nutch*.job > > altogether and since the former TaskManager is divided into > > the ResourceManager and the NodeManager, the NodeManager seems not to > > be able to handle the packed nutch-project. > > > > (see also: > > > http://www.cloudera.com/blog/2011/01/how-to-include-third-party-libraries-in-your-map-reduce-job/ > > ) > > > > > > Something one can do, is to unpack the job in the Nodemanager > manually > > and to load the classes from within the code into the current > > classloader. > > > > I modified the org/apache/nutch/plugin/PluginManifestParser.java > > slightly and everything works fine at least for the moment. > > > > > > I attached the modified file. > > > > > > Please remark, I don't have experience yet, if CDH4 removes the > > application directories and the unpacked files properly. > > You should consider to check the directories, if they are still > > needed after the crawl succeeded. > > > > > > > > Hope this helps, cheers, Walter > > > > > > > > > > Am 17.09.2012 18:31, schrieb Casey McTaggart: > > > I would also like to add that I can run the same crawl locally and > > it's > > > successful. So, it's just the distributed mode that's not working. > can > > > anyone offer any advice? Do you think it might be something with > CDH4? > > > > > > On Sat, Sep 15, 2012 at 5:22 PM, Casey McTaggart > > > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED] > >>wrote: > > > > > >> Hi everyone, > > >> > > >> I'm using Hadoop as installed by Cloudera (CDH4)... I think it's > > version > > >> 1.0.1. I can run a local filesystem crawl with Nutch, and it > > returns what > > >> I'd expect. However, I need to take advantage of the mapreduce > > >> functionality, since I want to crawl a local filesystem with many > > GB of > > >> files. I'm going to put all of these files on an apache server so > > they can > > >> be crawled. First, though, I want to just crawl a simple website, > > and I > > >> can't make it work. > > >> > > >> My urls/seed.txt is on hdfs and is this: > > >> http://lucene.apache.org > > >> > > >> I run this command: > > >> sudo -u hdfs hadoop jar build/apache-nutch-1.5.1.job |