|
Paul van Hoven
2011-07-10, 14:42
Markus Jelsma
2011-07-10, 14:47
Ing. Yusniel Hidalgo Delg...
2011-07-10, 14:52
lewis john mcgibbney
2011-07-10, 17:13
Cupbearer
2011-07-10, 22:48
Paul van Hoven
2011-07-12, 09:43
Julien Nioche
2011-07-12, 09:51
soberchallen
2012-06-17, 14:46
Emre Çelikten
2012-06-17, 17:32
|
-
Problems with tutorialPaul van Hoven 2011-07-10, 14:42
I'm completly new to nutch so I downloaded version 1.3 and worked
through the beginners tutorial at http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I did not find the file "conf/crawl-urlfilter.txt" so I omitted that and continued with launiching nutch. Therefore I created a plain text file in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which contains the following text: tom:crawled toom$ cat urls.txt http://nutch.apache.org/ So after that I invoked nutch by calling tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in: /Users/toom/Downloads/nutch-1.3/sites rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled threads = 10 depth = 3 solrUrl=null topN = 50 Injector: starting at 2011-07-07 14:02:31 Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 Generator: starting at 2011-07-07 14:02:35 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 Fetcher: No agents listed in 'http.agent.name' property. Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) I do not understand what happend here, maybe one of you can help me?
-
Re: Problems with tutorialMarkus Jelsma 2011-07-10, 14:47
Hi,
There are a lot of questions on that error: http://www.google.nl/#hl=nl&source=hp&q=No+agents+listed+in+%27http.agent.name%27+property.&oq=No+agents+listed+in+%27http.agent.name%27+property.&aq=f&aqi=&aql=undefined&gs_sm=e&gs_upl=972l972l0l1l1l0l0l0l0l38l38l1l1&bav=on.2,or.r_gc.r_pw.&fp=62113c346707e160&biw=790&bih=328 Add the agents property to your configuration as per the tutorial: http://wiki.apache.org/nutch/NutchTutorial Cheers, > I'm completly new to nutch so I downloaded version 1.3 and worked > through the beginners tutorial at > http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I > did not find the file "conf/crawl-urlfilter.txt" so I omitted that and > continued with launiching nutch. Therefore I created a plain text file > in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which > contains the following text: > > tom:crawled toom$ cat urls.txt > http://nutch.apache.org/ > > So after that I invoked nutch by calling > tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir > /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 > solrUrl is not set, indexing will be skipped... > crawl started in: /Users/toom/Downloads/nutch-1.3/sites > rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled > threads = 10 > depth = 3 > solrUrl=null > topN = 50 > Injector: starting at 2011-07-07 14:02:31 > Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb > Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 > Generator: starting at 2011-07-07 14:02:35 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 > Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 > Fetcher: No agents listed in 'http.agent.name' property. > Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: > No agents listed in 'http.agent.name' property. > at > org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) > > > I do not understand what happend here, maybe one of you can help me?
-
Re: Problems with tutorialIng. Yusniel Hidalgo Delg... 2011-07-10, 14:52
Paul, I think that your problem is related with 'http.agent.name' property. Please, change this property in your configuration file, such as describe the tutorial in:
Good! You are almost ready to crawl. You need to give your crawler a name. This is required. 1. Open up $NUTCH_HOME/conf/nutch-default.xml file 2. Search for http.agent.name , and give it value 'YOURNAME Spider' 3. Optionally you may also set http.agent.url and http.agent.email properties. and try again. Grettings ----- Mensaje original ----- De: "Paul van Hoven" <[EMAIL PROTECTED]> Para: [EMAIL PROTECTED] Enviados: Domingo, 10 de Julio 2011 7:42:47 GMT -08:00 Tijuana / Baja California Asunto: Problems with tutorial I'm completly new to nutch so I downloaded version 1.3 and worked through the beginners tutorial at http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I did not find the file "conf/crawl-urlfilter.txt" so I omitted that and continued with launiching nutch. Therefore I created a plain text file in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which contains the following text: tom:crawled toom$ cat urls.txt http://nutch.apache.org/ So after that I invoked nutch by calling tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 solrUrl is not set, indexing will be skipped... crawl started in: /Users/toom/Downloads/nutch-1.3/sites rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled threads = 10 depth = 3 solrUrl=null topN = 50 Injector: starting at 2011-07-07 14:02:31 Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 Generator: starting at 2011-07-07 14:02:35 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 50 Generator: jobtracker is 'local', generating exactly one partition. Generator: Partitioning selected urls for politeness. Generator: segment: /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 Fetcher: No agents listed in 'http.agent.name' property. Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property. at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) I do not understand what happend here, maybe one of you can help me? -- -------------------------------------------------------------------------------------------- Ing. Yusniel Hidalgo Delgado Participe en COMPUMAT 2011 http://www.mfc.uclv.edu.cu/scmc Participe en INFO 2012 http://www.congreso-info.cu Universidad de las Ciencias Informáticas --------------------------------------------------------------------------------------------
-
Re: Problems with tutoriallewis john mcgibbney 2011-07-10, 17:13
Hi,
For a 1.3 tutorial please see here [1]. I am in the process of overhauling the nutch site to accomodate new changes as per 1.3 release. Thank you On Sun, Jul 10, 2011 at 3:42 PM, Paul van Hoven < [EMAIL PROTECTED]> wrote: > I'm completly new to nutch so I downloaded version 1.3 and worked through > the beginners tutorial at http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>. > The first problem was that I did not find the file > "conf/crawl-urlfilter.txt" so I omitted that and continued with launiching > nutch. Therefore I created a plain text file in > "/Users/toom/Downloads/nutch-**1.3/crawled" called "urls.txt" which > contains the following text: > > tom:crawled toom$ cat urls.txt > http://nutch.apache.org/ > > So after that I invoked nutch by calling > tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.**3/crawled -dir > /Users/toom/Downloads/nutch-1.**3/sites -depth 3 -topN 50 > solrUrl is not set, indexing will be skipped... > crawl started in: /Users/toom/Downloads/nutch-1.**3/sites > rootUrlDir = /Users/toom/Downloads/nutch-1.**3/crawled > threads = 10 > depth = 3 > solrUrl=null > topN = 50 > Injector: starting at 2011-07-07 14:02:31 > Injector: crawlDb: /Users/toom/Downloads/nutch-1.**3/sites/crawldb > Injector: urlDir: /Users/toom/Downloads/nutch-1.**3/crawled > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 > Generator: starting at 2011-07-07 14:02:35 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: /Users/toom/Downloads/nutch-1.**3/sites/segments/** > 20110707140238 > Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 > Fetcher: No agents listed in 'http.agent.name' property. > Exception in thread "main" java.lang.**IllegalArgumentException: Fetcher: > No agents listed in 'http.agent.name' property. > at org.apache.nutch.fetcher.**Fetcher.checkConfiguration(** > Fetcher.java:1166) > at org.apache.nutch.fetcher.**Fetcher.fetch(Fetcher.java:**1068) > at org.apache.nutch.crawl.Crawl.**run(Crawl.java:135) > at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**java:65) > at org.apache.nutch.crawl.Crawl.**main(Crawl.java:54) > > > I do not understand what happend here, maybe one of you can help me? > > -- *Lewis*
-
Re: Problems with tutorialCupbearer 2011-07-10, 22:48
I had this problem also and then saw this part... which answered a TON of
questions for me... "or runtime/local/bin/nutch (version >= 1.3) " Part of the Tutorial. If you downloaded the tar.gz file like I did then you needed to find everything in the runtime folder. Then EVERYTHING else when they say "bin/nutch" will make sense. ----- Cupbearer Jerry E. Craig, Jr. -- View this message in context: http://lucene.472066.n3.nabble.com/Problems-with-tutorial-tp3156809p3157625.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Problems with tutorialPaul van Hoven 2011-07-12, 09:43
Thanks for the answers. I'm not shure if the 'http.agent.name' is the
problem since I set it: This is the configuration I'm using from nutch-1.3/conf/nutch-default.xml: <!-- HTTP properties --> <property> <name>http.agent.name</name> <value>MyFirstNutchCrawler</value> <description>HTTP 'User-Agent' request header. MUST NOT be empty - please set this to a single word uniquely related to your organization. NOTE: You should also check other related properties: http.robots.agents http.agent.description http.agent.url http.agent.email http.agent.version and set their values appropriately. </description> </property> As I understand the tutorial this should be correct: turoial citation "Search for http.agent.name , and give it value 'YOURNAME Spider'" I already had that set this way in my first email. 2011/7/10 Ing. Yusniel Hidalgo Delgado <[EMAIL PROTECTED]>: > Paul, I think that your problem is related with 'http.agent.name' property. Please, change this property in your configuration file, such as describe the tutorial in: > > > > Good! You are almost ready to crawl. You need to give your crawler a name. This is required. > > 1. Open up $NUTCH_HOME/conf/nutch-default.xml file > 2. > > Search for http.agent.name , and give it value 'YOURNAME Spider' > 3. > > Optionally you may also set http.agent.url and http.agent.email properties. > > and try again. > > Grettings > > ----- Mensaje original ----- > De: "Paul van Hoven" <[EMAIL PROTECTED]> > Para: [EMAIL PROTECTED] > Enviados: Domingo, 10 de Julio 2011 7:42:47 GMT -08:00 Tijuana / Baja California > Asunto: Problems with tutorial > > I'm completly new to nutch so I downloaded version 1.3 and worked > through the beginners tutorial at > http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I > did not find the file "conf/crawl-urlfilter.txt" so I omitted that and > continued with launiching nutch. Therefore I created a plain text file > in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which > contains the following text: > > tom:crawled toom$ cat urls.txt > http://nutch.apache.org/ > > So after that I invoked nutch by calling > tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir > /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 > solrUrl is not set, indexing will be skipped... > crawl started in: /Users/toom/Downloads/nutch-1.3/sites > rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled > threads = 10 > depth = 3 > solrUrl=null > topN = 50 > Injector: starting at 2011-07-07 14:02:31 > Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb > Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 > Generator: starting at 2011-07-07 14:02:35 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 50 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: Partitioning selected urls for politeness. > Generator: segment: > /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 > Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 > Fetcher: No agents listed in 'http.agent.name' property. > Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: > No agents listed in 'http.agent.name' property. > at > org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) > at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1068) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:135) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:54) > > > I do not understand what happend here, maybe one of you can help me? > > > > -- > > > > ---------------------------------
-
Re: Problems with tutorialJulien Nioche 2011-07-12, 09:51
Have just updated the tutorial, as of 1.3 the files shoudl be changed in
$NUTCH_HOME/runtime/local/conf/ unless you rebuild with ANT On 12 July 2011 10:43, Paul van Hoven <[EMAIL PROTECTED]> wrote: > Thanks for the answers. I'm not shure if the 'http.agent.name' is the > problem since I set it: > > This is the configuration I'm using from nutch-1.3/conf/nutch-default.xml: > > <!-- HTTP properties --> > > <property> > <name>http.agent.name</name> > <value>MyFirstNutchCrawler</value> > <description>HTTP 'User-Agent' request header. MUST NOT be empty - > please set this to a single word uniquely related to your organization. > > NOTE: You should also check other related properties: > > http.robots.agents > http.agent.description > http.agent.url > http.agent.email > http.agent.version > > and set their values appropriately. > > </description> > </property> > > As I understand the tutorial this should be correct: > turoial citation "Search for http.agent.name , and give it value > 'YOURNAME Spider'" > > > I already had that set this way in my first email. > > > > 2011/7/10 Ing. Yusniel Hidalgo Delgado <[EMAIL PROTECTED]>: > > Paul, I think that your problem is related with 'http.agent.name' > property. Please, change this property in your configuration file, such as > describe the tutorial in: > > > > > > > > Good! You are almost ready to crawl. You need to give your crawler a > name. This is required. > > > > 1. Open up $NUTCH_HOME/conf/nutch-default.xml file > > 2. > > > > Search for http.agent.name , and give it value 'YOURNAME Spider' > > 3. > > > > Optionally you may also set http.agent.url and http.agent.email > properties. > > > > and try again. > > > > Grettings > > > > ----- Mensaje original ----- > > De: "Paul van Hoven" <[EMAIL PROTECTED]> > > Para: [EMAIL PROTECTED] > > Enviados: Domingo, 10 de Julio 2011 7:42:47 GMT -08:00 Tijuana / Baja > California > > Asunto: Problems with tutorial > > > > I'm completly new to nutch so I downloaded version 1.3 and worked > > through the beginners tutorial at > > http://wiki.apache.org/nutch/NutchTutorial. The first problem was that I > > did not find the file "conf/crawl-urlfilter.txt" so I omitted that and > > continued with launiching nutch. Therefore I created a plain text file > > in "/Users/toom/Downloads/nutch-1.3/crawled" called "urls.txt" which > > contains the following text: > > > > tom:crawled toom$ cat urls.txt > > http://nutch.apache.org/ > > > > So after that I invoked nutch by calling > > tom:bin toom$ ./nutch crawl /Users/toom/Downloads/nutch-1.3/crawled -dir > > /Users/toom/Downloads/nutch-1.3/sites -depth 3 -topN 50 > > solrUrl is not set, indexing will be skipped... > > crawl started in: /Users/toom/Downloads/nutch-1.3/sites > > rootUrlDir = /Users/toom/Downloads/nutch-1.3/crawled > > threads = 10 > > depth = 3 > > solrUrl=null > > topN = 50 > > Injector: starting at 2011-07-07 14:02:31 > > Injector: crawlDb: /Users/toom/Downloads/nutch-1.3/sites/crawldb > > Injector: urlDir: /Users/toom/Downloads/nutch-1.3/crawled > > Injector: Converting injected urls to crawl db entries. > > Injector: Merging injected urls into crawl db. > > Injector: finished at 2011-07-07 14:02:35, elapsed: 00:00:03 > > Generator: starting at 2011-07-07 14:02:35 > > Generator: Selecting best-scoring urls due for fetch. > > Generator: filtering: true > > Generator: normalizing: true > > Generator: topN: 50 > > Generator: jobtracker is 'local', generating exactly one partition. > > Generator: Partitioning selected urls for politeness. > > Generator: segment: > > /Users/toom/Downloads/nutch-1.3/sites/segments/20110707140238 > > Generator: finished at 2011-07-07 14:02:39, elapsed: 00:00:04 > > Fetcher: No agents listed in 'http.agent.name' property. > > Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: > > No agents listed in 'http.agent.name' property. > > at > > org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1166) * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com
-
Re: Problems with tutorialsoberchallen 2012-06-17, 14:46
Hello, I have the same problem. Have you already solved? The detail is as
followed! *bin/nutch crawl urls -dir crawl -depth 2 -topN 100 -threads 2* solrUrl is not set, indexing will be skipped... crawl started in: crawl rootUrlDir = urls threads = 2 depth = 2 solrUrl=null topN = 100 Injector: starting at 2012-06-17 22:27:39 Injector: crawlDb: crawl/crawldb Injector: urlDir: urls Injector: Converting injected urls to crawl db entries. Injector: Merging injected urls into crawl db. Injector: finished at 2012-06-17 22:27:41, elapsed: 00:00:02 Generator: starting at 2012-06-17 22:27:41 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: topN: 100 Generator: jobtracker is 'local', generating exactly one partition. Generator: 0 records selected for fetching, exiting ... Stopping at depth=0 - no more URLs to fetch. No URLs to fetch - check your seed list and URL filters. crawl finished: crawl -- View this message in context: http://lucene.472066.n3.nabble.com/Problems-with-tutorial-tp3156809p3990019.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: Problems with tutorialEmre Çelikten 2012-06-17, 17:32
Hello,
Check your urls and regex-urlfilter files. Probably you have a problem there, assuming you are using your own links. On 06/17/2012 05:46 PM, soberchallen wrote: > Hello, I have the same problem. Have you already solved? The detail is as > followed! > *bin/nutch crawl urls -dir crawl -depth 2 -topN 100 -threads 2* > solrUrl is not set, indexing will be skipped... > crawl started in: crawl > rootUrlDir = urls > threads = 2 > depth = 2 > solrUrl=null > topN = 100 > Injector: starting at 2012-06-17 22:27:39 > Injector: crawlDb: crawl/crawldb > Injector: urlDir: urls > Injector: Converting injected urls to crawl db entries. > Injector: Merging injected urls into crawl db. > Injector: finished at 2012-06-17 22:27:41, elapsed: 00:00:02 > Generator: starting at 2012-06-17 22:27:41 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: topN: 100 > Generator: jobtracker is 'local', generating exactly one partition. > Generator: 0 records selected for fetching, exiting ... > Stopping at depth=0 - no more URLs to fetch. > No URLs to fetch - check your seed list and URL filters. > crawl finished: crawl > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/Problems-with-tutorial-tp3156809p3990019.html > Sent from the Nutch - User mailing list archive at Nabble.com. |