|
Stephan Kristyn
2012-05-09, 10:11
Stephan Kristyn
2012-05-09, 10:17
Lewis John Mcgibbney
2012-05-09, 10:32
Stephan Kristyn
2012-05-09, 11:21
Lewis John Mcgibbney
2012-05-09, 11:33
Stephan Kristyn
2012-05-09, 12:26
Stephan Kristyn
2012-05-09, 12:28
Stephan Kristyn
2012-05-09, 14:04
Stephan Kristyn
2012-05-09, 14:33
Stephan Kristyn
2012-05-09, 15:11
Lewis John Mcgibbney
2012-05-09, 16:05
Stephan Kristyn
2012-05-09, 16:25
Markus Jelsma
2012-05-09, 16:36
Tolga
2012-05-09, 19:28
Markus Jelsma
2012-05-09, 19:34
Tolga
2012-05-10, 06:10
Markus Jelsma
2012-05-10, 06:42
Tolga
2012-05-10, 07:07
Stephan Kristyn
2012-05-10, 10:22
Michael Erickson
2012-05-10, 12:56
Lewis John Mcgibbney
2012-05-10, 13:35
Markus Jelsma
2012-05-10, 13:45
Ferdy Galema
2012-05-10, 14:03
Tolga
2012-05-10, 19:54
Markus Jelsma
2012-05-10, 20:38
Tolga
2012-05-11, 04:39
Markus Jelsma
2012-05-11, 06:40
Tolga
2012-05-15, 10:40
Markus Jelsma
2012-05-15, 11:05
Tolga
2012-05-15, 12:01
Tolga
2012-05-15, 12:49
Tolga
2012-05-17, 10:07
Jean-François Gingras
2012-05-19, 01:43
m2000hsf
2012-05-19, 06:43
keesp
2012-05-24, 08:29
|
-
HTTP ERROR 400Stephan Kristyn 2012-05-09, 10:11
Hi,
after installing Nutch and Solr I get a HTTP ERROR 400 Problem accessing /solr/select/. Reason: undefined field text ------------------------------------------------------------------------ /Powered by Jetty:// /Any ideas how to fix this? Thanks, Stephan
-
Re: HTTP ERROR 400Stephan Kristyn 2012-05-09, 10:17
Also.. entering
java -jar post.jar *.xml on RHEL6 I get a INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 SimplePostTool: FATAL: Solr returned an error #400 ERROR: [doc=GB18030TEST] unknown field 'name' Thanks, Stephan Am 09.05.2012 12:11, schrieb Stephan Kristyn: > Hi, > > after installing Nutch and Solr I get a > > > HTTP ERROR 400 > > Problem accessing /solr/select/. Reason: > > undefined field text > > ------------------------------------------------------------------------ > /Powered by Jetty:// > > > > /Any ideas how to fix this? > > Thanks, > Stephan
-
Re: HTTP ERROR 400Lewis John Mcgibbney 2012-05-09, 10:32
Which schema are you using with your SOlr server?
On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]> wrote: > Also.. entering > > java -jar post.jar *.xml on RHEL6 I get a > > INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 > SimplePostTool: FATAL: Solr returned an error #400 ERROR: > [doc=GB18030TEST] unknown field 'name' > > Thanks, > Stephan > > > Am 09.05.2012 12:11, schrieb Stephan Kristyn: >> Hi, >> >> after installing Nutch and Solr I get a >> >> >> HTTP ERROR 400 >> >> Problem accessing /solr/select/. Reason: >> >> undefined field text >> >> ------------------------------------------------------------------------ >> /Powered by Jetty:// >> >> >> >> /Any ideas how to fix this? >> >> Thanks, >> Stephan > -- Lewis
-
Re: HTTP ERROR 400Stephan Kristyn 2012-05-09, 11:21
I copied over the schema and everything else in conf from nutch.
$cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: > Which schema are you using with your SOlr server? > > On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]> wrote: >> Also.. entering >> >> java -jar post.jar *.xml on RHEL6 I get a >> >> INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 >> SimplePostTool: FATAL: Solr returned an error #400 ERROR: >> [doc=GB18030TEST] unknown field 'name' >> >> Thanks, >> Stephan >> >> >> Am 09.05.2012 12:11, schrieb Stephan Kristyn: >>> Hi, >>> >>> after installing Nutch and Solr I get a >>> >>> >>> HTTP ERROR 400 >>> >>> Problem accessing /solr/select/. Reason: >>> >>> undefined field text >>> >>> ------------------------------------------------------------------------ >>> /Powered by Jetty:// >>> >>> >>> >>> /Any ideas how to fix this? >>> >>> Thanks, >>> Stephan > > -- *stephan* *kristyn* partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED] direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 fax (408) 349 3301 http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
-
Re: HTTP ERROR 400Lewis John Mcgibbney 2012-05-09, 11:33
are you attempting to index to Solr or is this simply when you start you
solr server? On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED]>wrote: > I copied over the schema and everything else in conf from nutch. > > $cp apache-nutch-1.4-bin/runtime/local/conf/* > apache-solr-3.6.0/example/solr/conf/ > > > > > Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: > > Which schema are you using with your SOlr server? > > On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > > Also.. entering > > java -jar post.jar *.xml on RHEL6 I get a > > INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 > SimplePostTool: FATAL: Solr returned an error #400 ERROR: > [doc=GB18030TEST] unknown field 'name' > > Thanks, > Stephan > > > Am 09.05.2012 12:11, schrieb Stephan Kristyn: > > Hi, > > after installing Nutch and Solr I get a > > > HTTP ERROR 400 > > Problem accessing /solr/select/. Reason: > > undefined field text > > ------------------------------------------------------------------------ > /Powered by Jetty:// > > > > /Any ideas how to fix this? > > Thanks, > Stephan > > > -- > **** > > ** ** > > *stephan* > *kristyn* > partner operations manager > > "The Internet? Is that thing still around?" - Homer Simpson > > [EMAIL PROTECTED] > direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 > > yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany > phone (408) 349 3300 fax (408) 349 3301 > > [image: > http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif] > **** > > ** ** > -- *Lewis*
-
RE: HTTP ERROR 400Stephan Kristyn 2012-05-09, 12:26
This is when hitting the Search button in the Web Interface:
http://myDomain.com:8983/solr/admin/ From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] Sent: Mittwoch, 9. Mai 2012 13:34 To: [EMAIL PROTECTED] Subject: Re: HTTP ERROR 400 are you attempting to index to Solr or is this simply when you start you solr server? On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I copied over the schema and everything else in conf from nutch. $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: Which schema are you using with your SOlr server? On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> wrote: Also.. entering java -jar post.jar *.xml on RHEL6 I get a INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 SimplePostTool: FATAL: Solr returned an error #400 ERROR: [doc=GB18030TEST] unknown field 'name' Thanks, Stephan Am 09.05.2012 12:11, schrieb Stephan Kristyn: Hi, after installing Nutch and Solr I get a HTTP ERROR 400 Problem accessing /solr/select/. Reason: undefined field text ------------------------------------------------------------------------ /Powered by Jetty:// /Any ideas how to fix this? Thanks, Stephan -- stephan kristyn partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> direct +49 (0)89 231 97 207<tel:%2B49%20%280%2989%20231%2097%20207> mobile +49 (0) 162 28899 02<tel:%2B49%20%280%29%20162%2028899%2002> yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300<tel:%28408%29%20349%203300> fax (408) 349 3301<tel:%28408%29%20349%203301> [cid:[EMAIL PROTECTED]] -- Lewis
-
RE: HTTP ERROR 400Stephan Kristyn 2012-05-09, 12:28
This is the query that the SOLR interface generates when I enter "test" and hit the serach button:
http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on Maybe this is a question better suited for the Solr ML? From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] Sent: Mittwoch, 9. Mai 2012 13:34 To: [EMAIL PROTECTED] Subject: Re: HTTP ERROR 400 are you attempting to index to Solr or is this simply when you start you solr server? On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: I copied over the schema and everything else in conf from nutch. $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: Which schema are you using with your SOlr server? On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> wrote: Also.. entering java -jar post.jar *.xml on RHEL6 I get a INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 SimplePostTool: FATAL: Solr returned an error #400 ERROR: [doc=GB18030TEST] unknown field 'name' Thanks, Stephan Am 09.05.2012 12:11, schrieb Stephan Kristyn: Hi, after installing Nutch and Solr I get a HTTP ERROR 400 Problem accessing /solr/select/. Reason: undefined field text ------------------------------------------------------------------------ /Powered by Jetty:// /Any ideas how to fix this? Thanks, Stephan -- stephan kristyn partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> direct +49 (0)89 231 97 207<tel:%2B49%20%280%2989%20231%2097%20207> mobile +49 (0) 162 28899 02<tel:%2B49%20%280%29%20162%2028899%2002> yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300<tel:%28408%29%20349%203300> fax (408) 349 3301<tel:%28408%29%20349%203301> [cid:[EMAIL PROTECTED]] -- Lewis
-
Re: HTTP ERROR 400Stephan Kristyn 2012-05-09, 14:04
Hi, it seems like I forgot to fetch the crawled URLs, as mentioned in
the tutorial: http://wiki.apache.org/nutch/NutchTutorial I'll let you know if and how that worked out for me. Am 09.05.2012 14:28, schrieb Stephan Kristyn: > This is the query that the SOLR interface generates when I enter "test" and hit the serach button: > http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on > > Maybe this is a question better suited for the Solr ML? > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > Sent: Mittwoch, 9. Mai 2012 13:34 > To: [EMAIL PROTECTED] > Subject: Re: HTTP ERROR 400 > > are you attempting to index to Solr or is this simply when you start you solr server? > On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: > I copied over the schema and everything else in conf from nutch. > > $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ > > > > > Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: > > Which schema are you using with your SOlr server? > > > > On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> wrote: > > Also.. entering > > > > java -jar post.jar *.xml on RHEL6 I get a > > > > INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 > > SimplePostTool: FATAL: Solr returned an error #400 ERROR: > > [doc=GB18030TEST] unknown field 'name' > > > > Thanks, > > Stephan > > > > > > Am 09.05.2012 12:11, schrieb Stephan Kristyn: > > Hi, > > > > after installing Nutch and Solr I get a > > > > > > HTTP ERROR 400 > > > > Problem accessing /solr/select/. Reason: > > > > undefined field text > > > > ------------------------------------------------------------------------ > > /Powered by Jetty:// > > > > > > > > /Any ideas how to fix this? > > > > Thanks, > > Stephan > > -- > > stephan > kristyn > partner operations manager > > "The Internet? Is that thing still around?" - Homer Simpson > > [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> > direct +49 (0)89 231 97 207<tel:%2B49%20%280%2989%20231%2097%20207> mobile +49 (0) 162 28899 02<tel:%2B49%20%280%29%20162%2028899%2002> > > yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany > phone (408) 349 3300<tel:%28408%29%20349%203300> fax (408) 349 3301<tel:%28408%29%20349%203301> > > [cid:[EMAIL PROTECTED]] > > > > > > -- > Lewis -- *stephan* *kristyn* partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED] direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 fax (408) 349 3301 http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
-
Re: HTTP ERROR 400Stephan Kristyn 2012-05-09, 14:33
Ok now at the heading "Step-by-Step: Fetching" I get
-bash-4.1$ bin/nutch generate crawldb crawldb/segments Generator: starting at 2012-05-09 14:32:44 Generator: Selecting best-scoring urls due for fetch. Generator: filtering: true Generator: normalizing: true Generator: jobtracker is 'local', generating exactly one partition. Generator: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/kristyns/apache-nutch-1.4-bin/runtime/local/crawldb/current at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.Generator.generate(Generator.java:538) at org.apache.nutch.crawl.Generator.run(Generator.java:704) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Generator.main(Generator.java:660) Strange... Am 09.05.2012 16:04, schrieb Stephan Kristyn: > Hi, it seems like I forgot to fetch the crawled URLs, as mentioned in > the tutorial: > > http://wiki.apache.org/nutch/NutchTutorial > > I'll let you know if and how that worked out for me. > > Am 09.05.2012 14:28, schrieb Stephan Kristyn: >> This is the query that the SOLR interface generates when I enter "test" and hit the serach button: >> http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on >> >> Maybe this is a question better suited for the Solr ML? >> >> From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] >> Sent: Mittwoch, 9. Mai 2012 13:34 >> To: [EMAIL PROTECTED] >> Subject: Re: HTTP ERROR 400 >> >> are you attempting to index to Solr or is this simply when you start you solr server? >> On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> I copied over the schema and everything else in conf from nutch. >> >> $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ >> >> >> >> >> Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: >> >> Which schema are you using with your SOlr server? >> >> >> >> On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> wrote: >> >> Also.. entering >> >> >> >> java -jar post.jar *.xml on RHEL6 I get a >> >> >> >> INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 >> >> SimplePostTool: FATAL: Solr returned an error #400 ERROR: >> >> [doc=GB18030TEST] unknown field 'name' >> >> >> >> Thanks, >> >> Stephan >> >> >> >> >> >> Am 09.05.2012 12:11, schrieb Stephan Kristyn: >> >> Hi, >> >> >> >> after installing Nutch and Solr I get a >> >> >> >> >> >> HTTP ERROR 400 >> >> >> >> Problem accessing /solr/select/. Reason: >> >> >> >> undefined field text >> >> >> >> ------------------------------------------------------------------------ >> >> /Powered by Jetty:// >> >> >> >> >> >> >> >> /Any ideas how to fix this? >> >> >> >> Thanks, >> >> Stephan >> >> -- >> >> stephan >> kristyn >> partner operations manager >> >> "The Internet? Is that thing still around?" - Homer Simpson >> >> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> direct +49 (0)89 231 97 207<tel:%2B49%20%280%2989%20231%2097%20207> mobile +49 (0) 162 28899 02<tel:%2B49%20%280%29%20162%2028899%2002> >> >> yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany >> phone (408) 349 3300<tel:%28408%29%20349%203300> fax (408) 349 3301<tel:%28408%29%20349%203301> >> >> [cid:[EMAIL PROTECTED]] >> >> >> >> > *stephan* *kristyn* partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED] direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 fax (408) 349 3301 http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
-
Re: HTTP ERROR 400Stephan Kristyn 2012-05-09, 15:11
Ok,
http://myDomain:8983/solr/admin/ works now, when I enter *:* as search. XML is displayed Now how can I search the crawl for - the html content - the html source ? Thanks, Stephan Am 09.05.2012 16:33, schrieb Stephan Kristyn: > Ok now at the heading "Step-by-Step: Fetching" I get > > -bash-4.1$ bin/nutch generate crawldb crawldb/segments > Generator: starting at 2012-05-09 14:32:44 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: org.apache.hadoop.mapred.InvalidInputException: Input path > does not exist: > file:/home/kristyns/apache-nutch-1.4-bin/runtime/local/crawldb/current > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.Generator.generate(Generator.java:538) > at org.apache.nutch.crawl.Generator.run(Generator.java:704) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Generator.main(Generator.java:660) > > Strange... > > Am 09.05.2012 16:04, schrieb Stephan Kristyn: >> Hi, it seems like I forgot to fetch the crawled URLs, as mentioned in >> the tutorial: >> >> http://wiki.apache.org/nutch/NutchTutorial >> >> I'll let you know if and how that worked out for me. >> >> Am 09.05.2012 14:28, schrieb Stephan Kristyn: >>> This is the query that the SOLR interface generates when I enter "test" and hit the serach button: >>> http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on >>> >>> Maybe this is a question better suited for the Solr ML? >>> >>> From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] >>> Sent: Mittwoch, 9. Mai 2012 13:34 >>> To: [EMAIL PROTECTED] >>> Subject: Re: HTTP ERROR 400 >>> >>> are you attempting to index to Solr or is this simply when you start you solr server? >>> On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >>> I copied over the schema and everything else in conf from nutch. >>> >>> $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ >>> >>> >>> >>> >>> Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: >>> >>> Which schema are you using with your SOlr server? >>> >>> >>> >>> On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> wrote: >>> >>> Also.. entering >>> >>> >>> >>> java -jar post.jar *.xml on RHEL6 I get a >>> >>> >>> >>> INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 >>> >>> SimplePostTool: FATAL: Solr returned an error #400 ERROR: >>> >>> [doc=GB18030TEST] unknown field 'name' >>> >>> >>> >>> Thanks, >>> >>> Stephan >>> >>> >>> >>> >>> >>> Am 09.05.2012 12:11, schrieb Stephan Kristyn: >>> >>> Hi, >>> >>> >>> >>> after installing Nutch and Solr I get a >>> >>> >>> >>> >>> >>> HTTP ERROR 400 >>> >>> >>> >>> Problem accessing /solr/select/. Reason: >>> >>> >>> >>> undefined field text >>> >>> >>> >>> ------------------------------------------------------------------------ >>> >>> /Powered by Jetty:// >>> >>> >>> >>> >>> >>> >>> >>> /Any ideas how to fix this? >>> >>> >>> >>> Thanks, >>> >>> Stephan >>> >>> -- >>> >>> stephan >>> kristyn >>> partner operations manager >>> >>> "The Internet? Is that thing still around?" - Homer Simpson >>> >>> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> *stephan* *kristyn* partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED] direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 fax (408) 349 3301 http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
-
Re: HTTP ERROR 400Lewis John Mcgibbney 2012-05-09, 16:05
Which segments are you trying to generate from? Do you maybe need to
include them individually? or use a wildcard? bin/nutch generate crawldb crawldb/segments/* bin/nutch generate crawldb crawldb/segments/segmentNo ? On Wed, May 9, 2012 at 3:33 PM, Stephan Kristyn <[EMAIL PROTECTED]>wrote: > Ok now at the heading "Step-by-Step: Fetching" I get > > -bash-4.1$ bin/nutch generate crawldb crawldb/segments > Generator: starting at 2012-05-09 14:32:44 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: > file:/home/kristyns/apache-nutch-1.4-bin/runtime/local/crawldb/current > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.Generator.generate(Generator.java:538) > at org.apache.nutch.crawl.Generator.run(Generator.java:704) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Generator.main(Generator.java:660) > > Strange... > > Am 09.05.2012 16:04, schrieb Stephan Kristyn: > > Hi, it seems like I forgot to fetch the crawled URLs, as mentioned in the > tutorial: > > http://wiki.apache.org/nutch/NutchTutorial > > > I'll let you know if and how that worked out for me. > > Am 09.05.2012 14:28, schrieb Stephan Kristyn: > > This is the query that the SOLR interface generates when I enter "test" and hit the serach button:http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on > > Maybe this is a question better suited for the Solr ML? > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED] <[EMAIL PROTECTED]>] > Sent: Mittwoch, 9. Mai 2012 13:34 > To: [EMAIL PROTECTED] > Subject: Re: HTTP ERROR 400 > > are you attempting to index to Solr or is this simply when you start you solr server? > On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> <[EMAIL PROTECTED]>> wrote: > I copied over the schema and everything else in conf from nutch. > > $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ > > > > > Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: > > Which schema are you using with your SOlr server? > > > > On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]> <[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> <[EMAIL PROTECTED]> wrote: > > Also.. entering > > > > java -jar post.jar *.xml on RHEL6 I get a > > > > INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 > > SimplePostTool: FATAL: Solr returned an error #400 ERROR: > > [doc=GB18030TEST] unknown field 'name' > > > > Thanks, > > Stephan > > > > > > Am 09.05.2012 12:11, schrieb Stephan Kristyn: > > Hi, > > > > after installing Nutch and Solr I get a > > > > > > HTTP ERROR 400 > > > > Problem accessing /solr/select/. Reason: > > > > undefined field text > > > > ------------------------------------------------------------------------ > > /Powered by Jetty:// > > > > > > > > /Any ideas how to fix this? > > > > Thanks, > > Stephan > > -- > > stephan > kristyn > partner operations manager > > "The Internet? Is that thing still around?" - Homer Simpson > [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> <[EMAIL PROTECTED]> *Lewis*
-
Re: HTTP ERROR 400Stephan Kristyn 2012-05-09, 16:25
I guess I have to read up on segments. I don't know what they are yet.
Looking at the Mail Archive of this List ( http://www.mail-archive.com/[EMAIL PROTECTED]/msg01607.html ), I found: ./bin/nutch readseg -dump crawl/segments/XXX/ dump_folder -nofetch -nogenerate -noparse -noparsedata -noparsetex That command will dump all the HTML sourcecode in one file. That's a good start. However the desired result in my case is to create a new field in the schema containing the source code, like: <htmlsource></htmlsource> Is that even possible? Best, Stephan Am 09.05.2012 18:05, schrieb Lewis John Mcgibbney: > Which segments are you trying to generate from? Do you maybe need to > include them individually? or use a wildcard? > > bin/nutch generate crawldb crawldb/segments/* > bin/nutch generate crawldb crawldb/segments/segmentNo > > ? > > On Wed, May 9, 2012 at 3:33 PM, Stephan Kristyn > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Ok now at the heading "Step-by-Step: Fetching" I get > > -bash-4.1$ bin/nutch generate crawldb crawldb/segments > Generator: starting at 2012-05-09 14:32:44 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: org.apache.hadoop.mapred.InvalidInputException: Input > path does not exist: > file:/home/kristyns/apache-nutch-1.4-bin/runtime/local/crawldb/current > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at > org.apache.nutch.crawl.Generator.generate(Generator.java:538) > at org.apache.nutch.crawl.Generator.run(Generator.java:704) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Generator.main(Generator.java:660) > > Strange... > > Am 09.05.2012 16:04, schrieb Stephan Kristyn: >> Hi, it seems like I forgot to fetch the crawled URLs, as >> mentioned in the tutorial: >> >> http://wiki.apache.org/nutch/NutchTutorial >> >> I'll let you know if and how that worked out for me. >> >> Am 09.05.2012 14:28, schrieb Stephan Kristyn: >>> This is the query that the SOLR interface generates when I enter "test" and hit the serach button: >>> http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on <http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on> >>> >>> Maybe this is a question better suited for the Solr ML? >>> >>> From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] >>> Sent: Mittwoch, 9. Mai 2012 13:34 >>> To: [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> >>> Subject: Re: HTTP ERROR 400 >>> >>> are you attempting to index to Solr or is this simply when you start you solr server? >>> On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]>> wrote: >>> I copied over the schema and everything else in conf from nutch. >>> >>> $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ >>> >>> >>> >>> >>> Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: >>> >>> Which schema are you using with your SOlr server? >>> >>> >>> >>> On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn <[EMAIL PROTECTED]> <mailto:[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> wrote: *stephan* *kristyn* partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED] direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 fax (408) 349 3301 http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
-
Re: HTTP ERROR 400Markus Jelsma 2012-05-09, 16:36
If you follow the tutorial then the command should be:
$ bin/nutch generate crawl/crawldb crawldb/segments On Wed, 9 May 2012 17:05:51 +0100, Lewis John Mcgibbney <[EMAIL PROTECTED]> wrote: > Which segments are you trying to generate from? Do you maybe need to > include them individually? or use a wildcard? > > bin/nutch generate crawldb crawldb/segments/* > bin/nutch generate crawldb crawldb/segments/segmentNo > > ? > > On Wed, May 9, 2012 at 3:33 PM, Stephan Kristyn wrote: > > Ok now at the heading "Step-by-Step: Fetching" I get > > -bash-4.1$ bin/nutch generate crawldb crawldb/segments > Generator: starting at 2012-05-09 14:32:44 > Generator: Selecting best-scoring urls due for fetch. > Generator: filtering: true > Generator: normalizing: true > Generator: jobtracker is 'local', generating exactly one partition. > Generator: org.apache.hadoop.mapred.InvalidInputException: Input > path does not exist: > > file:/home/kristyns/apache-nutch-1.4-bin/runtime/local/crawldb/current > at > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at > org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at > org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at > org.apache.nutch.crawl.Generator.generate(Generator.java:538) > at > org.apache.nutch.crawl.Generator.run(Generator.java:704) > at > org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at > org.apache.nutch.crawl.Generator.main(Generator.java:660) > > Strange... > > Am 09.05.2012 16:04, schrieb Stephan Kristyn: Hi, it seems like I > forgot to fetch the crawled URLs, as mentioned in the tutorial: > > http://wiki.apache.org/nutch/NutchTutorial [2] > > I'll let you know if and how that worked out for me. > > Am 09.05.2012 14:28, schrieb Stephan Kristyn: > > This is the query that the SOLR interface generates when I enter > "test" and hit the serach button: > > http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on > [3] > > Maybe this is a question better suited for the Solr ML? > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED] [4]] > Sent: Mittwoch, 9. Mai 2012 13:34 > To: [EMAIL PROTECTED] [5] > Subject: Re: HTTP ERROR 400 > > are you attempting to index to Solr or is this simply when you start > you solr server? > On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn wrote: > I copied over the schema and everything else in conf from nutch. > > $cp apache-nutch-1.4-bin/runtime/local/conf/* > apache-solr-3.6.0/example/solr/conf/ > > Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: > > Which schema are you using with your SOlr server? > > On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn [8] [9] wrote: > > Also.. entering > > java -jar post.jar *.xml on RHEL6 I get a > > INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 > > SimplePostTool: FATAL: Solr returned an error #400 ERROR: > > [doc=GB18030TEST] unknown field 'name' > > Thanks, > > Stephan > > Am 09.05.2012 12:11, schrieb Stephan Kristyn: > > Hi, > > after installing Nutch and Solr I get a > > HTTP ERROR 400 > > Problem accessing /solr/select/. Reason: > > undefined field text > > > ------------------------------------------------------------------------ > > /Powered by Jetty:// > > /Any ideas how to fix this? > > Thanks, > > Stephan > > -- > > stephan > kristyn > partner operations manager > > "The Internet? Is that thing still around?" - Homer Simpson > > [EMAIL PROTECTED] [10] [11] > direct +49 (0)89 231 97 207 [12] mobile +49 (0) 162 28899 02 [13] Markus Jelsma - CTO - Openindex
-
Re: HTTP ERROR 400Tolga 2012-05-09, 19:28
Hi,
It seems you have the same same error as me. Did you solve it? If yes, how? Regards, On 05/09/2012 05:04 PM, Stephan Kristyn wrote: > Hi, it seems like I forgot to fetch the crawled URLs, as mentioned in > the tutorial: > > http://wiki.apache.org/nutch/NutchTutorial > > I'll let you know if and how that worked out for me. > > Am 09.05.2012 14:28, schrieb Stephan Kristyn: >> This is the query that the SOLR interface generates when I enter "test" and hit the serach button: >> http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on >> >> Maybe this is a question better suited for the Solr ML? >> >> From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] >> Sent: Mittwoch, 9. Mai 2012 13:34 >> To:[EMAIL PROTECTED] >> Subject: Re: HTTP ERROR 400 >> >> are you attempting to index to Solr or is this simply when you start you solr server? >> On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn<[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> I copied over the schema and everything else in conf from nutch. >> >> $cp apache-nutch-1.4-bin/runtime/local/conf/* apache-solr-3.6.0/example/solr/conf/ >> >> >> >> >> Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: >> >> Which schema are you using with your SOlr server? >> >> >> >> On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn<[EMAIL PROTECTED]><mailto:[EMAIL PROTECTED]> wrote: >> >> Also.. entering >> >> >> >> java -jar post.jar *.xml on RHEL6 I get a >> >> >> >> INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 >> >> SimplePostTool: FATAL: Solr returned an error #400 ERROR: >> >> [doc=GB18030TEST] unknown field 'name' >> >> >> >> Thanks, >> >> Stephan >> >> >> >> >> >> Am 09.05.2012 12:11, schrieb Stephan Kristyn: >> >> Hi, >> >> >> >> after installing Nutch and Solr I get a >> >> >> >> >> >> HTTP ERROR 400 >> >> >> >> Problem accessing /solr/select/. Reason: >> >> >> >> undefined field text >> >> >> >> ------------------------------------------------------------------------ >> >> /Powered by Jetty:// >> >> >> >> >> >> >> >> /Any ideas how to fix this? >> >> >> >> Thanks, >> >> Stephan >> >> -- >> >> stephan >> kristyn >> partner operations manager >> >> "The Internet? Is that thing still around?" - Homer Simpson >> >> [EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]> >> direct +49 (0)89 231 97 207<tel:%2B49%20%280%2989%20231%2097%20207> mobile +49 (0) 162 28899 02<tel:%2B49%20%280%29%20162%2028899%2002> >> >> yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany >> phone (408) 349 3300<tel:%28408%29%20349%203300> fax (408) 349 3301<tel:%28408%29%20349%203301> >> >> [cid:[EMAIL PROTECTED]] >> >> >> >> >> >> -- >> Lewis > > -- > > *stephan* > *kristyn* > partner operations manager > > "The Internet? Is that thing still around?" - Homer Simpson > > [EMAIL PROTECTED] > direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 > > yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany > phone (408) 349 3300 fax (408) 349 3301 > > http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif >
-
Re: HTTP ERROR 400Markus Jelsma 2012-05-09, 19:34
I see this:
SimplePostTool: FATAL: Solr returned an error #400 ERROR: [doc=GB18030TEST] unknown field 'name' somewhere in the mail thread. Sounds like someone is using Solr's post.jar to post XML-data to schema without the field `name`. This doesn't sound like a Nutch issue, the field name doesn't appear in Nutch, nor is the post.jar tool used. On Wed, 09 May 2012 22:28:46 +0300, Tolga <[EMAIL PROTECTED]> wrote: > Hi, > > It seems you have the same same error as me. Did you solve it? If > yes, how? > > Regards, > > On 05/09/2012 05:04 PM, Stephan Kristyn wrote: Hi, it seems like I > forgot to fetch the crawled URLs, as mentioned in the tutorial: > > http://wiki.apache.org/nutch/NutchTutorial [1] > > I'll let you know if and how that worked out for me. > > Am 09.05.2012 14:28, schrieb Stephan Kristyn: > > This is the query that the SOLR interface generates when I enter > "test" and hit the serach button: > > http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on > [2] > > Maybe this is a question better suited for the Solr ML? > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED] [3]] > Sent: Mittwoch, 9. Mai 2012 13:34 > To: [EMAIL PROTECTED] [4] > Subject: Re: HTTP ERROR 400 > > are you attempting to index to Solr or is this simply when you start > you solr server? > On Wed, May 9, 2012 at 12:21 PM, Stephan Kristyn wrote: > I copied over the schema and everything else in conf from nutch. > > $cp apache-nutch-1.4-bin/runtime/local/conf/* > apache-solr-3.6.0/example/solr/conf/ > > Am 09.05.2012 12:32, schrieb Lewis John Mcgibbney: > > Which schema are you using with your SOlr server? > > On Wed, May 9, 2012 at 11:17 AM, Stephan Kristyn [7] [8] wrote: > > Also.. entering > > java -jar post.jar *.xml on RHEL6 I get a > > INFO: [] webapp=/solr path=/update params={} status=400 QTime=42 > > SimplePostTool: FATAL: Solr returned an error #400 ERROR: > > [doc=GB18030TEST] unknown field 'name' > > Thanks, > > Stephan > > Am 09.05.2012 12:11, schrieb Stephan Kristyn: > > Hi, > > after installing Nutch and Solr I get a > > HTTP ERROR 400 > > Problem accessing /solr/select/. Reason: > > undefined field text > > > ------------------------------------------------------------------------ > > /Powered by Jetty:// > > /Any ideas how to fix this? > > Thanks, > > Stephan > > -- > > stephan > kristyn > partner operations manager > > "The Internet? Is that thing still around?" - Homer Simpson > > [EMAIL PROTECTED] [9] [10] > direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 > > yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany > phone (408) 349 3300 fax (408) 349 3301 > > [cid:[EMAIL PROTECTED] [11]] > > -- > Lewis > > -- > [Fwd: RE: Weekly Report] > > STEPHAN > KRISTYN > partner operations manager > > "The Internet? Is that thing still around?" - Homer Simpson > > [EMAIL PROTECTED] [12] > direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 > > yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany > phone (408) 349 3300 fax (408) 349 3301 > > > > Links: > ------ > [1] http://wiki.apache.org/nutch/NutchTutorial > [2] > > http://myDomain:8983/solr/select/?q=test&version=2.2&start=0&rows=10&indent=on > [3] mailto:[EMAIL PROTECTED] > [4] mailto:[EMAIL PROTECTED] > [5] mailto:[EMAIL PROTECTED] > [6] mailto:[EMAIL PROTECTED] > [7] mailto:[EMAIL PROTECTED] > [8] mailto:[EMAIL PROTECTED] > [9] mailto:[EMAIL PROTECTED] > [10] mailto:[EMAIL PROTECTED] > [11] > http://webmail.openindex.io/cid:[EMAIL PROTECTED] > [12] mailto:[EMAIL PROTECTED] -- Markus Jelsma - CTO - Openindex
-
HTTP error 400Tolga 2012-05-10, 06:10
Hi,
This will sound like a duplicate, but actually it differs from the other one. Please bear with me. Following http://wiki.apache.org/nutch/NutchTutorial, I first issued the command bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 Then when I got the message Exception in thread "main" java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) at org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) I issued the commands bin/nutch crawl urls -dir crawl -depth 3 -topN 5 and bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/* separately, after which I got no errors. When I browsed to http://localhost:8983/solr/admin and attempted a search, I got the error HTTP ERROR 400 Problem accessing /solr/select. Reason: undefined field text ------------------------------------------------------------------------ /Powered by Jetty:// /What am I doing wrong? Regards,/ /
-
Re: HTTP error 400Markus Jelsma 2012-05-10, 06:42
Hi,
On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: > Hi, > > This will sound like a duplicate, but actually it differs from the > other one. Please bear with me. Following > http://wiki.apache.org/nutch/NutchTutorial, I first issued the > command > > bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN > 5 > > Then when I got the message > > Exception in thread "main" java.io.IOException: Job failed! > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) > at > > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) > at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Please include the relevant part of the log. This can be a known issue. > > I issued the commands > > bin/nutch crawl urls -dir crawl -depth 3 -topN 5 > > and > > bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb > crawldb/linkdb crawldb/segments/* > > separately, after which I got no errors. When I browsed to > http://localhost:8983/solr/admin and attempted a search, I got the > error > > > HTTP ERROR 400 > > Problem accessing /solr/select. Reason: > > undefined field text But this is a Solr thing, you have no field named text. Resolve this in Solr or on the Solr mailing list. > > > ------------------------------------------------------------------------ > /Powered by Jetty:// > > /What am I doing wrong? > > Regards,/ > / -- Markus Jelsma - CTO - Openindex
-
Re: HTTP error 400Tolga 2012-05-10, 07:07
Thanks. *heads to the Solr list*
On 5/10/12 9:42 AM, Markus Jelsma wrote: > Hi, > > On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >> Hi, >> >> This will sound like a duplicate, but actually it differs from the >> other one. Please bear with me. Following >> http://wiki.apache.org/nutch/NutchTutorial, I first issued the command >> >> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 >> >> Then when I got the message >> >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >> at >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >> >> at >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >> >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > > Please include the relevant part of the log. This can be a known issue. > >> >> I issued the commands >> >> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 >> >> and >> >> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb >> crawldb/linkdb crawldb/segments/* >> >> separately, after which I got no errors. When I browsed to >> http://localhost:8983/solr/admin and attempted a search, I got the >> error >> >> >> HTTP ERROR 400 >> >> Problem accessing /solr/select. Reason: >> >> undefined field text > > But this is a Solr thing, you have no field named text. Resolve this > in Solr or on the Solr mailing list. > >> >> >> ------------------------------------------------------------------------ >> /Powered by Jetty:// >> >> /What am I doing wrong? >> >> Regards,/ >> / >
-
Re: HTTP error 400Stephan Kristyn 2012-05-10, 10:22
Regarding the issue, I found out I was not using the right search syntax.
Make sure you specify a field and searchterm with a colon, like this: XMLfieldname:SearchTerm Am 10.05.2012 09:07, schrieb Tolga: > Thanks. *heads to the Solr list* > > On 5/10/12 9:42 AM, Markus Jelsma wrote: >> Hi, >> >> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> This will sound like a duplicate, but actually it differs from the >>> other one. Please bear with me. Following >>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the command >>> >>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 >>> >>> Then when I got the message >>> >>> Exception in thread "main" java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>> at >>> >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>> >>> at >>> >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>> >>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >> Please include the relevant part of the log. This can be a known issue. >> >>> I issued the commands >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 >>> >>> and >>> >>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb >>> crawldb/linkdb crawldb/segments/* >>> >>> separately, after which I got no errors. When I browsed to >>> http://localhost:8983/solr/admin and attempted a search, I got the >>> error >>> >>> >>> HTTP ERROR 400 >>> >>> Problem accessing /solr/select. Reason: >>> >>> undefined field text >> But this is a Solr thing, you have no field named text. Resolve this >> in Solr or on the Solr mailing list. >> >>> >>> ------------------------------------------------------------------------ >>> /Powered by Jetty:// >>> >>> /What am I doing wrong? >>> >>> Regards,/ >>> / -- *stephan* *kristyn* partner operations manager "The Internet? Is that thing still around?" - Homer Simpson [EMAIL PROTECTED] direct +49 (0)89 231 97 207 mobile +49 (0) 162 28899 02 yahoo! deutschland gmbh theresienhoehe 12, munich, 80339, germany phone (408) 349 3300 fax (408) 349 3301 http://us.i1.yimg.com/us.yimg.com/i/pt/i/buzzmktg/brand/logos/yahoo_email_sig_generic_v2.gif
-
Re: HTTP error 400Michael Erickson 2012-05-10, 12:56
On May 10, 2012, at 1:42 AM, Markus Jelsma wrote: > Hi, > > On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >> Hi, >> >> This will sound like a duplicate, but actually it differs from the >> other one. Please bear with me. Following >> http://wiki.apache.org/nutch/NutchTutorial, I first issued the command >> >> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 >> >> Then when I got the message >> >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >> at >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >> at >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > > Please include the relevant part of the log. This can be a known issue. > >> >> I issued the commands >> >> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 >> >> and >> >> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb >> crawldb/linkdb crawldb/segments/* >> >> separately, after which I got no errors. When I browsed to >> http://localhost:8983/solr/admin and attempted a search, I got the >> error >> >> >> HTTP ERROR 400 >> >> Problem accessing /solr/select. Reason: >> >> undefined field text > > But this is a Solr thing, you have no field named text. Resolve this in Solr or on the Solr mailing list. I will say that I had similar issues last week when I tried the Nutch tutorial. I went to the #Solr IRC channel and got no response. The quick answer was that I had to go back to Solr version 3.1.0 for the instructions in the Nutch tutorial to work. The longer answer is that following the existing Nutch tutorial gave me two errors. 1) SolrDeleteDuplicates exception as mentioned by Tolga above. To fix this I: 1.a) Stop Solr. 1.b) Delete Solr index. 1.c) Copy the Nutch-provided schema.xml into the proper Solr directory (example/solr/conf/). 1.d) Replace Nutch's solr-solrj-xxx.jar with the appropriate version from Solr: ( solr/dist/apache-solr-solrj-xxx.jar --> nutch/runtime/local/lib/solr-solrj-xxx.jar ) 1.e) Restart Solr. The first two steps may only be necessary if you had Solr running already using the default schema that they provided as I did because I had done the Solr tutorial first. 2) The HTTP 400 Error "undefined field text" issue. This appears to be the same as: https://issues.apache.org/jira/browse/SOLR-3416. Log output from Solr output is here: http://pastebin.com/YWdPnXpv and the Nutch provided schema is here: http://pastebin.com/LQDDKC5B The only way I got this working was to move Solr from version 3.6.0 back to version 3.1.0. I'm *totally* new to Solr/Nutch, but I might suggest a versioning mismatch? Regards, --mike Michael Erickson [EMAIL PROTECTED]
-
Re: HTTP error 400Lewis John Mcgibbney 2012-05-10, 13:35
Hi Michael,
As I'm also not using most recent stable Solr distribution (3.6.0), I can only comment (maybe unwisely) that the most recent version of Solr that Nutch supports is maybe 3.4.0 as this is the dependency we pull with ivy. It also looks like Solr and Solrj are released in parallel so maybe try upgrading your solrj dependency if you wish to use Solr 3.6.0... If the above is correct, then this is why 3.1.0 works fine when you roll back as I would imagine backwards compatibility is always of key importance. I would be pleased to know that the above is not correct and that Nutch is above to index to Solr 3.6.0, however if not then maybe we should upgrade accordingly in trunk. Thanks Lewis On Thu, May 10, 2012 at 1:56 PM, Michael Erickson <[EMAIL PROTECTED]> wrote: > > On May 10, 2012, at 1:42 AM, Markus Jelsma wrote: > >> Hi, >> >> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> This will sound like a duplicate, but actually it differs from the >>> other one. Please bear with me. Following >>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the command >>> >>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 >>> >>> Then when I got the message >>> >>> Exception in thread "main" java.io.IOException: Job failed! >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>> at >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>> at >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >> >> Please include the relevant part of the log. This can be a known issue. >> >>> >>> I issued the commands >>> >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 >>> >>> and >>> >>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb >>> crawldb/linkdb crawldb/segments/* >>> >>> separately, after which I got no errors. When I browsed to >>> http://localhost:8983/solr/admin and attempted a search, I got the >>> error >>> >>> >>> HTTP ERROR 400 >>> >>> Problem accessing /solr/select. Reason: >>> >>> undefined field text >> >> But this is a Solr thing, you have no field named text. Resolve this in Solr or on the Solr mailing list. > > > I will say that I had similar issues last week when I tried the Nutch tutorial. I went to the #Solr IRC channel and got no response. The quick answer was that I had to go back to Solr version 3.1.0 for the instructions in the Nutch tutorial to work. > > The longer answer is that following the existing Nutch tutorial gave me two errors. > > 1) SolrDeleteDuplicates exception as mentioned by Tolga above. > > To fix this I: > > 1.a) Stop Solr. > 1.b) Delete Solr index. > 1.c) Copy the Nutch-provided schema.xml into the proper Solr directory (example/solr/conf/). > 1.d) Replace Nutch's solr-solrj-xxx.jar with the appropriate version from Solr: > ( solr/dist/apache-solr-solrj-xxx.jar --> nutch/runtime/local/lib/solr-solrj-xxx.jar ) > 1.e) Restart Solr. > > The first two steps may only be necessary if you had Solr running already using the default schema that they provided as I did because I had done the Solr tutorial first. > > 2) The HTTP 400 Error "undefined field text" issue. > > This appears to be the same as: https://issues.apache.org/jira/browse/SOLR-3416. Log output from Solr output is here: http://pastebin.com/YWdPnXpv and the Nutch provided schema is here: http://pastebin.com/LQDDKC5B > > The only way I got this working was to move Solr from version 3.6.0 back to version 3.1.0. > > I'm *totally* new to Solr/Nutch, but I might suggest a versioning mismatch? > > > Regards, > --mike > > Michael Erickson > [EMAIL PROTECTED] > > -- Lewis
-
Re: HTTP error 400Markus Jelsma 2012-05-10, 13:45
On Thursday 10 May 2012 14:35:03 Lewis John Mcgibbney wrote:
> Hi Michael, > > As I'm also not using most recent stable Solr distribution (3.6.0), I > can only comment (maybe unwisely) that the most recent version of Solr > that Nutch supports is maybe 3.4.0 as this is the dependency we pull > with ivy. It also looks like Solr and Solrj are released in parallel > so maybe try upgrading your solrj dependency if you wish to use Solr > 3.6.0... This should not be a version issue. We happily index from trunk or 1.4 to Solr versions > 3.0. There must be some schema thing or bad Solr request handler defined. > > If the above is correct, then this is why 3.1.0 works fine when you > roll back as I would imagine backwards compatibility is always of key > importance. > > I would be pleased to know that the above is not correct and that > Nutch is above to index to Solr 3.6.0, however if not then maybe we > should upgrade accordingly in trunk. > > Thanks > > Lewis > > On Thu, May 10, 2012 at 1:56 PM, Michael Erickson > > <[EMAIL PROTECTED]> wrote: > > On May 10, 2012, at 1:42 AM, Markus Jelsma wrote: > >> Hi, > >> > >> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: > >>> Hi, > >>> > >>> This will sound like a duplicate, but actually it differs from the > >>> other one. Please bear with me. Following > >>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the command > >>> > >>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 > >>> > >>> Then when I got the message > >>> > >>> Exception in thread "main" java.io.IOException: Job failed! > >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > >>> at > >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli > >>> cates.java:373) at > >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli > >>> cates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > >> > >> Please include the relevant part of the log. This can be a known issue. > >> > >>> I issued the commands > >>> > >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 > >>> > >>> and > >>> > >>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb > >>> crawldb/linkdb crawldb/segments/* > >>> > >>> separately, after which I got no errors. When I browsed to > >>> http://localhost:8983/solr/admin and attempted a search, I got the > >>> error > >>> > >>> > >>> HTTP ERROR 400 > >>> > >>> Problem accessing /solr/select. Reason: > >>> > >>> undefined field text > >> > >> But this is a Solr thing, you have no field named text. Resolve this in > >> Solr or on the Solr mailing list.> > > I will say that I had similar issues last week when I tried the Nutch > > tutorial. I went to the #Solr IRC channel and got no response. The > > quick answer was that I had to go back to Solr version 3.1.0 for the > > instructions in the Nutch tutorial to work. > > > > The longer answer is that following the existing Nutch tutorial gave me > > two errors. > > > > 1) SolrDeleteDuplicates exception as mentioned by Tolga above. > > > > To fix this I: > > > > 1.a) Stop Solr. > > 1.b) Delete Solr index. > > 1.c) Copy the Nutch-provided schema.xml into the proper Solr directory > > (example/solr/conf/). 1.d) Replace Nutch's solr-solrj-xxx.jar with the > > appropriate version from Solr: ( solr/dist/apache-solr-solrj-xxx.jar --> > > nutch/runtime/local/lib/solr-solrj-xxx.jar ) 1.e) Restart Solr. > > > > The first two steps may only be necessary if you had Solr running already > > using the default schema that they provided as I did because I had done > > the Solr tutorial first. > > > > 2) The HTTP 400 Error "undefined field text" issue. > > > > This appears to be the same as: > > https://issues.apache.org/jira/browse/SOLR-3416. Log output from Solr > > output is here: http://pastebin.com/YWdPnXpv and the Nutch provided Markus Jelsma - CTO - Openindex
-
Re: HTTP error 400Ferdy Galema 2012-05-10, 14:03
It indeed seems to be caused by some sort of (schema) configuration issue.
We are currently trying to resolve this issue. Although the browse page shows an error, it is still possible to search. Use: /solr/select/?q=content%3Athe On Thu, May 10, 2012 at 3:45 PM, Markus Jelsma <[EMAIL PROTECTED]>wrote: > On Thursday 10 May 2012 14:35:03 Lewis John Mcgibbney wrote: > > Hi Michael, > > > > As I'm also not using most recent stable Solr distribution (3.6.0), I > > can only comment (maybe unwisely) that the most recent version of Solr > > that Nutch supports is maybe 3.4.0 as this is the dependency we pull > > with ivy. It also looks like Solr and Solrj are released in parallel > > so maybe try upgrading your solrj dependency if you wish to use Solr > > 3.6.0... > > This should not be a version issue. We happily index from trunk or 1.4 to > Solr > versions > 3.0. There must be some schema thing or bad Solr request handler > defined. > > > > > If the above is correct, then this is why 3.1.0 works fine when you > > roll back as I would imagine backwards compatibility is always of key > > importance. > > > > I would be pleased to know that the above is not correct and that > > Nutch is above to index to Solr 3.6.0, however if not then maybe we > > should upgrade accordingly in trunk. > > > > Thanks > > > > Lewis > > > > On Thu, May 10, 2012 at 1:56 PM, Michael Erickson > > > > <[EMAIL PROTECTED]> wrote: > > > On May 10, 2012, at 1:42 AM, Markus Jelsma wrote: > > >> Hi, > > >> > > >> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: > > >>> Hi, > > >>> > > >>> This will sound like a duplicate, but actually it differs from the > > >>> other one. Please bear with me. Following > > >>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the > command > > >>> > > >>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 > -topN 5 > > >>> > > >>> Then when I got the message > > >>> > > >>> Exception in thread "main" java.io.IOException: Job failed! > > >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > > >>> at > > >>> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli > > >>> cates.java:373) at > > >>> > org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDupli > > >>> cates.java:353) at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > > >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > > >> > > >> Please include the relevant part of the log. This can be a known > issue. > > >> > > >>> I issued the commands > > >>> > > >>> bin/nutch crawl urls -dir crawl -depth 3 -topN 5 > > >>> > > >>> and > > >>> > > >>> bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb > > >>> crawldb/linkdb crawldb/segments/* > > >>> > > >>> separately, after which I got no errors. When I browsed to > > >>> http://localhost:8983/solr/admin and attempted a search, I got the > > >>> error > > >>> > > >>> > > >>> HTTP ERROR 400 > > >>> > > >>> Problem accessing /solr/select. Reason: > > >>> > > >>> undefined field text > > >> > > >> But this is a Solr thing, you have no field named text. Resolve this > in > > >> Solr or on the Solr mailing list.> > > > I will say that I had similar issues last week when I tried the Nutch > > > tutorial. I went to the #Solr IRC channel and got no response. The > > > quick answer was that I had to go back to Solr version 3.1.0 for the > > > instructions in the Nutch tutorial to work. > > > > > > The longer answer is that following the existing Nutch tutorial gave me > > > two errors. > > > > > > 1) SolrDeleteDuplicates exception as mentioned by Tolga above. > > > > > > To fix this I: > > > > > > 1.a) Stop Solr. > > > 1.b) Delete Solr index. > > > 1.c) Copy the Nutch-provided schema.xml into the proper Solr directory > > > (example/solr/conf/). 1.d) Replace Nutch's solr-solrj-xxx.jar with the > > > appropriate version from Solr: ( solr/dist/apache-solr-solrj-xxx.jar
-
Re: HTTP error 400Tolga 2012-05-10, 19:54
Hi Markus,
On 05/10/2012 09:42 AM, Markus Jelsma wrote: > Hi, > > On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >> Hi, >> >> This will sound like a duplicate, but actually it differs from the >> other one. Please bear with me. Following >> http://wiki.apache.org/nutch/NutchTutorial, I first issued the command >> >> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 5 >> >> Then when I got the message >> >> Exception in thread "main" java.io.IOException: Job failed! >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >> at >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >> >> at >> >> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >> >> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > > Please include the relevant part of the log. This can be a known issue. This is an excerpt from hadoop.log: 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: crawl-20120510222629 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 2012-05-10 22:26:30,351 INFO crawl.Crawl - solrUrl=http://localhost:8983/solr/ 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at 2012-05-10 22:26:30 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: crawl-20120510222629/crawldb 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting injected urls to crawl db entries. 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin Auto-activation mode: [true] 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered Plugins: 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika Parser Plug-in (parse-tika) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Anchor Indexing Filter (index-anchor) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered Extension-Points: 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Protocol (org.apache.nutch.protocol.Protocol) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Segment Merge Filter (org.apache.nutch.segment.SegmentMergeFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch URL Filter (org.apache.nutch.net.URLFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - HTML Parse Filter (org.apache.nutch.parse.HtmlParseFilter) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Content Parser (org.apache.nutch.parse.Parser) 2012-05-10 22:26:34,963 INFO plugin.PluginRepository - Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 2012-05-10 22:26:35,439 INFO regex.RegexURLNormalizer - can't find rules for scope 'inject', using default 2012-05-10 22:26:36,434 INFO crawl.Injector - Injector: Merging injected urls into crawl db. 2012-05-10 22:26:36,710 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-05-10 22:26:37,542 INFO crawl.Injector - Injector: finished at 2012-05-10 22:26:37, elapsed: 00:00:06 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: starting at 2012-05-10 22:26:37 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: Selecting best-scoring urls due for fetch. 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: filtering: true 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: normalizing: true 2012-05-10 22:26:37,551 INFO crawl.Generator - Generator: topN: 100 2012-05-10 22:26:37,552 INFO crawl.Generator - Generator: jobtracker is 'local', generating exactly one partition. 2012-05-10 22:26:37,820 INFO crawl.FetchScheduleFactory - Using FetchSchedule impl: org.apache.nutch.crawl.DefaultFetchSchedule 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - defaultInterval=2592000 2012-05-10 22:26:37,820 INFO crawl.AbstractFetchSchedule - maxInterval=7776000 2012-05-10 22:26:37,856 INFO regex.RegexURLNormalizer - can't find rules for scope 'partition', using default
-
Re: HTTP error 400Markus Jelsma 2012-05-10, 20:38
thanks
This is a known issue: https://issues.apache.org/jira/browse/NUTCH-1100 I have not been able find the bug nor do i know how to reproduce it from scratch. If you have a public site with which we can reproduce it please comment to the Jira ticket. Make sure you use either default config or little, a seed URL and the exact crawl & dedup steps to reproduce. If you find it we might fix it. In any case we need to replace the dedup command with a more scalable tool which it currently is not. In the mean time you can omit solrdedup and use Solr's internal deduplication instead, it works similar and uses the same signature algorithm as Nutch has. Please consult the Solr wiki page on deduplication. Good luck On Thu, 10 May 2012 22:54:37 +0300, Tolga <[EMAIL PROTECTED]> wrote: > Hi Markus, > > On 05/10/2012 09:42 AM, Markus Jelsma wrote: >> Hi, >> >> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>> This will sound like a duplicate, but actually it differs from the >>> other one. Please bear with me. Following >>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>> command >>> >>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>> -topN 5 >>> >>> Then when I got the message >>> >>> Exception in thread "main" java.io.IOException: Job failed! >>> at >>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>> at >>> >>> >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>> >>> at >>> >>> >>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>> >>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >> >> Please include the relevant part of the log. This can be a known >> issue. > > This is an excerpt from hadoop.log: > > 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: > crawl-20120510222629 > 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls > 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 > 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 > 2012-05-10 22:26:30,351 INFO crawl.Crawl - > solrUrl=http://localhost:8983/solr/ > 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 > 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at > 2012-05-10 22:26:30 > 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: > crawl-20120510222629/crawldb > 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls > 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting > injected urls to crawl db entries. > 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: > looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin > Auto-activation mode: [true] > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered > Plugins: > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch > core extension points (nutch-extensionpoints) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL > Normalizer (urlnormalizer-basic) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html > Parse Plug-in (parse-html) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic > Indexing Filter (index-basic) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP > Framework (lib-http) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - > Pass-through URL Normalizer (urlnormalizer-pass) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL > Filter (urlfilter-regex) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Http > Protocol Plug-in (protocol-http) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Regex URL > Normalizer (urlnormalizer-regex) > 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Tika Markus Jelsma - CTO - Openindex
-
Re: HTTP error 400Tolga 2012-05-11, 04:39
Hi,
How do I exactly "omit solrdedup and use Solr's internal deduplication" instead.? I don't even know what any of that means :D I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 100 to get the error. I have to use all the steps? Regards, On 05/10/2012 11:38 PM, Markus Jelsma wrote: > thanks > > This is a known issue: > https://issues.apache.org/jira/browse/NUTCH-1100 > > I have not been able find the bug nor do i know how to reproduce it > from scratch. If you have a public site with which we can reproduce it > please comment to the Jira ticket. Make sure you use either default > config or little, a seed URL and the exact crawl & dedup steps to > reproduce. > > If you find it we might fix it. In any case we need to replace the > dedup command with a more scalable tool which it currently is not. > > In the mean time you can omit solrdedup and use Solr's internal > deduplication instead, it works similar and uses the same signature > algorithm as Nutch has. Please consult the Solr wiki page on > deduplication. > > Good luck > > > On Thu, 10 May 2012 22:54:37 +0300, Tolga <[EMAIL PROTECTED]> wrote: >> Hi Markus, >> >> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>> Hi, >>> >>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> This will sound like a duplicate, but actually it differs from the >>>> other one. Please bear with me. Following >>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the command >>>> >>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>> -topN 5 >>>> >>>> Then when I got the message >>>> >>>> Exception in thread "main" java.io.IOException: Job failed! >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>> at >>>> >>>> >>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>>> >>>> >>>> at >>>> >>>> >>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>>> >>>> >>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>> >>> Please include the relevant part of the log. This can be a known issue. >> >> This is an excerpt from hadoop.log: >> >> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >> crawl-20120510222629 >> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >> solrUrl=http://localhost:8983/solr/ >> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 >> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at >> 2012-05-10 22:26:30 >> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: >> crawl-20120510222629/crawldb >> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls >> 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting >> injected urls to crawl db entries. >> 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: >> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin >> Auto-activation mode: [true] >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered >> Plugins: >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the nutch >> core extension points (nutch-extensionpoints) >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic URL >> Normalizer (urlnormalizer-basic) >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Html >> Parse Plug-in (parse-html) >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Basic >> Indexing Filter (index-basic) >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - HTTP >> Framework (lib-http) >> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository -
-
Re: HTTP error 400Markus Jelsma 2012-05-11, 06:40
Ah, that means don't use the crawl command and do a little shell
scripting to execute the separte crawl cycle commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication. cheers On Fri, 11 May 2012 07:39:36 +0300, Tolga <[EMAIL PROTECTED]> wrote: > Hi, > > How do I exactly "omit solrdedup and use Solr's internal > deduplication" instead.? I don't even know what any of that means :D > I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ > -depth 3 -topN 100 to get the error. I have to use all the steps? > > Regards, > > On 05/10/2012 11:38 PM, Markus Jelsma wrote: >> thanks >> >> This is a known issue: >> https://issues.apache.org/jira/browse/NUTCH-1100 >> >> I have not been able find the bug nor do i know how to reproduce it >> from scratch. If you have a public site with which we can reproduce it >> please comment to the Jira ticket. Make sure you use either default >> config or little, a seed URL and the exact crawl & dedup steps to >> reproduce. >> >> If you find it we might fix it. In any case we need to replace the >> dedup command with a more scalable tool which it currently is not. >> >> In the mean time you can omit solrdedup and use Solr's internal >> deduplication instead, it works similar and uses the same signature >> algorithm as Nutch has. Please consult the Solr wiki page on >> deduplication. >> >> Good luck >> >> >> On Thu, 10 May 2012 22:54:37 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>> Hi Markus, >>> >>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>> Hi, >>>> >>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>>>> Hi, >>>>> >>>>> This will sound like a duplicate, but actually it differs from >>>>> the >>>>> other one. Please bear with me. Following >>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>>>> command >>>>> >>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>> -topN 5 >>>>> >>>>> Then when I got the message >>>>> >>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>>> at >>>>> >>>>> >>>>> >>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>>>> >>>>> >>>>> at >>>>> >>>>> >>>>> >>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>>>> >>>>> >>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>> >>>> Please include the relevant part of the log. This can be a known >>>> issue. >>> >>> This is an excerpt from hadoop.log: >>> >>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>> crawl-20120510222629 >>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >>> solrUrl=http://localhost:8983/solr/ >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting >>> at >>> 2012-05-10 22:26:30 >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: >>> crawl-20120510222629/crawldb >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: >>> urls >>> 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting >>> injected urls to crawl db entries. >>> 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: >>> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin >>> Auto-activation mode: [true] >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered >>> Plugins: >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the >>> nutch >>> core extension points (nutch-extensionpoints) Markus Jelsma - CTO - Openindex
-
Re: HTTP error 400Tolga 2012-05-15, 10:40
I'm a little confused. How can I not use the crawl command and execute
the separate crawl cycle commands at the same time? Regards, On 5/11/12 9:40 AM, Markus Jelsma wrote: > Ah, that means don't use the crawl command and do a little shell > scripting to execute the separte crawl cycle commands, see the nutch > wiki for examples. And don't do solrdedup. Search the Solr wiki for > deduplication. > > cheers > > On Fri, 11 May 2012 07:39:36 +0300, Tolga <[EMAIL PROTECTED]> wrote: >> Hi, >> >> How do I exactly "omit solrdedup and use Solr's internal >> deduplication" instead.? I don't even know what any of that means :D >> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ >> -depth 3 -topN 100 to get the error. I have to use all the steps? >> >> Regards, >> >> On 05/10/2012 11:38 PM, Markus Jelsma wrote: >>> thanks >>> >>> This is a known issue: >>> https://issues.apache.org/jira/browse/NUTCH-1100 >>> >>> I have not been able find the bug nor do i know how to reproduce it >>> from scratch. If you have a public site with which we can reproduce >>> it please comment to the Jira ticket. Make sure you use either >>> default config or little, a seed URL and the exact crawl & dedup >>> steps to reproduce. >>> >>> If you find it we might fix it. In any case we need to replace the >>> dedup command with a more scalable tool which it currently is not. >>> >>> In the mean time you can omit solrdedup and use Solr's internal >>> deduplication instead, it works similar and uses the same signature >>> algorithm as Nutch has. Please consult the Solr wiki page on >>> deduplication. >>> >>> Good luck >>> >>> >>> On Thu, 10 May 2012 22:54:37 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>>> Hi Markus, >>>> >>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>>> Hi, >>>>> >>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: >>>>>> Hi, >>>>>> >>>>>> This will sound like a duplicate, but actually it differs from the >>>>>> other one. Please bear with me. Following >>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>>>>> command >>>>>> >>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>>> -topN 5 >>>>>> >>>>>> Then when I got the message >>>>>> >>>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>>> at >>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>>>> at >>>>>> >>>>>> >>>>>> >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>>>>> >>>>>> >>>>>> >>>>>> at >>>>>> >>>>>> >>>>>> >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>>>>> >>>>>> >>>>>> >>>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>>> >>>>> Please include the relevant part of the log. This can be a known >>>>> issue. >>>> >>>> This is an excerpt from hadoop.log: >>>> >>>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>>> crawl-20120510222629 >>>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >>>> solrUrl=http://localhost:8983/solr/ >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at >>>> 2012-05-10 22:26:30 >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: >>>> crawl-20120510222629/crawldb >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: urls >>>> 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting >>>> injected urls to crawl db entries. >>>> 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: >>>> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins
-
Re: HTTP error 400Markus Jelsma 2012-05-15, 11:05
Please follow the step-by-step tutorial, it's explained there:
http://wiki.apache.org/nutch/NutchTutorial On Tuesday 15 May 2012 13:40:26 Tolga wrote: > I'm a little confused. How can I not use the crawl command and execute > the separate crawl cycle commands at the same time? > > Regards, > > On 5/11/12 9:40 AM, Markus Jelsma wrote: > > Ah, that means don't use the crawl command and do a little shell > > scripting to execute the separte crawl cycle commands, see the nutch > > wiki for examples. And don't do solrdedup. Search the Solr wiki for > > deduplication. > > > > cheers > > > > On Fri, 11 May 2012 07:39:36 +0300, Tolga <[EMAIL PROTECTED]> wrote: > >> Hi, > >> > >> How do I exactly "omit solrdedup and use Solr's internal > >> deduplication" instead.? I don't even know what any of that means :D > >> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ > >> -depth 3 -topN 100 to get the error. I have to use all the steps? > >> > >> Regards, > >> > >> On 05/10/2012 11:38 PM, Markus Jelsma wrote: > >>> thanks > >>> > >>> This is a known issue: > >>> https://issues.apache.org/jira/browse/NUTCH-1100 > >>> > >>> I have not been able find the bug nor do i know how to reproduce it > >>> from scratch. If you have a public site with which we can reproduce > >>> it please comment to the Jira ticket. Make sure you use either > >>> default config or little, a seed URL and the exact crawl & dedup > >>> steps to reproduce. > >>> > >>> If you find it we might fix it. In any case we need to replace the > >>> dedup command with a more scalable tool which it currently is not. > >>> > >>> In the mean time you can omit solrdedup and use Solr's internal > >>> deduplication instead, it works similar and uses the same signature > >>> algorithm as Nutch has. Please consult the Solr wiki page on > >>> deduplication. > >>> > >>> Good luck > >>> > >>> On Thu, 10 May 2012 22:54:37 +0300, Tolga <[EMAIL PROTECTED]> wrote: > >>>> Hi Markus, > >>>> > >>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: > >>>>> Hi, > >>>>> > >>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[EMAIL PROTECTED]> wrote: > >>>>>> Hi, > >>>>>> > >>>>>> This will sound like a duplicate, but actually it differs from the > >>>>>> other one. Please bear with me. Following > >>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the > >>>>>> command > >>>>>> > >>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 > >>>>>> -topN 5 > >>>>>> > >>>>>> Then when I got the message > >>>>>> > >>>>>> Exception in thread "main" java.io.IOException: Job failed! > >>>>>> > >>>>>> at > >>>>>> > >>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) > >>>>>> > >>>>>> at > >>>>>> > >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu > >>>>>> plicates.java:373)>>>>>> > >>>>>> at > >>>>>> > >>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu > >>>>>> plicates.java:353)>>>>>> > >>>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) > >>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > >>>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > >>>>> > >>>>> Please include the relevant part of the log. This can be a known > >>>>> issue. > >>>> > >>>> This is an excerpt from hadoop.log: > >>>> > >>>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: > >>>> crawl-20120510222629 > >>>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - > >>>> solrUrl=http://localhost:8983/solr/ > >>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 > >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at > >>>> 2012-05-10 22:26:30 > >>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: > >>>> crawl-20120510222629/crawldb Markus Jelsma - CTO - Openindex
-
Re: HTTP error 400Tolga 2012-05-15, 12:01
Hi,
I would like to report that the directory schema given in the command bin/nutch solrindex http://127.0.0.1:8983/solr/ crawldb -linkdb crawldb/linkdb crawldb/segments/* in Nutch FAQ doesn't match previous examples. That said, I'm totally confused. How can I index to solr if I don't crawl? On 5/15/12 2:05 PM, Markus Jelsma wrote: > Please follow the step-by-step tutorial, it's explained there: > http://wiki.apache.org/nutch/NutchTutorial > > On Tuesday 15 May 2012 13:40:26 Tolga wrote: >> I'm a little confused. How can I not use the crawl command and execute >> the separate crawl cycle commands at the same time? >> >> Regards, >> >> On 5/11/12 9:40 AM, Markus Jelsma wrote: >>> Ah, that means don't use the crawl command and do a little shell >>> scripting to execute the separte crawl cycle commands, see the nutch >>> wiki for examples. And don't do solrdedup. Search the Solr wiki for >>> deduplication. >>> >>> cheers >>> >>> On Fri, 11 May 2012 07:39:36 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> How do I exactly "omit solrdedup and use Solr's internal >>>> deduplication" instead.? I don't even know what any of that means :D >>>> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ >>>> -depth 3 -topN 100 to get the error. I have to use all the steps? >>>> >>>> Regards, >>>> >>>> On 05/10/2012 11:38 PM, Markus Jelsma wrote: >>>>> thanks >>>>> >>>>> This is a known issue: >>>>> https://issues.apache.org/jira/browse/NUTCH-1100 >>>>> >>>>> I have not been able find the bug nor do i know how to reproduce it >>>>> from scratch. If you have a public site with which we can reproduce >>>>> it please comment to the Jira ticket. Make sure you use either >>>>> default config or little, a seed URL and the exact crawl& dedup >>>>> steps to reproduce. >>>>> >>>>> If you find it we might fix it. In any case we need to replace the >>>>> dedup command with a more scalable tool which it currently is not. >>>>> >>>>> In the mean time you can omit solrdedup and use Solr's internal >>>>> deduplication instead, it works similar and uses the same signature >>>>> algorithm as Nutch has. Please consult the Solr wiki page on >>>>> deduplication. >>>>> >>>>> Good luck >>>>> >>>>> On Thu, 10 May 2012 22:54:37 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>> Hi Markus, >>>>>> >>>>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> This will sound like a duplicate, but actually it differs from the >>>>>>>> other one. Please bear with me. Following >>>>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>>>>>>> command >>>>>>>> >>>>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>>>>> -topN 5 >>>>>>>> >>>>>>>> Then when I got the message >>>>>>>> >>>>>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu >>>>>>>> plicates.java:373)>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu >>>>>>>> plicates.java:353)>>>>>> >>>>>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>>>>> Please include the relevant part of the log. This can be a known >>>>>>> issue. >>>>>> This is an excerpt from hadoop.log: >>>>>> >>>>>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>>>>> crawl-20120510222629 >>>>>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl -
-
Re: HTTP error 400Tolga 2012-05-15, 12:49
bin/nutch solrindex http://localhost:8983/solr/ crawldb -linkdb
crawl/linkdb crawl/segments/* SolrIndexer: starting at 2012-05-15 15:34:36 org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/root/apache-nutch-1.4-bin/runtime/local/crawl/current On 5/15/12 2:05 PM, Markus Jelsma wrote: > Please follow the step-by-step tutorial, it's explained there: > http://wiki.apache.org/nutch/NutchTutorial > > On Tuesday 15 May 2012 13:40:26 Tolga wrote: >> I'm a little confused. How can I not use the crawl command and execute >> the separate crawl cycle commands at the same time? >> >> Regards, >> >> On 5/11/12 9:40 AM, Markus Jelsma wrote: >>> Ah, that means don't use the crawl command and do a little shell >>> scripting to execute the separte crawl cycle commands, see the nutch >>> wiki for examples. And don't do solrdedup. Search the Solr wiki for >>> deduplication. >>> >>> cheers >>> >>> On Fri, 11 May 2012 07:39:36 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> How do I exactly "omit solrdedup and use Solr's internal >>>> deduplication" instead.? I don't even know what any of that means :D >>>> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ >>>> -depth 3 -topN 100 to get the error. I have to use all the steps? >>>> >>>> Regards, >>>> >>>> On 05/10/2012 11:38 PM, Markus Jelsma wrote: >>>>> thanks >>>>> >>>>> This is a known issue: >>>>> https://issues.apache.org/jira/browse/NUTCH-1100 >>>>> >>>>> I have not been able find the bug nor do i know how to reproduce it >>>>> from scratch. If you have a public site with which we can reproduce >>>>> it please comment to the Jira ticket. Make sure you use either >>>>> default config or little, a seed URL and the exact crawl& dedup >>>>> steps to reproduce. >>>>> >>>>> If you find it we might fix it. In any case we need to replace the >>>>> dedup command with a more scalable tool which it currently is not. >>>>> >>>>> In the mean time you can omit solrdedup and use Solr's internal >>>>> deduplication instead, it works similar and uses the same signature >>>>> algorithm as Nutch has. Please consult the Solr wiki page on >>>>> deduplication. >>>>> >>>>> Good luck >>>>> >>>>> On Thu, 10 May 2012 22:54:37 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>> Hi Markus, >>>>>> >>>>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> This will sound like a duplicate, but actually it differs from the >>>>>>>> other one. Please bear with me. Following >>>>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>>>>>>> command >>>>>>>> >>>>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>>>>> -topN 5 >>>>>>>> >>>>>>>> Then when I got the message >>>>>>>> >>>>>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu >>>>>>>> plicates.java:373)>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu >>>>>>>> plicates.java:353)>>>>>> >>>>>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>>>>> Please include the relevant part of the log. This can be a known >>>>>>> issue. >>>>>> This is an excerpt from hadoop.log: >>>>>> >>>>>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>>>>> crawl-20120510222629 >>>>>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >>>>>> solrUrl=http://localhost:8983/solr/
-
Re: HTTP error 400Tolga 2012-05-17, 10:07
I'm still confused. You mean to use
http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling ? On 5/15/12 2:05 PM, Markus Jelsma wrote: > Please follow the step-by-step tutorial, it's explained there: > http://wiki.apache.org/nutch/NutchTutorial > > On Tuesday 15 May 2012 13:40:26 Tolga wrote: >> I'm a little confused. How can I not use the crawl command and execute >> the separate crawl cycle commands at the same time? >> >> Regards, >> >> On 5/11/12 9:40 AM, Markus Jelsma wrote: >>> Ah, that means don't use the crawl command and do a little shell >>> scripting to execute the separte crawl cycle commands, see the nutch >>> wiki for examples. And don't do solrdedup. Search the Solr wiki for >>> deduplication. >>> >>> cheers >>> >>> On Fri, 11 May 2012 07:39:36 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>> How do I exactly "omit solrdedup and use Solr's internal >>>> deduplication" instead.? I don't even know what any of that means :D >>>> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ >>>> -depth 3 -topN 100 to get the error. I have to use all the steps? >>>> >>>> Regards, >>>> >>>> On 05/10/2012 11:38 PM, Markus Jelsma wrote: >>>>> thanks >>>>> >>>>> This is a known issue: >>>>> https://issues.apache.org/jira/browse/NUTCH-1100 >>>>> >>>>> I have not been able find the bug nor do i know how to reproduce it >>>>> from scratch. If you have a public site with which we can reproduce >>>>> it please comment to the Jira ticket. Make sure you use either >>>>> default config or little, a seed URL and the exact crawl& dedup >>>>> steps to reproduce. >>>>> >>>>> If you find it we might fix it. In any case we need to replace the >>>>> dedup command with a more scalable tool which it currently is not. >>>>> >>>>> In the mean time you can omit solrdedup and use Solr's internal >>>>> deduplication instead, it works similar and uses the same signature >>>>> algorithm as Nutch has. Please consult the Solr wiki page on >>>>> deduplication. >>>>> >>>>> Good luck >>>>> >>>>> On Thu, 10 May 2012 22:54:37 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>> Hi Markus, >>>>>> >>>>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>>>>> Hi, >>>>>>> >>>>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>>>> Hi, >>>>>>>> >>>>>>>> This will sound like a duplicate, but actually it differs from the >>>>>>>> other one. Please bear with me. Following >>>>>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>>>>>>> command >>>>>>>> >>>>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>>>>> -topN 5 >>>>>>>> >>>>>>>> Then when I got the message >>>>>>>> >>>>>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu >>>>>>>> plicates.java:373)>>>>>> >>>>>>>> at >>>>>>>> >>>>>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDu >>>>>>>> plicates.java:353)>>>>>> >>>>>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>>>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>>>>> Please include the relevant part of the log. This can be a known >>>>>>> issue. >>>>>> This is an excerpt from hadoop.log: >>>>>> >>>>>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>>>>> crawl-20120510222629 >>>>>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >>>>>> solrUrl=http://localhost:8983/solr/ >>>>>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 >>>>>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting at
-
Re: HTTP error 400Jean-François Gingras 2012-05-19, 01:43
Yes. Also take a look at this page
[1]<http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script> for script exemples. [1] http://wiki.apache.org/nutch/Whole-Web%20Crawling%20incremental%20script On Thu, May 17, 2012 at 6:07 AM, Tolga <[EMAIL PROTECTED]> wrote: > I'm still confused. You mean to use http://wiki.apache.org/nutch/** > NutchTutorial#A3.2_Using_**Individual_Commands_for_Whole-**Web_Crawling<http://wiki.apache.org/nutch/NutchTutorial#A3.2_Using_Individual_Commands_for_Whole-Web_Crawling>? > > > On 5/15/12 2:05 PM, Markus Jelsma wrote: > >> Please follow the step-by-step tutorial, it's explained there: >> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial> >> >> On Tuesday 15 May 2012 13:40:26 Tolga wrote: >> >>> I'm a little confused. How can I not use the crawl command and execute >>> the separate crawl cycle commands at the same time? >>> >>> Regards, >>> >>> On 5/11/12 9:40 AM, Markus Jelsma wrote: >>> >>>> Ah, that means don't use the crawl command and do a little shell >>>> scripting to execute the separte crawl cycle commands, see the nutch >>>> wiki for examples. And don't do solrdedup. Search the Solr wiki for >>>> deduplication. >>>> >>>> cheers >>>> >>>> On Fri, 11 May 2012 07:39:36 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>> >>>>> Hi, >>>>> >>>>> How do I exactly "omit solrdedup and use Solr's internal >>>>> deduplication" instead.? I don't even know what any of that means :D >>>>> I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ >>>>> -depth 3 -topN 100 to get the error. I have to use all the steps? >>>>> >>>>> Regards, >>>>> >>>>> On 05/10/2012 11:38 PM, Markus Jelsma wrote: >>>>> >>>>>> thanks >>>>>> >>>>>> This is a known issue: >>>>>> https://issues.apache.org/**jira/browse/NUTCH-1100<https://issues.apache.org/jira/browse/NUTCH-1100> >>>>>> >>>>>> I have not been able find the bug nor do i know how to reproduce it >>>>>> from scratch. If you have a public site with which we can reproduce >>>>>> it please comment to the Jira ticket. Make sure you use either >>>>>> default config or little, a seed URL and the exact crawl& dedup >>>>>> >>>>>> steps to reproduce. >>>>>> >>>>>> If you find it we might fix it. In any case we need to replace the >>>>>> dedup command with a more scalable tool which it currently is not. >>>>>> >>>>>> In the mean time you can omit solrdedup and use Solr's internal >>>>>> deduplication instead, it works similar and uses the same signature >>>>>> algorithm as Nutch has. Please consult the Solr wiki page on >>>>>> deduplication. >>>>>> >>>>>> Good luck >>>>>> >>>>>> On Thu, 10 May 2012 22:54:37 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>> >>>>>>> Hi Markus, >>>>>>> >>>>>>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga<[EMAIL PROTECTED]> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> >>>>>>>>> This will sound like a duplicate, but actually it differs from the >>>>>>>>> other one. Please bear with me. Following >>>>>>>>> http://wiki.apache.org/nutch/**NutchTutorial<http://wiki.apache.org/nutch/NutchTutorial>, >>>>>>>>> I first issued the >>>>>>>>> command >>>>>>>>> >>>>>>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>>>>>> -topN 5 >>>>>>>>> >>>>>>>>> Then when I got the message >>>>>>>>> >>>>>>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.hadoop.mapred.**JobClient.runJob(JobClient.**java:1252) >>>>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates.dedup(** >>>>>>>>> SolrDeleteDu >>>>>>>>> plicates.java:373)>>>>>> >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.nutch.indexer.solr.**SolrDeleteDuplicates.dedup(** >>>>>>>>> SolrDeleteDu >>>>>>>>> plicates.java:353)>>>>>> >>>>>>>>> at org.apache.nutch.crawl.Crawl.**run(Crawl.java:153) >>>>>>>>> at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.** Jean-François Gingras
-
Re: HTTP error 400m2000hsf 2012-05-19, 06:43
Ah, that means don't use the crawl command and do a little shell
scripting to execute the separte crawl cycle commands, see the nutch wiki for examples. And don't do solrdedup. Search the Solr wiki for deduplication. cheers On Fri, 11 May 2012 07:39:36 +0300, Tolga <[hidden email]> wrote: > Hi, > > How do I exactly "omit solrdedup and use Solr's internal > deduplication" instead.? I don't even know what any of that means :D > I've just used bin/nutch crawl urls -solr http://localhost:8983/solr/ > -depth 3 -topN 100 to get the error. I have to use all the steps? > > Regards, > > On 05/10/2012 11:38 PM, Markus Jelsma wrote: >> thanks >> >> This is a known issue: >> https://issues.apache.org/jira/browse/NUTCH-1100 >> >> I have not been able find the bug nor do i know how to reproduce it >> from scratch. If you have a public site with which we can reproduce it >> please comment to the Jira ticket. Make sure you use either default >> config or little, a seed URL and the exact crawl & dedup steps to >> reproduce. >> >> If you find it we might fix it. In any case we need to replace the >> dedup command with a more scalable tool which it currently is not. >> >> In the mean time you can omit solrdedup and use Solr's internal >> deduplication instead, it works similar and uses the same signature >> algorithm as Nutch has. Please consult the Solr wiki page on >> deduplication. >> >> Good luck >> >> >> On Thu, 10 May 2012 22:54:37 +0300, Tolga <[hidden email]> wrote: >>> Hi Markus, >>> >>> On 05/10/2012 09:42 AM, Markus Jelsma wrote: >>>> Hi, >>>> >>>> On Thu, 10 May 2012 09:10:04 +0300, Tolga <[hidden email]> wrote: >>>>> Hi, >>>>> >>>>> This will sound like a duplicate, but actually it differs from >>>>> the >>>>> other one. Please bear with me. Following >>>>> http://wiki.apache.org/nutch/NutchTutorial, I first issued the >>>>> command >>>>> >>>>> bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 >>>>> -topN 5 >>>>> >>>>> Then when I got the message >>>>> >>>>> Exception in thread "main" java.io.IOException: Job failed! >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1252) >>>>> at >>>>> >>>>> >>>>> >>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:373) >>>>> >>>>> >>>>> at >>>>> >>>>> >>>>> >>>>> org.apache.nutch.indexer.solr.SolrDeleteDuplicates.dedup(SolrDeleteDuplicates.java:353) >>>>> >>>>> >>>>> at org.apache.nutch.crawl.Crawl.run(Crawl.java:153) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) >>>> >>>> Please include the relevant part of the log. This can be a known >>>> issue. >>> >>> This is an excerpt from hadoop.log: >>> >>> 2012-05-10 22:26:30,349 INFO crawl.Crawl - crawl started in: >>> crawl-20120510222629 >>> 2012-05-10 22:26:30,350 INFO crawl.Crawl - rootUrlDir = urls >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - threads = 10 >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - depth = 3 >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - >>> solrUrl=http://localhost:8983/solr/ >>> 2012-05-10 22:26:30,351 INFO crawl.Crawl - topN = 100 >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: starting >>> at >>> 2012-05-10 22:26:30 >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: crawlDb: >>> crawl-20120510222629/crawldb >>> 2012-05-10 22:26:30,750 INFO crawl.Injector - Injector: urlDir: >>> urls >>> 2012-05-10 22:26:30,809 INFO crawl.Injector - Injector: Converting >>> injected urls to crawl db entries. >>> 2012-05-10 22:26:34,173 INFO plugin.PluginRepository - Plugins: >>> looking in: /root/apache-nutch-1.4-bin/runtime/local/plugins >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Plugin >>> Auto-activation mode: [true] >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - Registered >>> Plugins: >>> 2012-05-10 22:26:34,962 INFO plugin.PluginRepository - the >>> nutch >>> core extension points (nutch-extensionpoints) ... [show rest of quote] Markus Jelsma - CTO - Openindex View this message in context: http://lucene.472066.n3.nabble.com/HTTP-error-400-tp3976225p3984830.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: HTTP error 400keesp 2012-05-24, 08:29
Stephan Kristyn wrote > > Regarding the issue, I found out I was not using the right search syntax. > > Make sure you specify a field and searchterm with a colon, like this: > > XMLfieldname:SearchTerm > I was having the same problem: http error 400: underfined field text, using the most recent versions of solr and nutch (and having used the nutch configuration files. a search using: content:searchterm works fine. Is there any way to tell solr that text:searchtermis equivalent to content:searchterm? Thanks -- View this message in context: http://lucene.472066.n3.nabble.com/HTTP-error-400-tp3976225p3985866.html Sent from the Nutch - User mailing list archive at Nabble.com. |