|
Muhammad Rizwan
2011-12-09, 07:40
remi tassing
2011-12-09, 08:26
Lewis John Mcgibbney
2011-12-09, 12:08
Lewis John Mcgibbney
2011-12-09, 12:11
remi tassing
2011-12-09, 12:13
M.Rizwan
2011-12-10, 13:54
remi tassing
2011-12-23, 13:45
Markus Jelsma
2011-12-23, 13:49
remi tassing
2011-12-23, 13:52
Markus Jelsma
2011-12-23, 13:59
remi tassing
2011-12-23, 14:13
|
-
Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataMuhammad Rizwan 2011-12-09, 07:40
Hi,
I am new to Nutch and configured Nutch 1.4 using Tutorial here <http://wiki.apache.org/nutch/NutchTutorial#A1_Setup_Nutch_from_binary_distr ibution> on my linux machine. Now when I run this command to crawl my first website # bin/nutch crawl urls -dir crawl -depth 3 -topN 5 It starts working and after few seconds, I get following error Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209175156/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190 ) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInpu tFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Any idea, what going wrong here? - Riz
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataremi tassing 2011-12-09, 08:26
Hello guys,
how do you use "org.apache.nutch.net.URLFilterChecker"? It's not documented and it always shows me this "Checking combination of all URLFilters available" and then gets stuck. Remi
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataLewis John Mcgibbney 2011-12-09, 12:08
Hi Remi,
Please don't hijack someone's thread, start your own. Thank you Lewis On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> wrote: > Hello guys, > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not documented > and it always shows me this "Checking combination of all URLFilters > available" and then gets stuck. > > Remi > -- *Lewis*
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataLewis John Mcgibbney 2011-12-09, 12:11
Hi Riz,
Did you verify if Nutch is installed correctly? http://wiki.apache.org/nutch/NutchTutorial#A2._Verify_your_Nutch_installation if you have Nutch installed and correctly configured there should be no problems running it in local mode as you are doing. On Fri, Dec 9, 2011 at 7:40 AM, Muhammad Rizwan < [EMAIL PROTECTED]> wrote: > Hi, > > > > I am new to Nutch and configured Nutch 1.4 using Tutorial here > < > http://wiki.apache.org/nutch/NutchTutorial#A1_Setup_Nutch_from_binary_distr > ibution> on my linux machine. > > Now when I run this command to crawl my first website > # bin/nutch crawl urls -dir crawl -depth 3 -topN 5 > > > > It starts working and after few seconds, I get following error > > > > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data > > Input path does not exist: > file:/home/nutch/1.4/runtime/local/crawl/segments/20111209175156/parse_data > > at > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190 > ) > > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInpu > tFormat.java:44) > > at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > > at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > > > > Any idea, what going wrong here? > > > > - Riz > > -- *Lewis*
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataremi tassing 2011-12-09, 12:13
Sorry, I forgot to change the title...
However I had the same error "Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. I believe it's because I stopped Nutch while it was crawling and data were not saved properly. I couldn't find an alternative and just had to delete my "crawl" folder, then it worked...Not a good solution! On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Hi Remi, > > Please don't hijack someone's thread, start your own. > > Thank you > > Lewis > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > wrote: > > > Hello guys, > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > documented > > and it always shows me this "Checking combination of all URLFilters > > available" and then gets stuck. > > > > Remi > > > > > > -- > *Lewis* > -- Remi Tassing
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataM.Rizwan 2011-12-10, 13:54
Thanks Rami. Yes not a good solution but this worked for me too.
Thanks for sharing. On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> wrote: > Sorry, I forgot to change the title... > > However I had the same error "Exception > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > I believe it's because I stopped Nutch while it was crawling and data were > not saved properly. > > I couldn't find an alternative and just had to delete my "crawl" folder, > then it worked...Not a good solution! > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > > > Hi Remi, > > > > Please don't hijack someone's thread, start your own. > > > > Thank you > > > > Lewis > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > > wrote: > > > > > Hello guys, > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > documented > > > and it always shows me this "Checking combination of all URLFilters > > > available" and then gets stuck. > > > > > > Remi > > > > > > > > > > > -- > > *Lewis* > > > > > > -- > Remi Tassing >
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataremi tassing 2011-12-23, 13:45
My computer shut down yesterday and I'm having the same problem. The
problem this time is that I can't just delete and re-started again. I've been crawling for days! Any other ways to handle this? Remove segments? Sanitize the database? On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan <[EMAIL PROTECTED]>wrote: > Thanks Rami. Yes not a good solution but this worked for me too. > > Thanks for sharing. > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > wrote: > > > Sorry, I forgot to change the title... > > > > However I had the same error "Exception > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > > > I believe it's because I stopped Nutch while it was crawling and data > were > > not saved properly. > > > > I couldn't find an alternative and just had to delete my "crawl" folder, > > then it worked...Not a good solution! > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Remi, > > > > > > Please don't hijack someone's thread, start your own. > > > > > > Thank you > > > > > > Lewis > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hello guys, > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > > documented > > > > and it always shows me this "Checking combination of all URLFilters > > > > available" and then gets stuck. > > > > > > > > Remi > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > > > -- > > Remi Tassing > > > -- Remi Tassing
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataMarkus Jelsma 2011-12-23, 13:49
you have to get rid of the bad segments. they cannot be recovered. It is with
Nutch 1.x never a good idea to use extremely large segments that take days to run. On Friday 23 December 2011 14:45:39 remi tassing wrote: > My computer shut down yesterday and I'm having the same problem. The > problem this time is that I can't just delete and re-started again. I've > been crawling for days! > > Any other ways to handle this? Remove segments? Sanitize the database? > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > <[EMAIL PROTECTED]>wrote: > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > Thanks for sharing. > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > > > > wrote: > > > Sorry, I forgot to change the title... > > > > > > However I had the same error "Exception > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > > > > exist: > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > > > > > I believe it's because I stopped Nutch while it was crawling and data > > > > were > > > > > not saved properly. > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > folder, then it worked...Not a good solution! > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > [EMAIL PROTECTED]> wrote: > > > > Hi Remi, > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > Thank you > > > > > > > > Lewis > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > Hello guys, > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > > > > > > > documented > > > > > > > > > and it always shows me this "Checking combination of all URLFilters > > > > > available" and then gets stuck. > > > > > > > > > > Remi > > > > > > > > -- > > > > *Lewis* > > > > > > -- > > > Remi Tassing -- Markus Jelsma - CTO - Openindex
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataremi tassing 2011-12-23, 13:52
Just deleting the folders?
On Fri, Dec 23, 2011 at 3:49 PM, Markus Jelsma <[EMAIL PROTECTED]>wrote: > you have to get rid of the bad segments. they cannot be recovered. It is > with > Nutch 1.x never a good idea to use extremely large segments that take days > to > run. > > On Friday 23 December 2011 14:45:39 remi tassing wrote: > > My computer shut down yesterday and I'm having the same problem. The > > problem this time is that I can't just delete and re-started again. I've > > been crawling for days! > > > > Any other ways to handle this? Remove segments? Sanitize the database? > > > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > > > <[EMAIL PROTECTED]>wrote: > > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > > > Thanks for sharing. > > > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > > > > > > wrote: > > > > Sorry, I forgot to change the title... > > > > > > > > However I had the same error "Exception > > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > > > > > > exist: > > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > > > > > > > I believe it's because I stopped Nutch while it was crawling and data > > > > > > were > > > > > > > not saved properly. > > > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > > folder, then it worked...Not a good solution! > > > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > > > [EMAIL PROTECTED]> wrote: > > > > > Hi Remi, > > > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > > > Thank you > > > > > > > > > > Lewis > > > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing < > [EMAIL PROTECTED]> > > > > > > > > > > wrote: > > > > > > Hello guys, > > > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > > > > > > > > > documented > > > > > > > > > > > and it always shows me this "Checking combination of all > URLFilters > > > > > > available" and then gets stuck. > > > > > > > > > > > > Remi > > > > > > > > > > -- > > > > > *Lewis* > > > > > > > > -- > > > > Remi Tassing > > -- > Markus Jelsma - CTO - Openindex > -- Remi Tassing
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataMarkus Jelsma 2011-12-23, 13:59
yes, all segments/* that show errors. They are useless, only the
crawl_generate subdir can be used again to restart the crawl from scratch. On Friday 23 December 2011 14:52:22 remi tassing wrote: > Just deleting the folders? > > On Fri, Dec 23, 2011 at 3:49 PM, Markus Jelsma > > <[EMAIL PROTECTED]>wrote: > > you have to get rid of the bad segments. they cannot be recovered. It is > > with > > Nutch 1.x never a good idea to use extremely large segments that take > > days to > > run. > > > > On Friday 23 December 2011 14:45:39 remi tassing wrote: > > > My computer shut down yesterday and I'm having the same problem. The > > > problem this time is that I can't just delete and re-started again. > > > I've been crawling for days! > > > > > > Any other ways to handle this? Remove segments? Sanitize the database? > > > > > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > > > > > <[EMAIL PROTECTED]>wrote: > > > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > > > > > Thanks for sharing. > > > > > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > Sorry, I forgot to change the title... > > > > > > > > > > However I had the same error "Exception > > > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > > > > > > > > exist: > > > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this > > > > > morning. > > > > > > > > > > I believe it's because I stopped Nutch while it was crawling and > > > > > data > > > > > > > > were > > > > > > > > > not saved properly. > > > > > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > > > folder, then it worked...Not a good solution! > > > > > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > > > > > [EMAIL PROTECTED]> wrote: > > > > > > Hi Remi, > > > > > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > > > > > Thank you > > > > > > > > > > > > Lewis > > > > > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing < > > > > [EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > > > Hello guys, > > > > > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's > > > > > > > not > > > > > > > > > > > > documented > > > > > > > > > > > > > and it always shows me this "Checking combination of all > > > > URLFilters > > > > > > > > > available" and then gets stuck. > > > > > > > > > > > > > > Remi > > > > > > > > > > > > -- > > > > > > *Lewis* > > > > > > > > > > -- > > > > > Remi Tassing > > > > -- > > Markus Jelsma - CTO - Openindex -- Markus Jelsma - CTO - Openindex
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_dataremi tassing 2011-12-23, 14:13
It looks like it's working now, muchos gracias Markus!!!
On Fri, Dec 23, 2011 at 3:59 PM, Markus Jelsma <[EMAIL PROTECTED]>wrote: > yes, all segments/* that show errors. They are useless, only the > crawl_generate subdir can be used again to restart the crawl from scratch. > > On Friday 23 December 2011 14:52:22 remi tassing wrote: > > Just deleting the folders? > > > > On Fri, Dec 23, 2011 at 3:49 PM, Markus Jelsma > > > > <[EMAIL PROTECTED]>wrote: > > > you have to get rid of the bad segments. they cannot be recovered. It > is > > > with > > > Nutch 1.x never a good idea to use extremely large segments that take > > > days to > > > run. > > > > > > On Friday 23 December 2011 14:45:39 remi tassing wrote: > > > > My computer shut down yesterday and I'm having the same problem. The > > > > problem this time is that I can't just delete and re-started again. > > > > I've been crawling for days! > > > > > > > > Any other ways to handle this? Remove segments? Sanitize the > database? > > > > > > > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > > > > > > > <[EMAIL PROTECTED]>wrote: > > > > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > > > > > > > Thanks for sharing. > > > > > > > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing < > [EMAIL PROTECTED]> > > > > > > > > > > wrote: > > > > > > Sorry, I forgot to change the title... > > > > > > > > > > > > However I had the same error "Exception > > > > > > org.apache.hadoop.mapred.InvalidInputException: Input path does > not > > > > > > > > > > exist: > > > > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this > > > > > > morning. > > > > > > > > > > > > I believe it's because I stopped Nutch while it was crawling and > > > > > > data > > > > > > > > > > were > > > > > > > > > > > not saved properly. > > > > > > > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > > > > folder, then it worked...Not a good solution! > > > > > > > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > > > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi Remi, > > > > > > > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > > > > > > > Thank you > > > > > > > > > > > > > > Lewis > > > > > > > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing < > > > > > > [EMAIL PROTECTED]> > > > > > > > > > > wrote: > > > > > > > > Hello guys, > > > > > > > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's > > > > > > > > not > > > > > > > > > > > > > > documented > > > > > > > > > > > > > > > and it always shows me this "Checking combination of all > > > > > > URLFilters > > > > > > > > > > > available" and then gets stuck. > > > > > > > > > > > > > > > > Remi > > > > > > > > > > > > > > -- > > > > > > > *Lewis* > > > > > > > > > > > > -- > > > > > > Remi Tassing > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > -- > Markus Jelsma - CTO - Openindex > -- Remi Tassing |