|
|
-
Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
Muhammad Rizwan 2011-12-09, 07:40
Hi, I am new to Nutch and configured Nutch 1.4 using Tutorial here < http://wiki.apache.org/nutch/NutchTutorial#A1_Setup_Nutch_from_binary_distribution> on my linux machine. Now when I run this command to crawl my first website # bin/nutch crawl urls -dir crawl -depth 3 -topN 5 It starts working and after few seconds, I get following error Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209175156/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190 ) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInpu tFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) Any idea, what going wrong here? - Riz
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
remi tassing 2011-12-09, 08:26
Hello guys,
how do you use "org.apache.nutch.net.URLFilterChecker"? It's not documented and it always shows me this "Checking combination of all URLFilters available" and then gets stuck.
Remi
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
Lewis John Mcgibbney 2011-12-09, 12:08
Hi Remi,
Please don't hijack someone's thread, start your own.
Thank you
Lewis
On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> wrote:
> Hello guys, > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not documented > and it always shows me this "Checking combination of all URLFilters > available" and then gets stuck. > > Remi >
-- *Lewis*
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
Lewis John Mcgibbney 2011-12-09, 12:11
Hi Riz, Did you verify if Nutch is installed correctly? http://wiki.apache.org/nutch/NutchTutorial#A2._Verify_your_Nutch_installationif you have Nutch installed and correctly configured there should be no problems running it in local mode as you are doing. On Fri, Dec 9, 2011 at 7:40 AM, Muhammad Rizwan < [EMAIL PROTECTED]> wrote: > Hi, > > > > I am new to Nutch and configured Nutch 1.4 using Tutorial here > < > http://wiki.apache.org/nutch/NutchTutorial#A1_Setup_Nutch_from_binary_distr> ibution> on my linux machine. > > Now when I run this command to crawl my first website > # bin/nutch crawl urls -dir crawl -depth 3 -topN 5 > > > > It starts working and after few seconds, I get following error > > > > Exception in thread "main" org.apache.hadoop.mapred.InvalidInputException: > Input path does not exist: > file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data > > Input path does not exist: > file:/home/nutch/1.4/runtime/local/crawl/segments/20111209175156/parse_data > > at > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190 > ) > > at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInpu > tFormat.java:44) > > at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > > at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:149) > > at org.apache.nutch.crawl.Crawl.run(Crawl.java:143) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.crawl.Crawl.main(Crawl.java:55) > > > > Any idea, what going wrong here? > > > > - Riz > > -- *Lewis*
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
remi tassing 2011-12-09, 12:13
Sorry, I forgot to change the title...
However I had the same error "Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning.
I believe it's because I stopped Nutch while it was crawling and data were not saved properly.
I couldn't find an alternative and just had to delete my "crawl" folder, then it worked...Not a good solution!
On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote:
> Hi Remi, > > Please don't hijack someone's thread, start your own. > > Thank you > > Lewis > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > wrote: > > > Hello guys, > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > documented > > and it always shows me this "Checking combination of all URLFilters > > available" and then gets stuck. > > > > Remi > > > > > > -- > *Lewis* >
-- Remi Tassing
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
M.Rizwan 2011-12-10, 13:54
Thanks Rami. Yes not a good solution but this worked for me too.
Thanks for sharing.
On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> wrote:
> Sorry, I forgot to change the title... > > However I had the same error "Exception > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > I believe it's because I stopped Nutch while it was crawling and data were > not saved properly. > > I couldn't find an alternative and just had to delete my "crawl" folder, > then it worked...Not a good solution! > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > > > Hi Remi, > > > > Please don't hijack someone's thread, start your own. > > > > Thank you > > > > Lewis > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > > wrote: > > > > > Hello guys, > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > documented > > > and it always shows me this "Checking combination of all URLFilters > > > available" and then gets stuck. > > > > > > Remi > > > > > > > > > > > -- > > *Lewis* > > > > > > -- > Remi Tassing >
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
remi tassing 2011-12-23, 13:45
My computer shut down yesterday and I'm having the same problem. The problem this time is that I can't just delete and re-started again. I've been crawling for days!
Any other ways to handle this? Remove segments? Sanitize the database?
On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan <[EMAIL PROTECTED]>wrote:
> Thanks Rami. Yes not a good solution but this worked for me too. > > Thanks for sharing. > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > wrote: > > > Sorry, I forgot to change the title... > > > > However I had the same error "Exception > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > > > I believe it's because I stopped Nutch while it was crawling and data > were > > not saved properly. > > > > I couldn't find an alternative and just had to delete my "crawl" folder, > > then it worked...Not a good solution! > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > [EMAIL PROTECTED]> wrote: > > > > > Hi Remi, > > > > > > Please don't hijack someone's thread, start your own. > > > > > > Thank you > > > > > > Lewis > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hello guys, > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > > documented > > > > and it always shows me this "Checking combination of all URLFilters > > > > available" and then gets stuck. > > > > > > > > Remi > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > > > > > > -- > > Remi Tassing > > >
-- Remi Tassing
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
Markus Jelsma 2011-12-23, 13:49
you have to get rid of the bad segments. they cannot be recovered. It is with Nutch 1.x never a good idea to use extremely large segments that take days to run.
On Friday 23 December 2011 14:45:39 remi tassing wrote: > My computer shut down yesterday and I'm having the same problem. The > problem this time is that I can't just delete and re-started again. I've > been crawling for days! > > Any other ways to handle this? Remove segments? Sanitize the database? > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > <[EMAIL PROTECTED]>wrote: > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > Thanks for sharing. > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > > > > wrote: > > > Sorry, I forgot to change the title... > > > > > > However I had the same error "Exception > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > > > > exist: > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > > > > > I believe it's because I stopped Nutch while it was crawling and data > > > > were > > > > > not saved properly. > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > folder, then it worked...Not a good solution! > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > [EMAIL PROTECTED]> wrote: > > > > Hi Remi, > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > Thank you > > > > > > > > Lewis > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing <[EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > Hello guys, > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > > > > > > > documented > > > > > > > > > and it always shows me this "Checking combination of all URLFilters > > > > > available" and then gets stuck. > > > > > > > > > > Remi > > > > > > > > -- > > > > *Lewis* > > > > > > -- > > > Remi Tassing
-- Markus Jelsma - CTO - Openindex
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
remi tassing 2011-12-23, 13:52
Just deleting the folders?
On Fri, Dec 23, 2011 at 3:49 PM, Markus Jelsma <[EMAIL PROTECTED]>wrote:
> you have to get rid of the bad segments. they cannot be recovered. It is > with > Nutch 1.x never a good idea to use extremely large segments that take days > to > run. > > On Friday 23 December 2011 14:45:39 remi tassing wrote: > > My computer shut down yesterday and I'm having the same problem. The > > problem this time is that I can't just delete and re-started again. I've > > been crawling for days! > > > > Any other ways to handle this? Remove segments? Sanitize the database? > > > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > > > <[EMAIL PROTECTED]>wrote: > > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > > > Thanks for sharing. > > > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > > > > > > wrote: > > > > Sorry, I forgot to change the title... > > > > > > > > However I had the same error "Exception > > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > > > > > > exist: > > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this morning. > > > > > > > > I believe it's because I stopped Nutch while it was crawling and data > > > > > > were > > > > > > > not saved properly. > > > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > > folder, then it worked...Not a good solution! > > > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > > > [EMAIL PROTECTED]> wrote: > > > > > Hi Remi, > > > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > > > Thank you > > > > > > > > > > Lewis > > > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing < > [EMAIL PROTECTED]> > > > > > > > > > > wrote: > > > > > > Hello guys, > > > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's not > > > > > > > > > > documented > > > > > > > > > > > and it always shows me this "Checking combination of all > URLFilters > > > > > > available" and then gets stuck. > > > > > > > > > > > > Remi > > > > > > > > > > -- > > > > > *Lewis* > > > > > > > > -- > > > > Remi Tassing > > -- > Markus Jelsma - CTO - Openindex >
-- Remi Tassing
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
Markus Jelsma 2011-12-23, 13:59
yes, all segments/* that show errors. They are useless, only the crawl_generate subdir can be used again to restart the crawl from scratch.
On Friday 23 December 2011 14:52:22 remi tassing wrote: > Just deleting the folders? > > On Fri, Dec 23, 2011 at 3:49 PM, Markus Jelsma > > <[EMAIL PROTECTED]>wrote: > > you have to get rid of the bad segments. they cannot be recovered. It is > > with > > Nutch 1.x never a good idea to use extremely large segments that take > > days to > > run. > > > > On Friday 23 December 2011 14:45:39 remi tassing wrote: > > > My computer shut down yesterday and I'm having the same problem. The > > > problem this time is that I can't just delete and re-started again. > > > I've been crawling for days! > > > > > > Any other ways to handle this? Remove segments? Sanitize the database? > > > > > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > > > > > <[EMAIL PROTECTED]>wrote: > > > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > > > > > Thanks for sharing. > > > > > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing <[EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > Sorry, I forgot to change the title... > > > > > > > > > > However I had the same error "Exception > > > > > org.apache.hadoop.mapred.InvalidInputException: Input path does not > > > > > > > > exist: > > > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this > > > > > morning. > > > > > > > > > > I believe it's because I stopped Nutch while it was crawling and > > > > > data > > > > > > > > were > > > > > > > > > not saved properly. > > > > > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > > > folder, then it worked...Not a good solution! > > > > > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > > > > > [EMAIL PROTECTED]> wrote: > > > > > > Hi Remi, > > > > > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > > > > > Thank you > > > > > > > > > > > > Lewis > > > > > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing < > > > > [EMAIL PROTECTED]> > > > > > > > > wrote: > > > > > > > Hello guys, > > > > > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's > > > > > > > not > > > > > > > > > > > > documented > > > > > > > > > > > > > and it always shows me this "Checking combination of all > > > > URLFilters > > > > > > > > > available" and then gets stuck. > > > > > > > > > > > > > > Remi > > > > > > > > > > > > -- > > > > > > *Lewis* > > > > > > > > > > -- > > > > > Remi Tassing > > > > -- > > Markus Jelsma - CTO - Openindex
-- Markus Jelsma - CTO - Openindex
-
Re: Exception org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/nutch/1.4/runtime/local/crawl/segments/20111209174842/parse_data
remi tassing 2011-12-23, 14:13
It looks like it's working now, muchos gracias Markus!!!
On Fri, Dec 23, 2011 at 3:59 PM, Markus Jelsma <[EMAIL PROTECTED]>wrote:
> yes, all segments/* that show errors. They are useless, only the > crawl_generate subdir can be used again to restart the crawl from scratch. > > On Friday 23 December 2011 14:52:22 remi tassing wrote: > > Just deleting the folders? > > > > On Fri, Dec 23, 2011 at 3:49 PM, Markus Jelsma > > > > <[EMAIL PROTECTED]>wrote: > > > you have to get rid of the bad segments. they cannot be recovered. It > is > > > with > > > Nutch 1.x never a good idea to use extremely large segments that take > > > days to > > > run. > > > > > > On Friday 23 December 2011 14:45:39 remi tassing wrote: > > > > My computer shut down yesterday and I'm having the same problem. The > > > > problem this time is that I can't just delete and re-started again. > > > > I've been crawling for days! > > > > > > > > Any other ways to handle this? Remove segments? Sanitize the > database? > > > > > > > > On Sat, Dec 10, 2011 at 3:54 PM, M.Rizwan > > > > > > > > <[EMAIL PROTECTED]>wrote: > > > > > Thanks Rami. Yes not a good solution but this worked for me too. > > > > > > > > > > Thanks for sharing. > > > > > > > > > > On Fri, Dec 9, 2011 at 5:13 PM, remi tassing < > [EMAIL PROTECTED]> > > > > > > > > > > wrote: > > > > > > Sorry, I forgot to change the title... > > > > > > > > > > > > However I had the same error "Exception > > > > > > org.apache.hadoop.mapred.InvalidInputException: Input path does > not > > > > > > > > > > exist: > > > > > > file:/home/nutch/1.4/runtime/local/crawl/segments/..." this > > > > > > morning. > > > > > > > > > > > > I believe it's because I stopped Nutch while it was crawling and > > > > > > data > > > > > > > > > > were > > > > > > > > > > > not saved properly. > > > > > > > > > > > > I couldn't find an alternative and just had to delete my "crawl" > > > > > > folder, then it worked...Not a good solution! > > > > > > > > > > > > On Fri, Dec 9, 2011 at 2:08 PM, Lewis John Mcgibbney < > > > > > > > > > > > > [EMAIL PROTECTED]> wrote: > > > > > > > Hi Remi, > > > > > > > > > > > > > > Please don't hijack someone's thread, start your own. > > > > > > > > > > > > > > Thank you > > > > > > > > > > > > > > Lewis > > > > > > > > > > > > > > On Fri, Dec 9, 2011 at 8:26 AM, remi tassing < > > > > > > [EMAIL PROTECTED]> > > > > > > > > > > wrote: > > > > > > > > Hello guys, > > > > > > > > > > > > > > > > how do you use "org.apache.nutch.net.URLFilterChecker"? It's > > > > > > > > not > > > > > > > > > > > > > > documented > > > > > > > > > > > > > > > and it always shows me this "Checking combination of all > > > > > > URLFilters > > > > > > > > > > > available" and then gets stuck. > > > > > > > > > > > > > > > > Remi > > > > > > > > > > > > > > -- > > > > > > > *Lewis* > > > > > > > > > > > > -- > > > > > > Remi Tassing > > > > > > -- > > > Markus Jelsma - CTO - Openindex > > -- > Markus Jelsma - CTO - Openindex >
-- Remi Tassing
|
|