|
Dean Pullen
2012-01-05, 17:28
Lewis John Mcgibbney
2012-01-05, 17:39
Dean Pullen
2012-01-06, 10:04
Dean Pullen
2012-01-06, 10:42
Dean Pullen
2012-01-06, 12:14
Lewis John Mcgibbney
2012-01-06, 14:33
Dean Pullen
2012-01-06, 15:30
Lewis John Mcgibbney
2012-01-06, 15:43
Dean Pullen
2012-01-06, 16:08
Lewis John Mcgibbney
2012-01-06, 16:17
Dean Pullen
2012-01-06, 16:24
Lewis John Mcgibbney
2012-01-06, 16:28
Dean Pullen
2012-01-06, 16:38
Lewis John Mcgibbney
2012-01-06, 16:41
Dean Pullen
2012-01-06, 17:17
Lewis John Mcgibbney
2012-01-06, 17:53
Dean Pullen
2012-01-07, 13:15
Dean Pullen
2012-01-07, 13:18
Lewis John Mcgibbney
2012-01-08, 14:08
Dean Pullen
2012-01-08, 14:26
Dean Pullen
2012-01-08, 22:51
Dean Pullen
2012-01-09, 13:31
Lewis John Mcgibbney
2012-01-09, 14:24
Dean Pullen
2012-01-09, 14:28
Dean Pullen
2012-01-09, 16:14
Lewis John Mcgibbney
2012-01-09, 16:41
Dean Pullen
2012-01-10, 11:33
Dean Pullen
2012-01-10, 14:11
Dean Pullen
2012-01-10, 16:49
Markus Jelsma
2012-01-10, 16:59
Markus Jelsma
2012-01-10, 17:01
Dean Pullen
2012-01-10, 17:05
Dean Pullen
2012-01-10, 17:06
Markus Jelsma
2012-01-10, 17:25
Dean Pullen
2012-01-11, 11:09
Dean Pullen
2012-01-11, 11:21
Markus Jelsma
2012-01-11, 11:31
Markus Jelsma
2012-01-11, 11:33
Dean Pullen
2012-01-11, 11:37
|
-
parse data directory not found after mergeDean Pullen 2012-01-05, 17:28
Hi all,
I'm upgrading from nutch 1 to 1.4 and am having problems running invertlinks. Error: LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) I notice that the parse_data directories are produced after a fetch (with fetcher.parse set to true), but after the merge the parse_data directory doesn't exist. What behaviour has changed since 1.0 and does anyone have a solution for the above? Thanks in advance, Dean.
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-05, 17:39
Hi Dean,
Depending on the size of the segments your fetching, in most cases I would advise you to separate out fetching and parsing into individual steps. This becomes self explanatory as your segments increase in size and the possibility of something going wrong with the fetching and parsing when done together. This looks to be a segments which when being fetched has experienced problems during parsing, therefore no parse_data was produced. Can you please try a test fetch (with parsing boolean set to false) on a sample segment then an individual parse and report back to us with this one please. Thanks On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > Hi all, > > I'm upgrading from nutch 1 to 1.4 and am having problems running > invertlinks. > > Error: > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > I notice that the parse_data directories are produced after a fetch (with > fetcher.parse set to true), but after the merge the parse_data directory > doesn't exist. > > What behaviour has changed since 1.0 and does anyone have a solution for the > above? > > Thanks in advance, > > Dean. -- Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 10:04
Lewis,
Many thanks for your reply. I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error. The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories. Regards, Dean. On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote: > Hi Dean, > > Depending on the size of the segments your fetching, in most cases I > would advise you to separate out fetching and parsing into individual > steps. This becomes self explanatory as your segments increase in size > and the possibility of something going wrong with the fetching and > parsing when done together. This looks to be a segments which when > being fetched has experienced problems during parsing, therefore no > parse_data was produced. > > Can you please try a test fetch (with parsing boolean set to false) on > a sample segment then an individual parse and report back to us with > this one please. > > Thanks > > On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: >> Hi all, >> >> I'm upgrading from nutch 1 to 1.4 and am having problems running >> invertlinks. >> >> Error: >> >> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not >> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data >> at >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >> at >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >> at >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >> >> I notice that the parse_data directories are produced after a fetch (with >> fetcher.parse set to true), but after the merge the parse_data directory >> doesn't exist. >> >> What behaviour has changed since 1.0 and does anyone have a solution for the >> above? >> >> Thanks in advance, >> >> Dean. > > > > -- > Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 10:42
I'd like to reiterate that this all works in v1...
Dean On 06/01/2012 10:04, Dean Pullen wrote: > Lewis, > > Many thanks for your reply. > > I've separated the parsing from the fetching, and although each segment - we run the crawl 5 times - has the parse_data directory after parsing (observed via pausing the process), the mergesegs command does not reproduce the parse_data directory meaning invertlinks fails with the same parse_data not found error. > > The merged segments directory simply has the crawl_generate and crawl_fetch directories, not any of the others you can see in the other segments directories. > > Regards, > > Dean. > > > On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote: > >> Hi Dean, >> >> Depending on the size of the segments your fetching, in most cases I >> would advise you to separate out fetching and parsing into individual >> steps. This becomes self explanatory as your segments increase in size >> and the possibility of something going wrong with the fetching and >> parsing when done together. This looks to be a segments which when >> being fetched has experienced problems during parsing, therefore no >> parse_data was produced. >> >> Can you please try a test fetch (with parsing boolean set to false) on >> a sample segment then an individual parse and report back to us with >> this one please. >> >> Thanks >> >> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: >>> Hi all, >>> >>> I'm upgrading from nutch 1 to 1.4 and am having problems running >>> invertlinks. >>> >>> Error: >>> >>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not >>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data >>> at >>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >>> at >>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >>> at >>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >>> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>> at >>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >>> >>> I notice that the parse_data directories are produced after a fetch (with >>> fetcher.parse set to true), but after the merge the parse_data directory >>> doesn't exist. >>> >>> What behaviour has changed since 1.0 and does anyone have a solution for the >>> above? >>> >>> Thanks in advance, >>> >>> Dean. >> >> >> -- >> Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 12:14
I've also tried nutch v1.3 with the same outcome (i.e. parse_data
directory is not found). On 06/01/2012 10:42, Dean Pullen wrote: > I'd like to reiterate that this all works in v1... > > Dean > > On 06/01/2012 10:04, Dean Pullen wrote: >> Lewis, >> >> Many thanks for your reply. >> >> I've separated the parsing from the fetching, and although each >> segment - we run the crawl 5 times - has the parse_data directory >> after parsing (observed via pausing the process), the mergesegs >> command does not reproduce the parse_data directory meaning >> invertlinks fails with the same parse_data not found error. >> >> The merged segments directory simply has the crawl_generate and >> crawl_fetch directories, not any of the others you can see in the >> other segments directories. >> >> Regards, >> >> Dean. >> >> >> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote: >> >>> Hi Dean, >>> >>> Depending on the size of the segments your fetching, in most cases I >>> would advise you to separate out fetching and parsing into individual >>> steps. This becomes self explanatory as your segments increase in size >>> and the possibility of something going wrong with the fetching and >>> parsing when done together. This looks to be a segments which when >>> being fetched has experienced problems during parsing, therefore no >>> parse_data was produced. >>> >>> Can you please try a test fetch (with parsing boolean set to false) on >>> a sample segment then an individual parse and report back to us with >>> this one please. >>> >>> Thanks >>> >>> On Thu, Jan 5, 2012 at 5:28 PM, Dean >>> Pullen<[EMAIL PROTECTED]> wrote: >>>> Hi all, >>>> >>>> I'm upgrading from nutch 1 to 1.4 and am having problems running >>>> invertlinks. >>>> >>>> Error: >>>> >>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path >>>> does not >>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data >>>> at >>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >>>> >>>> at >>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >>>> >>>> at >>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >>>> >>>> at >>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>>> at >>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>>> >>>> at >>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >>>> >>>> I notice that the parse_data directories are produced after a fetch >>>> (with >>>> fetcher.parse set to true), but after the merge the parse_data >>>> directory >>>> doesn't exist. >>>> >>>> What behaviour has changed since 1.0 and does anyone have a >>>> solution for the >>>> above? >>>> >>>> Thanks in advance, >>>> >>>> Dean. >>> >>> >>> -- >>> Lewis >
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-06, 14:33
Can you please post your script or what type of commands (and
parameters) you are passing... I suspect that there is maybe something lurking which we could fix now e.g. differences between the 1.0/1.3 commands and current 1.4. If not then you may have flagged up something which requires some TLC. Thanks On Fri, Jan 6, 2012 at 12:14 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > I've also tried nutch v1.3 with the same outcome (i.e. parse_data directory > is not found). > > > > On 06/01/2012 10:42, Dean Pullen wrote: >> >> I'd like to reiterate that this all works in v1... >> >> Dean >> >> On 06/01/2012 10:04, Dean Pullen wrote: >>> >>> Lewis, >>> >>> Many thanks for your reply. >>> >>> I've separated the parsing from the fetching, and although each segment - >>> we run the crawl 5 times - has the parse_data directory after parsing >>> (observed via pausing the process), the mergesegs command does not reproduce >>> the parse_data directory meaning invertlinks fails with the same parse_data >>> not found error. >>> >>> The merged segments directory simply has the crawl_generate and >>> crawl_fetch directories, not any of the others you can see in the other >>> segments directories. >>> >>> Regards, >>> >>> Dean. >>> >>> >>> On 5 Jan 2012, at 17:39, Lewis John Mcgibbney wrote: >>> >>>> Hi Dean, >>>> >>>> Depending on the size of the segments your fetching, in most cases I >>>> would advise you to separate out fetching and parsing into individual >>>> steps. This becomes self explanatory as your segments increase in size >>>> and the possibility of something going wrong with the fetching and >>>> parsing when done together. This looks to be a segments which when >>>> being fetched has experienced problems during parsing, therefore no >>>> parse_data was produced. >>>> >>>> Can you please try a test fetch (with parsing boolean set to false) on >>>> a sample segment then an individual parse and report back to us with >>>> this one please. >>>> >>>> Thanks >>>> >>>> On Thu, Jan 5, 2012 at 5:28 PM, Dean Pullen<[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Hi all, >>>>> >>>>> I'm upgrading from nutch 1 to 1.4 and am having problems running >>>>> invertlinks. >>>>> >>>>> Error: >>>>> >>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does >>>>> not >>>>> exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >>>>> at >>>>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>>>> at >>>>> >>>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>>>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >>>>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >>>>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>>>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >>>>> >>>>> I notice that the parse_data directories are produced after a fetch >>>>> (with >>>>> fetcher.parse set to true), but after the merge the parse_data >>>>> directory >>>>> doesn't exist. >>>>> >>>>> What behaviour has changed since 1.0 and does anyone have a solution >>>>> for the >>>>> above? >>>>> >>>>> Thanks in advance, >>>>> >>>>> Dean. >>>> >>>> >>>> >>>> -- >>>> Lewis >> >> > -- Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 15:30
No problem Lewis, I appreciate you looking into it.
Firstly I have a seed URL XML document here: http://www.ukcigarforums.com/injectlist.xml This basically has 'http://www.ukcigarforums.com/content.php' as a URL within it. Nutch's regex-urlfilter.txt contains this: # allow urls in ukcigarforums.com domain +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ # deny anything else -. Here's the procedure: 1) INJECT: /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ /opt/nutch_1_4/data/seed/ 2) GENERATE: /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 3) FETCH: /opt/nutch_1_4/bin/nutch fetch /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 4) PARSE: /opt/nutch_1_4/bin/nutch parse /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 5) UPDATE DB: /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter Repeat steps 2 to 5 another 4 times, then: 6) MERGE SEGMENTS: /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize Interestingly, this prints out: "SegmentMerger: using segment data from: crawl_generate crawl_fetch crawl_parse parse_data parse_text" MERGEDsegments segment directory then has just two directories, instead of all of those listed in the last output, i.e. just: crawl_generate and crawl_fetch (when then delete from the segments directory and copy the MERGEDsegments results into it) Lastly we run invert links after merge segments: 7) INVERT LINKS: /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir /opt/nutch_1_4/data/crawl/segments/ Which produces: "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data"
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-06, 15:43
Hi Dean,
Without discussing any of your configuration properties can you please try 6) MERGE SEGMENTS: /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir /opt/nutch_1_4/data/crawl/segments/* -filter -normalize paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/* Also presumably, when you mention you repeat steps 2-5 another 4 times, you are not recursively generating, fetching, parsing and updating the WebDB with /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change with every iteration of the g/f/p/updatedb cycle. Thanks On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > No problem Lewis, I appreciate you looking into it. > > > Firstly I have a seed URL XML document here: > http://www.ukcigarforums.com/injectlist.xml > This basically has 'http://www.ukcigarforums.com/content.php' as a URL > within it. > > Nutch's regex-urlfilter.txt contains this: > > # allow urls in ukcigarforums.com domain > +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ > # deny anything else > -. > > > Here's the procedure: > > > 1) INJECT: > /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ > /opt/nutch_1_4/data/seed/ > > 2) GENERATE: > /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ > /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 > > 3) FETCH: > /opt/nutch_1_4/bin/nutch fetch > /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 > > 4) PARSE: > /opt/nutch_1_4/bin/nutch parse > /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 > > 5) UPDATE DB: > /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ > /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter > > > Repeat steps 2 to 5 another 4 times, then: > > 6) MERGE SEGMENTS: > /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/ > -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize > > > Interestingly, this prints out: > "SegmentMerger: using segment data from: crawl_generate crawl_fetch > crawl_parse parse_data parse_text" > > MERGEDsegments segment directory then has just two directories, instead of > all of those listed in the last output, i.e. just: crawl_generate and > crawl_fetch > > (when then delete from the segments directory and copy the MERGEDsegments > results into it) > > > Lastly we run invert links after merge segments: > > 7) INVERT LINKS: > /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir > /opt/nutch_1_4/data/crawl/segments/ > > Which produces: > > "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not > exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data" > > -- Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 16:08
Lewis,
Changing the merge to * returns a similar response: LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files And yes, your assumption was correct - it's a different segment directory each loop. Many thanks, Dean. On 06/01/2012 15:43, Lewis John Mcgibbney wrote: > Hi Dean, > > Without discussing any of your configuration properties can you please try > > 6) MERGE SEGMENTS: > /opt/nutch_1_4/bin/nutch mergesegs > /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir > /opt/nutch_1_4/data/crawl/segments/* -filter -normalize > > paying attention to the wildcard /* in -dir /opt/nutch_1_4/data/crawl/segments/* > > Also presumably, when you mention you repeat steps 2-5 another 4 > times, you are not recursively generating, fetching, parsing and > updating the WebDB with > /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change > with every iteration of the g/f/p/updatedb cycle. > > Thanks > > On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: >> No problem Lewis, I appreciate you looking into it. >> >> >> Firstly I have a seed URL XML document here: >> http://www.ukcigarforums.com/injectlist.xml >> This basically has 'http://www.ukcigarforums.com/content.php' as a URL >> within it. >> >> Nutch's regex-urlfilter.txt contains this: >> >> # allow urls in ukcigarforums.com domain >> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ >> # deny anything else >> -. >> >> >> Here's the procedure: >> >> >> 1) INJECT: >> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >> /opt/nutch_1_4/data/seed/ >> >> 2) GENERATE: >> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 >> >> 3) FETCH: >> /opt/nutch_1_4/bin/nutch fetch >> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >> >> 4) PARSE: >> /opt/nutch_1_4/bin/nutch parse >> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >> >> 5) UPDATE DB: >> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ >> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter >> >> >> Repeat steps 2 to 5 another 4 times, then: >> >> 6) MERGE SEGMENTS: >> /opt/nutch_1_4/bin/nutch mergesegs /opt/nutch_1_4/data/crawl/MERGEDsegments/ >> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize >> >> >> Interestingly, this prints out: >> "SegmentMerger: using segment data from: crawl_generate crawl_fetch >> crawl_parse parse_data parse_text" >> >> MERGEDsegments segment directory then has just two directories, instead of >> all of those listed in the last output, i.e. just: crawl_generate and >> crawl_fetch >> >> (when then delete from the segments directory and copy the MERGEDsegments >> results into it) >> >> >> Lastly we run invert links after merge segments: >> >> 7) INVERT LINKS: >> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ -dir >> /opt/nutch_1_4/data/crawl/segments/ >> >> Which produces: >> >> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not >> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data" >> >> > >
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-06, 16:17
Ok then,
How about your generate command: 2) GENERATE: /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 Your <segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc, when everything else being utilised within the crawl cycle points to an entirely different <segment_dirs> path which is /opt/nutch_1_4/data/crawl/segments/segment_date Was this intentional? On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > Lewis, > > Changing the merge to * returns a similar response: > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern > file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files > > And yes, your assumption was correct - it's a different segment directory > each loop. > > Many thanks, > > Dean. > > On 06/01/2012 15:43, Lewis John Mcgibbney wrote: >> >> Hi Dean, >> >> Without discussing any of your configuration properties can you please try >> >> 6) MERGE SEGMENTS: >> /opt/nutch_1_4/bin/nutch mergesegs >> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir >> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize >> >> paying attention to the wildcard /* in -dir >> /opt/nutch_1_4/data/crawl/segments/* >> >> Also presumably, when you mention you repeat steps 2-5 another 4 >> times, you are not recursively generating, fetching, parsing and >> updating the WebDB with >> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change >> with every iteration of the g/f/p/updatedb cycle. >> >> Thanks >> >> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]> >> wrote: >>> >>> No problem Lewis, I appreciate you looking into it. >>> >>> >>> Firstly I have a seed URL XML document here: >>> http://www.ukcigarforums.com/injectlist.xml >>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL >>> within it. >>> >>> Nutch's regex-urlfilter.txt contains this: >>> >>> # allow urls in ukcigarforums.com domain >>> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ >>> # deny anything else >>> -. >>> >>> >>> Here's the procedure: >>> >>> >>> 1) INJECT: >>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >>> /opt/nutch_1_4/data/seed/ >>> >>> 2) GENERATE: >>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays >>> 26 >>> >>> 3) FETCH: >>> /opt/nutch_1_4/bin/nutch fetch >>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>> >>> 4) PARSE: >>> /opt/nutch_1_4/bin/nutch parse >>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>> >>> 5) UPDATE DB: >>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ >>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter >>> >>> >>> Repeat steps 2 to 5 another 4 times, then: >>> >>> 6) MERGE SEGMENTS: >>> /opt/nutch_1_4/bin/nutch mergesegs >>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ >>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize >>> >>> >>> Interestingly, this prints out: >>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch >>> crawl_parse parse_data parse_text" >>> >>> MERGEDsegments segment directory then has just two directories, instead >>> of >>> all of those listed in the last output, i.e. just: crawl_generate and >>> crawl_fetch >>> >>> (when then delete from the segments directory and copy the MERGEDsegments >>> results into it) >>> >>> >>> Lastly we run invert links after merge segments: >>> >>> 7) INVERT LINKS: >>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ >>> -dir >>> /opt/nutch_1_4/data/crawl/segments/ >>> >>> Which produces: >>> >>> "LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does >>> not >>> exist: file:/opt/nutch_1_4/data/crawl/segments/20120106152527/parse_data" >>> >>> >> >> > -- Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 16:24
Good spot because all of that was meant to be removed! No, I'm afraid
that's just a copy/paste problem. Dean On 06/01/2012 16:17, Lewis John Mcgibbney wrote: > Ok then, > > How about your generate command: > > 2) GENERATE: > /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ > /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 > > Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc, > when everything else being utilised within the crawl cycle points to > an entirely different<segment_dirs> path which is > /opt/nutch_1_4/data/crawl/segments/segment_date > > Was this intentional? > > On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: >> Lewis, >> >> Changing the merge to * returns a similar response: >> >> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern >> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files >> >> And yes, your assumption was correct - it's a different segment directory >> each loop. >> >> Many thanks, >> >> Dean. >> >> On 06/01/2012 15:43, Lewis John Mcgibbney wrote: >>> Hi Dean, >>> >>> Without discussing any of your configuration properties can you please try >>> >>> 6) MERGE SEGMENTS: >>> /opt/nutch_1_4/bin/nutch mergesegs >>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir >>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize >>> >>> paying attention to the wildcard /* in -dir >>> /opt/nutch_1_4/data/crawl/segments/* >>> >>> Also presumably, when you mention you repeat steps 2-5 another 4 >>> times, you are not recursively generating, fetching, parsing and >>> updating the WebDB with >>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change >>> with every iteration of the g/f/p/updatedb cycle. >>> >>> Thanks >>> >>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]> >>> wrote: >>>> No problem Lewis, I appreciate you looking into it. >>>> >>>> >>>> Firstly I have a seed URL XML document here: >>>> http://www.ukcigarforums.com/injectlist.xml >>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL >>>> within it. >>>> >>>> Nutch's regex-urlfilter.txt contains this: >>>> >>>> # allow urls in ukcigarforums.com domain >>>> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ >>>> # deny anything else >>>> -. >>>> >>>> >>>> Here's the procedure: >>>> >>>> >>>> 1) INJECT: >>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >>>> /opt/nutch_1_4/data/seed/ >>>> >>>> 2) GENERATE: >>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays >>>> 26 >>>> >>>> 3) FETCH: >>>> /opt/nutch_1_4/bin/nutch fetch >>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>> >>>> 4) PARSE: >>>> /opt/nutch_1_4/bin/nutch parse >>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>> >>>> 5) UPDATE DB: >>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ >>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter >>>> >>>> >>>> Repeat steps 2 to 5 another 4 times, then: >>>> >>>> 6) MERGE SEGMENTS: >>>> /opt/nutch_1_4/bin/nutch mergesegs >>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ >>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize >>>> >>>> >>>> Interestingly, this prints out: >>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch >>>> crawl_parse parse_data parse_text" >>>> >>>> MERGEDsegments segment directory then has just two directories, instead >>>> of >>>> all of those listed in the last output, i.e. just: crawl_generate and >>>> crawl_fetch >>>> >>>> (when then delete from the segments directory and copy the MERGEDsegments >>>> results into it) >>>> >>>> >>>> Lastly we run invert links after merge segments: >>>> >>>> 7) INVERT LINKS: >>>> /opt/nutch_1_4/bin/nutch invertlinks /opt/nutch_1_4/data/crawl/linkdb/ >>>> -dir >>>> /opt/nutch_1_4/data/crawl/segments/
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-06, 16:28
How about merging segs after every subsequent iteration of the crawl
cycle... surely this is a problem with producing the specific parse_data directory. If it doesn't work after two iterations then we know that it is happening early on in the crawl cycle. Have you manually checked that the directories exist after fetching and parsing? On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > Good spot because all of that was meant to be removed! No, I'm afraid that's > just a copy/paste problem. > > Dean > > On 06/01/2012 16:17, Lewis John Mcgibbney wrote: >> >> Ok then, >> >> How about your generate command: >> >> 2) GENERATE: >> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 >> >> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc, >> when everything else being utilised within the crawl cycle points to >> an entirely different<segment_dirs> path which is >> /opt/nutch_1_4/data/crawl/segments/segment_date >> >> Was this intentional? >> >> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[EMAIL PROTECTED]> >> wrote: >>> >>> Lewis, >>> >>> Changing the merge to * returns a similar response: >>> >>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern >>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files >>> >>> And yes, your assumption was correct - it's a different segment directory >>> each loop. >>> >>> Many thanks, >>> >>> Dean. >>> >>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote: >>>> >>>> Hi Dean, >>>> >>>> Without discussing any of your configuration properties can you please >>>> try >>>> >>>> 6) MERGE SEGMENTS: >>>> /opt/nutch_1_4/bin/nutch mergesegs >>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir >>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize >>>> >>>> paying attention to the wildcard /* in -dir >>>> /opt/nutch_1_4/data/crawl/segments/* >>>> >>>> Also presumably, when you mention you repeat steps 2-5 another 4 >>>> times, you are not recursively generating, fetching, parsing and >>>> updating the WebDB with >>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change >>>> with every iteration of the g/f/p/updatedb cycle. >>>> >>>> Thanks >>>> >>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> No problem Lewis, I appreciate you looking into it. >>>>> >>>>> >>>>> Firstly I have a seed URL XML document here: >>>>> http://www.ukcigarforums.com/injectlist.xml >>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL >>>>> within it. >>>>> >>>>> Nutch's regex-urlfilter.txt contains this: >>>>> >>>>> # allow urls in ukcigarforums.com domain >>>>> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ >>>>> # deny anything else >>>>> -. >>>>> >>>>> >>>>> Here's the procedure: >>>>> >>>>> >>>>> 1) INJECT: >>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >>>>> /opt/nutch_1_4/data/seed/ >>>>> >>>>> 2) GENERATE: >>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays >>>>> 26 >>>>> >>>>> 3) FETCH: >>>>> /opt/nutch_1_4/bin/nutch fetch >>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>>> >>>>> 4) PARSE: >>>>> /opt/nutch_1_4/bin/nutch parse >>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>>> >>>>> 5) UPDATE DB: >>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ >>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter >>>>> >>>>> >>>>> Repeat steps 2 to 5 another 4 times, then: >>>>> >>>>> 6) MERGE SEGMENTS: >>>>> /opt/nutch_1_4/bin/nutch mergesegs >>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ >>>>> -dir /opt/nutch_1_4/data/crawl/segments/ -filter -normalize >>>>> >>>>> >>>>> Interestingly, this prints out: >>>>> "SegmentMerger: using segment data from: crawl_generate crawl_fetch Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 16:38
Two iterations do the same thing - the parse_data directory is missing.
Interestingly, just doing the mergesegs on ONE crawl also removes the parse_data dir etc! Dean. On 06/01/2012 16:28, Lewis John Mcgibbney wrote: > How about merging segs after every subsequent iteration of the crawl > cycle... surely this is a problem with producing the specific > parse_data directory. If it doesn't work after two iterations then we > know that it is happening early on in the crawl cycle. Have you > manually checked that the directories exist after fetching and > parsing? > > On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: >> Good spot because all of that was meant to be removed! No, I'm afraid that's >> just a copy/paste problem. >> >> Dean >> >> On 06/01/2012 16:17, Lewis John Mcgibbney wrote: >>> Ok then, >>> >>> How about your generate command: >>> >>> 2) GENERATE: >>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays 26 >>> >>> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc, >>> when everything else being utilised within the crawl cycle points to >>> an entirely different<segment_dirs> path which is >>> /opt/nutch_1_4/data/crawl/segments/segment_date >>> >>> Was this intentional? >>> >>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[EMAIL PROTECTED]> >>> wrote: >>>> Lewis, >>>> >>>> Changing the merge to * returns a similar response: >>>> >>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern >>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files >>>> >>>> And yes, your assumption was correct - it's a different segment directory >>>> each loop. >>>> >>>> Many thanks, >>>> >>>> Dean. >>>> >>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote: >>>>> Hi Dean, >>>>> >>>>> Without discussing any of your configuration properties can you please >>>>> try >>>>> >>>>> 6) MERGE SEGMENTS: >>>>> /opt/nutch_1_4/bin/nutch mergesegs >>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir >>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize >>>>> >>>>> paying attention to the wildcard /* in -dir >>>>> /opt/nutch_1_4/data/crawl/segments/* >>>>> >>>>> Also presumably, when you mention you repeat steps 2-5 another 4 >>>>> times, you are not recursively generating, fetching, parsing and >>>>> updating the WebDB with >>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change >>>>> with every iteration of the g/f/p/updatedb cycle. >>>>> >>>>> Thanks >>>>> >>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]> >>>>> wrote: >>>>>> No problem Lewis, I appreciate you looking into it. >>>>>> >>>>>> >>>>>> Firstly I have a seed URL XML document here: >>>>>> http://www.ukcigarforums.com/injectlist.xml >>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a URL >>>>>> within it. >>>>>> >>>>>> Nutch's regex-urlfilter.txt contains this: >>>>>> >>>>>> # allow urls in ukcigarforums.com domain >>>>>> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ >>>>>> # deny anything else >>>>>> -. >>>>>> >>>>>> >>>>>> Here's the procedure: >>>>>> >>>>>> >>>>>> 1) INJECT: >>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >>>>>> /opt/nutch_1_4/data/seed/ >>>>>> >>>>>> 2) GENERATE: >>>>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>>>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays >>>>>> 26 >>>>>> >>>>>> 3) FETCH: >>>>>> /opt/nutch_1_4/bin/nutch fetch >>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>>>> >>>>>> 4) PARSE: >>>>>> /opt/nutch_1_4/bin/nutch parse >>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -threads 15 >>>>>> >>>>>> 5) UPDATE DB: >>>>>> /opt/nutch_1_4/bin/nutch updatedb /opt/nutch_1_4/data/crawl/crawldb/ >>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527 -normalize -filter >>>>>> >>>>>> >>>>>> Repeat steps 2 to 5 another 4 times, then:
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-06, 16:41
Another thing which I have stupidly not asked yet, have you checked
your hadoop.log to see if there are any problems around the parse phase? It should begin LOG.info("ParseSegment: starting at " + sdf.format(start)); LOG.info("ParseSegment: segment: " + segment); ... if successful ... LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url); ... if not then ... LOG.warn("Error parsing: " etc Any joy? On Fri, Jan 6, 2012 at 4:38 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > Two iterations do the same thing - the parse_data directory is missing. > > Interestingly, just doing the mergesegs on ONE crawl also removes the > parse_data dir etc! > > Dean. > > > > On 06/01/2012 16:28, Lewis John Mcgibbney wrote: >> >> How about merging segs after every subsequent iteration of the crawl >> cycle... surely this is a problem with producing the specific >> parse_data directory. If it doesn't work after two iterations then we >> know that it is happening early on in the crawl cycle. Have you >> manually checked that the directories exist after fetching and >> parsing? >> >> On Fri, Jan 6, 2012 at 4:24 PM, Dean Pullen<[EMAIL PROTECTED]> >> wrote: >>> >>> Good spot because all of that was meant to be removed! No, I'm afraid >>> that's >>> just a copy/paste problem. >>> >>> Dean >>> >>> On 06/01/2012 16:17, Lewis John Mcgibbney wrote: >>>> >>>> Ok then, >>>> >>>> How about your generate command: >>>> >>>> 2) GENERATE: >>>> /opt/nutch_1_4/bin/nutch generate /opt/nutch_1_4/data/crawl/crawldb/ >>>> /opt/semantico/slot/nutch_1_4/data/crawl/segments/ -topN 10000 -adddays >>>> 26 >>>> >>>> Your<segments_dir> seems to point to /opt/semantico/slot/etc/etc/etc, >>>> when everything else being utilised within the crawl cycle points to >>>> an entirely different<segment_dirs> path which is >>>> /opt/nutch_1_4/data/crawl/segments/segment_date >>>> >>>> Was this intentional? >>>> >>>> On Fri, Jan 6, 2012 at 4:08 PM, Dean Pullen<[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Lewis, >>>>> >>>>> Changing the merge to * returns a similar response: >>>>> >>>>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern >>>>> file:/opt/nutch_1_4/data/crawl/segments/*/parse_data matches 0 files >>>>> >>>>> And yes, your assumption was correct - it's a different segment >>>>> directory >>>>> each loop. >>>>> >>>>> Many thanks, >>>>> >>>>> Dean. >>>>> >>>>> On 06/01/2012 15:43, Lewis John Mcgibbney wrote: >>>>>> >>>>>> Hi Dean, >>>>>> >>>>>> Without discussing any of your configuration properties can you please >>>>>> try >>>>>> >>>>>> 6) MERGE SEGMENTS: >>>>>> /opt/nutch_1_4/bin/nutch mergesegs >>>>>> /opt/nutch_1_4/data/crawl/MERGEDsegments/ -dir >>>>>> /opt/nutch_1_4/data/crawl/segments/* -filter -normalize >>>>>> >>>>>> paying attention to the wildcard /* in -dir >>>>>> /opt/nutch_1_4/data/crawl/segments/* >>>>>> >>>>>> Also presumably, when you mention you repeat steps 2-5 another 4 >>>>>> times, you are not recursively generating, fetching, parsing and >>>>>> updating the WebDB with >>>>>> /opt/nutch_1_4/data/crawl/segments/20120106152527? This should change >>>>>> with every iteration of the g/f/p/updatedb cycle. >>>>>> >>>>>> Thanks >>>>>> >>>>>> On Fri, Jan 6, 2012 at 3:30 PM, Dean Pullen<[EMAIL PROTECTED]> >>>>>> wrote: >>>>>>> >>>>>>> No problem Lewis, I appreciate you looking into it. >>>>>>> >>>>>>> >>>>>>> Firstly I have a seed URL XML document here: >>>>>>> http://www.ukcigarforums.com/injectlist.xml >>>>>>> This basically has 'http://www.ukcigarforums.com/content.php' as a >>>>>>> URL >>>>>>> within it. >>>>>>> >>>>>>> Nutch's regex-urlfilter.txt contains this: >>>>>>> >>>>>>> # allow urls in ukcigarforums.com domain >>>>>>> +http://([a-z0-9-A-Z]*.)*ukcigarforums.com/ >>>>>>> # deny anything else >>>>>>> -. >>>>>>> >>>>>>> >>>>>>> Here's the procedure: >>>>>>> >>>>>>> >>>>>>> 1) INJECT: >>>>>>> /opt/nutch_1_4/bin/nutch inject /opt/nutch_1_4/data/crawl/crawldb/ >>>>>>> /opt/nutch_1_4/data/seed/ Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-06, 17:17
Only this:
2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at 2012-01-06 17:15:51 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: /opt/nutch_1_4/data/crawl/linkdb 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: file:/opt/nutch_1_4/data/crawl/segments/20120106171547 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at 2012-01-06 17:15:52 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: /opt/nutch_1_4/data/crawl/crawldb 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: linkdb: /opt/nutch_1_4/data/crawl/linkdb 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse Input path does not exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data Input path does not exist: file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text 2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump: starting 2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db: /opt/nutch_1_4/data/crawl/crawldb/ 2012-01-06 17:15:54,212 WARN mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same. 2012-01-06 17:15:55,603 INFO crawl.CrawlDbReader - CrawlDb dump: done
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-06, 17:53
OK so now I think were at the bottom of it. If you wish to create a
linkdb in >= Nutch 1.4 you need to specifically pass the linkdb parameter. This was implemented as not everyone wishes to create a linkdb. Your invertlinks command should be passed as follows bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir /path/to/segment/dirs then bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb path/to/linkdb -dir path/to/segment/dirs If you are not passing the -linkdb path/to/linkdb explicitly you will be thrown an exception as the linkdb is treated as a segment directory now. On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > Only this: > > 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser > for parsing the arguments. Applications should implement Tool for the same. > 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at 2012-01-06 > 17:15:51 > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: > /opt/nutch_1_4/data/crawl/linkdb > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true > 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547 > 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at > 2012-01-06 17:15:52 > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: /opt/nutch_1_4/data/crawl/crawldb > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: > linkdb: /opt/nutch_1_4/data/crawl/linkdb > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduces: > adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547 > 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer - > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse > Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data > Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text > 2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump: starting > 2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db: > /opt/nutch_1_4/data/crawl/crawldb/ > 2012-01-06 17:15:54,212 WARN mapred.JobClient - Use GenericOptionsParser > for parsing the arguments. Applications should implement Tool for the same. > 2012-01-06 17:15:55,603 INFO crawl.CrawlDbReader - CrawlDb dump: done > -- Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-07, 13:15
The -linkdb param isn't in the invertlinks docs
http://wiki.apache.org/nutch/bin/nutch_invertlinks (However it is in the solrindex docs) Adding it makes no difference to invertlinks. I think the problem is definitely with mergesegs, as opposed to invertlinks etc. Thanks again, Dean. On 06/01/2012 17:53, Lewis John Mcgibbney wrote: > OK so now I think were at the bottom of it. If you wish to create a > linkdb in>= Nutch 1.4 you need to specifically pass the linkdb > parameter. This was implemented as not everyone wishes to create a > linkdb. > > Your invertlinks command should be passed as follows > > bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir > /path/to/segment/dirs > then > bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb > path/to/linkdb -dir path/to/segment/dirs > > If you are not passing the -linkdb path/to/linkdb explicitly you will > be thrown an exception as the linkdb is treated as a segment directory > now. > > On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: >> Only this: >> >> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser >> for parsing the arguments. Applications should implement Tool for the same. >> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes where >> applicable >> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at 2012-01-06 >> 17:15:51 >> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: >> /opt/nutch_1_4/data/crawl/linkdb >> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true >> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true >> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: >> file:/opt/nutch_1_4/data/crawl/segments/20120106171547 >> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: >> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: >> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >> at >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >> at >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >> at >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >> at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >> >> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at >> 2012-01-06 17:15:52 >> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: >> crawldb: /opt/nutch_1_4/data/crawl/crawldb >> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: >> linkdb: /opt/nutch_1_4/data/crawl/linkdb >> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduces: >> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547 >> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer - >> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: >> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse >> Input path does not exist: >> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >> Input path does not exist: >> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_text >> 2012-01-06 17:15:54,027 INFO crawl.CrawlDbReader - CrawlDb dump: starting >> 2012-01-06 17:15:54,028 INFO crawl.CrawlDbReader - CrawlDb db: >> /opt/nutch_1_4/data/crawl/crawldb/ >> 2012-01-06 17:15:54,212 WARN mapred.JobClient - Use GenericOptionsParser
-
Re: parse data directory not found after mergeDean Pullen 2012-01-07, 13:18
Sorry, you did mean on solrindex - which I already do...
On 07/01/2012 13:15, Dean Pullen wrote: > The -linkdb param isn't in the invertlinks docs > http://wiki.apache.org/nutch/bin/nutch_invertlinks > > (However it is in the solrindex docs) > > Adding it makes no difference to invertlinks. > > I think the problem is definitely with mergesegs, as opposed to > invertlinks etc. > > Thanks again, > > Dean. > > On 06/01/2012 17:53, Lewis John Mcgibbney wrote: >> OK so now I think were at the bottom of it. If you wish to create a >> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb >> parameter. This was implemented as not everyone wishes to create a >> linkdb. >> >> Your invertlinks command should be passed as follows >> >> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir >> /path/to/segment/dirs >> then >> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb >> path/to/linkdb -dir path/to/segment/dirs >> >> If you are not passing the -linkdb path/to/linkdb explicitly you will >> be thrown an exception as the linkdb is treated as a segment directory >> now. >> >> On Fri, Jan 6, 2012 at 5:17 PM, Dean >> Pullen<[EMAIL PROTECTED]> wrote: >>> Only this: >>> >>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use >>> GenericOptionsParser >>> for parsing the arguments. Applications should implement Tool for >>> the same. >>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java >>> classes where >>> applicable >>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at >>> 2012-01-06 >>> 17:15:51 >>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: >>> /opt/nutch_1_4/data/crawl/linkdb >>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: >>> true >>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true >>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: >>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547 >>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: >>> org.apache.hadoop.mapred.InvalidInputException: Input path does not >>> exist: >>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >>> at >>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >>> >>> at >>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >>> >>> at >>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >>> >>> at >>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>> at >>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>> >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >>> >>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: >>> starting at >>> 2012-01-06 17:15:52 >>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - >>> IndexerMapReduce: >>> crawldb: /opt/nutch_1_4/data/crawl/crawldb >>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - >>> IndexerMapReduce: >>> linkdb: /opt/nutch_1_4/data/crawl/linkdb >>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - >>> IndexerMapReduces: >>> adding segment: /opt/nutch_1_4/data/crawl/segments/20120106171547 >>> 2012-01-06 17:15:53,000 ERROR solr.SolrIndexer - >>> org.apache.hadoop.mapred.InvalidInputException: Input path does not >>> exist: >>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/crawl_parse >>> Input path does not exist: >>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >>> Input path does not exist:
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-08, 14:08
Hi dean is this sorted
On Saturday, January 7, 2012, Dean Pullen <[EMAIL PROTECTED]> wrote: > Sorry, you did mean on solrindex - which I already do... > > On 07/01/2012 13:15, Dean Pullen wrote: > > The -linkdb param isn't in the invertlinks docs http://wiki.apache.org/nutch/bin/nutch_invertlinks > > (However it is in the solrindex docs) > > Adding it makes no difference to invertlinks. > > I think the problem is definitely with mergesegs, as opposed to invertlinks etc. > > Thanks again, > > Dean. > > On 06/01/2012 17:53, Lewis John Mcgibbney wrote: > > OK so now I think were at the bottom of it. If you wish to create a > linkdb in>= Nutch 1.4 you need to specifically pass the linkdb > parameter. This was implemented as not everyone wishes to create a > linkdb. > > Your invertlinks command should be passed as follows > > bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir > /path/to/segment/dirs > then > bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb > path/to/linkdb -dir path/to/segment/dirs > > If you are not passing the -linkdb path/to/linkdb explicitly you will > be thrown an exception as the linkdb is treated as a segment directory > now. > > On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: > > Only this: > > 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser > for parsing the arguments. Applications should implement Tool for the same. > 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at 2012-01-06 > 17:15:51 > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: > /opt/nutch_1_4/data/crawl/linkdb > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true > 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true > 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547 > 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: > org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: > file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) > at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) > at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) > at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at > 2012-01-06 17:15:52 > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: > crawldb: /opt/nutch_1_4/data/crawl/crawldb > 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: > linkdb: /opt/nutch_1_4/data/crawl/linkdb > -- *Lewis*
-
Re: parse data directory not found after mergeDean Pullen 2012-01-08, 14:26
No Lewis, -linkdb was already been used for the solrindex command, so we
still have the same problem. Many thanks, Dean On 08/01/2012 14:08, Lewis John Mcgibbney wrote: > Hi dean is this sorted > > On Saturday, January 7, 2012, Dean Pullen<[EMAIL PROTECTED]> wrote: >> Sorry, you did mean on solrindex - which I already do... >> >> On 07/01/2012 13:15, Dean Pullen wrote: >> >> The -linkdb param isn't in the invertlinks docs > http://wiki.apache.org/nutch/bin/nutch_invertlinks >> (However it is in the solrindex docs) >> >> Adding it makes no difference to invertlinks. >> >> I think the problem is definitely with mergesegs, as opposed to > invertlinks etc. >> Thanks again, >> >> Dean. >> >> On 06/01/2012 17:53, Lewis John Mcgibbney wrote: >> >> OK so now I think were at the bottom of it. If you wish to create a >> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb >> parameter. This was implemented as not everyone wishes to create a >> linkdb. >> >> Your invertlinks command should be passed as follows >> >> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir >> /path/to/segment/dirs >> then >> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb >> path/to/linkdb -dir path/to/segment/dirs >> >> If you are not passing the -linkdb path/to/linkdb explicitly you will >> be thrown an exception as the linkdb is treated as a segment directory >> now. >> >> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]> > wrote: >> Only this: >> >> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use GenericOptionsParser >> for parsing the arguments. Applications should implement Tool for the > same. >> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load >> native-hadoop library for your platform... using builtin-java classes > where >> applicable >> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at > 2012-01-06 >> 17:15:51 >> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: >> /opt/nutch_1_4/data/crawl/linkdb >> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: true >> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true >> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: >> file:/opt/nutch_1_4/data/crawl/segments/20120106171547 >> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: >> org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: >> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >> at >> > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >> at >> > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >> at >> > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >> at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >> at >> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >> >> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: starting at >> 2012-01-06 17:15:52 >> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: >> crawldb: /opt/nutch_1_4/data/crawl/crawldb >> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - IndexerMapReduce: >> linkdb: /opt/nutch_1_4/data/crawl/linkdb >>
-
Re: parse data directory not found after mergeDean Pullen 2012-01-08, 22:51
Where do we go from here? I can start looking/stepping through the
mergesegs code, but I'm reluctant due to it's probable complexity. Dean. On 08/01/2012 14:26, Dean Pullen wrote: > No Lewis, -linkdb was already been used for the solrindex command, so > we still have the same problem. > > Many thanks, > > Dean > > On 08/01/2012 14:08, Lewis John Mcgibbney wrote: >> Hi dean is this sorted >> >> On Saturday, January 7, 2012, Dean Pullen<[EMAIL PROTECTED]> >> wrote: >>> Sorry, you did mean on solrindex - which I already do... >>> >>> On 07/01/2012 13:15, Dean Pullen wrote: >>> >>> The -linkdb param isn't in the invertlinks docs >> http://wiki.apache.org/nutch/bin/nutch_invertlinks >>> (However it is in the solrindex docs) >>> >>> Adding it makes no difference to invertlinks. >>> >>> I think the problem is definitely with mergesegs, as opposed to >> invertlinks etc. >>> Thanks again, >>> >>> Dean. >>> >>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote: >>> >>> OK so now I think were at the bottom of it. If you wish to create a >>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb >>> parameter. This was implemented as not everyone wishes to create a >>> linkdb. >>> >>> Your invertlinks command should be passed as follows >>> >>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir >>> /path/to/segment/dirs >>> then >>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb >>> path/to/linkdb -dir path/to/segment/dirs >>> >>> If you are not passing the -linkdb path/to/linkdb explicitly you will >>> be thrown an exception as the linkdb is treated as a segment directory >>> now. >>> >>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]> >> wrote: >>> Only this: >>> >>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use >>> GenericOptionsParser >>> for parsing the arguments. Applications should implement Tool for the >> same. >>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load >>> native-hadoop library for your platform... using builtin-java classes >> where >>> applicable >>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at >> 2012-01-06 >>> 17:15:51 >>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: >>> /opt/nutch_1_4/data/crawl/linkdb >>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: >>> true >>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true >>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: >>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547 >>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: >>> org.apache.hadoop.mapred.InvalidInputException: Input path does not >>> exist: >>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >>> at >>> >> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >> >>> at >>> >> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >> >>> at >>> >> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >> >>> at >> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>> at >>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>> >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >>> >>> 2012-01-06 17:15:52,714 INFO solr.SolrIndexer - SolrIndexer: >>> starting at >>> 2012-01-06 17:15:52 >>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - >>> IndexerMapReduce: >>> crawldb: /opt/nutch_1_4/data/crawl/crawldb >>> 2012-01-06 17:15:52,782 INFO indexer.IndexerMapReduce - >>> IndexerMapReduce:
-
Re: parse data directory not found after mergeDean Pullen 2012-01-09, 13:31
Looking through the code, I'm seeing
org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for crawl_fetch and crawl_generate. Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...) gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse parse_data parse_text I'm not quiet sure what's going on in-between these two calls... Dean. On 08/01/2012 22:51, Dean Pullen wrote: > Where do we go from here? I can start looking/stepping through the > mergesegs code, but I'm reluctant due to it's probable complexity. > > Dean. > > > On 08/01/2012 14:26, Dean Pullen wrote: >> No Lewis, -linkdb was already been used for the solrindex command, so >> we still have the same problem. >> >> Many thanks, >> >> Dean >> >> On 08/01/2012 14:08, Lewis John Mcgibbney wrote: >>> Hi dean is this sorted >>> >>> On Saturday, January 7, 2012, Dean >>> Pullen<[EMAIL PROTECTED]> wrote: >>>> Sorry, you did mean on solrindex - which I already do... >>>> >>>> On 07/01/2012 13:15, Dean Pullen wrote: >>>> >>>> The -linkdb param isn't in the invertlinks docs >>> http://wiki.apache.org/nutch/bin/nutch_invertlinks >>>> (However it is in the solrindex docs) >>>> >>>> Adding it makes no difference to invertlinks. >>>> >>>> I think the problem is definitely with mergesegs, as opposed to >>> invertlinks etc. >>>> Thanks again, >>>> >>>> Dean. >>>> >>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote: >>>> >>>> OK so now I think were at the bottom of it. If you wish to create a >>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb >>>> parameter. This was implemented as not everyone wishes to create a >>>> linkdb. >>>> >>>> Your invertlinks command should be passed as follows >>>> >>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir >>>> /path/to/segment/dirs >>>> then >>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb >>>> path/to/linkdb -dir path/to/segment/dirs >>>> >>>> If you are not passing the -linkdb path/to/linkdb explicitly you will >>>> be thrown an exception as the linkdb is treated as a segment directory >>>> now. >>>> >>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]> >>> wrote: >>>> Only this: >>>> >>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use >>>> GenericOptionsParser >>>> for parsing the arguments. Applications should implement Tool for the >>> same. >>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load >>>> native-hadoop library for your platform... using builtin-java classes >>> where >>>> applicable >>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at >>> 2012-01-06 >>>> 17:15:51 >>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: >>>> /opt/nutch_1_4/data/crawl/linkdb >>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: >>>> true >>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true >>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: >>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547 >>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: >>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not >>>> exist: >>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >>>> at >>>> >>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >>> >>>> at >>>> >>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) >>> >>>> at >>>> >>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201) >>> >>>> at >>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>>> at >>>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>>> >>>> at >>>> org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-09, 14:24
Hi Dean,
I'll have a look into this later today if I get a chance. Anyone else experiencing problems using the mergesegs command or code? Thanks for persisting with this Dean hopefully we will get to the bottom of it soon. On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > Looking through the code, I'm seeing > org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for > crawl_fetch and crawl_generate. > > Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...) > gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse > parse_data parse_text > > I'm not quiet sure what's going on in-between these two calls... > > Dean. > > > > On 08/01/2012 22:51, Dean Pullen wrote: >> >> Where do we go from here? I can start looking/stepping through the >> mergesegs code, but I'm reluctant due to it's probable complexity. >> >> Dean. >> >> >> On 08/01/2012 14:26, Dean Pullen wrote: >>> >>> No Lewis, -linkdb was already been used for the solrindex command, so we >>> still have the same problem. >>> >>> Many thanks, >>> >>> Dean >>> >>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote: >>>> >>>> Hi dean is this sorted >>>> >>>> On Saturday, January 7, 2012, Dean Pullen<[EMAIL PROTECTED]> >>>> wrote: >>>>> >>>>> Sorry, you did mean on solrindex - which I already do... >>>>> >>>>> On 07/01/2012 13:15, Dean Pullen wrote: >>>>> >>>>> The -linkdb param isn't in the invertlinks docs >>>> >>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks >>>>> >>>>> (However it is in the solrindex docs) >>>>> >>>>> Adding it makes no difference to invertlinks. >>>>> >>>>> I think the problem is definitely with mergesegs, as opposed to >>>> >>>> invertlinks etc. >>>>> >>>>> Thanks again, >>>>> >>>>> Dean. >>>>> >>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote: >>>>> >>>>> OK so now I think were at the bottom of it. If you wish to create a >>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb >>>>> parameter. This was implemented as not everyone wishes to create a >>>>> linkdb. >>>>> >>>>> Your invertlinks command should be passed as follows >>>>> >>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir >>>>> /path/to/segment/dirs >>>>> then >>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb >>>>> path/to/linkdb -dir path/to/segment/dirs >>>>> >>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will >>>>> be thrown an exception as the linkdb is treated as a segment directory >>>>> now. >>>>> >>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]> >>>> >>>> wrote: >>>>> >>>>> Only this: >>>>> >>>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use >>>>> GenericOptionsParser >>>>> for parsing the arguments. Applications should implement Tool for the >>>> >>>> same. >>>>> >>>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load >>>>> native-hadoop library for your platform... using builtin-java classes >>>> >>>> where >>>>> >>>>> applicable >>>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at >>>> >>>> 2012-01-06 >>>>> >>>>> 17:15:51 >>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: >>>>> /opt/nutch_1_4/data/crawl/linkdb >>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: >>>>> true >>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true >>>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: >>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547 >>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: >>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not >>>>> exist: >>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >>>>> at >>>>> >>>> >>>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190) >>>>> >>>>> at >>>>> >>>> >>>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44) Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-09, 14:28
No, thank you for taking the time to look at it! I'm still on the case
but am hoping you'll find the problem. Dean. On 09/01/2012 14:24, Lewis John Mcgibbney wrote: > Hi Dean, > > I'll have a look into this later today if I get a chance. Anyone else > experiencing problems using the mergesegs command or code? > > Thanks for persisting with this Dean hopefully we will get to the > bottom of it soon. > > On Mon, Jan 9, 2012 at 1:31 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: >> Looking through the code, I'm seeing >> org.apache.nutch.segment.SegmentMerger.reduce(..) only being called for >> crawl_fetch and crawl_generate. >> >> Prior to this org.apache.nutch.segment.SegmentMerger.getRecordWriter(...) >> gets called for all components, i.e. crawl_generate crawl_fetch crawl_parse >> parse_data parse_text >> >> I'm not quiet sure what's going on in-between these two calls... >> >> Dean. >> >> >> >> On 08/01/2012 22:51, Dean Pullen wrote: >>> Where do we go from here? I can start looking/stepping through the >>> mergesegs code, but I'm reluctant due to it's probable complexity. >>> >>> Dean. >>> >>> >>> On 08/01/2012 14:26, Dean Pullen wrote: >>>> No Lewis, -linkdb was already been used for the solrindex command, so we >>>> still have the same problem. >>>> >>>> Many thanks, >>>> >>>> Dean >>>> >>>> On 08/01/2012 14:08, Lewis John Mcgibbney wrote: >>>>> Hi dean is this sorted >>>>> >>>>> On Saturday, January 7, 2012, Dean Pullen<[EMAIL PROTECTED]> >>>>> wrote: >>>>>> Sorry, you did mean on solrindex - which I already do... >>>>>> >>>>>> On 07/01/2012 13:15, Dean Pullen wrote: >>>>>> >>>>>> The -linkdb param isn't in the invertlinks docs >>>>> http://wiki.apache.org/nutch/bin/nutch_invertlinks >>>>>> (However it is in the solrindex docs) >>>>>> >>>>>> Adding it makes no difference to invertlinks. >>>>>> >>>>>> I think the problem is definitely with mergesegs, as opposed to >>>>> invertlinks etc. >>>>>> Thanks again, >>>>>> >>>>>> Dean. >>>>>> >>>>>> On 06/01/2012 17:53, Lewis John Mcgibbney wrote: >>>>>> >>>>>> OK so now I think were at the bottom of it. If you wish to create a >>>>>> linkdb in>= Nutch 1.4 you need to specifically pass the linkdb >>>>>> parameter. This was implemented as not everyone wishes to create a >>>>>> linkdb. >>>>>> >>>>>> Your invertlinks command should be passed as follows >>>>>> >>>>>> bin/nutch invertlinks path/you/wish/to/have/the/linkdb -dir >>>>>> /path/to/segment/dirs >>>>>> then >>>>>> bin/nutch solrindex http://solrUrl path/to/crawldb -linkdb >>>>>> path/to/linkdb -dir path/to/segment/dirs >>>>>> >>>>>> If you are not passing the -linkdb path/to/linkdb explicitly you will >>>>>> be thrown an exception as the linkdb is treated as a segment directory >>>>>> now. >>>>>> >>>>>> On Fri, Jan 6, 2012 at 5:17 PM, Dean Pullen<[EMAIL PROTECTED]> >>>>> wrote: >>>>>> Only this: >>>>>> >>>>>> 2012-01-06 17:15:47,972 WARN mapred.JobClient - Use >>>>>> GenericOptionsParser >>>>>> for parsing the arguments. Applications should implement Tool for the >>>>> same. >>>>>> 2012-01-06 17:15:48,692 WARN util.NativeCodeLoader - Unable to load >>>>>> native-hadoop library for your platform... using builtin-java classes >>>>> where >>>>>> applicable >>>>>> 2012-01-06 17:15:51,566 INFO crawl.LinkDb - LinkDb: starting at >>>>> 2012-01-06 >>>>>> 17:15:51 >>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: linkdb: >>>>>> /opt/nutch_1_4/data/crawl/linkdb >>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL normalize: >>>>>> true >>>>>> 2012-01-06 17:15:51,567 INFO crawl.LinkDb - LinkDb: URL filter: true >>>>>> 2012-01-06 17:15:51,576 INFO crawl.LinkDb - LinkDb: adding segment: >>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547 >>>>>> 2012-01-06 17:15:51,721 ERROR crawl.LinkDb - LinkDb: >>>>>> org.apache.hadoop.mapred.InvalidInputException: Input path does not >>>>>> exist: >>>>>> file:/opt/nutch_1_4/data/crawl/segments/20120106171547/parse_data >>>>>> at >>>>>>
-
Re: parse data directory not found after mergeDean Pullen 2012-01-09, 16:14
This is interesting, and something I've only just noticed in the logs:
2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker: org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out in any of the configured local directories This is during the mergesegs job (and previous jobs).....but I'm not sure what it means or if it's actually a problem. mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists. It suggests that the map part of the hadoop job has not produced an output file, or it's looking in the wrong place? Dean
-
Re: parse data directory not found after mergeLewis John Mcgibbney 2012-01-09, 16:41
How are you running Nutch local or deploy mode? Which hadoop versions
are you using 0.20.2? This appears to be an open issue with this version [1]. Also please have a look here [2] for a similar frustrating situation. [1] https://issues.apache.org/jira/browse/HADOOP-6958 [2] http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html On Mon, Jan 9, 2012 at 4:14 PM, Dean Pullen <[EMAIL PROTECTED]> wrote: > This is interesting, and something I've only just noticed in the logs: > > 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker: > org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find > taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out > in any of the configured local directories > > This is during the mergesegs job (and previous jobs).....but I'm not sure > what it means or if it's actually a problem. > > mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists. > > It suggests that the map part of the hadoop job has not produced an output > file, or it's looking in the wrong place? > > Dean -- Lewis
-
Re: parse data directory not found after mergeDean Pullen 2012-01-10, 11:33
I'm running in local mode (I believe) and using hadoop 0.20.2, as this
is the lib version shipped with nutch 1.4 Dean. On 09/01/2012 16:41, Lewis John Mcgibbney wrote: > How are you running Nutch local or deploy mode? Which hadoop versions > are you using 0.20.2? This appears to be an open issue with this > version [1]. > > Also please have a look here [2] for a similar frustrating situation. > > [1]https://issues.apache.org/jira/browse/HADOOP-6958 > [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html > > On Mon, Jan 9, 2012 at 4:14 PM, Dean Pullen<[EMAIL PROTECTED]> wrote: >> This is interesting, and something I've only just noticed in the logs: >> >> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker: >> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out >> in any of the configured local directories >> >> This is during the mergesegs job (and previous jobs).....but I'm not sure >> what it means or if it's actually a problem. >> >> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists. >> >> It suggests that the map part of the hadoop job has not produced an output >> file, or it's looking in the wrong place? >> >> Dean >
-
Re: parse data directory not found after mergeDean Pullen 2012-01-10, 14:11
Upgraded to Hadoop 0.20.205.0 and the DiskErrorException dissappears,
but the same result occurs, i.e. only the crawl_fetch and crawl_data directories get merged, no parse_data directory exists. Arghhhhhhhhh. Dean. On 10/01/2012 11:33, Dean Pullen wrote: > I'm running in local mode (I believe) and using hadoop 0.20.2, as this > is the lib version shipped with nutch 1.4 > > Dean. > > On 09/01/2012 16:41, Lewis John Mcgibbney wrote: >> How are you running Nutch local or deploy mode? Which hadoop versions >> are you using 0.20.2? This appears to be an open issue with this >> version [1]. >> >> Also please have a look here [2] for a similar frustrating situation. >> >> [1]https://issues.apache.org/jira/browse/HADOOP-6958 >> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html >> >> >> On Mon, Jan 9, 2012 at 4:14 PM, Dean >> Pullen<[EMAIL PROTECTED]> wrote: >>> This is interesting, and something I've only just noticed in the logs: >>> >>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker: >>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out >>> >>> in any of the configured local directories >>> >>> This is during the mergesegs job (and previous jobs).....but I'm not >>> sure >>> what it means or if it's actually a problem. >>> >>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists. >>> >>> It suggests that the map part of the hadoop job has not produced an >>> output >>> file, or it's looking in the wrong place? >>> >>> Dean >> > >
-
Re: parse data directory not found after mergeDean Pullen 2012-01-10, 16:49
Pretty sure the same thing is happening with Hadoop 1.0...
On 10/01/2012 14:11, Dean Pullen wrote: > Upgraded to Hadoop 0.20.205.0 and the DiskErrorException dissappears, > but the same result occurs, i.e. only the crawl_fetch and crawl_data > directories get merged, no parse_data directory exists. > > Arghhhhhhhhh. > > > Dean. > > On 10/01/2012 11:33, Dean Pullen wrote: >> I'm running in local mode (I believe) and using hadoop 0.20.2, as >> this is the lib version shipped with nutch 1.4 >> >> Dean. >> >> On 09/01/2012 16:41, Lewis John Mcgibbney wrote: >>> How are you running Nutch local or deploy mode? Which hadoop versions >>> are you using 0.20.2? This appears to be an open issue with this >>> version [1]. >>> >>> Also please have a look here [2] for a similar frustrating situation. >>> >>> [1]https://issues.apache.org/jira/browse/HADOOP-6958 >>> [2]http://lucene.472066.n3.nabble.com/org-apache-hadoop-util-DiskChecker-DiskErrorException-td1792797.html >>> >>> >>> On Mon, Jan 9, 2012 at 4:14 PM, Dean >>> Pullen<[EMAIL PROTECTED]> wrote: >>>> This is interesting, and something I've only just noticed in the logs: >>>> >>>> 2012-01-09 16:02:27,257 INFO org.apache.hadoop.mapred.TaskTracker: >>>> org.apache.hadoop.util.DiskChecker$DiskErrorException: Could not find >>>> taskTracker/jobcache/job_201201091558_0008/attempt_201201091558_0008_m_000006_0/output/file.out >>>> >>>> in any of the configured local directories >>>> >>>> This is during the mergesegs job (and previous jobs).....but I'm >>>> not sure >>>> what it means or if it's actually a problem. >>>> >>>> mapred.local.dir is set to /opt/nutch_1_4/data/local - which exists. >>>> >>>> It suggests that the map part of the hadoop job has not produced an >>>> output >>>> file, or it's looking in the wrong place? >>>> >>>> Dean >>> >> >> >
-
Re: parse data directory not found after mergeMarkus Jelsma 2012-01-10, 16:59
I haven't followed the entire thread but this is about the parse_data
directory disappears after a merge? We have no issues with merges on small crawls. Do you still store content despite the parsing fetcher? Can you reproduce this on a clean Nutch 1.4 build with an example crawl? On Thursday 05 January 2012 18:28:52 Dean Pullen wrote: > Hi all, > > I'm upgrading from nutch 1 to 1.4 and am having problems running > invertlinks. > > Error: > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does > not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data > at > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:19 > 0) at > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInp > utFormat.java:44) at > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201 > ) at > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > at > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > I notice that the parse_data directories are produced after a fetch > (with fetcher.parse set to true), but after the merge the parse_data > directory doesn't exist. > > What behaviour has changed since 1.0 and does anyone have a solution for > the above? > > Thanks in advance, > > Dean. -- Markus Jelsma - CTO - Openindex
-
Re: parse data directory not found after mergeMarkus Jelsma 2012-01-10, 17:01
I might want to ask about your Hadoop temp dir since you seem to have disk
errors. Have you set it? On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote: > I haven't followed the entire thread but this is about the parse_data > directory disappears after a merge? We have no issues with merges on small > crawls. > > Do you still store content despite the parsing fetcher? Can you reproduce > this on a clean Nutch 1.4 build with an example crawl? > > On Thursday 05 January 2012 18:28:52 Dean Pullen wrote: > > Hi all, > > > > I'm upgrading from nutch 1 to 1.4 and am having problems running > > invertlinks. > > > > Error: > > > > LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does > > not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data > > > > at > > > > org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java: > > 19 0) at > > org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI > > np utFormat.java:44) at > > org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2 > > 01 ) at > > org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) > > > > at > > > > org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) > > > > at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) > > at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) > > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) > > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) > > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) > > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) > > > > I notice that the parse_data directories are produced after a fetch > > (with fetcher.parse set to true), but after the merge the parse_data > > directory doesn't exist. > > > > What behaviour has changed since 1.0 and does anyone have a solution for > > the above? > > > > Thanks in advance, > > > > Dean. -- Markus Jelsma - CTO - Openindex
-
Re: parse data directory not found after mergeDean Pullen 2012-01-10, 17:05
The disk errors were solved by upgrading hadoop to 0.20.203 - they no
longer appear. Dean. On 10/01/2012 17:01, Markus Jelsma wrote: > I might want to ask about your Hadoop temp dir since you seem to have disk > errors. Have you set it? > > On Tuesday 10 January 2012 17:59:58 Markus Jelsma wrote: >> I haven't followed the entire thread but this is about the parse_data >> directory disappears after a merge? We have no issues with merges on small >> crawls. >> >> Do you still store content despite the parsing fetcher? Can you reproduce >> this on a clean Nutch 1.4 build with an example crawl? >> >> On Thursday 05 January 2012 18:28:52 Dean Pullen wrote: >>> Hi all, >>> >>> I'm upgrading from nutch 1 to 1.4 and am having problems running >>> invertlinks. >>> >>> Error: >>> >>> LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does >>> not exist: file:/opt/nutch/data/crawl/segments/20120105172548/parse_data >>> >>> at >>> >>> org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java: >>> 19 0) at >>> org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileI >>> np utFormat.java:44) at >>> org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:2 >>> 01 ) at >>> org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810) >>> >>> at >>> >>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781) >>> >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) >>> at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>> at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175) >>> at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:290) >>> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) >>> at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255) >>> >>> I notice that the parse_data directories are produced after a fetch >>> (with fetcher.parse set to true), but after the merge the parse_data >>> directory doesn't exist. >>> >>> What behaviour has changed since 1.0 and does anyone have a solution for >>> the above? >>> >>> Thanks in advance, >>> >>> Dean.
-
Re: parse data directory not found after mergeDean Pullen 2012-01-10, 17:06
Yes, this is about the parse_data directory dissapearing after a merge.
I've used a clean Nutch 1.4 multiple times, I've not yet use an example crawl though. Anything specific you recommend? Dean. On 10/01/2012 16:59, Markus Jelsma wrote: > I haven't followed the entire thread but this is about the parse_data > directory disappears after a merge? We have no issues with merges on small > crawls. > > Do you still store content despite the parsing fetcher? Can you reproduce this > on a clean Nutch 1.4 build with an example crawl? > >
-
Re: parse data directory not found after mergeMarkus Jelsma 2012-01-10, 17:25
Well, set up to crawl nutch.apache.org only and fetch some cycles and see what
happens. If merging goes bad then i can reproduce and perhaps fix it. If not, you may want to start debugging the thing step by step. On Tuesday 10 January 2012 18:06:34 Dean Pullen wrote: > Yes, this is about the parse_data directory dissapearing after a merge. > > I've used a clean Nutch 1.4 multiple times, I've not yet use an example > crawl though. > > Anything specific you recommend? > > Dean. > > On 10/01/2012 16:59, Markus Jelsma wrote: > > I haven't followed the entire thread but this is about the parse_data > > directory disappears after a merge? We have no issues with merges on > > small crawls. > > > > Do you still store content despite the parsing fetcher? Can you reproduce > > this on a clean Nutch 1.4 build with an example crawl? -- Markus Jelsma - CTO - Openindex
-
Re: parse data directory not found after mergeDean Pullen 2012-01-11, 11:09
A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same
thing. I've zipped up the nutch/hadoop dir with all config etc, would either of you (Markus/Lewis) care to look at it? Any help at this stage would be immensely appreciated. Regards, Dean.
-
Re: parse data directory not found after mergeDean Pullen 2012-01-11, 11:21
For further reference, below is the Hadoop job task log for the
mergesegs command. You'll see that parse_data etc merges are performed. Completed Tasks Task Complete Status Start Time Finish Time Errors Counters task_201201111048_0031_m_000000 100.00% file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_fetch/part-00000/data:0+259 11-Jan-2012 11:16:22 11-Jan-2012 11:16:25 (3sec) 9 task_201201111048_0031_m_000001 100.00% file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_generate/part-00000:0+234 11-Jan-2012 11:16:22 11-Jan-2012 11:16:25 (3sec) 9 task_201201111048_0031_m_000002 100.00% file:/opt/nutch_1_4/data/crawl/segments/20120111111422/content/part-00000/data:0+129 11-Jan-2012 11:16:25 11-Jan-2012 11:16:28 (3sec) 9 task_201201111048_0031_m_000003 100.00% file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_parse/part-00000:0+129 11-Jan-2012 11:16:25 11-Jan-2012 11:16:28 (3sec) 9 task_201201111048_0031_m_000004 100.00% file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_data/part-00000/data:0+128 11-Jan-2012 11:16:28 11-Jan-2012 11:16:31 (3sec) 9 task_201201111048_0031_m_000005 100.00% file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_text/part-00000/data:0+128 11-Jan-2012 11:16:28 11-Jan-2012 11:16:31 (3sec) And the parse_data job itself: attempt_201201111048_0031_m_000004_0 /default-rack/dhcp-192-168-4-26.semantico.net SUCCEEDED 100.00% 11-Jan-2012 11:16:28 11-Jan-2012 11:16:30 (1sec)
-
Re: parse data directory not found after mergeMarkus Jelsma 2012-01-11, 11:31
There is no zip. Anyway, i just did three fetch and parse cycles of
nutch.apache.org with trunk. Trunk has no changes concerning segments etc with regards to 1.4. I injected nutch.apache.org and then did two fetches of -topN 4 pages so i got 9 pages in three segments. I also configured to stay within the domain. CrawlDb statistics start: crawl/crawldb/ Statistics for CrawlDb: crawl/crawldb/ TOTAL urls: 28 retry 0: 28 min score: 0.0010 avg score: 0.080714285 max score: 1.588 status 1 (db_unfetched): 19 status 2 (db_fetched): 9 CrawlDb statistics: done crawl/segments/20120111122321/: total 24 drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 content drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 crawl_fetch drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_generate drwxr-xr-x 2 markus markus 4096 2012-01-11 12:23 crawl_parse drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_data drwxr-xr-x 3 markus markus 4096 2012-01-11 12:23 parse_text crawl/segments/20120111122438/: total 24 drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 content drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 crawl_fetch drwxr-xr-x 2 markus markus 4096 2012-01-11 12:24 crawl_generate drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_parse drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_data drwxr-xr-x 3 markus markus 4096 2012-01-11 12:25 parse_text crawl/segments/20120111122539/: total 24 drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 content drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 crawl_fetch drwxr-xr-x 2 markus markus 4096 2012-01-11 12:25 crawl_generate drwxr-xr-x 2 markus markus 4096 2012-01-11 12:26 crawl_parse drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_data drwxr-xr-x 3 markus markus 4096 2012-01-11 12:26 parse_text Let's merge the three segments into one: $ bin/nutch mergesegs merged_segment -dir crawl/segments/ Merging 3 segments to merged_segment/20120111122826 SegmentMerger: adding file:/PATH/crawl/segments/20120111122539 SegmentMerger: adding file:/PATH/crawl/segments/20120111122438 SegmentMerger: adding file:/PATH/crawl/segments/20120111122321 SegmentMerger: using segment data from: content crawl_generate crawl_fetch crawl_parse parse_data parse_text .. it takes a while but finishes. Then i've got this: $ ls merged_segment/20120111122826/ content crawl_fetch crawl_generate crawl_parse parse_data parse_text I don't see the problem but this should reproduce your problem as your steps are not really different from mine. Is it still the parse_data directory that is missing? Why are you mering anyway, it is not mandatory at all. On Wednesday 11 January 2012 12:09:57 Dean Pullen wrote: > A fresh Nutch 1.4/Hadoop 0.20.2 crawling nutch.apache.org does the same > thing. > > I've zipped up the nutch/hadoop dir with all config etc, would either of > you (Markus/Lewis) care to look at it? > > Any help at this stage would be immensely appreciated. > > Regards, > > Dean. -- Markus Jelsma - CTO - Openindex
-
Re: parse data directory not found after mergeMarkus Jelsma 2012-01-11, 11:33
I ran the merge local only. I've never merged on a Hadoop cluster since we
don't need it there. On Wednesday 11 January 2012 12:21:20 Dean Pullen wrote: > For further reference, below is the Hadoop job task log for the > mergesegs command. > You'll see that parse_data etc merges are performed. > > > Completed Tasks > > Task Complete Status Start Time Finish Time Errors > Counters > task_201201111048_0031_m_000000 100.00% > file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_fetch/part-000 > 00/data:0+259 11-Jan-2012 11:16:22 > 11-Jan-2012 11:16:25 (3sec) > > 9 > task_201201111048_0031_m_000001 100.00% > file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_generate/part- > 00000:0+234 11-Jan-2012 11:16:22 > 11-Jan-2012 11:16:25 (3sec) > > 9 > task_201201111048_0031_m_000002 100.00% > file:/opt/nutch_1_4/data/crawl/segments/20120111111422/content/part-00000/d > ata:0+129 11-Jan-2012 11:16:25 > 11-Jan-2012 11:16:28 (3sec) > > 9 > task_201201111048_0031_m_000003 100.00% > file:/opt/nutch_1_4/data/crawl/segments/20120111111422/crawl_parse/part-000 > 00:0+129 11-Jan-2012 11:16:25 > 11-Jan-2012 11:16:28 (3sec) > > 9 > task_201201111048_0031_m_000004 100.00% > file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_data/part-0000 > 0/data:0+128 11-Jan-2012 11:16:28 > 11-Jan-2012 11:16:31 (3sec) > > 9 > task_201201111048_0031_m_000005 100.00% > file:/opt/nutch_1_4/data/crawl/segments/20120111111422/parse_text/part-0000 > 0/data:0+128 11-Jan-2012 11:16:28 > 11-Jan-2012 11:16:31 (3sec) > > > > > And the parse_data job itself: > > attempt_201201111048_0031_m_000004_0 > /default-rack/dhcp-192-168-4-26.semantico.net SUCCEEDED 100.00% > 11-Jan-2012 11:16:28 11-Jan-2012 11:16:30 (1sec) -- Markus Jelsma - CTO - Openindex
-
Re: parse data directory not found after mergeDean Pullen 2012-01-11, 11:37
Markus,
I didn't include the zip, I was just saying I have it if you would like to see/use it! Shall I send? Can you zip up and send to me what you've just done? Presumably it must be a config thing?! I know mergesegs isn't needed, but as I believed there was a problem with it I've been trying to discover the problem for the sake of it... Dean. |