|
|
-
Re: parse existing segmentsFerdy Galema 2011-11-03, 13:11
What are you trying to achieve? The crawl command does not invocate any
plugins of itself, it merely chains several Nutch jobs together. The Nutch jobs themselves - or more specifically the mappers and reducers - make use of the plugin repository. On 11/03/2011 01:47 PM, Ashish M wrote: > What method in crawl.java would trigger the invocation of plugins? > > Sent from my iPhone. Please ignore the typos. > > On Nov 3, 2011, at 5:30 AM, Markus Jelsma<[EMAIL PROTECTED]> wrote: > >> remove *parse* in the segment and you're good to go. >> >> On Thursday 03 November 2011 13:16:40 Ashish Mehrotra wrote: >>> Hi All, >>> >>> I am trying to parse already crawled segments using the method -- >>> ParseSegment.parse(seg); >>> >>> >>> seg is the Path to the existing segment. >>> This internally fires a new job and the error thrown is -- >>> >>> Exception in thread "main" java.io.IOException: Segment already parsed! >>> at >>> org.apache.nutch.parse.ParseOutputFormat.checkOutputSpecs(ParseOutputForma >>> t.java:80) at >>> org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:772) >>> at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730) at >>> org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249) >>> at org.apache.nutch.parse.ParseSegment.parse(ParseSegment.java:156) >>> >>> What I am trying to do here is parse the already fetched data to test my >>> HTML Parse Filter. Looks like the above method of ParseSegment gets called >>> in the normal workflow of crawl, fetch, parse ... >>> >>> What I have done is modified the org.apache.nutch.crawl.Crawl.run() to >>> call only ParseSegment and commented the injector, generator and fetcher >>> parts. I am calling ParseSegment.parse(segment) in the run() method. I am >>> passing the segment name in the command line. >>> >>> Should I be calling some other method to test my HTML parser filter plugin >>> without crawling again? >>> >>> Any pointers should be helpful. >>> >>> Thanks, >>> Ashish >> -- >> Markus Jelsma - CTO - Openindex >> http://www.linkedin.com/in/markus17 >> 050-8536620 / 06-50258350 |