|
|
+
blunderboy 2012-05-22, 10:40
-
Re: Get Parent of URLs fetched by nutchJulien Nioche 2012-05-22, 11:03
Implement your own scoring filter and add the URL of the source to the
targets' metadata. See https://issues.apache.org/jira/browse/NUTCH-1331 for something (vaguely) related On 22 May 2012 11:40, blunderboy <[EMAIL PROTECTED]> wrote: > As I run Apache Nutch 1.4 crawler, I want to store some additional > information. I want to store the parent of every URL. > > For example, I want to crawl a page a.html that has 2 anchor links to > b.html > and c.html So when I crawl a.html, I should get something like this :- > > a.html null > b.html a.html > c.html a.html > > I want to store something like this. I have read how nutch works and have > run nutch in eclipse too. I also read fetcher.java and logged where it > fetched content. But I got no success in knowing where Nutch fetches the > child URLs of a given page. I think this step takes place after parsing > step. > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/Get-Parent-of-URLs-fetched-by-nutch-tp3985369.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble +
blunderboy 2012-05-22, 11:55
|