|
|
-
Re: Getting seed urlSebastian Nagel 2012-06-11, 21:45
Hi Sandeep,
tracking the seed(s) for a document could be done by a scoring filter. The seed URL must be passed: 0 into CrawlDatum's meta by injectedScore() (alternatively, use additional fields in the seed file: <url> <tab> seed=<url> see Injector Javadoc) 1 in passScoreBeforeParsing(): from CrawlDatum to Content 2 in passScoreAfterParsing(): from Content to ParseData 3 in distributeScoreToOutlinks(): from source ParseData to all target/outlink CrawlDatum objects 4 in updateDbScore(): resolve inlinks from multiple seeds Point 4 shows a little problem: a page may be reachable from multiple seeds. The web is a graph not a forest of trees each with one seed as root! Finally: amazon.com is definitely linked from apache.org but it is not a "project" site. Wouldn't a mapping <domain name> -> <meta data> be more reliable (though notoriously incomplete)? Best, Sebastian On 06/11/2012 08:09 PM, Sandeep C R wrote: > Hello, > > I am trying to find a way in which I can get the seed url of current url > being parsed. I have many URL's in seed.txt. I am trying to add additional > metadata for each URL crawled. The metadata depends on the seed URL of the > current URL. This metadata will be later picked by the indexer. I have > written a custom plugin for this purpose. However I am unable to get the > seed url of the current url being parsed. > > Ex: This is my seed.txt > > http://apache.org > http://amazon.com > http://w3.org > > For all URL's crawled for every seed URL, I want to add metadata. The value > of metadata will depend on seed URL. I have a properties file which will > map seed url to metadata value. If seed url is http://apache.org then my > metadata will be something like "project". If it is http://amazon.com then > it will be "estore". I have written a plugin which will add metadata. This > plugin extends HtmlParserFilter. However I am not able find a way to get > the seed url of current url. If http://nutch.apache.org is being parsed > currently, then how do we know the seed url(http:/apache.org) of this url? > Is there any API which I could use in my plugin? Or is there any better way > to achieve this? > > Regards, > Sandeep > |