|
Adriana Farina
2011-11-28, 11:14
Lewis John Mcgibbney
2011-11-30, 16:24
Adriana Farina
2011-12-01, 08:57
Lewis John Mcgibbney
2011-12-01, 20:17
Adriana Farina
2011-12-02, 09:20
Arkadi.Kosmynin@...
2011-12-01, 21:43
Adriana Farina
2011-12-02, 09:19
alxsss@...
2011-12-01, 22:48
Lewis John Mcgibbney
2011-12-01, 22:59
alxsss@...
2011-12-01, 23:15
|
-
Fetching just some urls outside domainAdriana Farina 2011-11-28, 11:14
Hello,
I'm using nutch 1.3 from just a month, so I'm not an expert. I configured it so that it doesn't fetch pages outside a specific domain. However now I need to let it fetch pages outside the domain I choosed but only for some urls (not for all the urls I have to crawl). How can I do this? I have to write a new plugin? Thanks. +
Adriana Farina 2011-11-28, 11:14
-
Re: Fetching just some urls outside domainLewis John Mcgibbney 2011-11-30, 16:24
Hi Adriana,
This should be achievable through fine grained URL filters. It is kindof hard to substantiate on this without you providing some examples of the type of stuff you're trying to do! Lewis On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <[EMAIL PROTECTED] > wrote: > Hello, > > I'm using nutch 1.3 from just a month, so I'm not an expert. I configured > it so that it doesn't fetch pages outside a specific domain. However now I > need to let it fetch pages outside the domain I choosed but only for some > urls (not for all the urls I have to crawl). How can I do this? I have to > write a new plugin? > > Thanks. > -- *Lewis* +
Lewis John Mcgibbney 2011-11-30, 16:24
-
Re: Fetching just some urls outside domainAdriana Farina 2011-12-01, 08:57
Hi!
Thank you for your answer. You're right, maybe an example would explain better what I need to do. I have to perform the following task. I have to explore a specific domain (. gov.it) and I have an initial set of seeds, for example www.aaa.it, www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch pages outside that domain. However some resources I need to download (documents) are stored on web sites that are not inside the domain I'm interested in. For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where www.somesite.it is not inside "my" domain). Nutch will not fetch that page since I told it to behave that way, but I need to download documents stored on www.somesite.it. So I need nutch to go outside the domain I specified only when it sees the words "albi" or "albo" inside the url, since that words identify the documents I need. How can I do this? I hope I've been clear. :) 2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]> > Hi Adriana, > > This should be achievable through fine grained URL filters. It is kindof > hard to substantiate on this without you providing some examples of the > type of stuff you're trying to do! > > Lewis > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > [EMAIL PROTECTED] > > wrote: > > > Hello, > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I configured > > it so that it doesn't fetch pages outside a specific domain. However now > I > > need to let it fetch pages outside the domain I choosed but only for some > > urls (not for all the urls I have to crawl). How can I do this? I have to > > write a new plugin? > > > > Thanks. > > > > > > -- > *Lewis* > +
Adriana Farina 2011-12-01, 08:57
-
Re: Fetching just some urls outside domainLewis John Mcgibbney 2011-12-01, 20:17
If you also provide the settings from nutch-site.xml which restrict's your
Nutchbot from crawling outside some specified domain that would be helpful. At this stage I think that if your restrictions completely deny Nutch from following outlinks to other domains, then the use of reg-ex filters is pointless. This is not what you wish to be configuring. Instead you want to be allowing Nutch to crawl outlinks to other domains but limit which domains you wish to crawl. I think it should be possible to add the filters in your reg-ex file like # accept the following but block everything else +^http://([a-z0-9]*.)*somesite.it/ +^http://([a-z0-9]*.)*aaa.it/ +^http://([a-z0-9]*.)*bbb.it/ etc I don't think you will need to explicitly deny everything else. However you'll only find out by doing a number of small test crawls to check out whether your reg-ex filters are working HTH On Thu, Dec 1, 2011 at 8:57 AM, Adriana Farina <[EMAIL PROTECTED]>wrote: > Hi! > > Thank you for your answer. You're right, maybe an example would explain > better what I need to do. > > I have to perform the following task. I have to explore a specific domain > (. > gov.it) and I have an initial set of seeds, for example www.aaa.it, > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > pages outside that domain. However some resources I need to download > (documents) are stored on web sites that are not inside the domain I'm > interested in. > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where > www.somesite.it is not inside "my" domain). Nutch will not fetch that page > since I told it to behave that way, but I need to download documents stored > on www.somesite.it. So I need nutch to go outside the domain I specified > only when it sees the words "albi" or "albo" inside the url, since that > words identify the documents I need. How can I do this? > > I hope I've been clear. :) > > > > 2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]> > > > Hi Adriana, > > > > This should be achievable through fine grained URL filters. It is kindof > > hard to substantiate on this without you providing some examples of the > > type of stuff you're trying to do! > > > > Lewis > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > [EMAIL PROTECTED] > > > wrote: > > > > > Hello, > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > configured > > > it so that it doesn't fetch pages outside a specific domain. However > now > > I > > > need to let it fetch pages outside the domain I choosed but only for > some > > > urls (not for all the urls I have to crawl). How can I do this? I have > to > > > write a new plugin? > > > > > > Thanks. > > > > > > > > > > > -- > > *Lewis* > > > -- *Lewis* +
Lewis John Mcgibbney 2011-12-01, 20:17
-
Re: Fetching just some urls outside domainAdriana Farina 2011-12-02, 09:20
I setted nutch-site.xml in the following way:
<?xml version="1.0"?> <?xml-stylesheet type="text/xsl" href="configuration.xsl"?> <!-- Put site-specific property overrides in this file. --> <configuration> <property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property> <property> <name>file.crawl.parent</name> <value>false</value> <description>The crawler is not restricted to the directories that you specified in the Urls file but it is jumping into the parent directories as well. For your own crawlings you can change this bahavior (set to false) the way that only directories beneath the directories that you specify get crawled.</description> </property> <property> <name>http.robots.agents</name> <value>My Nutch Spider, *</value> <description>The agent strings we'll look for in robots.txt files, comma-separated, in decreasing order of precedence. You should put the value of http.agent.name as the first agent name, and keep the default * at the end of the list. E.g.: BlurflDev,Blurfl,* </description> </property> <property> <name>http.max.delays</name> <value>10</value> <description>The number of times a thread will delay when trying to fetch a page. Each time it finds that a host is busy, it will wait fetcher.server.delay. After http.max.delays attepts, it will give up on the page for now.</description> </property> <property> <name>http.accept.language</name> <value>it, en;q=0.7,*;q=0.3</value> <description>Value of the "Accept-Language" request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> <property> <name>http.verbose</name> <value>true</value> <description>If true, HTTP will log more verbosely.</description> </property> <property> <name>http.redirect.max</name> <value>3</value> <description>The maximum number of redirects the fetcher will follow when trying to fetch a page. If set to negative or 0, fetcher won't immediately follow redirected URLs, instead it will record them for later fetching. </description> </property> <property> <name>db.max.outlinks.per.page</name> <value>1000</value> <description>The maximum number of outlinks that we'll process for a page. If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks will be processed for a page; otherwise, all outlinks will be processed. </description> </property> <property> <name>fetcher.server.delay</name> <value>2.0</value> <description>The number of seconds the fetcher will delay between successive requests to the same server.</description> </property> <property> <name>fetcher.threads.fetch</name> <value>8</value> <description>The number of FetcherThreads the fetcher should use. This is also determines the maximum number of requests that are made at once (each FetcherThread handles one connection).</description> </property> <property> <name>fetcher.threads.per.host</name> <value>5</value> <description>This number is the maximum number of threads that should be allowed to access a host at one time.</description> </property> <property> <name>fetcher.verbose</name> <value>true</value> <description>If true, fetcher will log more verbosely.</description> </property> <property> <name>fetcher.parse</name> <value>true</value> <description>If true, fetcher will parse content. Default is false, which means that a separate parsing step is required after fetching is finished.</description> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|pdf|doc)|index-(basic|anchor)|urlmeta|scoring-opic|urlnormalizer-(pass|regex|basic)</value> <description>Regular expression naming plugin directory names to include. Any plugin not matching this expression is excluded. In any case you need at least include the nutch-extensionpoints plugin. By default Nutch includes crawling just HTML and plain text via HTTP, and basic indexing and search plugins. In order to use HTTPS please enable protocol-httpclient, but be aware of possible intermittent problems with the underlying commons-httpclient library. </description> </property> <property> <name>http.content.limit</name> <value>-1</value> <description>The length limit for downloaded content using the http:// protocol, in bytes. If this value is nonnegative (>=0), content longer than it will be truncated; otherwise, no truncation at all. Do not confuse this setting with the file.content.limit setting. </description> </property> <property> <name>urlmeta.tags</name> <value>idadmin</value> <description> To be used in conjunction with features introduced in NUTCH-655, which allows for custom metatags to be injected alongside your crawl URLs. Specifying those custom tags here will allow for their propagation into a pages outlinks, as well as allow for them to be included as part of an index. Values should be comma-delimited. ("tag1,tag2,tag3") Do not pad the tags with white-space at their boundaries, if you are using anything earlier than Hadoop-0.21. </description> </property> </configuration> 2011/12/1 Lewis John Mcgibbney <[EMAIL PROTECTED]> +
Adriana Farina 2011-12-02, 09:20
-
RE: Fetching just some urls outside domainArkadi.Kosmynin@... 2011-12-01, 21:43
Hi Adriana,
You can try Arch for this: http://www.atnf.csiro.au/computing/software/arch You can configure it to crawl your web sites plus sets of miscellaneous URLs called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, only Arch based on Nutch 1.2 is available for downloading. We are about to release Arch based on Nutch 1.4. Regards, Arkadi > -----Original Message----- > From: Adriana Farina [mailto:[EMAIL PROTECTED]] > Sent: Thursday, 1 December 2011 7:58 PM > To: [EMAIL PROTECTED] > Subject: Re: Fetching just some urls outside domain > > Hi! > > Thank you for your answer. You're right, maybe an example would explain > better what I need to do. > > I have to perform the following task. I have to explore a specific > domain (. > gov.it) and I have an initial set of seeds, for example www.aaa.it, > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > pages outside that domain. However some resources I need to download > (documents) are stored on web sites that are not inside the domain I'm > interested in. > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > (where > www.somesite.it is not inside "my" domain). Nutch will not fetch that > page > since I told it to behave that way, but I need to download documents > stored > on www.somesite.it. So I need nutch to go outside the domain I > specified > only when it sees the words "albi" or "albo" inside the url, since that > words identify the documents I need. How can I do this? > > I hope I've been clear. :) > > > > 2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]> > > > Hi Adriana, > > > > This should be achievable through fine grained URL filters. It is > kindof > > hard to substantiate on this without you providing some examples of > the > > type of stuff you're trying to do! > > > > Lewis > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > [EMAIL PROTECTED] > > > wrote: > > > > > Hello, > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > configured > > > it so that it doesn't fetch pages outside a specific domain. > However now > > I > > > need to let it fetch pages outside the domain I choosed but only > for some > > > urls (not for all the urls I have to crawl). How can I do this? I > have to > > > write a new plugin? > > > > > > Thanks. > > > > > > > > > > > -- > > *Lewis* > > +
Arkadi.Kosmynin@... 2011-12-01, 21:43
-
Re: Fetching just some urls outside domainAdriana Farina 2011-12-02, 09:19
I'll download nutch 1.2 and try Arch, it seems interesting. Thank you.
I think I need to do some tests to try all the solutions you all suggested. 2011/12/1 <[EMAIL PROTECTED]> > Hi Adriana, > > You can try Arch for this: > > http://www.atnf.csiro.au/computing/software/arch > > You can configure it to crawl your web sites plus sets of miscellaneous > URLs called "bookmarks" in Arch. Arch is a free extension of Nutch. Right > now, only Arch based on Nutch 1.2 is available for downloading. We are > about to release Arch based on Nutch 1.4. > > Regards, > > Arkadi > > > > > -----Original Message----- > > From: Adriana Farina [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, 1 December 2011 7:58 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Fetching just some urls outside domain > > > > Hi! > > > > Thank you for your answer. You're right, maybe an example would explain > > better what I need to do. > > > > I have to perform the following task. I have to explore a specific > > domain (. > > gov.it) and I have an initial set of seeds, for example www.aaa.it, > > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > > pages outside that domain. However some resources I need to download > > (documents) are stored on web sites that are not inside the domain I'm > > interested in. > > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > > (where > > www.somesite.it is not inside "my" domain). Nutch will not fetch that > > page > > since I told it to behave that way, but I need to download documents > > stored > > on www.somesite.it. So I need nutch to go outside the domain I > > specified > > only when it sees the words "albi" or "albo" inside the url, since that > > words identify the documents I need. How can I do this? > > > > I hope I've been clear. :) > > > > > > > > 2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]> > > > > > Hi Adriana, > > > > > > This should be achievable through fine grained URL filters. It is > > kindof > > > hard to substantiate on this without you providing some examples of > > the > > > type of stuff you're trying to do! > > > > > > Lewis > > > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > > [EMAIL PROTECTED] > > > > wrote: > > > > > > > Hello, > > > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > > configured > > > > it so that it doesn't fetch pages outside a specific domain. > > However now > > > I > > > > need to let it fetch pages outside the domain I choosed but only > > for some > > > > urls (not for all the urls I have to crawl). How can I do this? I > > have to > > > > write a new plugin? > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > +
Adriana Farina 2011-12-02, 09:19
-
Re: Fetching just some urls outside domainalxsss@... 2011-12-01, 22:48
Hello,
It is interesting to know how can one put a filter on outlinks? I mean if I have a regex, in which file should I put it? For example, I want nutch to ignore outlinks ending with .info. Thanks. Alex. -----Original Message----- From: Arkadi.Kosmynin <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Thu, Dec 1, 2011 1:44 pm Subject: RE: Fetching just some urls outside domain Hi Adriana, You can try Arch for this: http://www.atnf.csiro.au/computing/software/arch You can configure it to crawl your web sites plus sets of miscellaneous URLs called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, only Arch based on Nutch 1.2 is available for downloading. We are about to release Arch based on Nutch 1.4. Regards, Arkadi > -----Original Message----- > From: Adriana Farina [mailto:[EMAIL PROTECTED]] > Sent: Thursday, 1 December 2011 7:58 PM > To: [EMAIL PROTECTED] > Subject: Re: Fetching just some urls outside domain > > Hi! > > Thank you for your answer. You're right, maybe an example would explain > better what I need to do. > > I have to perform the following task. I have to explore a specific > domain (. > gov.it) and I have an initial set of seeds, for example www.aaa.it, > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > pages outside that domain. However some resources I need to download > (documents) are stored on web sites that are not inside the domain I'm > interested in. > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > (where > www.somesite.it is not inside "my" domain). Nutch will not fetch that > page > since I told it to behave that way, but I need to download documents > stored > on www.somesite.it. So I need nutch to go outside the domain I > specified > only when it sees the words "albi" or "albo" inside the url, since that > words identify the documents I need. How can I do this? > > I hope I've been clear. :) > > > > 2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]> > > > Hi Adriana, > > > > This should be achievable through fine grained URL filters. It is > kindof > > hard to substantiate on this without you providing some examples of > the > > type of stuff you're trying to do! > > > > Lewis > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > [EMAIL PROTECTED] > > > wrote: > > > > > Hello, > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > configured > > > it so that it doesn't fetch pages outside a specific domain. > However now > > I > > > need to let it fetch pages outside the domain I choosed but only > for some > > > urls (not for all the urls I have to crawl). How can I do this? I > have to > > > write a new plugin? > > > > > > Thanks. > > > > > > > > > > > -- > > *Lewis* > > +
alxsss@... 2011-12-01, 22:48
-
Re: Fetching just some urls outside domainLewis John Mcgibbney 2011-12-01, 22:59
Nutch comes packed with quite a few url-filters out of the box. They just
need some tuning. Have a look in NUTCH_HOME/conf Also have a look at the corresponding plugins. Realistically you should really start a new thread for new questions :0) I think you're looking for the urlfilter-domain plugin On Thu, Dec 1, 2011 at 10:48 PM, <[EMAIL PROTECTED]> wrote: > Hello, > > It is interesting to know how can one put a filter on outlinks? I mean if > I have a regex, in which file should I put it? > For example, I want nutch to ignore outlinks ending with .info. > > Thanks. > Alex. > > > > > > > > -----Original Message----- > From: Arkadi.Kosmynin <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Thu, Dec 1, 2011 1:44 pm > Subject: RE: Fetching just some urls outside domain > > > Hi Adriana, > > You can try Arch for this: > > http://www.atnf.csiro.au/computing/software/arch > > You can configure it to crawl your web sites plus sets of miscellaneous > URLs > called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, > only > Arch based on Nutch 1.2 is available for downloading. We are about to > release > Arch based on Nutch 1.4. > > Regards, > > Arkadi > > > > > -----Original Message----- > > From: Adriana Farina [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, 1 December 2011 7:58 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Fetching just some urls outside domain > > > > Hi! > > > > Thank you for your answer. You're right, maybe an example would explain > > better what I need to do. > > > > I have to perform the following task. I have to explore a specific > > domain (. > > gov.it) and I have an initial set of seeds, for example www.aaa.it, > > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > > pages outside that domain. However some resources I need to download > > (documents) are stored on web sites that are not inside the domain I'm > > interested in. > > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > > (where > > www.somesite.it is not inside "my" domain). Nutch will not fetch that > > page > > since I told it to behave that way, but I need to download documents > > stored > > on www.somesite.it. So I need nutch to go outside the domain I > > specified > > only when it sees the words "albi" or "albo" inside the url, since that > > words identify the documents I need. How can I do this? > > > > I hope I've been clear. :) > > > > > > > > 2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]> > > > > > Hi Adriana, > > > > > > This should be achievable through fine grained URL filters. It is > > kindof > > > hard to substantiate on this without you providing some examples of > > the > > > type of stuff you're trying to do! > > > > > > Lewis > > > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > > [EMAIL PROTECTED] > > > > wrote: > > > > > > > Hello, > > > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > > configured > > > > it so that it doesn't fetch pages outside a specific domain. > > However now > > > I > > > > need to let it fetch pages outside the domain I choosed but only > > for some > > > > urls (not for all the urls I have to crawl). How can I do this? I > > have to > > > > write a new plugin? > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > -- *Lewis* +
Lewis John Mcgibbney 2011-12-01, 22:59
-
Re: Fetching just some urls outside domainalxsss@... 2011-12-01, 23:15
If I understand you correctly, you state that even if my question is related to the current thread, nevertheless I must open a new one?
-----Original Message----- From: Lewis John Mcgibbney <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Thu, Dec 1, 2011 3:01 pm Subject: Re: Fetching just some urls outside domain Nutch comes packed with quite a few url-filters out of the box. They just need some tuning. Have a look in NUTCH_HOME/conf Also have a look at the corresponding plugins. Realistically you should really start a new thread for new questions :0) I think you're looking for the urlfilter-domain plugin On Thu, Dec 1, 2011 at 10:48 PM, <[EMAIL PROTECTED]> wrote: > Hello, > > It is interesting to know how can one put a filter on outlinks? I mean if > I have a regex, in which file should I put it? > For example, I want nutch to ignore outlinks ending with .info. > > Thanks. > Alex. > > > > > > > > -----Original Message----- > From: Arkadi.Kosmynin <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Thu, Dec 1, 2011 1:44 pm > Subject: RE: Fetching just some urls outside domain > > > Hi Adriana, > > You can try Arch for this: > > http://www.atnf.csiro.au/computing/software/arch > > You can configure it to crawl your web sites plus sets of miscellaneous > URLs > called "bookmarks" in Arch. Arch is a free extension of Nutch. Right now, > only > Arch based on Nutch 1.2 is available for downloading. We are about to > release > Arch based on Nutch 1.4. > > Regards, > > Arkadi > > > > > -----Original Message----- > > From: Adriana Farina [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, 1 December 2011 7:58 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Fetching just some urls outside domain > > > > Hi! > > > > Thank you for your answer. You're right, maybe an example would explain > > better what I need to do. > > > > I have to perform the following task. I have to explore a specific > > domain (. > > gov.it) and I have an initial set of seeds, for example www.aaa.it, > > www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch > > pages outside that domain. However some resources I need to download > > (documents) are stored on web sites that are not inside the domain I'm > > interested in. > > For example: www.aaa.it/subfolder/albi redirects to www.somesite.it > > (where > > www.somesite.it is not inside "my" domain). Nutch will not fetch that > > page > > since I told it to behave that way, but I need to download documents > > stored > > on www.somesite.it. So I need nutch to go outside the domain I > > specified > > only when it sees the words "albi" or "albo" inside the url, since that > > words identify the documents I need. How can I do this? > > > > I hope I've been clear. :) > > > > > > > > 2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]> > > > > > Hi Adriana, > > > > > > This should be achievable through fine grained URL filters. It is > > kindof > > > hard to substantiate on this without you providing some examples of > > the > > > type of stuff you're trying to do! > > > > > > Lewis > > > > > > On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina < > > > [EMAIL PROTECTED] > > > > wrote: > > > > > > > Hello, > > > > > > > > I'm using nutch 1.3 from just a month, so I'm not an expert. I > > configured > > > > it so that it doesn't fetch pages outside a specific domain. > > However now > > > I > > > > need to let it fetch pages outside the domain I choosed but only > > for some > > > > urls (not for all the urls I have to crawl). How can I do this? I > > have to > > > > write a new plugin? > > > > > > > > Thanks. > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > -- *Lewis* +
alxsss@... 2011-12-01, 23:15
|