Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Plain View
Nutch, mail # user - Fetching just some urls outside domain


+
Adriana Farina 2011-11-28, 11:14
+
Lewis John Mcgibbney 2011-11-30, 16:24
Copy link to this message
-
Re: Fetching just some urls outside domain
Adriana Farina 2011-12-01, 08:57
Hi!

Thank you for your answer. You're right, maybe an example would explain
better what I need to do.

I have to perform the following task. I have to explore a specific domain (.
gov.it) and I have an initial set of seeds, for example www.aaa.it,
www.bbb.gov.it, www.ccc.it. I configured nutch so that it doesn't fetch
pages outside that domain. However some resources I need to download
(documents) are stored on web sites that are not inside the domain I'm
interested in.
For example: www.aaa.it/subfolder/albi redirects to www.somesite.it (where
www.somesite.it is not inside "my" domain). Nutch will not fetch that page
since I told it to behave that way, but I need to download documents stored
on www.somesite.it. So I need nutch to go outside the domain I specified
only when it sees the words "albi" or "albo" inside the url, since that
words identify the documents I need. How can I do this?

I hope I've been clear. :)

2011/11/30 Lewis John Mcgibbney <[EMAIL PROTECTED]>

> Hi Adriana,
>
> This should be achievable through fine grained URL filters. It is kindof
> hard to substantiate on this without you providing some examples of the
> type of stuff you're trying to do!
>
> Lewis
>
> On Mon, Nov 28, 2011 at 11:14 AM, Adriana Farina <
> [EMAIL PROTECTED]
> > wrote:
>
> > Hello,
> >
> > I'm using nutch 1.3 from just a month, so I'm not an expert. I configured
> > it so that it doesn't fetch pages outside a specific domain. However now
> I
> > need to let it fetch pages outside the domain I choosed but only for some
> > urls (not for all the urls I have to crawl). How can I do this? I have to
> > write a new plugin?
> >
> > Thanks.
> >
>
>
>
> --
> *Lewis*
>
+
Lewis John Mcgibbney 2011-12-01, 20:17
+
Adriana Farina 2011-12-02, 09:20
+
Arkadi.Kosmynin@... 2011-12-01, 21:43
+
Adriana Farina 2011-12-02, 09:19
+
alxsss@... 2011-12-01, 22:48
+
Lewis John Mcgibbney 2011-12-01, 22:59
+
alxsss@... 2011-12-01, 23:15