-Re: Using Nutch for Web Site Mirroring
Tolga 2012-05-25, 12:26
Do you have to use Nutch for this purpose? I belive you can use wget -m
http://www.example.com and get everything in a much structured way.
On 25 May 2012 11:07, vlad.paunescu <[EMAIL PROTECTED]> wrote:
> I am currently trying to use Nutch as a web site mirroring tool. To be more
> explicit, I only need to download the pages, not to index them (I do not
> intend to use it as a search engine). I couldn't figure a simpler way to
> accomplish my task, so what I do now is:
> - crawl the site, using the url;
> - merge the segments;
> - read segments (dump) and make it show the content.
> I didn't manage however to configure Nutch in order to change absolute
> to local links (e.g. href="http://www.example.com/dir/pag.html" to be
> transformed in href="dir/pag.html"). I found URLNormalizer, but I don't
> understand what it does, if it only scans the crawled page url and
> transforms it, or it scans the content of the page being crawled, and
> modifies the href or src attributes.
> I would also want to know if you can configure Nutch to create a directory
> tree with all the pages it crawled. Now, I only have the dumped content
> which needs to be parsed by a Java program I am currently writing in order
> to create directory tree that matches the site's structure.
> Any help will be much appreciated! Thank you!
> View this message in context:
> Sent from the Nutch - User mailing list archive at Nabble.com.