|
|
-
Re: crawling a websitealessio crisantemi 2012-04-02, 23:20
dear Remi,
thank you for your reply but that's no good for my case. because the first command stop my crawling at the first section and the second stop it just at the start point. so, I see that the sectiond of my website have like a first page a urls with 'index.php' (EG: http://ww.mywebsite.com/beta/index.php) so, for crawl all this section (http://ww.mywebsite.com/beta) but for not include the parsing of the http://ww.mywebsite.com/beta/index.php page) wich is the correct command? (may be the following? *"- ^http://ww.mywebsite.com/index-php$"* ) or similar? thanks alessio Il giorno 02 aprile 2012 11:40, remi tassing <[EMAIL PROTECTED]> ha scritto: > It depends on the structure of your site and you can modify > "regex-urlfilter.txt" to reach your goal. > > From the examples you gave, you can do this: > *"- ^http://ww.mywebsite.com/[^/]*$"* > it will exclude http://ww.mywebsite.com/alpha, > http://ww.mywebsite.com/beta > , http://ww.mywebsite.com/gamma > > *"- ^http://ww.mywebsite.com/.*/$"* > This will exclude any URL that ends with "/" > > I would suggest you get familiar with regular expressions (in case you > don't yet) > > Remi > > On Sun, Apr 1, 2012 at 6:27 PM, alessio crisantemi < > [EMAIL PROTECTED]> wrote: > > > Dear All, > > I would change my crawling operation but I don't know how can I do. > > > > crawling my website I used the follow command: > > > > $ bin/nutch crawl urls -solr http://localhost:8983/solr -threads 3 > -depth > > 35 -topN 10 > > > > for crawl with nutch and index results on solr index. > > > > > > > > But I would not crawl the single section of my website but only the > single > > pages. > > > > for example: > > > > You considere a site: www.mywebsite.com composed with 3 section: > > > > http://ww.mywebsite.com/alpha > > > > http://ww.mywebsite.com/beta > > > > http://ww.mywebsite.com/gamma > > > > > > > > so, I want between my results, only the single pages of my articles, and > > not the list of articles on this directories also. > > > > So, I would for example, the parsong of the file: > > > > http://ww.mywebsite.com/alpha/artcle1.html > > > > http://ww.mywebsite.com/alpha/artcle3.html > > > > ... > > > > > > > > and i don't want the parsing of the parent section: > > > > http://ww.mywebsite.com/alpha/ > > > > > > > > How can I do? > > > > suggestion? > > > > sorry if not all clear > > > > thank you > > > > alessio > > > |