|
|
-
Re: Unable to fetch contents from this particular URLSebastian Nagel 2012-06-19, 21:20
Hi Sandeep,
>>> However, there is just relative url like this >>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx You don't have to care about relative URLs. They are converted by Nutch to absolute URLs and URL filters operate exclusively on absolute URLs. >>> all the pages which starts with >>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/ You can use either -urlfilter-prefix by adding this prefix to conf/prefix-urlfilter (don't forget to enable the plugin via property plugin.includes) -urlfilter-regex by the replacing the last two lines of conf/regex-urlfilter.txt # accept anything else +. by ^http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/ see http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website On 06/19/2012 10:51 PM, Sandeep C R wrote: > Hi Sebastian, > > You are right. After setting it to -1 it worked. I am able to get all the > text. Thank you. > > It will be really helpful if you/others can guide me with relative url's > and regular expression problem which I have mentioned in main post. > > Regards, > Sandeep > > On Tue, Jun 19, 2012 at 4:28 PM, Sebastian Nagel <[EMAIL PROTECTED] >> wrote: > >> Hi Sandeep, >> >>> It just fetches text "Analytical Cytometry". >> It looks like the property http.content.limit >> is still on its default (64kB) which causes the >> document to be truncated right after "Analytical >> Cytometry". >> Unfortunately, truncated content is not logged >> to make it easier to locate the reason, see >> http://wiki.apache.org/nutch/DebugTool >> https://issues.apache.org/jira/browse/NUTCH-1389 >> >> You should increase the value in your nutch-site.xml >> and use parsechecker for a quick trial: >> % nutch parsechecker -dumpText >> >> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx >> >> Sebastian >> >> On 06/19/2012 09:37 PM, Sandeep C R wrote: >>> Hello, >>> >>> Some how Nutch is unable to fetch contents from the below website. It >> just >>> fetches text "Analytical Cytometry". All other text is skipped. I am not >>> sure why this is happening. Nutch is able to crawl and fetch all other >>> websites. I am using Nutch 1.4 version. >>> >>> >> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx >>> >>> And also, all the links within this page are relative url's. >>> >>> Ex: I want to fetch this url which is within the above url. >>> >> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx >>> >>> However, there is just relative url like this >>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx< >> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx >>> >>> >>> Will nutch crawl/fetch websites with relatives url's by default i.e with >> no >>> additional configurations? Also I am not sure how to set regular >> expression >>> so these pages will be fetched. I want to fetch all the pages which >> starts >>> with http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/ . >>> Thank you. >>> >>> Regards, >>> Sandeep >>> >> >> > |