Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # user - Unable to fetch contents from this particular URL


Copy link to this message
-
Re: Unable to fetch contents from this particular URL
Sebastian Nagel 2012-06-19, 21:20
Hi Sandeep,

>>> However, there is just relative url like this
>>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
You don't have to care about relative URLs. They are converted by Nutch
to absolute URLs and URL filters operate exclusively on absolute URLs.

>>> all the pages which starts with
>>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/
You can use either
-urlfilter-prefix by adding this prefix to conf/prefix-urlfilter
  (don't forget to enable the plugin via property plugin.includes)
-urlfilter-regex by the replacing the last two lines of conf/regex-urlfilter.txt
  # accept anything else
  +.
 by
  ^http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/
 see http://wiki.apache.org/nutch/NutchTutorial#A3._Crawl_your_first_website

On 06/19/2012 10:51 PM, Sandeep C R wrote:
> Hi Sebastian,
>
> You are right. After setting it to -1 it worked. I am able to get all the
> text. Thank you.
>
> It will be really helpful if you/others can guide me with relative url's
> and regular expression problem which I have mentioned in main post.
>
> Regards,
> Sandeep
>
> On Tue, Jun 19, 2012 at 4:28 PM, Sebastian Nagel <[EMAIL PROTECTED]
>> wrote:
>
>> Hi Sandeep,
>>
>>> It just fetches text "Analytical Cytometry".
>> It looks like the property http.content.limit
>> is still on its default (64kB) which causes the
>> document to be truncated right after "Analytical
>> Cytometry".
>> Unfortunately, truncated content is not logged
>> to make it easier to locate the reason, see
>>  http://wiki.apache.org/nutch/DebugTool
>>  https://issues.apache.org/jira/browse/NUTCH-1389
>>
>> You should increase the value in your nutch-site.xml
>> and use parsechecker for a quick trial:
>> % nutch parsechecker -dumpText
>>
>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx
>>
>> Sebastian
>>
>> On 06/19/2012 09:37 PM, Sandeep C R wrote:
>>> Hello,
>>>
>>> Some how Nutch is unable to fetch contents from the below website. It
>> just
>>> fetches text "Analytical Cytometry". All other text is skipped. I am not
>>> sure why this is happening. Nutch is able to crawl and fetch all other
>>> websites. I am using Nutch 1.4 version.
>>>
>>>
>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/pages/index.aspx
>>>
>>> And also, all the links within this page are relative url's.
>>>
>>> Ex: I want to fetch this url which is within the above url.
>>>
>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
>>>
>>> However, there is just relative url like this
>>> /research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx<
>> http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/expertise/pages/index.aspx
>>>
>>>
>>> Will nutch crawl/fetch websites with relatives url's by default i.e with
>> no
>>> additional configurations? Also I am not sure how to set regular
>> expression
>>> so these pages will be fetched. I want to fetch all the pages which
>> starts
>>> with http://cancer.osu.edu/research/cancerresearch/sharedresources/ac/ .
>>> Thank you.
>>>
>>> Regards,
>>> Sandeep
>>>
>>
>>
>