|
Rafael Pappert
2011-11-16, 19:17
Rafael Pappert
2011-11-17, 13:52
Ferdy Galema
2011-11-17, 14:00
Lewis John Mcgibbney
2011-11-17, 14:48
alxsss@...
2011-11-17, 22:05
Rafael Pappert
2011-11-18, 18:19
Rafael Pappert
2011-11-18, 19:17
Lewis John Mcgibbney
2011-11-21, 16:49
xuyuanme
2012-02-23, 04:08
remi tassing
2012-02-23, 04:49
xuyuanme
2012-02-23, 05:07
Lewis John Mcgibbney
2012-02-23, 11:38
xuyuanme
2012-02-23, 12:11
Lewis John Mcgibbney
2012-02-23, 15:13
Lewis John Mcgibbney
2012-02-23, 15:18
Lewis John Mcgibbney
2012-02-23, 19:09
xuyuanme
2012-02-24, 09:30
alxsss@...
2012-03-01, 20:09
Lewis John Mcgibbney
2012-03-02, 11:21
|
-
http.redirect.maxRafael Pappert 2011-11-16, 19:17
Hello List,
is it possible to follow http 301 redirects immediately? I tried to set http.redirect.max to 3 but the page is still not indexed. readdb is still showing 1 page is unfetched / db_redir_perm. And I can't find the redirection target in the crawldb. How does nutch handle redirects? Thanks in advance, Rafael.
-
Re: http.redirect.maxRafael Pappert 2011-11-17, 13:52
Hi,
after some investigation i got the problem. I had db.ignore.external.links set to true, this is why fetcher isn't following the redirection from domain.com to www.domain.com. Rafael. On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: > Hello List, > > is it possible to follow http 301 redirects immediately? > > I tried to set http.redirect.max to 3 but the page is > still not indexed. readdb is still showing 1 page is > unfetched / db_redir_perm. And I can't find the > redirection target in the crawldb. > > How does nutch handle redirects? > > Thanks in advance, > Rafael. > > > >
-
Re: http.redirect.maxFerdy Galema 2011-11-17, 14:00
Thanks for updating the list.
On 11/17/2011 02:52 PM, Rafael Pappert wrote: > Hi, > > after some investigation i got the problem. > I had db.ignore.external.links set to true, this is why > fetcher isn't following the redirection from domain.com to > www.domain.com. > > Rafael. > > > > On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: > >> Hello List, >> >> is it possible to follow http 301 redirects immediately? >> >> I tried to set http.redirect.max to 3 but the page is >> still not indexed. readdb is still showing 1 page is >> unfetched / db_redir_perm. And I can't find the >> redirection target in the crawldb. >> >> How does nutch handle redirects? >> >> Thanks in advance, >> Rafael. >> >> >> >>
-
Re: http.redirect.maxLewis John Mcgibbney 2011-11-17, 14:48
Hi Rafael,
The honest truth is that there needs to be comprehensive documentation on the wiki for the way that Nutch handles redirects. This is a question that has gone fully unanswered for sometime. That's just the way it is I suppose. I'll get my head around everything and try to get some wiki page up and running ASAP. In the meantime, can you adivise if there is anything over and above the files in nutch-default.xml and o.a.n.protocol package which you would like to see documented? Thanks On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert <[EMAIL PROTECTED]> wrote: > Hello List, > > is it possible to follow http 301 redirects immediately? > > I tried to set http.redirect.max to 3 but the page is > still not indexed. readdb is still showing 1 page is > unfetched / db_redir_perm. And I can't find the > redirection target in the crawldb. > > How does nutch handle redirects? > > Thanks in advance, > Rafael. > > > > > -- *Lewis*
-
Re: http.redirect.maxalxsss@... 2011-11-17, 22:05
Hi, Is this issue resolved in https://issues.apache.org/jira/browse/NUTCH-1044 for the case when db.ignore.external.links set to true ? Thanks. Alex. -----Original Message----- From: Ferdy Galema <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Thu, Nov 17, 2011 6:01 am Subject: Re: http.redirect.max Thanks for updating the list. On 11/17/2011 02:52 PM, Rafael Pappert wrote: > Hi, > > after some investigation i got the problem. > I had db.ignore.external.links set to true, this is why > fetcher isn't following the redirection from domain.com to > www.domain.com. > > Rafael. > > > > On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: > >> Hello List, >> >> is it possible to follow http 301 redirects immediately? >> >> I tried to set http.redirect.max to 3 but the page is >> still not indexed. readdb is still showing 1 page is >> unfetched / db_redir_perm. And I can't find the >> redirection target in the crawldb. >> >> How does nutch handle redirects? >> >> Thanks in advance, >> Rafael. >> >> >> >>
-
Re: http.redirect.maxRafael Pappert 2011-11-18, 18:19
Hi Alex,
this is not really a bug. It's a "undocumented" feature. db.ignore.external.links prevents the fetcher from breaking out of your set of domains. And this is what you need, if you won't crawl the whole web. Best regards, Rafael. On 17/Nov/ 2011, at 23:05 , [EMAIL PROTECTED] wrote: > > Hi, > > Is this issue resolved in https://issues.apache.org/jira/browse/NUTCH-1044 > for the case when > db.ignore.external.links set to true > ? > > Thanks. > Alex. > > > > > > > -----Original Message----- > From: Ferdy Galema <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Thu, Nov 17, 2011 6:01 am > Subject: Re: http.redirect.max > > > Thanks for updating the list. > > On 11/17/2011 02:52 PM, Rafael Pappert wrote: >> Hi, >> >> after some investigation i got the problem. >> I had db.ignore.external.links set to true, this is why >> fetcher isn't following the redirection from domain.com to >> www.domain.com. >> >> Rafael. >> >> >> >> On 16/Nov/ 2011, at 20:17 , Rafael Pappert wrote: >> >>> Hello List, >>> >>> is it possible to follow http 301 redirects immediately? >>> >>> I tried to set http.redirect.max to 3 but the page is >>> still not indexed. readdb is still showing 1 page is >>> unfetched / db_redir_perm. And I can't find the >>> redirection target in the crawldb. >>> >>> How does nutch handle redirects? >>> >>> Thanks in advance, >>> Rafael. >>> >>> >>> >>> > >
-
Re: http.redirect.maxRafael Pappert 2011-11-18, 19:17
Hi Lewis,
> > The honest truth is that there needs to be comprehensive documentation on > the wiki for the way that Nutch handles redirects. This is a question that > has gone fully unanswered for sometime. That's true. > In the meantime, can you adivise if there is anything over > and above the files in nutch-default.xml and o.a.n.protocol package which > you would like to see documented? I guess the poor documentation of nutch/hadoop is the biggest problem for beginners like me. I started with nutch ~4-6 month ago (not full time, but several hours every week). At first I wrote some plugins (parser/indexer). This was a bit tricky because i had learn directly from the source. Because most of the tutorials/documents were outdated (<1.0) or simply wrong. My crawler is now running and I need to scale it up. The current version runs in local mode but thats not really fast. So I started to setup a hadoop cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today and my current questions are: - i will buy some new hardware for the hadoop cluster, but i'm shure about the configuration. Is nutch i/o or cpu heavy? http://www.cloudera.com/blog/2010/03/clouderas-support-team-shares-some-basic-hardware-recommendations/ - what is the difference between protocol-httpclient and protocol-http? Just ssl and authentication? What about performance? - what is a good value for the following configuration parameter: - fetcher.threads.fetch - fetcher.threads.per.queue - mapred.tasktracker.map.tasks.maximum - mapred.tasktracker.reduce.tasks.maximum - mapred.map.tasks - mapred.reduce.tasks My current hardware is a 4 Node Cluster of dual CPU (quad core xeon), 32GB RAM, 2*2TB SATA HDD. I know it's impossible to define the "always right" value. But a rule of the thumb, to use as start value, would be very a great thing and would save me a lot of "try-and-error" investigation. - what's the difference fetcher.threads.fetch from the configuration an the -threads option from the crawl command? - is it possible to follow external links only on 301 redirects? - what is happening if a page is marked as db_redir_temp / db_redir_perm? Refetch after db.fetch.interval.default? I found loads tutorials and all of them have the "same" content, only the the very very basics (how to do your first crawl). I guess a comprehensive documentation would be a big step for the amazing nutch/hadoop project. Thanks in advance, Rafael. > > Thanks > > On Wed, Nov 16, 2011 at 7:17 PM, Rafael Pappert <[EMAIL PROTECTED]> wrote: > >> Hello List, >> >> is it possible to follow http 301 redirects immediately? >> >> I tried to set http.redirect.max to 3 but the page is >> still not indexed. readdb is still showing 1 page is >> unfetched / db_redir_perm. And I can't find the >> redirection target in the crawldb. >> >> How does nutch handle redirects? >> >> Thanks in advance, >> Rafael. >> >> >> >> >> > > > -- > *Lewis*
-
Re: http.redirect.maxLewis John Mcgibbney 2011-11-21, 16:49
Hi Rafael,
The page we are talking about will be added on the link below. http://wiki.apache.org/nutch/InternalDocumentation and will be available here http://wiki.apache.org/nutch/RedirectHandling > I guess the poor documentation of nutch/hadoop is the biggest problem for > beginners like me. I started with nutch ~4-6 month ago (not full time, but > several > hours every week). At first I wrote some plugins (parser/indexer). This was > a bit tricky because i had learn directly from the source. Because most of > the tutorials/documents were outdated (<1.0) or simply wrong. > Please note we are trying to remove as much duplication documentation regarding Nutch & Hadoop as possible. The Nutch wiki has been updated recently and this is ongoing work so hopefully we can improve this more in the near future. As Nutch focuses purely on web crawling the Hadoop material can be viewed directly in the Hadoop wiki. I've added a link to this on our wiki Nutch Hadoop Tutorial. > My crawler is now running and I need to scale it up. The current version > runs in local mode but thats not really fast. So I started to setup a > hadoop > cluster (4 Nodes) to run nutch in the deploy mode. This is were I'm today > and > my current questions are: > > - i will buy some new hardware for the hadoop cluster, but i'm shure about > the configuration. Is nutch i/o or cpu heavy? > On a brand new hardware configuration I have not hard of anyone blowing gaskets or anything similar. If thereis something wrong, it can usually be fixed by improving configuration. > > - what is the difference between protocol-httpclient and protocol-http? > Just > ssl and authentication? What about performance? > protocol-httpclient is broken, please see the jira issue that has been filed. You will also need to have a look at the code for this as I am by no means an expert with the protocol-httpclient material. > > - what is a good value for the following configuration parameter: > - fetcher.threads.fetch > - fetcher.threads.per.queue > - mapred.tasktracker.map.tasks.maximum > - mapred.tasktracker.reduce.tasks.maximum > - mapred.map.tasks > - mapred.reduce.tasks > Impossible to say, this varies significantly from crawl/network/nature of crawl data etc. You simply need to experiment and read as much existing documentation as possible. Sorry about this one. > > My current hardware is a 4 Node Cluster of dual CPU (quad core > xeon), 32GB RAM, 2*2TB SATA HDD. > I know it's impossible to define the "always right" value. But a > rule of the thumb, to use as start value, would be very a great thing > and would save me a lot of "try-and-error" investigation. > Unfortunately this open source software you are using. Maybe Cloudera or some of the other commercially motivated experts can help you with this stuff. This is outwith my experience. Try here http://wiki.apache.org/nutch/Support > - what's the difference fetcher.threads.fetch from the configuration an > the -threads option from the crawl > command? > This depends on how you wish to monitor/schedule your Nutch crawls. As you know, running individual commands gives you more flexibility/control over how Nutch does the work for you. > > - is it possible to follow external links only on 301 redirects? > Not got a clue but will definitely include this type of material in the wiki page I created above. Mayeb you can do a bit of investigation and halp me out when I get round to writing up on this stuff. > > - what is happening if a page is marked as db_redir_temp / db_redir_perm? > Refetch after db.fetch.interval.default? > > Again we will need to work together to get our heads around this, if you have a look at the code then maybe we can get somethign written up in due course. Sorry about the vague answers however its a pretty large task to answer everything fully considering there are ~5-10 questions all in. I'm sure there must be some material on the user@ archives so please have a look there as well. hth Lewis
-
Re: http.redirect.maxxuyuanme 2012-02-23, 04:08
Thanks for the information. But I found the wiki page
http://wiki.apache.org/nutch/RedirectHandling http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much content about Nutch redirects. I found even if I set http.redirect.max=2 and db.ignore.external.links=false, the crawler still can't get redirect pages. And with further digging, I found the plugin lib-http (in Nutch 1.1) contains following code: Java file: org.apache.nutch.protocol.http.api.HttpBase public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { ...... response = getResponse(u, datum, */false/*); // make a request ...... } protected abstract Response getResponse(URL url, CrawlDatum datum, boolean followRedirects) throws ProtocolException, IOException; After I changed the call to getResponse(u, datum, */true/*) and recompile the plugin, the crawler fetches redirected pages as expected. So is this a bug in lib-http library or I had some misunderstanding on how redirect works? Thanks! lewis john mcgibbney wrote > > Hi Rafael, > > The page we are talking about will be added on the link below. > > http://wiki.apache.org/nutch/InternalDocumentation > > and will be available here > > http://wiki.apache.org/nutch/RedirectHandling > > -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: http.redirect.maxremi tassing 2012-02-23, 04:49
Would you give Nucth-1.4 a try? Maybe this bug is already solved?
Remi On Thursday, February 23, 2012, xuyuanme <[EMAIL PROTECTED]> wrote: > Thanks for the information. But I found the wiki page > http://wiki.apache.org/nutch/RedirectHandling > http://wiki.apache.org/nutch/RedirectHandling still doesn't have too much > content about Nutch redirects. > > I found even if I set http.redirect.max=2 and > db.ignore.external.links=false, the crawler still can't get redirect pages. > And with further digging, I found the plugin lib-http (in Nutch 1.1) > contains following code: > > Java file: org.apache.nutch.protocol.http.api.HttpBase > > public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { > ...... > response = getResponse(u, datum, */false/*); // make a request > ...... > } > > protected abstract Response getResponse(URL url, > CrawlDatum datum, > boolean followRedirects) > throws ProtocolException, IOException; > > After I changed the call to getResponse(u, datum, */true/*) and recompile > the plugin, the crawler fetches redirected pages as expected. > > So is this a bug in lib-http library or I had some misunderstanding on how > redirect works? > > Thanks! > > lewis john mcgibbney wrote >> >> Hi Rafael, >> >> The page we are talking about will be added on the link below. >> >> http://wiki.apache.org/nutch/InternalDocumentation >> >> and will be available here >> >> http://wiki.apache.org/nutch/RedirectHandling >> >> > > > -- > View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768657.html > Sent from the Nutch - User mailing list archive at Nabble.com. >
-
Re: http.redirect.maxxuyuanme 2012-02-23, 05:07
Just checked the latest code in 1.4 but it's the same. See code line 138 in
below link: http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup The method just call getResponse() and set followRedirects parameter to *false*. So I guess the http.redirect.max setting is not working on it? remi tassing wrote > > Would you give Nucth-1.4 a try? Maybe this bug is already solved? > > Remi > > On Thursday, February 23, 2012, xuyuanme <xuyuanme@> wrote: >> Thanks for the information. But I found the wiki page >> http://wiki.apache.org/nutch/RedirectHandling >> http://wiki.apache.org/nutch/RedirectHandling still doesn't have too >> much >> content about Nutch redirects. >> >> I found even if I set http.redirect.max=2 and >> db.ignore.external.links=false, the crawler still can't get redirect > pages. >> And with further digging, I found the plugin lib-http (in Nutch 1.1) >> contains following code: >> >> Java file: org.apache.nutch.protocol.http.api.HttpBase >> >> public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { >> ...... >> response = getResponse(u, datum, */false/*); // make a request >> ...... >> } >> >> protected abstract Response getResponse(URL url, >> CrawlDatum datum, >> boolean followRedirects) >> throws ProtocolException, IOException; >> >> After I changed the call to getResponse(u, datum, */true/*) and recompile >> the plugin, the crawler fetches redirected pages as expected. >> >> So is this a bug in lib-http library or I had some misunderstanding on >> how >> redirect works? > -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: http.redirect.maxLewis John Mcgibbney 2012-02-23, 11:38
Hi,
Can you post your nutch-site.xml and I will give it a spin. Thank you Lewis On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <[EMAIL PROTECTED]> wrote: > Just checked the latest code in 1.4 but it's the same. See code line 138 in > below link: > > > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup > > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup > > The method just call getResponse() and set followRedirects parameter to > *false*. > > So I guess the http.redirect.max setting is not working on it? > > > remi tassing wrote > > > > Would you give Nucth-1.4 a try? Maybe this bug is already solved? > > > > Remi > > > > On Thursday, February 23, 2012, xuyuanme <xuyuanme@> wrote: > >> Thanks for the information. But I found the wiki page > >> http://wiki.apache.org/nutch/RedirectHandling > >> http://wiki.apache.org/nutch/RedirectHandling still doesn't have too > >> much > >> content about Nutch redirects. > >> > >> I found even if I set http.redirect.max=2 and > >> db.ignore.external.links=false, the crawler still can't get redirect > > pages. > >> And with further digging, I found the plugin lib-http (in Nutch 1.1) > >> contains following code: > >> > >> Java file: org.apache.nutch.protocol.http.api.HttpBase > >> > >> public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) { > >> ...... > >> response = getResponse(u, datum, */false/*); // make a request > >> ...... > >> } > >> > >> protected abstract Response getResponse(URL url, > >> CrawlDatum datum, > >> boolean followRedirects) > >> throws ProtocolException, IOException; > >> > >> After I changed the call to getResponse(u, datum, */true/*) and > recompile > >> the plugin, the crawler fetches redirected pages as expected. > >> > >> So is this a bug in lib-http library or I had some misunderstanding on > >> how > >> redirect works? > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3768744.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*
-
Re: http.redirect.maxxuyuanme 2012-02-23, 12:11
Thanks! The config file can be get here:
http://dl.dropbox.com/u/6614015/temp/config.zip http://dl.dropbox.com/u/6614015/temp/config.zip lewis john mcgibbney wrote > > Hi, > > Can you post your nutch-site.xml and I will give it a spin. > > Thank you > > Lewis > > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <xuyuanme@> wrote: > >> Just checked the latest code in 1.4 but it's the same. See code line 138 >> in >> below link: >> >> >> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup >> >> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup >> >> The method just call getResponse() and set followRedirects parameter to >> *false*. >> >> So I guess the http.redirect.max setting is not working on it? >> >> > -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: http.redirect.maxLewis John Mcgibbney 2012-02-23, 15:13
OK, for starters we don't use crawl-urlfilter.txt anymore, this is
deprecated as of Nutch 1.2 iirc. Secondly, what are you trying to achieve here? Your url filter includes +^http://www \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$ +^http://www \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$ Your seed urls are also not exactly what I would expect for a seed list. One last thing, your fetcher.threads.per.host is pretty aggressive, I wouldn't personally set it this high unless it was my own server I was communicating with. So what exactly is it that you are having problems with? Lewis On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <[EMAIL PROTECTED]> wrote: > Thanks! The config file can be get here: > http://dl.dropbox.com/u/6614015/temp/config.zip > http://dl.dropbox.com/u/6614015/temp/config.zip > > > lewis john mcgibbney wrote > > > > Hi, > > > > Can you post your nutch-site.xml and I will give it a spin. > > > > Thank you > > > > Lewis > > > > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <xuyuanme@> wrote: > > > >> Just checked the latest code in 1.4 but it's the same. See code line 138 > >> in > >> below link: > >> > >> > >> > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup > >> > >> > http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup > >> > >> The method just call getResponse() and set followRedirects parameter to > >> *false*. > >> > >> So I guess the http.redirect.max setting is not working on it? > >> > >> > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html > Sent from the Nutch - User mailing list archive at Nabble.com. > -- *Lewis*
-
Re: http.redirect.maxLewis John Mcgibbney 2012-02-23, 15:18
Additionally in your nutch-site.xml we don't maintain any query-(plugins),
and there is no parse-text plugin either. On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > OK, for starters we don't use crawl-urlfilter.txt anymore, this is > deprecated as of Nutch 1.2 iirc. > > Secondly, what are you trying to achieve here? Your url filter includes > +^http://www > \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$ > +^http://www > \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$ > > Your seed urls are also not exactly what I would expect for a seed list. > > One last thing, your fetcher.threads.per.host is pretty aggressive, I > wouldn't personally set it this high unless it was my own server I was > communicating with. > > So what exactly is it that you are having problems with? > > Lewis > > > > > On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <[EMAIL PROTECTED]> wrote: > >> Thanks! The config file can be get here: >> http://dl.dropbox.com/u/6614015/temp/config.zip >> http://dl.dropbox.com/u/6614015/temp/config.zip >> >> >> lewis john mcgibbney wrote >> > >> > Hi, >> > >> > Can you post your nutch-site.xml and I will give it a spin. >> > >> > Thank you >> > >> > Lewis >> > >> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <xuyuanme@> wrote: >> > >> >> Just checked the latest code in 1.4 but it's the same. See code line >> 138 >> >> in >> >> below link: >> >> >> >> >> >> >> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup >> >> >> >> >> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup >> >> >> >> The method just call getResponse() and set followRedirects parameter to >> >> *false*. >> >> >> >> So I guess the http.redirect.max setting is not working on it? >> >> >> >> >> > >> >> -- >> View this message in context: >> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> > > > > -- > *Lewis* > > -- *Lewis*
-
Re: http.redirect.maxLewis John Mcgibbney 2012-02-23, 19:09
I've checked working with redirects and everything seems to work fine for
me. The site I checked on http://www.scotland.gov.uk temp redirect to http://home.scotland.gov.uk/home Nutch gets this fine when I do some tweaking with nutch-site.xml redirects property -1 (just to demonstrate, I would usually not set it so) Lewis On Thu, Feb 23, 2012 at 3:18 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Additionally in your nutch-site.xml we don't maintain any query-(plugins), > and there is no parse-text plugin either. > > > On Thu, Feb 23, 2012 at 3:13 PM, Lewis John Mcgibbney < > [EMAIL PROTECTED]> wrote: > >> OK, for starters we don't use crawl-urlfilter.txt anymore, this is >> deprecated as of Nutch 1.2 iirc. >> >> Secondly, what are you trying to achieve here? Your url filter includes >> +^http://www >> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.SearchResults_Browse&DrugInitial=B$ >> +^http://www >> \.accessdata\.fda\.gov/scripts/cder/drugsatfda/index\.cfm\?fuseaction=Search\.Overview&DrugName=BACIGUENT$ >> >> Your seed urls are also not exactly what I would expect for a seed list. >> >> One last thing, your fetcher.threads.per.host is pretty aggressive, I >> wouldn't personally set it this high unless it was my own server I was >> communicating with. >> >> So what exactly is it that you are having problems with? >> >> Lewis >> >> >> >> >> On Thu, Feb 23, 2012 at 12:11 PM, xuyuanme <[EMAIL PROTECTED]> wrote: >> >>> Thanks! The config file can be get here: >>> http://dl.dropbox.com/u/6614015/temp/config.zip >>> http://dl.dropbox.com/u/6614015/temp/config.zip >>> >>> >>> lewis john mcgibbney wrote >>> > >>> > Hi, >>> > >>> > Can you post your nutch-site.xml and I will give it a spin. >>> > >>> > Thank you >>> > >>> > Lewis >>> > >>> > On Thu, Feb 23, 2012 at 5:07 AM, xuyuanme <xuyuanme@> wrote: >>> > >>> >> Just checked the latest code in 1.4 but it's the same. See code line >>> 138 >>> >> in >>> >> below link: >>> >> >>> >> >>> >> >>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup >>> >> >>> >> >>> http://svn.apache.org/viewvc/nutch/branches/branch-1.4/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java?view=markup >>> >> >>> >> The method just call getResponse() and set followRedirects parameter >>> to >>> >> *false*. >>> >> >>> >> So I guess the http.redirect.max setting is not working on it? >>> >> >>> >> >>> > >>> >>> -- >>> View this message in context: >>> http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3769491.html >>> Sent from the Nutch - User mailing list archive at Nabble.com. >>> >> >> >> >> -- >> *Lewis* >> >> > > > -- > *Lewis* > > -- *Lewis*
-
Re: http.redirect.maxxuyuanme 2012-02-24, 09:30
The config file is used for some proof of concept testing so the content
might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B as the seed page 2. And try to crawl one of the link (http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change "response = getResponse(u, datum, *false*)" call to "response = getResponse(u, datum, *true*)" in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B lewis john mcgibbney wrote > > I've checked working with redirects and everything seems to work fine for > me. > > The site I checked on > > http://www.scotland.gov.uk > > temp redirect to > > http://home.scotland.gov.uk/home > > Nutch gets this fine when I do some tweaking with nutch-site.xml > > redirects property -1 (just to demonstrate, I would usually not set it so) > > Lewis > -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: http.redirect.maxalxsss@... 2012-03-01, 20:09
Hello, I tried 1, 2, -1 for the config http.redirect.max, but nutch still postpones redirected urls to later depths. What is the correct config setting to have nutch crawl redirected urls immediately. I need it because I have restriction on depth be at most 2. Thanks. Alex. -----Original Message----- From: xuyuanme <[EMAIL PROTECTED]> To: user <[EMAIL PROTECTED]> Sent: Fri, Feb 24, 2012 1:31 am Subject: Re: http.redirect.max The config file is used for some proof of concept testing so the content might be confusing, please ignore some incorrect part. Yes from my end I can see the crawl for website http://www.scotland.gov.uk is redirected as expected. However the website I tried to crawl is a bit more tricky. Here's what I want to do: 1. Set http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B as the seed page 2. And try to crawl one of the link (http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT) as a test If you click the link, you'll find the website use redirect and cookie to control page navigation. So I used protocol-httpclient plugin instead of protocol-http to handle the cookie. However, the redirect does not happen as expected. The only way I can fetch second link is to manually change "response = getResponse(u, datum, *false*)" call to "response = getResponse(u, datum, *true*)" in org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the lib-http plugin. So my issue is related to this specific site http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B lewis john mcgibbney wrote > > I've checked working with redirects and everything seems to work fine for > me. > > The site I checked on > > http://www.scotland.gov.uk > > temp redirect to > > http://home.scotland.gov.uk/home > > Nutch gets this fine when I do some tweaking with nutch-site.xml > > redirects property -1 (just to demonstrate, I would usually not set it so) > > Lewis > -- View this message in context: http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html Sent from the Nutch - User mailing list archive at Nabble.com.
-
Re: http.redirect.maxLewis John Mcgibbney 2012-03-02, 11:21
Hi Alex,
Can you please have a look at NUTCH-1042? Might it be the case that your redirect possibly has a crawl-delay which then falls into the boundary case we witness in the issue above? You may want to chabge your log properties to debug for a while and run some small crawls on your problem URLs, maybe try adding in some LOG.debug statements to see what kind of conditions are being satisfied around the fetcher areas mentioned in NUTCH-1042. hth On Thu, Mar 1, 2012 at 8:09 PM, <[EMAIL PROTECTED]> wrote: > > Hello, > > I tried 1, 2, -1 for the config http.redirect.max, but nutch still > postpones redirected urls to later depths. > What is the correct config setting to have nutch crawl redirected urls > immediately. I need it because I have restriction on depth be at most 2. > > Thanks. > Alex. > > > > > > -----Original Message----- > From: xuyuanme <[EMAIL PROTECTED]> > To: user <[EMAIL PROTECTED]> > Sent: Fri, Feb 24, 2012 1:31 am > Subject: Re: http.redirect.max > > > The config file is used for some proof of concept testing so the content > might be confusing, please ignore some incorrect part. > > Yes from my end I can see the crawl for website http://www.scotland.gov.uk > is redirected as expected. > > However the website I tried to crawl is a bit more tricky. > > Here's what I want to do: > > 1. Set > > http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B > as the seed page > > 2. And try to crawl one of the link > ( > http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.Overview&DrugName=BACIGUENT > ) > as a test > > If you click the link, you'll find the website use redirect and cookie to > control page navigation. So I used protocol-httpclient plugin instead of > protocol-http to handle the cookie. > > However, the redirect does not happen as expected. The only way I can fetch > second link is to manually change "response = getResponse(u, datum, > *false*)" call to "response = getResponse(u, datum, *true*)" in > org.apache.nutch.protocol.http.api.HttpBase.java file and recompile the > lib-http plugin. > > So my issue is related to this specific site > > http://www.accessdata.fda.gov/scripts/cder/drugsatfda/index.cfm?fuseaction=Search.SearchResults_Browse&DrugInitial=B > > > lewis john mcgibbney wrote > > > > I've checked working with redirects and everything seems to work fine for > > me. > > > > The site I checked on > > > > http://www.scotland.gov.uk > > > > temp redirect to > > > > http://home.scotland.gov.uk/home > > > > Nutch gets this fine when I do some tweaking with nutch-site.xml > > > > redirects property -1 (just to demonstrate, I would usually not set it > so) > > > > Lewis > > > > -- > View this message in context: > http://lucene.472066.n3.nabble.com/http-redirect-max-tp3513652p3772115.html > Sent from the Nutch - User mailing list archive at Nabble.com. > > > -- *Lewis* |