|
remi tassing
2011-11-22, 05:27
Lewis John Mcgibbney
2011-11-22, 10:42
Arkadi.Kosmynin@...
2011-11-24, 00:48
Lewis John Mcgibbney
2011-11-24, 11:29
Alexander Aristov
2011-11-24, 17:17
Arkadi.Kosmynin@...
2011-11-25, 07:16
remi tassing
2011-11-25, 07:26
Lewis John Mcgibbney
2011-11-25, 13:34
Susam Pal
2011-11-25, 17:16
remi tassing
2011-11-27, 11:11
Susam Pal
2011-11-27, 13:17
Arkadi.Kosmynin@...
2011-11-30, 05:47
remi tassing
2011-11-30, 06:16
remi tassing
2011-12-01, 01:21
remi tassing
2011-12-17, 06:42
remi tassing
2011-12-17, 13:07
remi tassing
2011-12-20, 08:49
remi tassing
2012-01-18, 14:47
|
-
Nutch and Sharepoint authenticationremi tassing 2011-11-22, 05:27
Hello guys,
I read the wiki on "HttpAuthenticationSchemes<http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". I previously managed to make Nutch crawl local folders and websites (with SSL authentication). However, I'm trying to crawl some sites in a corporate intranet environment running under MS Sharepoint. I was unsucceful so far and I believe it's because of authentication. - Is Nutch able to crawl Sharepoint? If yes, do you have a link/mail tutorial on this? I was recently aware of the ManifoldCF initiative and it seems to be an eventual solution to my problem. But it's currently poorly documented (as far as Sharepoint connector is concerned). - Do you have any recommendation on this regards? Thanks in advance for your help, I'll really appreciate it! -- Remi Tassing
-
Re: Nutch and Sharepoint authenticationLewis John Mcgibbney 2011-11-22, 10:42
Hi,
>From what I have read on the Nutch user@ archives [1] it is possible to crawl a MS Sharepoint server which includes setting up NTLM authentication for your crawler. It is becoming a pretty major problem now the the protocol-httpclient plugin is unstable, there are Jira issues open for this. Unfortunately as Manifold CF is in incubation status, it can only be expected that they might have not completed all documentation yet, however I advise you to try there as well, as them about the Sharepoint configuration/documentation if it is not possible for you to work with Nutch protocol-httpclient. hth [1] http://www.mail-archive.com/search?q=sharepoint&l=user%40nutch.apache.org On Tue, Nov 22, 2011 at 5:27 AM, remi tassing <[EMAIL PROTECTED]> wrote: > Hello guys, > > I read the wiki on > "HttpAuthenticationSchemes< > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > I previously managed to make Nutch crawl local folders and websites (with > SSL authentication). However, I'm trying to crawl some sites in a corporate > intranet environment running under MS Sharepoint. I was unsucceful so far > and I believe it's because of authentication. > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a link/mail > tutorial on this? > > > I was recently aware of the ManifoldCF initiative and it seems to be an > eventual solution to my problem. But it's currently poorly documented (as > far as Sharepoint connector is concerned). > > - Do you have any recommendation on this regards? > > > Thanks in advance for your help, I'll really appreciate it! > > -- > Remi Tassing > -- *Lewis*
-
RE: Nutch and Sharepoint authenticationArkadi.Kosmynin@... 2011-11-24, 00:48
Hi,
I am crawling a SharePoint server, no major problems. I do have to use protocol-httpclient for this. Here is an extract from my httpclient-auth.xml file, if it helps: <auth-configuration> <credentials username="myusername" password="mypassword"> <default realm="myrealm" /> </credentials> </auth-configuration> Regards, Arkadi > -----Original Message----- > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > Sent: Tuesday, 22 November 2011 9:43 PM > To: [EMAIL PROTECTED] > Subject: Re: Nutch and Sharepoint authentication > > Hi, > > From what I have read on the Nutch user@ archives [1] it is possible to > crawl a MS Sharepoint server which includes setting up NTLM > authentication > for your crawler. It is becoming a pretty major problem now the the > protocol-httpclient plugin is unstable, there are Jira issues open for > this. > > Unfortunately as Manifold CF is in incubation status, it can only be > expected that they might have not completed all documentation yet, > however > I advise you to try there as well, as them about the Sharepoint > configuration/documentation if it is not possible for you to work with > Nutch protocol-httpclient. > > hth > > [1] > http://www.mail- > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing <[EMAIL PROTECTED]> > wrote: > > > Hello guys, > > > > I read the wiki on > > "HttpAuthenticationSchemes< > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > > I previously managed to make Nutch crawl local folders and websites > (with > > SSL authentication). However, I'm trying to crawl some sites in a > corporate > > intranet environment running under MS Sharepoint. I was unsucceful so > far > > and I believe it's because of authentication. > > > > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a > link/mail > > tutorial on this? > > > > > > I was recently aware of the ManifoldCF initiative and it seems to be > an > > eventual solution to my problem. But it's currently poorly documented > (as > > far as Sharepoint connector is concerned). > > > > - Do you have any recommendation on this regards? > > > > > > Thanks in advance for your help, I'll really appreciate it! > > > > -- > > Remi Tassing > > > > > > -- > *Lewis*
-
Re: Nutch and Sharepoint authenticationLewis John Mcgibbney 2011-11-24, 11:29
Hi Arkadi,
Are you saying that this has been solved and that are successfully able to crawl the server? Thanks On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > Hi, > > I am crawling a SharePoint server, no major problems. I do have to use > protocol-httpclient for this. Here is an extract from my > httpclient-auth.xml file, if it helps: > > <auth-configuration> > <credentials username="myusername" password="mypassword"> > <default realm="myrealm" /> > </credentials> > </auth-configuration> > > Regards, > > Arkadi > > > -----Original Message----- > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > Sent: Tuesday, 22 November 2011 9:43 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Nutch and Sharepoint authentication > > > > Hi, > > > > From what I have read on the Nutch user@ archives [1] it is possible to > > crawl a MS Sharepoint server which includes setting up NTLM > > authentication > > for your crawler. It is becoming a pretty major problem now the the > > protocol-httpclient plugin is unstable, there are Jira issues open for > > this. > > > > Unfortunately as Manifold CF is in incubation status, it can only be > > expected that they might have not completed all documentation yet, > > however > > I advise you to try there as well, as them about the Sharepoint > > configuration/documentation if it is not possible for you to work with > > Nutch protocol-httpclient. > > > > hth > > > > [1] > > http://www.mail- > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > > > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing <[EMAIL PROTECTED]> > > wrote: > > > > > Hello guys, > > > > > > I read the wiki on > > > "HttpAuthenticationSchemes< > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > > > I previously managed to make Nutch crawl local folders and websites > > (with > > > SSL authentication). However, I'm trying to crawl some sites in a > > corporate > > > intranet environment running under MS Sharepoint. I was unsucceful so > > far > > > and I believe it's because of authentication. > > > > > > > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a > > link/mail > > > tutorial on this? > > > > > > > > > I was recently aware of the ManifoldCF initiative and it seems to be > > an > > > eventual solution to my problem. But it's currently poorly documented > > (as > > > far as Sharepoint connector is concerned). > > > > > > - Do you have any recommendation on this regards? > > > > > > > > > Thanks in advance for your help, I'll really appreciate it! > > > > > > -- > > > Remi Tassing > > > > > > > > > > > -- > > *Lewis* > -- *Lewis*
-
Re: Nutch and Sharepoint authenticationAlexander Aristov 2011-11-24, 17:17
hi
one of a available solution is to set up webdav and crawl resoutses as files e.g. file://. but it wont exclude authentication. Alexander On 24/11/2011, Lewis John Mcgibbney <[EMAIL PROTECTED]> wrote: > Hi Arkadi, > > Are you saying that this has been solved and that are successfully able to > crawl the server? > > Thanks > > On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> I am crawling a SharePoint server, no major problems. I do have to use >> protocol-httpclient for this. Here is an extract from my >> httpclient-auth.xml file, if it helps: >> >> <auth-configuration> >> <credentials username="myusername" password="mypassword"> >> <default realm="myrealm" /> >> </credentials> >> </auth-configuration> >> >> Regards, >> >> Arkadi >> >> > -----Original Message----- >> > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] >> > Sent: Tuesday, 22 November 2011 9:43 PM >> > To: [EMAIL PROTECTED] >> > Subject: Re: Nutch and Sharepoint authentication >> > >> > Hi, >> > >> > From what I have read on the Nutch user@ archives [1] it is possible to >> > crawl a MS Sharepoint server which includes setting up NTLM >> > authentication >> > for your crawler. It is becoming a pretty major problem now the the >> > protocol-httpclient plugin is unstable, there are Jira issues open for >> > this. >> > >> > Unfortunately as Manifold CF is in incubation status, it can only be >> > expected that they might have not completed all documentation yet, >> > however >> > I advise you to try there as well, as them about the Sharepoint >> > configuration/documentation if it is not possible for you to work with >> > Nutch protocol-httpclient. >> > >> > hth >> > >> > [1] >> > http://www.mail- >> > archive.com/search?q=sharepoint&l=user%40nutch.apache.org >> > >> > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing <[EMAIL PROTECTED]> >> > wrote: >> > >> > > Hello guys, >> > > >> > > I read the wiki on >> > > "HttpAuthenticationSchemes< >> > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". >> > > I previously managed to make Nutch crawl local folders and websites >> > (with >> > > SSL authentication). However, I'm trying to crawl some sites in a >> > corporate >> > > intranet environment running under MS Sharepoint. I was unsucceful so >> > far >> > > and I believe it's because of authentication. >> > > >> > > >> > > - Is Nutch able to crawl Sharepoint? If yes, do you have a >> > link/mail >> > > tutorial on this? >> > > >> > > >> > > I was recently aware of the ManifoldCF initiative and it seems to be >> > an >> > > eventual solution to my problem. But it's currently poorly documented >> > (as >> > > far as Sharepoint connector is concerned). >> > > >> > > - Do you have any recommendation on this regards? >> > > >> > > >> > > Thanks in advance for your help, I'll really appreciate it! >> > > >> > > -- >> > > Remi Tassing >> > > >> > >> > >> > >> > -- >> > *Lewis* >> > > > > -- > *Lewis* > -- Best Regards Alexander Aristov
-
RE: Nutch and Sharepoint authenticationArkadi.Kosmynin@... 2011-11-25, 07:16
Hi Lewis,
I am saying that my configuration works with our SharePoint server. The authentication scheme is NTLM. Two versions of Nutch are working: a snapshot of Nutch 1.4 in my development and Nutch 1.2 that is being used in production. I have to admit that it took some tweaking to get authentication working. Regards, Arkadi > -----Original Message----- > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > Sent: Thursday, 24 November 2011 10:29 PM > To: [EMAIL PROTECTED] > Subject: Re: Nutch and Sharepoint authentication > > Hi Arkadi, > > Are you saying that this has been solved and that are successfully able > to > crawl the server? > > Thanks > > On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am crawling a SharePoint server, no major problems. I do have to > use > > protocol-httpclient for this. Here is an extract from my > > httpclient-auth.xml file, if it helps: > > > > <auth-configuration> > > <credentials username="myusername" password="mypassword"> > > <default realm="myrealm" /> > > </credentials> > > </auth-configuration> > > > > Regards, > > > > Arkadi > > > > > -----Original Message----- > > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > > Sent: Tuesday, 22 November 2011 9:43 PM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > Hi, > > > > > > From what I have read on the Nutch user@ archives [1] it is > possible to > > > crawl a MS Sharepoint server which includes setting up NTLM > > > authentication > > > for your crawler. It is becoming a pretty major problem now the the > > > protocol-httpclient plugin is unstable, there are Jira issues open > for > > > this. > > > > > > Unfortunately as Manifold CF is in incubation status, it can only > be > > > expected that they might have not completed all documentation yet, > > > however > > > I advise you to try there as well, as them about the Sharepoint > > > configuration/documentation if it is not possible for you to work > with > > > Nutch protocol-httpclient. > > > > > > hth > > > > > > [1] > > > http://www.mail- > > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > > > > > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing > <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hello guys, > > > > > > > > I read the wiki on > > > > "HttpAuthenticationSchemes< > > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > > > > I previously managed to make Nutch crawl local folders and > websites > > > (with > > > > SSL authentication). However, I'm trying to crawl some sites in a > > > corporate > > > > intranet environment running under MS Sharepoint. I was > unsucceful so > > > far > > > > and I believe it's because of authentication. > > > > > > > > > > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a > > > link/mail > > > > tutorial on this? > > > > > > > > > > > > I was recently aware of the ManifoldCF initiative and it seems to > be > > > an > > > > eventual solution to my problem. But it's currently poorly > documented > > > (as > > > > far as Sharepoint connector is concerned). > > > > > > > > - Do you have any recommendation on this regards? > > > > > > > > > > > > Thanks in advance for your help, I'll really appreciate it! > > > > > > > > -- > > > > Remi Tassing > > > > > > > > > > > > > > > > -- > > > *Lewis* > > > > > > -- > *Lewis*
-
Re: Nutch and Sharepoint authenticationremi tassing 2011-11-25, 07:26
Hey guys,
thanks very much for the feedback. It's been a while since I did my test...I gave up by then. I remember that authentication worked but Nutch couldn't get any file crawled (I assumed it was due to authentication but maybe not) but don't remember the details. I'll retry very soon and let you know the result. Thanks again guys! Remi On Fri, Nov 25, 2011 at 3:16 PM, <[EMAIL PROTECTED]> wrote: > Hi Lewis, > > I am saying that my configuration works with our SharePoint server. The > authentication scheme is NTLM. Two versions of Nutch are working: a > snapshot of Nutch 1.4 in my development and Nutch 1.2 that is being used in > production. > > I have to admit that it took some tweaking to get authentication working. > > Regards, > > Arkadi > > > -----Original Message----- > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, 24 November 2011 10:29 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Nutch and Sharepoint authentication > > > > Hi Arkadi, > > > > Are you saying that this has been solved and that are successfully able > > to > > crawl the server? > > > > Thanks > > > > On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am crawling a SharePoint server, no major problems. I do have to > > use > > > protocol-httpclient for this. Here is an extract from my > > > httpclient-auth.xml file, if it helps: > > > > > > <auth-configuration> > > > <credentials username="myusername" password="mypassword"> > > > <default realm="myrealm" /> > > > </credentials> > > > </auth-configuration> > > > > > > Regards, > > > > > > Arkadi > > > > > > > -----Original Message----- > > > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > > > Sent: Tuesday, 22 November 2011 9:43 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > > > Hi, > > > > > > > > From what I have read on the Nutch user@ archives [1] it is > > possible to > > > > crawl a MS Sharepoint server which includes setting up NTLM > > > > authentication > > > > for your crawler. It is becoming a pretty major problem now the the > > > > protocol-httpclient plugin is unstable, there are Jira issues open > > for > > > > this. > > > > > > > > Unfortunately as Manifold CF is in incubation status, it can only > > be > > > > expected that they might have not completed all documentation yet, > > > > however > > > > I advise you to try there as well, as them about the Sharepoint > > > > configuration/documentation if it is not possible for you to work > > with > > > > Nutch protocol-httpclient. > > > > > > > > hth > > > > > > > > [1] > > > > http://www.mail- > > > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > > > > > > > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing > > <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Hello guys, > > > > > > > > > > I read the wiki on > > > > > "HttpAuthenticationSchemes< > > > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > > > > > I previously managed to make Nutch crawl local folders and > > websites > > > > (with > > > > > SSL authentication). However, I'm trying to crawl some sites in a > > > > corporate > > > > > intranet environment running under MS Sharepoint. I was > > unsucceful so > > > > far > > > > > and I believe it's because of authentication. > > > > > > > > > > > > > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a > > > > link/mail > > > > > tutorial on this? > > > > > > > > > > > > > > > I was recently aware of the ManifoldCF initiative and it seems to > > be > > > > an > > > > > eventual solution to my problem. But it's currently poorly > > documented > > > > (as > > > > > far as Sharepoint connector is concerned). > > > > > > > > > > - Do you have any recommendation on this regards? > > > > > > > > > > > > > > > Thanks in advance for your help, I'll really appreciate it! > > > > > > > > > > -- > > > > > Remi Tassing Remi Tassing
-
Re: Nutch and Sharepoint authenticationLewis John Mcgibbney 2011-11-25, 13:34
Yes thanks for the feedback Arkadi.
I know this is possibly outside the scope of your work, but it would be really great if you could add some of your experience to http://wiki.apache.org/nutch/HttpAuthenticationSchemes This is an area which has been unclear for some users for sometime, if you are happy with your working implementation, your thoughts would be extremely appreciated from the rest of the community. Thank you, and glad to hear that things are working. On Fri, Nov 25, 2011 at 7:16 AM, <[EMAIL PROTECTED]> wrote: > Hi Lewis, > > I am saying that my configuration works with our SharePoint server. The > authentication scheme is NTLM. Two versions of Nutch are working: a > snapshot of Nutch 1.4 in my development and Nutch 1.2 that is being used in > production. > > I have to admit that it took some tweaking to get authentication working. > > Regards, > > Arkadi > > > -----Original Message----- > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > Sent: Thursday, 24 November 2011 10:29 PM > > To: [EMAIL PROTECTED] > > Subject: Re: Nutch and Sharepoint authentication > > > > Hi Arkadi, > > > > Are you saying that this has been solved and that are successfully able > > to > > crawl the server? > > > > Thanks > > > > On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > > > > > Hi, > > > > > > I am crawling a SharePoint server, no major problems. I do have to > > use > > > protocol-httpclient for this. Here is an extract from my > > > httpclient-auth.xml file, if it helps: > > > > > > <auth-configuration> > > > <credentials username="myusername" password="mypassword"> > > > <default realm="myrealm" /> > > > </credentials> > > > </auth-configuration> > > > > > > Regards, > > > > > > Arkadi > > > > > > > -----Original Message----- > > > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > > > Sent: Tuesday, 22 November 2011 9:43 PM > > > > To: [EMAIL PROTECTED] > > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > > > Hi, > > > > > > > > From what I have read on the Nutch user@ archives [1] it is > > possible to > > > > crawl a MS Sharepoint server which includes setting up NTLM > > > > authentication > > > > for your crawler. It is becoming a pretty major problem now the the > > > > protocol-httpclient plugin is unstable, there are Jira issues open > > for > > > > this. > > > > > > > > Unfortunately as Manifold CF is in incubation status, it can only > > be > > > > expected that they might have not completed all documentation yet, > > > > however > > > > I advise you to try there as well, as them about the Sharepoint > > > > configuration/documentation if it is not possible for you to work > > with > > > > Nutch protocol-httpclient. > > > > > > > > hth > > > > > > > > [1] > > > > http://www.mail- > > > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > > > > > > > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing > > <[EMAIL PROTECTED]> > > > > wrote: > > > > > > > > > Hello guys, > > > > > > > > > > I read the wiki on > > > > > "HttpAuthenticationSchemes< > > > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > > > > > I previously managed to make Nutch crawl local folders and > > websites > > > > (with > > > > > SSL authentication). However, I'm trying to crawl some sites in a > > > > corporate > > > > > intranet environment running under MS Sharepoint. I was > > unsucceful so > > > > far > > > > > and I believe it's because of authentication. > > > > > > > > > > > > > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a > > > > link/mail > > > > > tutorial on this? > > > > > > > > > > > > > > > I was recently aware of the ManifoldCF initiative and it seems to > > be > > > > an > > > > > eventual solution to my problem. But it's currently poorly > > documented > > > > (as > > > > > far as Sharepoint connector is concerned). > > > > > > > > > > - Do you have any recommendation on this regards? > > > > > *Lewis*
-
Re: Nutch and Sharepoint authenticationSusam Pal 2011-11-25, 17:16
On Tue, Nov 22, 2011 at 10:57 AM, remi tassing <[EMAIL PROTECTED]> wrote:
> Hello guys, > > I read the wiki on > "HttpAuthenticationSchemes<http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > I previously managed to make Nutch crawl local folders and websites (with > SSL authentication). However, I'm trying to crawl some sites in a corporate > intranet environment running under MS Sharepoint. I was unsucceful so far > and I believe it's because of authentication. > > > - Is Nutch able to crawl Sharepoint? If yes, do you have a link/mail > tutorial on this? > > > I was recently aware of the ManifoldCF initiative and it seems to be an > eventual solution to my problem. But it's currently poorly documented (as > far as Sharepoint connector is concerned). > > - Do you have any recommendation on this regards? > > > Thanks in advance for your help, I'll really appreciate it! > > -- > Remi Tassing > Hi Remi, I am sorry, I was not able to reply you earlier. I have been pretty busy this week. I haven't ever tried crawling SharePoint with Nutch, so, I am not very sure if it works fine. My work on authentication assumes that a website is properly configured to challenge the client or crawler with NTLM authentication. In case, it doesn't work, I would suggest that you follow the "Need Help?" section at http://wiki.apache.org/nutch/HttpAuthenticationSchemes#Need_Help.3F accurately and post the relevant information in [EMAIL PROTECTED] (with me in CC possibly since I am not actively monitoring the mailing list) and we as a community might be able to help you out. Once again, I am sorry, I couldn't help you sooner and good luck with this experiment. Regards, Susam Pal
-
Re: Nutch and Sharepoint authenticationremi tassing 2011-11-27, 11:11
Hello guys,
With your advices, I tried tweaking config files during the week-end and got some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get nutch-1.3 to run). A sample of my log file can be found below. I have two concerns: -How do I know if NTLM login worked? -How do I debug the http 500 error code? I suspect it might be due to cookies... Thanks in advance for your help ... 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported authentication schemes in the order of preference: [ntlm, digest, basic] 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm authentication scheme selected DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm DEBUG auth.AuthChallengeProcessor - Authorization challenge processed INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0 INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, url=https://URL INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0 INFO fetcher.Fetcher - -activeThreads=0 ... On Fri, Nov 25, 2011 at 9:34 PM, Lewis John Mcgibbney < [EMAIL PROTECTED]> wrote: > Yes thanks for the feedback Arkadi. > > I know this is possibly outside the scope of your work, but it would be > really great if you could add some of your experience to > http://wiki.apache.org/nutch/HttpAuthenticationSchemes > > This is an area which has been unclear for some users for sometime, if you > are happy with your working implementation, your thoughts would be > extremely appreciated from the rest of the community. > > Thank you, and glad to hear that things are working. > > On Fri, Nov 25, 2011 at 7:16 AM, <[EMAIL PROTECTED]> wrote: > > > Hi Lewis, > > > > I am saying that my configuration works with our SharePoint server. The > > authentication scheme is NTLM. Two versions of Nutch are working: a > > snapshot of Nutch 1.4 in my development and Nutch 1.2 that is being used > in > > production. > > > > I have to admit that it took some tweaking to get authentication working. > > > > Regards, > > > > Arkadi > > > > > -----Original Message----- > > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > > Sent: Thursday, 24 November 2011 10:29 PM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > Hi Arkadi, > > > > > > Are you saying that this has been solved and that are successfully able > > > to > > > crawl the server? > > > > > > Thanks > > > > > > On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I am crawling a SharePoint server, no major problems. I do have to > > > use > > > > protocol-httpclient for this. Here is an extract from my > > > > httpclient-auth.xml file, if it helps: > > > > > > > > <auth-configuration> > > > > <credentials username="myusername" password="mypassword"> > > > > <default realm="myrealm" /> > > > > </credentials> > > > > </auth-configuration> > > > > > > > > Regards, > > > > > > > > Arkadi > > > > > > > > > -----Original Message----- > > > > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > > > > Sent: Tuesday, 22 November 2011 9:43 PM > > > > > To: [EMAIL PROTECTED] > > > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > > > > > Hi, > > > > > > > > > > From what I have read on the Nutch user@ archives [1] it is > > > possible to > > > > > crawl a MS Sharepoint server which includes setting up NTLM > > > > > authentication > > > > > for your crawler. It is becoming a pretty major problem now the the > > > > > protocol-httpclient plugin is unstable, there are Jira issues open > > > for > > > > > this. > > > > > > > > > > Unfortunately as Manifold CF is in incubation status, it can only > > > be > > > > > expected that they might have not completed all documentation yet, Remi Tassing
-
Re: Nutch and Sharepoint authenticationSusam Pal 2011-11-27, 13:17
On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[EMAIL PROTECTED]> wrote:
> Hello guys, > With your advices, I tried tweaking config files during the week-end and got > some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get > nutch-1.3 to run). > A sample of my log file can be found below. I have two concerns: > -How do I know if NTLM login worked? > -How do I debug the http 500 error code? I suspect it might be due to > cookies... > Thanks in advance for your help > ... > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported > authentication schemes in the order of preference: [ntlm, digest, basic] > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm > authentication scheme selected > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > fetchQueues.totalSize=0 > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > fetchQueues.totalSize=0 > INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, > url=https://URL > INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, > fetchQueues.totalSize=0 > INFO fetcher.Fetcher - -activeThreads=0 > ... >From the logs, Nutch did attempt an NTLM authentication but the server returned HTTP 500. It says nothing about whether the NTLM authentication succeeded or failed. It only indicates that the authentication failed. It suggests that an internal error happened in SharePoint. Now, this can happen due to a variety of reasons. I don't know much about how to troubleshoot this in the SharePoint side. Perhaps you should be looking into IIS logs, event viewer, etc. to figure why SharePoint didn't accept your credentials. Most likely it is some kind of configuration problem in either SharePoint or IIS due to which the the NTLM authentication is causing some trouble. Even though it is outside the scope of Nutch, from my very limited experience working with SharePoint, I can say that it might be a good idea to get the Microsoft technical support involved while trying to troubleshoot this. Regards, Susam Pal http://susam.in/
-
RE: Nutch and Sharepoint authenticationArkadi.Kosmynin@... 2011-11-30, 05:47
Hi Lewis,
Thank you for the nice invitation. I don't consider myself an expert in the area, but I have added a small section on troubleshooting which hopefully will help people to pinpoint and fix their problems quicker. Please feel free to add more or correct my text. Regards, Arkadi > -----Original Message----- > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > Sent: Saturday, 26 November 2011 12:35 AM > To: [EMAIL PROTECTED] > Subject: Re: Nutch and Sharepoint authentication > > Yes thanks for the feedback Arkadi. > > I know this is possibly outside the scope of your work, but it would be > really great if you could add some of your experience to > http://wiki.apache.org/nutch/HttpAuthenticationSchemes > > This is an area which has been unclear for some users for sometime, if > you > are happy with your working implementation, your thoughts would be > extremely appreciated from the rest of the community. > > Thank you, and glad to hear that things are working. > > On Fri, Nov 25, 2011 at 7:16 AM, <[EMAIL PROTECTED]> wrote: > > > Hi Lewis, > > > > I am saying that my configuration works with our SharePoint server. > The > > authentication scheme is NTLM. Two versions of Nutch are working: a > > snapshot of Nutch 1.4 in my development and Nutch 1.2 that is being > used in > > production. > > > > I have to admit that it took some tweaking to get authentication > working. > > > > Regards, > > > > Arkadi > > > > > -----Original Message----- > > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > > Sent: Thursday, 24 November 2011 10:29 PM > > > To: [EMAIL PROTECTED] > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > Hi Arkadi, > > > > > > Are you saying that this has been solved and that are successfully > able > > > to > > > crawl the server? > > > > > > Thanks > > > > > > On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > > > > > > > Hi, > > > > > > > > I am crawling a SharePoint server, no major problems. I do have > to > > > use > > > > protocol-httpclient for this. Here is an extract from my > > > > httpclient-auth.xml file, if it helps: > > > > > > > > <auth-configuration> > > > > <credentials username="myusername" password="mypassword"> > > > > <default realm="myrealm" /> > > > > </credentials> > > > > </auth-configuration> > > > > > > > > Regards, > > > > > > > > Arkadi > > > > > > > > > -----Original Message----- > > > > > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > > > > > Sent: Tuesday, 22 November 2011 9:43 PM > > > > > To: [EMAIL PROTECTED] > > > > > Subject: Re: Nutch and Sharepoint authentication > > > > > > > > > > Hi, > > > > > > > > > > From what I have read on the Nutch user@ archives [1] it is > > > possible to > > > > > crawl a MS Sharepoint server which includes setting up NTLM > > > > > authentication > > > > > for your crawler. It is becoming a pretty major problem now the > the > > > > > protocol-httpclient plugin is unstable, there are Jira issues > open > > > for > > > > > this. > > > > > > > > > > Unfortunately as Manifold CF is in incubation status, it can > only > > > be > > > > > expected that they might have not completed all documentation > yet, > > > > > however > > > > > I advise you to try there as well, as them about the Sharepoint > > > > > configuration/documentation if it is not possible for you to > work > > > with > > > > > Nutch protocol-httpclient. > > > > > > > > > > hth > > > > > > > > > > [1] > > > > > http://www.mail- > > > > > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > > > > > > > > > > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing > > > <[EMAIL PROTECTED]> > > > > > wrote: > > > > > > > > > > > Hello guys, > > > > > > > > > > > > I read the wiki on > > > > > > "HttpAuthenticationSchemes< > > > > > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > > > > > > I previously managed to make Nutch crawl local folders and > > > websites
-
Re: Nutch and Sharepoint authenticationremi tassing 2011-11-30, 06:16
Thanks for tips Susam!
Unfortunately I don't have much support on the server side... I have been tipped off by a friend mentioning the possibility of crawlers being purposely blocked by the server. So how can I make Nutch impersonate a browser? I tried the tip in the following link but it didn't work: http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html Remi On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[EMAIL PROTECTED]> wrote: > On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[EMAIL PROTECTED]> > wrote: > > Hello guys, > > With your advices, I tried tweaking config files during the week-end and > got > > some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get > > nutch-1.3 to run). > > A sample of my log file can be found below. I have two concerns: > > -How do I know if NTLM login worked? > > -How do I debug the http 500 error code? I suspect it might be due to > > cookies... > > Thanks in advance for your help > > ... > > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported > > authentication schemes in the order of preference: [ntlm, digest, basic] > > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm > > authentication scheme selected > > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm > > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed > > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > > fetchQueues.totalSize=0 > > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > > fetchQueues.totalSize=0 > > INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, > > url=https://URL > > INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 > > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, > > fetchQueues.totalSize=0 > > INFO fetcher.Fetcher - -activeThreads=0 > > ... > > From the logs, Nutch did attempt an NTLM authentication but the server > returned HTTP 500. It says nothing about whether the NTLM > authentication succeeded or failed. It only indicates that the > authentication failed. It suggests that an internal error happened in > SharePoint. > > Now, this can happen due to a variety of reasons. I don't know much > about how to troubleshoot this in the SharePoint side. Perhaps you > should be looking into IIS logs, event viewer, etc. to figure why > SharePoint didn't accept your credentials. > > Most likely it is some kind of configuration problem in either > SharePoint or IIS due to which the the NTLM authentication is causing > some trouble. Even though it is outside the scope of Nutch, from my > very limited experience working with SharePoint, I can say that it > might be a good idea to get the Microsoft technical support involved > while trying to troubleshoot this. > > Regards, > Susam Pal > http://susam.in/ > -- Remi Tassing
-
Re: Nutch and Sharepoint authenticationremi tassing 2011-12-01, 01:21
Hello Alexander,
I'm considering trying your suggestion. I have one question thought. After Webdav does the crawling and saves the files locally, does it keep the link intact? Remi On Fri, Nov 25, 2011 at 1:17 AM, Alexander Aristov < [EMAIL PROTECTED]> wrote: > hi > > one of a available solution is to set up webdav and crawl resoutses as > files e.g. file://. but it wont exclude authentication. > > > Alexander > > On 24/11/2011, Lewis John Mcgibbney <[EMAIL PROTECTED]> wrote: > > Hi Arkadi, > > > > Are you saying that this has been solved and that are successfully able > to > > crawl the server? > > > > Thanks > > > > On Thu, Nov 24, 2011 at 12:48 AM, <[EMAIL PROTECTED]> wrote: > > > >> Hi, > >> > >> I am crawling a SharePoint server, no major problems. I do have to use > >> protocol-httpclient for this. Here is an extract from my > >> httpclient-auth.xml file, if it helps: > >> > >> <auth-configuration> > >> <credentials username="myusername" password="mypassword"> > >> <default realm="myrealm" /> > >> </credentials> > >> </auth-configuration> > >> > >> Regards, > >> > >> Arkadi > >> > >> > -----Original Message----- > >> > From: Lewis John Mcgibbney [mailto:[EMAIL PROTECTED]] > >> > Sent: Tuesday, 22 November 2011 9:43 PM > >> > To: [EMAIL PROTECTED] > >> > Subject: Re: Nutch and Sharepoint authentication > >> > > >> > Hi, > >> > > >> > From what I have read on the Nutch user@ archives [1] it is possible > to > >> > crawl a MS Sharepoint server which includes setting up NTLM > >> > authentication > >> > for your crawler. It is becoming a pretty major problem now the the > >> > protocol-httpclient plugin is unstable, there are Jira issues open for > >> > this. > >> > > >> > Unfortunately as Manifold CF is in incubation status, it can only be > >> > expected that they might have not completed all documentation yet, > >> > however > >> > I advise you to try there as well, as them about the Sharepoint > >> > configuration/documentation if it is not possible for you to work with > >> > Nutch protocol-httpclient. > >> > > >> > hth > >> > > >> > [1] > >> > http://www.mail- > >> > archive.com/search?q=sharepoint&l=user%40nutch.apache.org > >> > > >> > On Tue, Nov 22, 2011 at 5:27 AM, remi tassing <[EMAIL PROTECTED]> > >> > wrote: > >> > > >> > > Hello guys, > >> > > > >> > > I read the wiki on > >> > > "HttpAuthenticationSchemes< > >> > > http://wiki.apache.org/nutch/HttpAuthenticationSchemes>". > >> > > I previously managed to make Nutch crawl local folders and websites > >> > (with > >> > > SSL authentication). However, I'm trying to crawl some sites in a > >> > corporate > >> > > intranet environment running under MS Sharepoint. I was unsucceful > so > >> > far > >> > > and I believe it's because of authentication. > >> > > > >> > > > >> > > - Is Nutch able to crawl Sharepoint? If yes, do you have a > >> > link/mail > >> > > tutorial on this? > >> > > > >> > > > >> > > I was recently aware of the ManifoldCF initiative and it seems to be > >> > an > >> > > eventual solution to my problem. But it's currently poorly > documented > >> > (as > >> > > far as Sharepoint connector is concerned). > >> > > > >> > > - Do you have any recommendation on this regards? > >> > > > >> > > > >> > > Thanks in advance for your help, I'll really appreciate it! > >> > > > >> > > -- > >> > > Remi Tassing > >> > > > >> > > >> > > >> > > >> > -- > >> > *Lewis* > >> > > > > > > > > -- > > *Lewis* > > > > > -- > Best Regards > Alexander Aristov > -- Remi Tassing
-
Nutch and Sharepoint authenticationremi tassing 2011-12-17, 06:42
Hi,
According to the link below, IIS gives an HTTP 500 response when the server expects an NTLM V2 but is probably receiving an older version. I would guess that the Httpclient in Nutch doesn't support NTLM V2. I would also guess that It worked for Arkadi because its server doesn't use NTLM V2. Again according to the reference, Sun JRE 5 or higher fully suppliers NTLM V2. I wonder why it wasn't used for Nutch. reference: http://oaklandsoftware.com/papers/ntlm.html On Wednesday, November 30, 2011, remi tassing <[EMAIL PROTECTED]> wrote: > Thanks for tips Susam! > Unfortunately I don't have much support on the server side... > I have been tipped off by a friend mentioning the possibility of crawlers being purposely blocked by the server. > So how can I make Nutch impersonate a browser? > I tried the tip in the following link but it didn't work: http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html > Remi > On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[EMAIL PROTECTED]> wrote: >> >> On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[EMAIL PROTECTED]> wrote: >> > Hello guys, >> > With your advices, I tried tweaking config files during the week-end and got >> > some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get >> > nutch-1.3 to run). >> > A sample of my log file can be found below. I have two concerns: >> > -How do I know if NTLM login worked? >> > -How do I debug the http 500 error code? I suspect it might be due to >> > cookies... >> > Thanks in advance for your help >> > ... >> > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported >> > authentication schemes in the order of preference: [ntlm, digest, basic] >> > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm >> > authentication scheme selected >> > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm >> > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed >> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> > fetchQueues.totalSize=0 >> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> > fetchQueues.totalSize=0 >> > INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, >> > url=https://URL >> > INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 >> > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, >> > fetchQueues.totalSize=0 >> > INFO fetcher.Fetcher - -activeThreads=0 >> > ... >> >> From the logs, Nutch did attempt an NTLM authentication but the server >> returned HTTP 500. It says nothing about whether the NTLM >> authentication succeeded or failed. It only indicates that the >> authentication failed. It suggests that an internal error happened in >> SharePoint. >> >> Now, this can happen due to a variety of reasons. I don't know much >> about how to troubleshoot this in the SharePoint side. Perhaps you >> should be looking into IIS logs, event viewer, etc. to figure why >> SharePoint didn't accept your credentials. >> >> Most likely it is some kind of configuration problem in either >> SharePoint or IIS due to which the the NTLM authentication is causing >> some trouble. Even though it is outside the scope of Nutch, from my >> very limited experience working with SharePoint, I can say that it >> might be a good idea to get the Microsoft technical support involved >> while trying to troubleshoot this. >> >> Regards, >> Susam Pal >> http://susam.in/ > > > > -- > Remi Tassing > >
-
Re: Nutch and Sharepoint authenticationremi tassing 2011-12-17, 13:07
How can I make Nutch use HttpUrlConnection instead of HttpClient in the
painless way? It's been 8years since I wrote any Java code :-/ On Saturday, December 17, 2011, remi tassing <[EMAIL PROTECTED]> wrote: > Hi, > > According to the link below, IIS gives an HTTP 500 response when the server expects an NTLM V2 but is probably receiving an older version. I would guess that the Httpclient in Nutch doesn't support NTLM V2. > > I would also guess that It worked for Arkadi because its server doesn't use NTLM V2. > > Again according to the reference, Sun JRE 5 or higher fully suppliers NTLM V2. I wonder why it wasn't used for Nutch. > > reference: http://oaklandsoftware.com/papers/ntlm.html > > On Wednesday, November 30, 2011, remi tassing <[EMAIL PROTECTED]> wrote: >> Thanks for tips Susam! >> Unfortunately I don't have much support on the server side... >> I have been tipped off by a friend mentioning the possibility of crawlers being purposely blocked by the server. >> So how can I make Nutch impersonate a browser? >> I tried the tip in the following link but it didn't work: http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html >> Remi >> On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[EMAIL PROTECTED]> wrote: >>> >>> On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[EMAIL PROTECTED]> wrote: >>> > Hello guys, >>> > With your advices, I tried tweaking config files during the week-end and got >>> > some problem I couldn't solve (I'm running nutch-1.2. Cygwin couldn't get >>> > nutch-1.3 to run). >>> > A sample of my log file can be found below. I have two concerns: >>> > -How do I know if NTLM login worked? >>> > -How do I debug the http 500 error code? I suspect it might be due to >>> > cookies... >>> > Thanks in advance for your help >>> > ... >>> > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported >>> > authentication schemes in the order of preference: [ntlm, digest, basic] >>> > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm >>> > authentication scheme selected >>> > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm >>> > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >>> > fetchQueues.totalSize=0 >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >>> > fetchQueues.totalSize=0 >>> > INFO fetcher.Fetcher - fetch of https://URL failed with: Http code=500, >>> > url=https://URL >>> > INFO fetcher.Fetcher - -finishing thread FetcherThread, activeThreads=0 >>> > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, >>> > fetchQueues.totalSize=0 >>> > INFO fetcher.Fetcher - -activeThreads=0 >>> > ... >>> >>> From the logs, Nutch did attempt an NTLM authentication but the server >>> returned HTTP 500. It says nothing about whether the NTLM >>> authentication succeeded or failed. It only indicates that the >>> authentication failed. It suggests that an internal error happened in >>> SharePoint. >>> >>> Now, this can happen due to a variety of reasons. I don't know much >>> about how to troubleshoot this in the SharePoint side. Perhaps you >>> should be looking into IIS logs, event viewer, etc. to figure why >>> SharePoint didn't accept your credentials. >>> >>> Most likely it is some kind of configuration problem in either >>> SharePoint or IIS due to which the the NTLM authentication is causing >>> some trouble. Even though it is outside the scope of Nutch, from my >>> very limited experience working with SharePoint, I can say that it >>> might be a good idea to get the Microsoft technical support involved >>> while trying to troubleshoot this. >>> >>> Regards, >>> Susam Pal >>> http://susam.in/ >> >> >> >> -- >> Remi Tassing >> >>
-
Re: Nutch and Sharepoint authenticationremi tassing 2011-12-20, 08:49
Hi,
I tried the code snippet from the link below and it worked! Just need to figure out how to integrate that into Nutch, any help? [1] http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html On Sat, Dec 17, 2011 at 3:07 PM, remi tassing <[EMAIL PROTECTED]> wrote: > How can I make Nutch use HttpUrlConnection instead of HttpClient in the > painless way? It's been 8years since I wrote any Java code :-/ > > > On Saturday, December 17, 2011, remi tassing <[EMAIL PROTECTED]> > wrote: > > Hi, > > > > According to the link below, IIS gives an HTTP 500 response when the > server expects an NTLM V2 but is probably receiving an older version. I > would guess that the Httpclient in Nutch doesn't support NTLM V2. > > > > I would also guess that It worked for Arkadi because its server doesn't > use NTLM V2. > > > > Again according to the reference, Sun JRE 5 or higher fully suppliers > NTLM V2. I wonder why it wasn't used for Nutch. > > > > reference: http://oaklandsoftware.com/papers/ntlm.html > > > > On Wednesday, November 30, 2011, remi tassing <[EMAIL PROTECTED]> > wrote: > >> Thanks for tips Susam! > >> Unfortunately I don't have much support on the server side... > >> I have been tipped off by a friend mentioning the possibility of > crawlers being purposely blocked by the server. > >> So how can I make Nutch impersonate a browser? > >> I tried the tip in the following link but it didn't work: > http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html > >> Remi > >> On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[EMAIL PROTECTED]> wrote: > >>> > >>> On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[EMAIL PROTECTED]> > wrote: > >>> > Hello guys, > >>> > With your advices, I tried tweaking config files during the week-end > and got > >>> > some problem I couldn't solve (I'm running nutch-1.2. Cygwin > couldn't get > >>> > nutch-1.3 to run). > >>> > A sample of my log file can be found below. I have two concerns: > >>> > -How do I know if NTLM login worked? > >>> > -How do I debug the http 500 error code? I suspect it might be due > to > >>> > cookies... > >>> > Thanks in advance for your help > >>> > ... > >>> > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - Supported > >>> > authentication schemes in the order of preference: [ntlm, digest, > basic] > >>> > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm > >>> > authentication scheme selected > >>> > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: ntlm > >>> > DEBUG auth.AuthChallengeProcessor - Authorization challenge processed > >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > >>> > fetchQueues.totalSize=0 > >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, > >>> > fetchQueues.totalSize=0 > >>> > INFO fetcher.Fetcher - fetch of https://URL failed with: Http > code=500, > >>> > url=https://URL > >>> > INFO fetcher.Fetcher - -finishing thread FetcherThread, > activeThreads=0 > >>> > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, > >>> > fetchQueues.totalSize=0 > >>> > INFO fetcher.Fetcher - -activeThreads=0 > >>> > ... > >>> > >>> From the logs, Nutch did attempt an NTLM authentication but the server > >>> returned HTTP 500. It says nothing about whether the NTLM > >>> authentication succeeded or failed. It only indicates that the > >>> authentication failed. It suggests that an internal error happened in > >>> SharePoint. > >>> > >>> Now, this can happen due to a variety of reasons. I don't know much > >>> about how to troubleshoot this in the SharePoint side. Perhaps you > >>> should be looking into IIS logs, event viewer, etc. to figure why > >>> SharePoint didn't accept your credentials. > >>> > >>> Most likely it is some kind of configuration problem in either > >>> SharePoint or IIS due to which the the NTLM authentication is causing > >>> some trouble. Even though it is outside the scope of Nutch, from my > >>> very limited experience working with SharePoint, I can say that it Remi Tassing
-
Re: Nutch and Sharepoint authenticationremi tassing 2012-01-18, 14:47
I logged a JIRA for this issue. I wasn't sure if it was a bug or
improvement. But HttpUrlConnection does work for NTLMv2. So the problem will be to integrate it to Nutch. [1] https://issues.apache.org/jira/browse/NUTCH-1254 On Tue, Dec 20, 2011 at 10:49 AM, remi tassing <[EMAIL PROTECTED]>wrote: > Hi, > > I tried the code snippet from the link below and it worked! Just need to > figure out how to integrate that into Nutch, any help? > > [1] > http://developer-resource.blogspot.com/2008/06/ntlm-authentication-from-java.html > > > On Sat, Dec 17, 2011 at 3:07 PM, remi tassing <[EMAIL PROTECTED]>wrote: > >> How can I make Nutch use HttpUrlConnection instead of HttpClient in the >> painless way? It's been 8years since I wrote any Java code :-/ >> >> >> On Saturday, December 17, 2011, remi tassing <[EMAIL PROTECTED]> >> wrote: >> > Hi, >> > >> > According to the link below, IIS gives an HTTP 500 response when the >> server expects an NTLM V2 but is probably receiving an older version. I >> would guess that the Httpclient in Nutch doesn't support NTLM V2. >> > >> > I would also guess that It worked for Arkadi because its server doesn't >> use NTLM V2. >> > >> > Again according to the reference, Sun JRE 5 or higher fully suppliers >> NTLM V2. I wonder why it wasn't used for Nutch. >> > >> > reference: http://oaklandsoftware.com/papers/ntlm.html >> > >> > On Wednesday, November 30, 2011, remi tassing <[EMAIL PROTECTED]> >> wrote: >> >> Thanks for tips Susam! >> >> Unfortunately I don't have much support on the server side... >> >> I have been tipped off by a friend mentioning the possibility of >> crawlers being purposely blocked by the server. >> >> So how can I make Nutch impersonate a browser? >> >> I tried the tip in the following link but it didn't work: >> http://osdir.com/ml/nutch-user.lucene.apache.org/2009-06/msg00022.html >> >> Remi >> >> On Sun, Nov 27, 2011 at 9:17 PM, Susam Pal <[EMAIL PROTECTED]> wrote: >> >>> >> >>> On Sun, Nov 27, 2011 at 4:41 PM, remi tassing <[EMAIL PROTECTED]> >> wrote: >> >>> > Hello guys, >> >>> > With your advices, I tried tweaking config files during the >> week-end and got >> >>> > some problem I couldn't solve (I'm running nutch-1.2. Cygwin >> couldn't get >> >>> > nutch-1.3 to run). >> >>> > A sample of my log file can be found below. I have two concerns: >> >>> > -How do I know if NTLM login worked? >> >>> > -How do I debug the http 500 error code? I suspect it might be >> due to >> >>> > cookies... >> >>> > Thanks in advance for your help >> >>> > ... >> >>> > 2011-11-27 18:54:02,298 DEBUG auth.AuthChallengeProcessor - >> Supported >> >>> > authentication schemes in the order of preference: [ntlm, digest, >> basic] >> >>> > 2011-11-27 18:54:02,300 INFO auth.AuthChallengeProcessor - ntlm >> >>> > authentication scheme selected >> >>> > DEBUG auth.AuthChallengeProcessor - Using authentication scheme: >> ntlm >> >>> > DEBUG auth.AuthChallengeProcessor - Authorization challenge >> processed >> >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> >>> > fetchQueues.totalSize=0 >> >>> > INFO fetcher.Fetcher - -activeThreads=1, spinWaiting=0, >> >>> > fetchQueues.totalSize=0 >> >>> > INFO fetcher.Fetcher - fetch of https://URL failed with: Http >> code=500, >> >>> > url=https://URL >> >>> > INFO fetcher.Fetcher - -finishing thread FetcherThread, >> activeThreads=0 >> >>> > INFO fetcher.Fetcher - -activeThreads=0, spinWaiting=0, >> >>> > fetchQueues.totalSize=0 >> >>> > INFO fetcher.Fetcher - -activeThreads=0 >> >>> > ... >> >>> >> >>> From the logs, Nutch did attempt an NTLM authentication but the server >> >>> returned HTTP 500. It says nothing about whether the NTLM >> >>> authentication succeeded or failed. It only indicates that the >> >>> authentication failed. It suggests that an internal error happened in >> >>> SharePoint. >> >>> >> >>> Now, this can happen due to a variety of reasons. I don't know much >> >>> about how to troubleshoot this in the SharePoint side. Perhaps you |