|
Dennis Kubes
2010-05-17, 01:27
Alex Basa
2010-05-17, 04:18
Ron Shigeta
2010-05-17, 14:28
Mark Bennett
2010-05-17, 05:42
Davide Del Vecchio
2010-05-17, 10:00
Emmanuel de Castro Santan...
2010-05-17, 11:40
Hemanth Yamijala
2010-05-18, 01:30
Arkadi.Kosmynin@...
2010-05-18, 03:58
Ninad Raut
2010-05-17, 08:26
Dennis Kubes
2010-05-18, 15:09
Mambe Churchill Nanje
2010-05-18, 16:59
Markus Jelsma
2010-11-02, 12:30
nitin hardeniya
2010-11-02, 13:44
cong liu
2010-11-04, 13:24
Alexander Aristov
2010-05-17, 07:32
Piet van Remortel
2010-05-17, 07:40
Kevin Chen
2010-05-18, 03:22
Doğacan Güney
2010-05-17, 09:01
|
-
Writing a Book on NutchDennis Kubes 2010-05-17, 01:27
Hi Everyone,
It has been a long time coming but I have finally started to write a book on Nutch. It will be self published and should be available in PDF / paperback form in less than a month hopefully. A while back we discussed a Nutch training seminar on the list. I am not ready to do a full on seminar yet but I will be putting up some training and tutorial videos in the next few weeks. I will update the list as those become available. I already have a general outline but it would help me to know the following: 1) What types of things you would want explained in a book / videos on Nutch? 2) What are the biggest problems you face using Nutch? 3) Anything special you would like answered or explained? Thanks in advance for any responses. Dennis +
Dennis Kubes 2010-05-17, 01:27
-
Re: Writing a Book on NutchAlex Basa 2010-05-17, 04:18
Dennis,
One topic that had taken me a long time to figure out and lots of people have been having issues with is doing an incremental index. I don't think it was documented anywhere and it would be great if you could cover it. Thanks, Alex --- On Sun, 5/16/10, Dennis Kubes <[EMAIL PROTECTED]> wrote: > From: Dennis Kubes <[EMAIL PROTECTED]> > Subject: Writing a Book on Nutch > To: user@nutch.apache.org > Date: Sunday, May 16, 2010, 8:27 PM > Hi Everyone, > > It has been a long time coming but I have finally started > to write a book on Nutch. It will be self published > and should be available in PDF / paperback form in less than > a month hopefully. > > A while back we discussed a Nutch training seminar on the > list. I am not ready to do a full on seminar yet but I > will be putting up some training and tutorial videos in the > next few weeks. I will update the list as those become > available. > > I already have a general outline but it would help me to > know the following: > > 1) What types of things you would want explained in a book > / videos on Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Dennis > > +
Alex Basa 2010-05-17, 04:18
-
Re: Writing a Book on NutchRon Shigeta 2010-05-17, 14:28
I'd like to second this- ties in to hadoop and other ways to analyze your index are a big mystery to me when dealing with nutch!
----- Original Message ---- From: Alex Basa <[EMAIL PROTECTED]> To: user@nutch.apache.org Sent: Sun, May 16, 2010 9:18:01 PM Subject: Re: Writing a Book on Nutch Dennis, One topic that had taken me a long time to figure out and lots of people have been having issues with is doing an incremental index. I don't think it was documented anywhere and it would be great if you could cover it. Thanks, Alex --- On Sun, 5/16/10, Dennis Kubes <[EMAIL PROTECTED]> wrote: > From: Dennis Kubes <[EMAIL PROTECTED]> > Subject: Writing a Book on Nutch > To: user@nutch.apache.org > Date: Sunday, May 16, 2010, 8:27 PM > Hi Everyone, > > It has been a long time coming but I have finally started > to write a book on Nutch. It will be self published > and should be available in PDF / paperback form in less than > a month hopefully. > > A while back we discussed a Nutch training seminar on the > list. I am not ready to do a full on seminar yet but I > will be putting up some training and tutorial videos in the > next few weeks. I will update the list as those become > available. > > I already have a general outline but it would help me to > know the following: > > 1) What types of things you would want explained in a book > / videos on Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Dennis > > +
Ron Shigeta 2010-05-17, 14:28
-
Re: Writing a Book on NutchMark Bennett 2010-05-17, 05:42
Wow, really glad to see this moving forward. With Manning I'm guessing??
My top advice: * Debugging, Debugging, DEBUGGING!!!!!! I imagine you'd have a lot of ideas on this. In additional, I'd suggest: * Systematically break different parts of the system and record the symptoms, error messages, etc. Also: * I agree with Alex about incremental indexing * Setting up spidering for a lot of specific sites, how do you handle rules for hundreds of sites * As above, but also debuping www vs non-www prefix URLs from the same site * Detailed setup on Windows, including an outline of cygwin install and different path syntaxes * AND/or perhaps a rewrite in Windows CMD * Integrating with Solr. Yes, Nutch 1.0 had some prelim integration. And Lucid Imagination has an article on it. HOWEVER there needs to be a lot more info, like meta data fields, etc. Tradeoffs * Managing from a web GUI * A WARNING to always carefully check the Nutch matches Google brings back. For some reason it obsesses about 0.7 pages, but of course things changed quite a bit in 0.8. * A complete walk through setting up a debugging environment with Eclipse. To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and Nutch, with project linkages and sync'd source versions. And when you checkout Java code from ASF you can't just use the ant file to import into Ecliipse, it doesn't work right. * Also a bit about using patches and the patch submission process, again assuming Eclipse and covering any differences on Windows and Linux * Integrating with Open Pipeline or UIMA or whatever other flexible pipeline you like * Complex encoded URLs * Spider traps * Automatic restart on reboot, for Linux, Windows and Mac * Integrating filter packs for old and new MS Office and PDF files * CACHING with Squid or Apache or something, so that when you need to re-run over and over and over again to debug your document processing, you don't have to keep hitting the sites. I've seen two instances of where this was attempted but it didn't seem to work as expected, though I never found out why. * Benchmarking approximations: Assuming decent Internet connectivity, how much can you do with a single Nutch box (pick some stock configuration) * It'd be nice if you could include benchmarks comparing stock SATA drives to fast SCSI / Raid / Fiber. My advice is to stick with stock drives unless a project demonstrates that it needs caviar level storage, BUT I could be wrong, and this is certainly counter to what some of the enterprise search vendors advise. Some projects are small enough that they actually DON'T need high scalability - maybe they only need to index 10,000 pages on a LAN. * SATA RAID vs non-raided sata drives - some RAID can actually slow down writes. * Overhead (or benefit) of NAS, SAN, iSCSI * What is the initial hit in performance when going from a single box to a multibox configuration. In other words there is some overhead in distributing work - in one system I was consulted on it actually seemed to be *VERY* high, though I didn't have access to that system so never got to the bottom of it, but in many setups the client reported MUCH FASTER spidering with a 1 box setup than a 3 box setup - this seemed pretty consistent for them. I'm not suggesting you debug this per se, I just suggesting you do actual benchmarks in your book, believe no one! * Setting up Nutch in the Amazon Cloud, and specific issues with the various temp directories, "local drives" and persistent drives. Benchmarks. * Issues (if any) with VMWare, Xen and Microsoft HyperV virtual machines * Grabbing enough security metadata at spider/index time to do early binding security. Basically fetching the ACL info and injecting it in. This needs coordination on the client side, like from Solr. * Aging of your Nutch segments. When do you really need to blow away everything and start from scratch. * How do you recover from an interrupted / crashed spider / index run that took days or weeks to run (so you don't want to "just start over") Mark Bennett / New Idea Engineering, Inc. / [EMAIL PROTECTED] Direct: 408-733-0387 / Main: 866-IDEA-ENG / Cell: 408-829-6513 On Sun, May 16, 2010 at 9:18 PM, Alex Basa <[EMAIL PROTECTED]> wrote: +
Mark Bennett 2010-05-17, 05:42
-
Re: Writing a Book on NutchDavide Del Vecchio 2010-05-17, 10:00
Nice to hear: this book can be very helpful.
I totally agree with the points that Mark shared I expecially feel urgent the point about describing "Grabbing enough security metadata at spider/index time to do early binding" and possibly what are the extension point to write to a different index (not Lucene/Solr) That brings the topic of configuring a development environment with a proper Eclipse set up good news On Mon, May 17, 2010 at 7:42 AM, Mark Bennett <[EMAIL PROTECTED]> wrote: > Wow, really glad to see this moving forward. With Manning I'm guessing?? > > My top advice: > * Debugging, Debugging, DEBUGGING!!!!!! > > I imagine you'd have a lot of ideas on this. In additional, I'd suggest: > * Systematically break different parts of the system and record the > symptoms, error messages, etc. > > Also: > * I agree with Alex about incremental indexing > * Setting up spidering for a lot of specific sites, how do you handle rules > for hundreds of sites > * As above, but also debuping www vs non-www prefix URLs from the same site > * Detailed setup on Windows, including an outline of cygwin install and > different path syntaxes > * AND/or perhaps a rewrite in Windows CMD > * Integrating with Solr. Yes, Nutch 1.0 had some prelim integration. And > Lucid Imagination has an article on it. HOWEVER there needs to be a lot > more info, like meta data fields, etc. Tradeoffs > * Managing from a web GUI > * A WARNING to always carefully check the Nutch matches Google brings back. > For some reason it obsesses about 0.7 pages, but of course things changed > quite a bit in 0.8. > * A complete walk through setting up a debugging environment with Eclipse. > To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and > Nutch, with project linkages and sync'd source versions. And when you > checkout Java code from ASF you can't just use the ant file to import into > Ecliipse, it doesn't work right. > * Also a bit about using patches and the patch submission process, again > assuming Eclipse and covering any differences on Windows and Linux > * Integrating with Open Pipeline or UIMA or whatever other flexible pipeline > you like > * Complex encoded URLs > * Spider traps > * Automatic restart on reboot, for Linux, Windows and Mac > * Integrating filter packs for old and new MS Office and PDF files > * CACHING with Squid or Apache or something, so that when you need to re-run > over and over and over again to debug your document processing, you don't > have to keep hitting the sites. I've seen two instances of where this was > attempted but it didn't seem to work as expected, though I never found out > why. > * Benchmarking approximations: Assuming decent Internet connectivity, how > much can you do with a single Nutch box (pick some stock configuration) > * It'd be nice if you could include benchmarks comparing stock SATA drives > to fast SCSI / Raid / Fiber. My advice is to stick with stock drives unless > a project demonstrates that it needs caviar level storage, BUT I could be > wrong, and this is certainly counter to what some of the enterprise search > vendors advise. Some projects are small enough that they actually DON'T > need high scalability - maybe they only need to index 10,000 pages on a LAN. > * SATA RAID vs non-raided sata drives - some RAID can actually slow down > writes. > * Overhead (or benefit) of NAS, SAN, iSCSI > * What is the initial hit in performance when going from a single box to a > multibox configuration. In other words there is some overhead in > distributing work - in one system I was consulted on it actually seemed to > be *VERY* high, though I didn't have access to that system so never got to > the bottom of it, but in many setups the client reported MUCH FASTER > spidering with a 1 box setup than a 3 box setup - this seemed pretty > consistent for them. I'm not suggesting you debug this per se, I just > suggesting you do actual benchmarks in your book, believe no one! +
Davide Del Vecchio 2010-05-17, 10:00
-
Re: Writing a Book on NutchEmmanuel de Castro Santan... 2010-05-17, 11:40
"re-crawling and controlling that process seems like an issue in need of
covering to me" I am also very interested in knowing that better .. But also better strategies for crawling a single site and some benchmarks, linking configuration to performance. "... configuring a development environment with a proper Eclipse set up" "Automatic restart on reboot..." Those also interest me Looking forward to it -- Emmanuel de Castro Santana 2010/5/17 Davide Del Vecchio <[EMAIL PROTECTED]> > Nice to hear: this book can be very helpful. > I totally agree with the points that Mark shared I expecially feel urgent > the point about describing "Grabbing enough security metadata at > spider/index time to do early binding" > and possibly what are the extension point to write to a different > index (not Lucene/Solr) > That brings the topic of configuring a development environment with a > proper Eclipse set up > > good news > > On Mon, May 17, 2010 at 7:42 AM, Mark Bennett <[EMAIL PROTECTED]> > wrote: > > Wow, really glad to see this moving forward. With Manning I'm guessing?? > > > > My top advice: > > * Debugging, Debugging, DEBUGGING!!!!!! > > > > I imagine you'd have a lot of ideas on this. In additional, I'd suggest: > > * Systematically break different parts of the system and record the > > symptoms, error messages, etc. > > > > Also: > > * I agree with Alex about incremental indexing > > * Setting up spidering for a lot of specific sites, how do you handle > rules > > for hundreds of sites > > * As above, but also debuping www vs non-www prefix URLs from the same > site > > * Detailed setup on Windows, including an outline of cygwin install and > > different path syntaxes > > * AND/or perhaps a rewrite in Windows CMD > > * Integrating with Solr. Yes, Nutch 1.0 had some prelim integration. > And > > Lucid Imagination has an article on it. HOWEVER there needs to be a lot > > more info, like meta data fields, etc. Tradeoffs > > * Managing from a web GUI > > * A WARNING to always carefully check the Nutch matches Google brings > back. > > For some reason it obsesses about 0.7 pages, but of course things changed > > quite a bit in 0.8. > > * A complete walk through setting up a debugging environment with > Eclipse. > > To do real work you'd need 3 Eclipse projects setup, so Lucene, Solr and > > Nutch, with project linkages and sync'd source versions. And when you > > checkout Java code from ASF you can't just use the ant file to import > into > > Ecliipse, it doesn't work right. > > * Also a bit about using patches and the patch submission process, again > > assuming Eclipse and covering any differences on Windows and Linux > > * Integrating with Open Pipeline or UIMA or whatever other flexible > pipeline > > you like > > * Complex encoded URLs > > * Spider traps > > * Automatic restart on reboot, for Linux, Windows and Mac > > * Integrating filter packs for old and new MS Office and PDF files > > * CACHING with Squid or Apache or something, so that when you need to > re-run > > over and over and over again to debug your document processing, you don't > > have to keep hitting the sites. I've seen two instances of where this was > > attempted but it didn't seem to work as expected, though I never found > out > > why. > > * Benchmarking approximations: Assuming decent Internet connectivity, how > > much can you do with a single Nutch box (pick some stock configuration) > > * It'd be nice if you could include benchmarks comparing stock SATA > drives > > to fast SCSI / Raid / Fiber. My advice is to stick with stock drives > unless > > a project demonstrates that it needs caviar level storage, BUT I could be > > wrong, and this is certainly counter to what some of the enterprise > search > > vendors advise. Some projects are small enough that they actually DON'T > > need high scalability - maybe they only need to index 10,000 pages on a > LAN. > > * SATA RAID vs non-raided sata drives - some RAID can actually slow down Emmanuel de Castro Santana +
Emmanuel de Castro Santan... 2010-05-17, 11:40
-
Re: Writing a Book on NutchHemanth Yamijala 2010-05-18, 01:30
Hi,
> "re-crawling and controlling that process seems like an issue in need of > covering to me" > > I am also very interested in knowing that better .. > But also better strategies for crawling a single site and some benchmarks, > linking configuration to performance. +1 for information on benchmarks and performance tuning in all phases starting from crawl. If the focus could be on production quality deployments - typical hardware, configuration settings and the like - that would help a lot. Thanks Hemanth +
Hemanth Yamijala 2010-05-18, 01:30
-
RE: Writing a Book on NutchArkadi.Kosmynin@... 2010-05-18, 03:58
Hi Dennis,
I think you should include info on: - Data structures and data flow in Nutch, since understanding of these helps understand other things better; - Common problems, solutions, troubleshooting and tuning, because everyone working with Nutch faces these issues sooner or later. Regards, Arkadi > -----Original Message----- > From: Dennis Kubes [mailto:[EMAIL PROTECTED]] > Sent: Monday, May 17, 2010 11:28 AM > To: user@nutch.apache.org > Subject: Writing a Book on Nutch > > Hi Everyone, > > It has been a long time coming but I have finally started to write a > book on Nutch. It will be self published and should be available in > / paperback form in less than a month hopefully. > > A while back we discussed a Nutch training seminar on the list. I am > not ready to do a full on seminar yet but I will be putting up some > training and tutorial videos in the next few weeks. I will update the > list as those become available. > > I already have a general outline but it would help me to know the > following: > > 1) What types of things you would want explained in a book / videos on > Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Dennis +
Arkadi.Kosmynin@... 2010-05-18, 03:58
-
Re: Writing a Book on NutchNinad Raut 2010-05-17, 08:26
I would like one chapter on how to configure Nutch for focus crawling.. best
practices and strategies... especially to avoid host-blocking. On Mon, May 17, 2010 at 6:57 AM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > Hi Everyone, > > It has been a long time coming but I have finally started to write a book > on Nutch. It will be self published and should be available in PDF / > paperback form in less than a month hopefully. > > A while back we discussed a Nutch training seminar on the list. I am not > ready to do a full on seminar yet but I will be putting up some training and > tutorial videos in the next few weeks. I will update the list as those > become available. > > I already have a general outline but it would help me to know the > following: > > 1) What types of things you would want explained in a book / videos on > Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Dennis > > +
Ninad Raut 2010-05-17, 08:26
-
Re: Writing a Book on NutchDennis Kubes 2010-05-18, 15:09
I wanted to thank everyone for all the great responses. It really helps
in putting together information that will be useful to everyone. I am in also process of launching a blog about nutch/hadoop too and am working to get the first post (with video) done and up. I will update the list when that is finished. Dennis On 05/16/2010 08:27 PM, Dennis Kubes wrote: > Hi Everyone, > > It has been a long time coming but I have finally started to write a > book on Nutch. It will be self published and should be available in > PDF / paperback form in less than a month hopefully. > > A while back we discussed a Nutch training seminar on the list. I am > not ready to do a full on seminar yet but I will be putting up some > training and tutorial videos in the next few weeks. I will update the > list as those become available. > > I already have a general outline but it would help me to know the > following: > > 1) What types of things you would want explained in a book / videos on > Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Dennis > +
Dennis Kubes 2010-05-18, 15:09
-
Re: Writing a Book on NutchMambe Churchill Nanje 2010-05-18, 16:59
I need to know how to be able to integrate nutch with solr
and also track the index time of an article on solr then sort...if you book can use such a case study you got my buy on that Mambe Churchill Nanje 237 77545907, AfroVisioN Founder, President,CEO http://www.afrovisiongroup.com | http://mambenanje.blogspot.com skypeID: mambenanje www.twitter.com/mambenanje On Tue, May 18, 2010 at 5:09 PM, Dennis Kubes <[EMAIL PROTECTED]> wrote: > I wanted to thank everyone for all the great responses. It really helps in > putting together information that will be useful to everyone. > > I am in also process of launching a blog about nutch/hadoop too and am > working to get the first post (with video) done and up. I will update the > list when that is finished. > > Dennis > > > On 05/16/2010 08:27 PM, Dennis Kubes wrote: > >> Hi Everyone, >> >> It has been a long time coming but I have finally started to write a book >> on Nutch. It will be self published and should be available in PDF / >> paperback form in less than a month hopefully. >> >> A while back we discussed a Nutch training seminar on the list. I am not >> ready to do a full on seminar yet but I will be putting up some training and >> tutorial videos in the next few weeks. I will update the list as those >> become available. >> >> I already have a general outline but it would help me to know the >> following: >> >> 1) What types of things you would want explained in a book / videos on >> Nutch? >> 2) What are the biggest problems you face using Nutch? >> 3) Anything special you would like answered or explained? >> >> Thanks in advance for any responses. >> >> Dennis >> >> +
Mambe Churchill Nanje 2010-05-18, 16:59
-
Re: Writing a Book on NutchMarkus Jelsma 2010-11-02, 12:30
Hello Dennis,
How's it going? Cheers, On Monday 17 May 2010 03:27:58 Dennis Kubes wrote: > Hi Everyone, > > It has been a long time coming but I have finally started to write a > book on Nutch. It will be self published and should be available in PDF > / paperback form in less than a month hopefully. > > A while back we discussed a Nutch training seminar on the list. I am > not ready to do a full on seminar yet but I will be putting up some > training and tutorial videos in the next few weeks. I will update the > list as those become available. > > I already have a general outline but it would help me to know the > following: > > 1) What types of things you would want explained in a book / videos on > Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Dennis -- Markus Jelsma - CTO - Openindex http://www.linkedin.com/in/markus17 050-8536600 / 06-50258350 +
Markus Jelsma 2010-11-02, 12:30
-
Re: Writing a Book on Nutchnitin hardeniya 2010-11-02, 13:44
Writing plugins is one of the most important & something on which not so
many comprehending tutorials are available .we also doesn't have any video tutorial for any of them .also if you add nutch +hadoop that will be very cool . I will be available for any help. On Tue, Nov 2, 2010 at 8:30 AM, Markus Jelsma <[EMAIL PROTECTED]>wrote: > Hello Dennis, > > How's it going? > > Cheers, > > On Monday 17 May 2010 03:27:58 Dennis Kubes wrote: > > Hi Everyone, > > > > It has been a long time coming but I have finally started to write a > > book on Nutch. It will be self published and should be available in PDF > > / paperback form in less than a month hopefully. > > > > A while back we discussed a Nutch training seminar on the list. I am > > not ready to do a full on seminar yet but I will be putting up some > > training and tutorial videos in the next few weeks. I will update the > > list as those become available. > > > > I already have a general outline but it would help me to know the > > following: > > > > 1) What types of things you would want explained in a book / videos on > > Nutch? > > 2) What are the biggest problems you face using Nutch? > > 3) Anything special you would like answered or explained? > > > > Thanks in advance for any responses. > > > > Dennis > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536600 / 06-50258350 > -- Nitin Kumar Hardeniya M.Tech Computational Linguistics IIIT Hyderabad +
nitin hardeniya 2010-11-02, 13:44
-
Re: Writing a Book on Nutchcong liu 2010-11-04, 13:24
I want to know the schedule of fetcher which may be the graph theory?
On Tue, Nov 2, 2010 at 8:30 PM, Markus Jelsma <[EMAIL PROTECTED]>wrote: > Hello Dennis, > > How's it going? > > Cheers, > > On Monday 17 May 2010 03:27:58 Dennis Kubes wrote: > > Hi Everyone, > > > > It has been a long time coming but I have finally started to write a > > book on Nutch. It will be self published and should be available in PDF > > / paperback form in less than a month hopefully. > > > > A while back we discussed a Nutch training seminar on the list. I am > > not ready to do a full on seminar yet but I will be putting up some > > training and tutorial videos in the next few weeks. I will update the > > list as those become available. > > > > I already have a general outline but it would help me to know the > > following: > > > > 1) What types of things you would want explained in a book / videos on > > Nutch? > > 2) What are the biggest problems you face using Nutch? > > 3) Anything special you would like answered or explained? > > > > Thanks in advance for any responses. > > > > Dennis > > -- > Markus Jelsma - CTO - Openindex > http://www.linkedin.com/in/markus17 > 050-8536600 / 06-50258350 > +
cong liu 2010-11-04, 13:24
-
Re: Writing a Book on NutchAlexander Aristov 2010-05-17, 07:32
I would definetely want to see answers on questions about distributed
search. Starting from crawling, - how to make it in distributed mode, where to store collected pages and indexes and ending questions about relevancy of results abtained from different search servers. Best Regards Alexander Aristov On 17 May 2010 05:27, Dennis Kubes <[EMAIL PROTECTED]> wrote: > Hi Everyone, > > It has been a long time coming but I have finally started to write a book > on Nutch. It will be self published and should be available in PDF / > paperback form in less than a month hopefully. > > A while back we discussed a Nutch training seminar on the list. I am not > ready to do a full on seminar yet but I will be putting up some training and > tutorial videos in the next few weeks. I will update the list as those > become available. > > I already have a general outline but it would help me to know the > following: > > 1) What types of things you would want explained in a book / videos on > Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Dennis > > +
Alexander Aristov 2010-05-17, 07:32
-
Re: Writing a Book on NutchPiet van Remortel 2010-05-17, 07:40
re-crawling and controlling that process seems like an issue in need of
covering to me Thanks Piet Belgium On Mon, May 17, 2010 at 9:32 AM, Alexander Aristov < [EMAIL PROTECTED]> wrote: > I would definetely want to see answers on questions about distributed > search. > > Starting from crawling, - how to make it in distributed mode, where to > store > collected pages and indexes > and ending questions about relevancy of results abtained from different > search servers. > > > Best Regards > Alexander Aristov > > > On 17 May 2010 05:27, Dennis Kubes <[EMAIL PROTECTED]> wrote: > > > Hi Everyone, > > > > It has been a long time coming but I have finally started to write a book > > on Nutch. It will be self published and should be available in PDF / > > paperback form in less than a month hopefully. > > > > A while back we discussed a Nutch training seminar on the list. I am not > > ready to do a full on seminar yet but I will be putting up some training > and > > tutorial videos in the next few weeks. I will update the list as those > > become available. > > > > I already have a general outline but it would help me to know the > > following: > > > > 1) What types of things you would want explained in a book / videos on > > Nutch? > > 2) What are the biggest problems you face using Nutch? > > 3) Anything special you would like answered or explained? > > > > Thanks in advance for any responses. > > > > Dennis > > > > > -- -- PvR +
Piet van Remortel 2010-05-17, 07:40
-
Re: Writing a Book on NutchKevin Chen 2010-05-18, 03:22
Second this. Best practice in a production system, how to keep
re-crawling without bloating the whole system. On 5/17/2010 3:40 AM, Piet van Remortel wrote: > re-crawling and controlling that process seems like an issue in need of > covering to me > > Thanks > > Piet > Belgium > > On Mon, May 17, 2010 at 9:32 AM, Alexander Aristov< > [EMAIL PROTECTED]> wrote: > > >> I would definetely want to see answers on questions about distributed >> search. >> >> Starting from crawling, - how to make it in distributed mode, where to >> store >> collected pages and indexes >> and ending questions about relevancy of results abtained from different >> search servers. >> >> >> Best Regards >> Alexander Aristov >> >> >> On 17 May 2010 05:27, Dennis Kubes<[EMAIL PROTECTED]> wrote: >> >> >>> Hi Everyone, >>> >>> It has been a long time coming but I have finally started to write a book >>> on Nutch. It will be self published and should be available in PDF / >>> paperback form in less than a month hopefully. >>> >>> A while back we discussed a Nutch training seminar on the list. I am not >>> ready to do a full on seminar yet but I will be putting up some training >>> >> and >> >>> tutorial videos in the next few weeks. I will update the list as those >>> become available. >>> >>> I already have a general outline but it would help me to know the >>> following: >>> >>> 1) What types of things you would want explained in a book / videos on >>> Nutch? >>> 2) What are the biggest problems you face using Nutch? >>> 3) Anything special you would like answered or explained? >>> >>> Thanks in advance for any responses. >>> >>> Dennis >>> >>> >>> >> > > > +
Kevin Chen 2010-05-18, 03:22
-
Re: Writing a Book on NutchDoğacan Güney 2010-05-17, 09:01
Hey,
On Mon, May 17, 2010 at 04:27, Dennis Kubes <[EMAIL PROTECTED]> wrote: > Hi Everyone, > > It has been a long time coming but I have finally started to write a book > on Nutch. It will be self published and should be available in PDF / > paperback form in less than a month hopefully. > > A while back we discussed a Nutch training seminar on the list. I am not > ready to do a full on seminar yet but I will be putting up some training and > tutorial videos in the next few weeks. I will update the list as those > become available. > > I already have a general outline but it would help me to know the > following: > > 1) What types of things you would want explained in a book / videos on > Nutch? > 2) What are the biggest problems you face using Nutch? > 3) Anything special you would like answered or explained? > > Thanks in advance for any responses. > > Awesome news, Dennis! Looking forward to ordering my copy :) > Dennis > > -- Doğacan Güney +
Doğacan Güney 2010-05-17, 09:01
|