|
|
Dominique Bejean 2011-03-02, 00:25
Hi,
I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web Crawler. It includes :
* a crawler * a document processing pipeline * a solr indexer
The crawler has a web administration in order to manage web sites to be crawled. Each web site crawl is configured with a lot of possible parameters (no all mandatory) :
* number of simultaneous items crawled by site * recrawl period rules based on item type (html, PDF, …) * item type inclusion / exclusion rules * item path inclusion / exclusion / strategy rules * max depth * web site authentication * language * country * tags * collections * ...
The pileline includes various ready to use stages (text extraction, language detection, Solr ready to index xml writer, ...).
All is very configurable and extendible either by scripting or java coding.
With scripting technology, you can help the crawler to handle javascript links or help the pipeline to extract relevant title and cleanup the html pages (remove menus, header, footers, ..)
With java coding, you can develop your own pipeline stage stage
The Crawl Anywhere web site provides good explanations and screen shots. All is documented in a wiki.
The current version is 1.1.4. You can download and try it out from here : www.crawl-anywhere.com Regards
Dominique
+
Dominique Bejean 2011-03-02, 00:25
-
Re: [ANNOUNCE] Web Crawler
David Smiley 2011-03-02, 04:41
Dominique, The obvious number one question is of course why you re-invented this wheel when there are several existing crawlers to choose from. Your website says the reason is that the UIs on existing crawlers (e.g. Nutch, Heritrix, ...) weren't sufficiently user-friendly or had the site-specific configuration you wanted. Well if that is the case, why didn't you add/enhance such capabilities for an existing crawler? ~ David Smiley ----- Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book-- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p2608956.htmlSent from the Solr - User mailing list archive at Nabble.com.
+
David Smiley 2011-03-02, 04:41
-
Re: [ANNOUNCE] Web Crawler
Dominique Bejean 2011-03-02, 10:59
David, The UI was not the only reason that make me choose to write a totaly new crawler. After eliminating candidate crawlers due to various reasons (inactive project, ...), Nutch and Heritrix where the 2 crawlers in my short list of possible candidates to be use. In my mind, the crawler and the pipleline have to be tottaly disconnected of the target repository (Solr, ...). This made nutch not a possible choice. At the end, I found Heritrix to far of the solution's architecture I imagined. Dominique Le 02/03/11 05:41, David Smiley (@MITRE.org) a �crit : > Dominique, > The obvious number one question is of course why you re-invented this wheel > when there are several existing crawlers to choose from. Your website says > the reason is that the UIs on existing crawlers (e.g. Nutch, Heritrix, ...) > weren't sufficiently user-friendly or had the site-specific configuration > you wanted. Well if that is the case, why didn't you add/enhance such > capabilities for an existing crawler? > > ~ David Smiley > > ----- > Author: https://www.packtpub.com/solr-1-4-enterprise-search-server/book
+
Dominique Bejean 2011-03-02, 10:59
-
Re: [ANNOUNCE] Web Crawler
Rosa 2011-03-02, 08:36
Nice job!
It would be good to be able to extract specific data from a given page via XPATH though.
Regards, Le 02/03/2011 01:25, Dominique Bejean a écrit : > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web > Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites to > be crawled. Each web site crawl is configured with a lot of possible > parameters (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, …) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text extraction, > language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or java > coding. > > With scripting technology, you can help the crawler to handle > javascript links or help the pipeline to extract relevant title and > cleanup the html pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen > shots. All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from > here : www.crawl-anywhere.com > > > Regards > > Dominique > >
-
Re: [ANNOUNCE] Web Crawler
Dominique Bejean 2011-03-02, 11:04
Rosa, In the pipeline, there is a stage that extract the text from the original document (PDF, HTML, ...). It is possible to plug scripts (Java 6 compliant) in order to keep only relevant parts of the document. See http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stageDominique Le 02/03/11 09:36, Rosa (Anuncios) a écrit : > Nice job! > > It would be good to be able to extract specific data from a given page > via XPATH though. > > Regards, > > > Le 02/03/2011 01:25, Dominique Bejean a écrit : >> Hi, >> >> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web >> Crawler. It includes : >> >> * a crawler >> * a document processing pipeline >> * a solr indexer >> >> The crawler has a web administration in order to manage web sites to >> be crawled. Each web site crawl is configured with a lot of possible >> parameters (no all mandatory) : >> >> * number of simultaneous items crawled by site >> * recrawl period rules based on item type (html, PDF, …) >> * item type inclusion / exclusion rules >> * item path inclusion / exclusion / strategy rules >> * max depth >> * web site authentication >> * language >> * country >> * tags >> * collections >> * ... >> >> The pileline includes various ready to use stages (text extraction, >> language detection, Solr ready to index xml writer, ...). >> >> All is very configurable and extendible either by scripting or java >> coding. >> >> With scripting technology, you can help the crawler to handle >> javascript links or help the pipeline to extract relevant title and >> cleanup the html pages (remove menus, header, footers, ..) >> >> With java coding, you can develop your own pipeline stage stage >> >> The Crawl Anywhere web site provides good explanations and screen >> shots. All is documented in a wiki. >> >> The current version is 1.1.4. You can download and try it out from >> here : www.crawl-anywhere.com >> >> >> Regards >> >> Dominique >> >> > >
+
Dominique Bejean 2011-03-02, 11:04
-
Re: [ANNOUNCE] Web Crawler
Geert-Jan Brits 2011-03-02, 11:20
Hi Dominique, This looks nice. In the past, I've been interested in (semi)-automatically inducing a scheme/wrapper from a set of example webpages (often called 'wrapper induction' is the scientific field) . This would allow for fast scheme-creation which could be used as a basis for extraction. Lately I've been looking for crawlers that incoporate this technology but without success. Any plans on incorporating this? Cheers, Geert-Jan 2011/3/2 Dominique Bejean <[EMAIL PROTECTED]> > Rosa, > > In the pipeline, there is a stage that extract the text from the original > document (PDF, HTML, ...). > It is possible to plug scripts (Java 6 compliant) in order to keep only > relevant parts of the document. > See > http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage> > Dominique > > Le 02/03/11 09:36, Rosa (Anuncios) a écrit : > > Nice job! >> >> It would be good to be able to extract specific data from a given page via >> XPATH though. >> >> Regards, >> >> >> Le 02/03/2011 01:25, Dominique Bejean a écrit : >> >>> Hi, >>> >>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web >>> Crawler. It includes : >>> >>> * a crawler >>> * a document processing pipeline >>> * a solr indexer >>> >>> The crawler has a web administration in order to manage web sites to be >>> crawled. Each web site crawl is configured with a lot of possible parameters >>> (no all mandatory) : >>> >>> * number of simultaneous items crawled by site >>> * recrawl period rules based on item type (html, PDF, …) >>> * item type inclusion / exclusion rules >>> * item path inclusion / exclusion / strategy rules >>> * max depth >>> * web site authentication >>> * language >>> * country >>> * tags >>> * collections >>> * ... >>> >>> The pileline includes various ready to use stages (text extraction, >>> language detection, Solr ready to index xml writer, ...). >>> >>> All is very configurable and extendible either by scripting or java >>> coding. >>> >>> With scripting technology, you can help the crawler to handle javascript >>> links or help the pipeline to extract relevant title and cleanup the html >>> pages (remove menus, header, footers, ..) >>> >>> With java coding, you can develop your own pipeline stage stage >>> >>> The Crawl Anywhere web site provides good explanations and screen shots. >>> All is documented in a wiki. >>> >>> The current version is 1.1.4. You can download and try it out from here : >>> www.crawl-anywhere.com >>> >>> >>> Regards >>> >>> Dominique >>> >>> >>> >> >>
+
Geert-Jan Brits 2011-03-02, 11:20
-
Re: [ANNOUNCE] Web Crawler
Dominique Bejean 2011-03-02, 11:28
Hi, The crawler comes with a extendible document processing pipeline. If you know java libraries or web services for 'wrapper induction' processing, it is possible to implement a dedicated stage in the pipeline. Dominique Le 02/03/11 12:20, Geert-Jan Brits a �crit : > Hi Dominique, > > This looks nice. > In the past, I've been interested in (semi)-automatically inducing a > scheme/wrapper from a set of example webpages (often called 'wrapper > induction' is the scientific field) . > This would allow for fast scheme-creation which could be used as a > basis for extraction. > > Lately I've been looking for crawlers that incoporate this technology > but without success. > Any plans on incorporating this? > > Cheers, > Geert-Jan > > 2011/3/2 Dominique Bejean <[EMAIL PROTECTED] > <mailto:[EMAIL PROTECTED]>> > > Rosa, > > In the pipeline, there is a stage that extract the text from the > original document (PDF, HTML, ...). > It is possible to plug scripts (Java 6 compliant) in order to keep > only relevant parts of the document. > See > http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage> > Dominique > > Le 02/03/11 09:36, Rosa (Anuncios) a �crit : > > Nice job! > > It would be good to be able to extract specific data from a > given page via XPATH though. > > Regards, > > > Le 02/03/2011 01:25, Dominique Bejean a �crit : > > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is > a Java Web Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage > web sites to be crawled. Each web site crawl is configured > with a lot of possible parameters (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, �) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text > extraction, language detection, Solr ready to index xml > writer, ...). > > All is very configurable and extendible either by > scripting or java coding. > > With scripting technology, you can help the crawler to > handle javascript links or help the pipeline to extract > relevant title and cleanup the html pages (remove menus, > header, footers, ..) > > With java coding, you can develop your own pipeline stage > stage > > The Crawl Anywhere web site provides good explanations and > screen shots. All is documented in a wiki. > > The current version is 1.1.4. You can download and try it > out from here : www.crawl-anywhere.com > < http://www.crawl-anywhere.com>> > > Regards > > Dominique > > > > >
+
Dominique Bejean 2011-03-02, 11:28
-
Re: [ANNOUNCE] Web Crawler
Paul Libbrecht 2011-03-02, 11:36
VIewing the indexing result, which is a part of what you are describing I think, is a nice job for such an indexing framework. Do you guys know whether such feature is already out there? paul Le 2 mars 2011 à 12:20, Geert-Jan Brits a écrit : > Hi Dominique, > > This looks nice. > In the past, I've been interested in (semi)-automatically inducing a > scheme/wrapper from a set of example webpages (often called 'wrapper > induction' is the scientific field) . > This would allow for fast scheme-creation which could be used as a basis for > extraction. > > Lately I've been looking for crawlers that incoporate this technology but > without success. > Any plans on incorporating this? > > Cheers, > Geert-Jan > > 2011/3/2 Dominique Bejean <[EMAIL PROTECTED]> > >> Rosa, >> >> In the pipeline, there is a stage that extract the text from the original >> document (PDF, HTML, ...). >> It is possible to plug scripts (Java 6 compliant) in order to keep only >> relevant parts of the document. >> See >> http://www.wiizio.com/confluence/display/CRAWLUSERS/DocTextExtractor+stage>> >> Dominique >> >> Le 02/03/11 09:36, Rosa (Anuncios) a écrit : >> >> Nice job! >>> >>> It would be good to be able to extract specific data from a given page via >>> XPATH though. >>> >>> Regards, >>> >>> >>> Le 02/03/2011 01:25, Dominique Bejean a écrit : >>> >>>> Hi, >>>> >>>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web >>>> Crawler. It includes : >>>> >>>> * a crawler >>>> * a document processing pipeline >>>> * a solr indexer >>>> >>>> The crawler has a web administration in order to manage web sites to be >>>> crawled. Each web site crawl is configured with a lot of possible parameters >>>> (no all mandatory) : >>>> >>>> * number of simultaneous items crawled by site >>>> * recrawl period rules based on item type (html, PDF, …) >>>> * item type inclusion / exclusion rules >>>> * item path inclusion / exclusion / strategy rules >>>> * max depth >>>> * web site authentication >>>> * language >>>> * country >>>> * tags >>>> * collections >>>> * ... >>>> >>>> The pileline includes various ready to use stages (text extraction, >>>> language detection, Solr ready to index xml writer, ...). >>>> >>>> All is very configurable and extendible either by scripting or java >>>> coding. >>>> >>>> With scripting technology, you can help the crawler to handle javascript >>>> links or help the pipeline to extract relevant title and cleanup the html >>>> pages (remove menus, header, footers, ..) >>>> >>>> With java coding, you can develop your own pipeline stage stage >>>> >>>> The Crawl Anywhere web site provides good explanations and screen shots. >>>> All is documented in a wiki. >>>> >>>> The current version is 1.1.4. You can download and try it out from here : >>>> www.crawl-anywhere.com >>>> >>>> >>>> Regards >>>> >>>> Dominique >>>> >>>> >>>> >>> >>>
+
Paul Libbrecht 2011-03-02, 11:36
-
Re: [ANNOUNCE] Web Crawler
Lukáš Vlček 2011-03-02, 09:01
Hi,
is there any plan to open source it?
Regards, Lukas
[OT] I tried HuriSearch, input "Java" into search field, it returned a lot of references to coldfusion error pages. May be a recrawl would help?
On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean <[EMAIL PROTECTED]>wrote:
> Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web > Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites to be > crawled. Each web site crawl is configured with a lot of possible parameters > (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, …) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text extraction, > language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or java coding. > > With scripting technology, you can help the crawler to handle javascript > links or help the pipeline to extract relevant title and cleanup the html > pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen shots. > All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from here : > www.crawl-anywhere.com > > > Regards > > Dominique > >
+
Lukáš Vlček 2011-03-02, 09:01
-
Re: [ANNOUNCE] Web Crawler
Dominique Bejean 2011-03-02, 11:20
Lukas, I am thinking about it but no decision yet. Anyway, in next release, I will provide source code of pipeline stages and connectors as samples. Dominique Le 02/03/11 10:01, Lukáš Vlček a écrit : > Hi, > > is there any plan to open source it? > > Regards, > Lukas > > [OT] I tried HuriSearch, input "Java" into search field, it returned a > lot of references to coldfusion error pages. May be a recrawl would help? > > On Wed, Mar 2, 2011 at 1:25 AM, Dominique Bejean > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java > Web Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites > to be crawled. Each web site crawl is configured with a lot of > possible parameters (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, …) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text > extraction, language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or > java coding. > > With scripting technology, you can help the crawler to handle > javascript links or help the pipeline to extract relevant title > and cleanup the html pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen > shots. All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from > here : www.crawl-anywhere.com < http://www.crawl-anywhere.com>> > > Regards > > Dominique > >
+
Dominique Bejean 2011-03-02, 11:20
-
Re: [ANNOUNCE] Web Crawler
findbestopensource 2011-03-02, 09:02
Hello Dominique Bejean, Good job. We identified almost 8 open source web crawlers http://www.findbestopensource.com/tagged/webcrawler I don't know how far yours would be different from the rest. Your license states that it is not open source but it is free for personnel use. Regards Aditya www.findbestopensource.com On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean <[EMAIL PROTECTED]>wrote: > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java Web > Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites to be > crawled. Each web site crawl is configured with a lot of possible parameters > (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, …) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text extraction, > language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or java coding. > > With scripting technology, you can help the crawler to handle javascript > links or help the pipeline to extract relevant title and cleanup the html > pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen shots. > All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from here : > www.crawl-anywhere.com > > > Regards > > Dominique > >
+
findbestopensource 2011-03-02, 09:02
-
Re: [ANNOUNCE] Web Crawler
Dominique Bejean 2011-03-02, 11:21
Aditya, The crawler is not open source and won't be in the next future. Anyway, I have to change the license because it can be use for any personal or commercial projects. Sincerely, Dominique Le 02/03/11 10:02, findbestopensource a �crit : > Hello Dominique Bejean, > > Good job. > > We identified almost 8 open source web crawlers > http://www.findbestopensource.com/tagged/webcrawler I don't know how > far yours would be different from the rest. > > Your license states that it is not open source but it is free for > personnel use. > > Regards > Aditya > www.findbestopensource.com < http://www.findbestopensource.com>> > > On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java > Web Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites > to be crawled. Each web site crawl is configured with a lot of > possible parameters (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, �) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text > extraction, language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or > java coding. > > With scripting technology, you can help the crawler to handle > javascript links or help the pipeline to extract relevant title > and cleanup the html pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen > shots. All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from > here : www.crawl-anywhere.com < http://www.crawl-anywhere.com>> > > Regards > > Dominique > >
+
Dominique Bejean 2011-03-02, 11:21
+
SivaKarthik 2013-01-27, 10:11
-
Re: [ANNOUNCE] Web Crawler
O. Klein 2013-01-27, 13:26
This is actualy showing it works. crawlerws is used by Crawl Anywhere UI and will pass it the correct arguments when needed. SivaKarthik wrote > Hii, > I'm trying to configure crawl-anywhere 3.0.3 version in my local system.. > i'm following the steps from the page > http://www.crawl-anywhere.com/installation-v300/> but, crawlerws is failing and throwing the below error message in the > brower > http://localhost:8080/crawlerws/> <error> > > <errno> > 1 > </errno> > > <errmsg> > Missing action > </errmsg> > </error> > Not sure where im doing wrong.. could please help me out to resolve the > problem.. thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036520.htmlSent from the Solr - User mailing list archive at Nabble.com.
+
O. Klein 2013-01-27, 13:26
+
SivaKarthik 2013-01-29, 06:54
-
Re: [ANNOUNCE] Web Crawler
SivaKarthik 2013-01-29, 08:28
Hi, i resolved the issue "Access denied for user 'crawler'@'localhost' (using password: YES)" mysql user crawler/crawler was created and privileges added as mentioned in the tutorial.. Thank you. -- View this message in context: http://lucene.472066.n3.nabble.com/ANNOUNCE-Web-Crawler-tp2607831p4036978.htmlSent from the Solr - User mailing list archive at Nabble.com.
+
SivaKarthik 2013-01-29, 08:28
-
RE: [ANNOUNCE] Web Crawler
Thumuluri, Sai 2011-03-02, 14:04
Dominique, Does your crawler support NTLM2 authentication? We have content under SiteMinder which uses NTLM2 and that is posing challenges with Nutch? -----Original Message----- From: Dominique Bejean [mailto:[EMAIL PROTECTED]] Sent: Wednesday, March 02, 2011 6:22 AM To: [EMAIL PROTECTED] Subject: Re: [ANNOUNCE] Web Crawler Aditya, The crawler is not open source and won't be in the next future. Anyway, I have to change the license because it can be use for any personal or commercial projects. Sincerely, Dominique Le 02/03/11 10:02, findbestopensource a écrit : > Hello Dominique Bejean, > > Good job. > > We identified almost 8 open source web crawlers > http://www.findbestopensource.com/tagged/webcrawler I don't know how > far yours would be different from the rest. > > Your license states that it is not open source but it is free for > personnel use. > > Regards > Aditya > www.findbestopensource.com < http://www.findbestopensource.com>> > > On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean > <[EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>> wrote: > > Hi, > > I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java > Web Crawler. It includes : > > * a crawler > * a document processing pipeline > * a solr indexer > > The crawler has a web administration in order to manage web sites > to be crawled. Each web site crawl is configured with a lot of > possible parameters (no all mandatory) : > > * number of simultaneous items crawled by site > * recrawl period rules based on item type (html, PDF, ...) > * item type inclusion / exclusion rules > * item path inclusion / exclusion / strategy rules > * max depth > * web site authentication > * language > * country > * tags > * collections > * ... > > The pileline includes various ready to use stages (text > extraction, language detection, Solr ready to index xml writer, ...). > > All is very configurable and extendible either by scripting or > java coding. > > With scripting technology, you can help the crawler to handle > javascript links or help the pipeline to extract relevant title > and cleanup the html pages (remove menus, header, footers, ..) > > With java coding, you can develop your own pipeline stage stage > > The Crawl Anywhere web site provides good explanations and screen > shots. All is documented in a wiki. > > The current version is 1.1.4. You can download and try it out from > here : www.crawl-anywhere.com < http://www.crawl-anywhere.com>> > > Regards > > Dominique > >
+
Thumuluri, Sai 2011-03-02, 14:04
-
Re: [ANNOUNCE] Web Crawler
Dominique Bejean 2011-03-02, 14:46
Hi, No, it doesn't. It looks like to be a apache httpclient 3.x limitation. https://issues.apache.org/jira/browse/HTTPCLIENT-579Dominique Le 02/03/11 15:04, Thumuluri, Sai a �crit : > Dominique, Does your crawler support NTLM2 authentication? We have content under SiteMinder which uses NTLM2 and that is posing challenges with Nutch? > > -----Original Message----- > From: Dominique Bejean [mailto:[EMAIL PROTECTED]] > Sent: Wednesday, March 02, 2011 6:22 AM > To: [EMAIL PROTECTED] > Subject: Re: [ANNOUNCE] Web Crawler > > Aditya, > > The crawler is not open source and won't be in the next future. Anyway, > I have to change the license because it can be use for any personal or > commercial projects. > > Sincerely, > > Dominique > > Le 02/03/11 10:02, findbestopensource a �crit : >> Hello Dominique Bejean, >> >> Good job. >> >> We identified almost 8 open source web crawlers >> http://www.findbestopensource.com/tagged/webcrawler I don't know how >> far yours would be different from the rest. >> >> Your license states that it is not open source but it is free for >> personnel use. >> >> Regards >> Aditya >> www.findbestopensource.com< http://www.findbestopensource.com>>> >> >> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean >> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >> >> Hi, >> >> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java >> Web Crawler. It includes : >> >> * a crawler >> * a document processing pipeline >> * a solr indexer >> >> The crawler has a web administration in order to manage web sites >> to be crawled. Each web site crawl is configured with a lot of >> possible parameters (no all mandatory) : >> >> * number of simultaneous items crawled by site >> * recrawl period rules based on item type (html, PDF, ...) >> * item type inclusion / exclusion rules >> * item path inclusion / exclusion / strategy rules >> * max depth >> * web site authentication >> * language >> * country >> * tags >> * collections >> * ... >> >> The pileline includes various ready to use stages (text >> extraction, language detection, Solr ready to index xml writer, ...). >> >> All is very configurable and extendible either by scripting or >> java coding. >> >> With scripting technology, you can help the crawler to handle >> javascript links or help the pipeline to extract relevant title >> and cleanup the html pages (remove menus, header, footers, ..) >> >> With java coding, you can develop your own pipeline stage stage >> >> The Crawl Anywhere web site provides good explanations and screen >> shots. All is documented in a wiki. >> >> The current version is 1.1.4. You can download and try it out from >> here : www.crawl-anywhere.com< http://www.crawl-anywhere.com>>> >> >> Regards >> >> Dominique >> >>
+
Dominique Bejean 2011-03-02, 14:46
-
Re: [ANNOUNCE] Web Crawler
Nestor Oviedo 2011-03-02, 15:29
Hi everyone! I've been following this thread and I realized we've constructed something similar to "Crawl Anywhere". The main difference is that our project is oriented to the digital libraries and digital repositories context. Specifically related to metadata collection from multiple sources, information improvements and storing in multiple destinations. So far, I can share an article about the project, because the code is in our development machines and testing servers. If everything goes well, we plan to make it open source in the near future. I'd be glad to hear your comments and opinions about it. There is no need to be polite. Thanks in advance. Best regards. Nestor On Wed, Mar 2, 2011 at 11:46 AM, Dominique Bejean <[EMAIL PROTECTED] > wrote: > Hi, > > No, it doesn't. It looks like to be a apache httpclient 3.x limitation. > https://issues.apache.org/jira/browse/HTTPCLIENT-579> > Dominique > > Le 02/03/11 15:04, Thumuluri, Sai a écrit : > > Dominique, Does your crawler support NTLM2 authentication? We have content >> under SiteMinder which uses NTLM2 and that is posing challenges with Nutch? >> >> -----Original Message----- >> From: Dominique Bejean [mailto:[EMAIL PROTECTED]] >> Sent: Wednesday, March 02, 2011 6:22 AM >> To: [EMAIL PROTECTED] >> Subject: Re: [ANNOUNCE] Web Crawler >> >> Aditya, >> >> The crawler is not open source and won't be in the next future. Anyway, >> I have to change the license because it can be use for any personal or >> commercial projects. >> >> Sincerely, >> >> Dominique >> >> Le 02/03/11 10:02, findbestopensource a écrit : >> >>> Hello Dominique Bejean, >>> >>> Good job. >>> >>> We identified almost 8 open source web crawlers >>> http://www.findbestopensource.com/tagged/webcrawler I don't know how >>> far yours would be different from the rest. >>> >>> Your license states that it is not open source but it is free for >>> personnel use. >>> >>> Regards >>> Aditya >>> www.findbestopensource.com< http://www.findbestopensource.com>>>> >>> >>> On Wed, Mar 2, 2011 at 5:55 AM, Dominique Bejean >>> <[EMAIL PROTECTED]<mailto:[EMAIL PROTECTED]>> wrote: >>> >>> Hi, >>> >>> I would like to announce Crawl Anywhere. Crawl-Anywhere is a Java >>> Web Crawler. It includes : >>> >>> * a crawler >>> * a document processing pipeline >>> * a solr indexer >>> >>> The crawler has a web administration in order to manage web sites >>> to be crawled. Each web site crawl is configured with a lot of >>> possible parameters (no all mandatory) : >>> >>> * number of simultaneous items crawled by site >>> * recrawl period rules based on item type (html, PDF, ...) >>> * item type inclusion / exclusion rules >>> * item path inclusion / exclusion / strategy rules >>> * max depth >>> * web site authentication >>> * language >>> * country >>> * tags >>> * collections >>> * ... >>> >>> The pileline includes various ready to use stages (text >>> extraction, language detection, Solr ready to index xml writer, ...). >>> >>> All is very configurable and extendible either by scripting or >>> java coding. >>> >>> With scripting technology, you can help the crawler to handle >>> javascript links or help the pipeline to extract relevant title >>> and cleanup the html pages (remove menus, header, footers, ..) >>> >>> With java coding, you can develop your own pipeline stage stage >>> >>> The Crawl Anywhere web site provides good explanations and screen >>> shots. All is documented in a wiki. >>> >>> The current version is 1.1.4. You can download and try it out from >>> here : www.crawl-anywhere.com< http://www.crawl-anywhere.com>>>> >>> >>> Regards >>> >>> Dominique >>> >>> >>>
+
Nestor Oviedo 2011-03-02, 15:29
|
|