|
Julien Nioche
2010-04-06, 13:43
Andrzej Bialecki
2010-04-06, 17:23
Julien Nioche
2010-04-07, 08:14
Doğacan Güney
2010-04-07, 16:54
Enis Söztutar
2010-04-07, 17:24
Enis Söztutar
2010-04-07, 17:31
Andrzej Bialecki
2010-04-07, 17:32
Andrzej Bialecki
2010-04-07, 17:35
MilleBii
2010-04-07, 18:19
Doğacan Güney
2010-04-08, 07:42
Doğacan Güney
2010-04-08, 07:44
MilleBii
2010-04-08, 18:11
Doğacan Güney
2010-04-08, 20:20
lewis john mcgibbney
2011-07-02, 00:19
|
-
Nutch 2.0 roadmapJulien Nioche 2010-04-06, 13:43
Hi guys,
I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be based on what is currently referred to as NutchBase. Shall we create a branch for 2.0 in the Nutch SVN repository and have a label accordingly for JIRA so that we can file issues / feature requests on 2.0? Do you think that the current NutchBase could be used as a basis for the 2.0 branch? Talking about features, what else would we add apart from : * support for HBase : via ORM or not (see NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> ) * plugin cleanup : Tika only for parsing - get rid of everything else? * remove index / search and delegate to SOLR * new functionalities e.g. sitemap support, canonical tag etc... I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an update? I look forward to hearing your thoughts on this Julien -- DigitalPebble Ltd http://www.digitalpebble.com
-
Re: Nutch 2.0 roadmapAndrzej Bialecki 2010-04-06, 17:23
On 2010-04-06 15:43, Julien Nioche wrote:
> Hi guys, > > I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be > based on what is currently referred to as NutchBase. Shall we create a > branch for 2.0 in the Nutch SVN repository and have a label accordingly for > JIRA so that we can file issues / feature requests on 2.0? Do you think that > the current NutchBase could be used as a basis for the 2.0 branch? I'm not sure what is the status of the nutchbase - it's missed a lot of fixes and changes in trunk since it's been last touched ... > > Talking about features, what else would we add apart from : > > * support for HBase : via ORM or not (see > NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> > ) This IMHO is promising, this could open the doors to small-to-medium installations that are currently too cumbersome to handle. > * plugin cleanup : Tika only for parsing - get rid of everything else? Basically, yes - keep only stuff like HtmlParseFilters (probably with a different API) so that we can post-process the DOM created in Tika from whatever original format. Also, the goal of the crawler-commons project is to provide APIs and implementations of stuff that is needed for every open source crawler project, like: robots handling, url filtering and url normalization, URL state management, perhaps deduplication. We should coordinate our efforts, and share code freely so that other projects (bixo, heritrix, droids) may contribute to this shared pool of functionality, much like Tika does for the common need of parsing complex formats. > * remove index / search and delegate to SOLR +1 - we may still keep a thin abstract layer to allow other indexing/search backends, but the current mess of indexing/query filters and competing indexing frameworks (lucene, fields, solr) should go away. We should go directly from DOM to a NutchDocument, and stop there. Regarding search - currently the search API is too low-level, with the custom text and query analysis chains. This needlessly introduces the (in)famous Nutch Query classes and Nutch query syntax limitations, We should get rid of it and simply leave this part of the processing to the search backend. Probably we will use the SolrCloud branch that supports sharding and global IDF. > * new functionalities e.g. sitemap support, canonical tag etc... Plus a better handling of redirects, detecting duplicated sites, detection of spam cliques, tools to manage the webgraph, etc. > > I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an > update? Definitely. :) -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: Nutch 2.0 roadmapJulien Nioche 2010-04-07, 08:14
Hi,
I'm not sure what is the status of the nutchbase - it's missed a lot of > fixes and changes in trunk since it's been last touched ... > yes, maybe we should start the 2.0 branch from 1.1 instead Dogacan - what do you think? BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it > Also, the goal of the crawler-commons project is to provide APIs and > implementations of stuff that is needed for every open source crawler > project, like: robots handling, url filtering and url normalization, URL > state management, perhaps deduplication. We should coordinate our > efforts, and share code freely so that other projects (bixo, heritrix, > droids) may contribute to this shared pool of functionality, much like > Tika does for the common need of parsing complex formats. > definitely +1 - we may still keep a thin abstract layer to allow other > indexing/search backends, but the current mess of indexing/query filters > and competing indexing frameworks (lucene, fields, solr) should go away. > We should go directly from DOM to a NutchDocument, and stop there. > I think that separating the parsing filters from the indexing filters can have its merits e.g. combining the metadata generated by 2 or more different parsing filters into a single field in the NutchDocument, keeping only a subset of the available information etc... > > > > I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an > > update? > Have created a new page to serve as a support for discussion : http://wiki.apache.org/nutch/Nutch2Roadmap julien -- DigitalPebble Ltd http://www.digitalpebble.com
-
Re: Nutch 2.0 roadmapDoğacan Güney 2010-04-07, 16:54
Hey everyone,
On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > On 2010-04-06 15:43, Julien Nioche wrote: >> Hi guys, >> >> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >> based on what is currently referred to as NutchBase. Shall we create a >> branch for 2.0 in the Nutch SVN repository and have a label accordingly for >> JIRA so that we can file issues / feature requests on 2.0? Do you think that >> the current NutchBase could be used as a basis for the 2.0 branch? > > I'm not sure what is the status of the nutchbase - it's missed a lot of > fixes and changes in trunk since it's been last touched ... > I know... But I still intend to finish it, I just need to schedule some time for it. My vote would be to go with nutchbase. >> >> Talking about features, what else would we add apart from : >> >> * support for HBase : via ORM or not (see >> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >> ) > > This IMHO is promising, this could open the doors to small-to-medium > installations that are currently too cumbersome to handle. > Yeah, there is already a simple ORM within nutchbase that is avro-based and should be generic enough to also support MySQL, cassandra and berkeleydb. But any good ORM will be a very good addition. >> * plugin cleanup : Tika only for parsing - get rid of everything else? > > Basically, yes - keep only stuff like HtmlParseFilters (probably with a > different API) so that we can post-process the DOM created in Tika from > whatever original format. > > Also, the goal of the crawler-commons project is to provide APIs and > implementations of stuff that is needed for every open source crawler > project, like: robots handling, url filtering and url normalization, URL > state management, perhaps deduplication. We should coordinate our > efforts, and share code freely so that other projects (bixo, heritrix, > droids) may contribute to this shared pool of functionality, much like > Tika does for the common need of parsing complex formats. > >> * remove index / search and delegate to SOLR > > +1 - we may still keep a thin abstract layer to allow other > indexing/search backends, but the current mess of indexing/query filters > and competing indexing frameworks (lucene, fields, solr) should go away. > We should go directly from DOM to a NutchDocument, and stop there. > Agreed. I would like to add support for katta and other indexing backends at some point but NutchDocument should be our canonical representation. The rest should be up to indexing backends. > Regarding search - currently the search API is too low-level, with the > custom text and query analysis chains. This needlessly introduces the > (in)famous Nutch Query classes and Nutch query syntax limitations, We > should get rid of it and simply leave this part of the processing to the > search backend. Probably we will use the SolrCloud branch that supports > sharding and global IDF. > >> * new functionalities e.g. sitemap support, canonical tag etc... > > Plus a better handling of redirects, detecting duplicated sites, > detection of spam cliques, tools to manage the webgraph, etc. > >> >> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an >> update? > > Definitely. :) > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney
-
Re: Nutch 2.0 roadmapEnis Söztutar 2010-04-07, 17:24
Hi,
On 04/07/2010 07:54 PM, Doğacan Güney wrote: > Hey everyone, > > On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki<[EMAIL PROTECTED]> wrote: > >> On 2010-04-06 15:43, Julien Nioche wrote: >> >>> Hi guys, >>> >>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>> based on what is currently referred to as NutchBase. Shall we create a >>> branch for 2.0 in the Nutch SVN repository and have a label accordingly for >>> JIRA so that we can file issues / feature requests on 2.0? Do you think that >>> the current NutchBase could be used as a basis for the 2.0 branch? >>> >> I'm not sure what is the status of the nutchbase - it's missed a lot of >> fixes and changes in trunk since it's been last touched ... >> >> > I know... But I still intend to finish it, I just need to schedule > some time for it. > > My vote would be to go with nutchbase. > A suggestion would be to continue with trunk until nutch-base is stable. Once it is, then we can merge the nutchbase branch to trunk (after 1.1 split), at which point trunk becomes the nutchbase+other issues merged. Then when the time comes, we can fork branch-2.0 and release when blockers are done. I strongly suggest against having a trunk and a 2.0 branch for development. > >>> Talking about features, what else would we add apart from : >>> >>> * support for HBase : via ORM or not (see >>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>> ) >>> >> This IMHO is promising, this could open the doors to small-to-medium >> installations that are currently too cumbersome to handle. >> >> > Yeah, there is already a simple ORM within nutchbase that is > avro-based and should > be generic enough to also support MySQL, cassandra and berkeleydb. But > any good ORM will > be a very good addition. > Current ORM code is merged with nutchbase code, but I think the sooner we split it the better, since development will be much more clear and simple this way. A have opened Nutch-808 to explore the alternatives, but we might as well continue with current implementation. I intent to share my findings in a couple of days. > >>> * plugin cleanup : Tika only for parsing - get rid of everything else? >>> >> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >> different API) so that we can post-process the DOM created in Tika from >> whatever original format. >> >> Also, the goal of the crawler-commons project is to provide APIs and >> implementations of stuff that is needed for every open source crawler >> project, like: robots handling, url filtering and url normalization, URL >> state management, perhaps deduplication. We should coordinate our >> efforts, and share code freely so that other projects (bixo, heritrix, >> droids) may contribute to this shared pool of functionality, much like >> Tika does for the common need of parsing complex formats. >> >> So, it seems that at some point, we need to bite the bullet, and refactor plugins, dropping backwards compatibility. >>> * remove index / search and delegate to SOLR >>> >> +1 - we may still keep a thin abstract layer to allow other >> indexing/search backends, but the current mess of indexing/query filters >> and competing indexing frameworks (lucene, fields, solr) should go away. >> We should go directly from DOM to a NutchDocument, and stop there. >> >> > Agreed. I would like to add support for katta and other indexing > backends at some point but > NutchDocument should be our canonical representation. The rest should > be up to indexing backends. > > >> Regarding search - currently the search API is too low-level, with the >> custom text and query analysis chains. This needlessly introduces the >> (in)famous Nutch Query classes and Nutch query syntax limitations, We >> should get rid of it and simply leave this part of the processing to the >> search backend. Probably we will use the SolrCloud branch that supports
-
Re: Nutch 2.0 roadmapEnis Söztutar 2010-04-07, 17:31
Forgot to say that, at Hadoop, it is the convention that big issues,
like the ones under discussion come with a design document. So that a solid design is agreed upon for the work. We can apply the same pattern at Nutch. On 04/07/2010 07:54 PM, Doğacan Güney wrote: > Hey everyone, > > On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki<[EMAIL PROTECTED]> wrote: > >> On 2010-04-06 15:43, Julien Nioche wrote: >> >>> Hi guys, >>> >>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>> based on what is currently referred to as NutchBase. Shall we create a >>> branch for 2.0 in the Nutch SVN repository and have a label accordingly for >>> JIRA so that we can file issues / feature requests on 2.0? Do you think that >>> the current NutchBase could be used as a basis for the 2.0 branch? >>> >> I'm not sure what is the status of the nutchbase - it's missed a lot of >> fixes and changes in trunk since it's been last touched ... >> >> > I know... But I still intend to finish it, I just need to schedule > some time for it. > > My vote would be to go with nutchbase. > > >>> Talking about features, what else would we add apart from : >>> >>> * support for HBase : via ORM or not (see >>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>> ) >>> >> This IMHO is promising, this could open the doors to small-to-medium >> installations that are currently too cumbersome to handle. >> >> > Yeah, there is already a simple ORM within nutchbase that is > avro-based and should > be generic enough to also support MySQL, cassandra and berkeleydb. But > any good ORM will > be a very good addition. > > >>> * plugin cleanup : Tika only for parsing - get rid of everything else? >>> >> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >> different API) so that we can post-process the DOM created in Tika from >> whatever original format. >> >> Also, the goal of the crawler-commons project is to provide APIs and >> implementations of stuff that is needed for every open source crawler >> project, like: robots handling, url filtering and url normalization, URL >> state management, perhaps deduplication. We should coordinate our >> efforts, and share code freely so that other projects (bixo, heritrix, >> droids) may contribute to this shared pool of functionality, much like >> Tika does for the common need of parsing complex formats. >> >> >>> * remove index / search and delegate to SOLR >>> >> +1 - we may still keep a thin abstract layer to allow other >> indexing/search backends, but the current mess of indexing/query filters >> and competing indexing frameworks (lucene, fields, solr) should go away. >> We should go directly from DOM to a NutchDocument, and stop there. >> >> > Agreed. I would like to add support for katta and other indexing > backends at some point but > NutchDocument should be our canonical representation. The rest should > be up to indexing backends. > > >> Regarding search - currently the search API is too low-level, with the >> custom text and query analysis chains. This needlessly introduces the >> (in)famous Nutch Query classes and Nutch query syntax limitations, We >> should get rid of it and simply leave this part of the processing to the >> search backend. Probably we will use the SolrCloud branch that supports >> sharding and global IDF. >> >> >>> * new functionalities e.g. sitemap support, canonical tag etc... >>> >> Plus a better handling of redirects, detecting duplicated sites, >> detection of spam cliques, tools to manage the webgraph, etc. >> >> >>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an >>> update? >>> >> Definitely. :) >> >> -- >> Best regards, >> Andrzej Bialecki<>< >> ___. ___ ___ ___ _ _ __________________________________ >> [__ || __|__/|__||\/| Information Retrieval, Semantic Web >> ___|||__|| \| || | Embedded Unix, System Integration
-
Re: Nutch 2.0 roadmapAndrzej Bialecki 2010-04-07, 17:32
On 2010-04-07 18:54, Doğacan Güney wrote:
> Hey everyone, > > On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >> On 2010-04-06 15:43, Julien Nioche wrote: >>> Hi guys, >>> >>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>> based on what is currently referred to as NutchBase. Shall we create a >>> branch for 2.0 in the Nutch SVN repository and have a label accordingly for >>> JIRA so that we can file issues / feature requests on 2.0? Do you think that >>> the current NutchBase could be used as a basis for the 2.0 branch? >> >> I'm not sure what is the status of the nutchbase - it's missed a lot of >> fixes and changes in trunk since it's been last touched ... >> > > I know... But I still intend to finish it, I just need to schedule > some time for it. > > My vote would be to go with nutchbase. Hmm .. this puzzles me, do you think we should port changes from 1.1 to nutchbase? I thought we should do it the other way around, i.e. merge nutchbase bits to trunk. >>> * support for HBase : via ORM or not (see >>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>> ) >> >> This IMHO is promising, this could open the doors to small-to-medium >> installations that are currently too cumbersome to handle. >> > > Yeah, there is already a simple ORM within nutchbase that is > avro-based and should > be generic enough to also support MySQL, cassandra and berkeleydb. But > any good ORM will > be a very good addition. Again, the advantage of DataNucleus is that we don't have to handcraft all the mid- to low-level mappings, just the mid-level ones (JOQL or whatever) - the cost of maintenance is lower, and the number of backends that are supported out of the box is larger. Of course, this is just IMHO - we won't know for sure until we try to use both your custom ORM and DataNucleus... -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: Nutch 2.0 roadmapAndrzej Bialecki 2010-04-07, 17:35
On 2010-04-07 19:24, Enis Söztutar wrote:
>>> Also, the goal of the crawler-commons project is to provide APIs and >>> implementations of stuff that is needed for every open source crawler >>> project, like: robots handling, url filtering and url normalization, URL >>> state management, perhaps deduplication. We should coordinate our >>> efforts, and share code freely so that other projects (bixo, heritrix, >>> droids) may contribute to this shared pool of functionality, much like >>> Tika does for the common need of parsing complex formats. >>> >>> > > So, it seems that at some point, we need to bite the bullet, and > refactor plugins, dropping backwards compatibility. Right, that was my point - now is the time to break it, with the cut-over to 2.0, and leaving 1.1 branch in a good shape, to serve well enough in the interim period. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
-
Re: Nutch 2.0 roadmapMilleBii 2010-04-07, 18:19
Just a question ?
Will the new HBase implementation allow more sophisticated crawling strategies than the current score based. Give you a few example of what I'd like to do : Define different crawling frequency for different set of URLs, say weekly for some url, monthly or more for others. Select URLs to re-crawl based on attributes previously extracted.Just one example: recrawl urls that contained a certain keyword (or set of) Select URLs that have not yet been crawled, at the frontier of the crawl therefore 2010/4/7, Doğacan Güney <[EMAIL PROTECTED]>: > Hey everyone, > > On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >> On 2010-04-06 15:43, Julien Nioche wrote: >>> Hi guys, >>> >>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>> based on what is currently referred to as NutchBase. Shall we create a >>> branch for 2.0 in the Nutch SVN repository and have a label accordingly >>> for >>> JIRA so that we can file issues / feature requests on 2.0? Do you think >>> that >>> the current NutchBase could be used as a basis for the 2.0 branch? >> >> I'm not sure what is the status of the nutchbase - it's missed a lot of >> fixes and changes in trunk since it's been last touched ... >> > > I know... But I still intend to finish it, I just need to schedule > some time for it. > > My vote would be to go with nutchbase. > >>> >>> Talking about features, what else would we add apart from : >>> >>> * support for HBase : via ORM or not (see >>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>> ) >> >> This IMHO is promising, this could open the doors to small-to-medium >> installations that are currently too cumbersome to handle. >> > > Yeah, there is already a simple ORM within nutchbase that is > avro-based and should > be generic enough to also support MySQL, cassandra and berkeleydb. But > any good ORM will > be a very good addition. > >>> * plugin cleanup : Tika only for parsing - get rid of everything else? >> >> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >> different API) so that we can post-process the DOM created in Tika from >> whatever original format. >> >> Also, the goal of the crawler-commons project is to provide APIs and >> implementations of stuff that is needed for every open source crawler >> project, like: robots handling, url filtering and url normalization, URL >> state management, perhaps deduplication. We should coordinate our >> efforts, and share code freely so that other projects (bixo, heritrix, >> droids) may contribute to this shared pool of functionality, much like >> Tika does for the common need of parsing complex formats. >> >>> * remove index / search and delegate to SOLR >> >> +1 - we may still keep a thin abstract layer to allow other >> indexing/search backends, but the current mess of indexing/query filters >> and competing indexing frameworks (lucene, fields, solr) should go away. >> We should go directly from DOM to a NutchDocument, and stop there. >> > > Agreed. I would like to add support for katta and other indexing > backends at some point but > NutchDocument should be our canonical representation. The rest should > be up to indexing backends. > >> Regarding search - currently the search API is too low-level, with the >> custom text and query analysis chains. This needlessly introduces the >> (in)famous Nutch Query classes and Nutch query syntax limitations, We >> should get rid of it and simply leave this part of the processing to the >> search backend. Probably we will use the SolrCloud branch that supports >> sharding and global IDF. >> >>> * new functionalities e.g. sitemap support, canonical tag etc... >> >> Plus a better handling of redirects, detecting duplicated sites, >> detection of spam cliques, tools to manage the webgraph, etc. >> >>> >>> I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an >>> update? >> >> Definitely. :) >> >> -- >> Best regards, >> Andrzej Bialecki <>< -MilleBii-
-
Re: Nutch 2.0 roadmapDoğacan Güney 2010-04-08, 07:42
On Wed, Apr 7, 2010 at 20:32, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> On 2010-04-07 18:54, Doğacan Güney wrote: >> Hey everyone, >> >> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >>> On 2010-04-06 15:43, Julien Nioche wrote: >>>> Hi guys, >>>> >>>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>>> based on what is currently referred to as NutchBase. Shall we create a >>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly for >>>> JIRA so that we can file issues / feature requests on 2.0? Do you think that >>>> the current NutchBase could be used as a basis for the 2.0 branch? >>> >>> I'm not sure what is the status of the nutchbase - it's missed a lot of >>> fixes and changes in trunk since it's been last touched ... >>> >> >> I know... But I still intend to finish it, I just need to schedule >> some time for it. >> >> My vote would be to go with nutchbase. > > Hmm .. this puzzles me, do you think we should port changes from 1.1 to > nutchbase? I thought we should do it the other way around, i.e. merge > nutchbase bits to trunk. > Hmm, I am a bit out of touch with the latest changes but I know that the differences between trunk and nutchbase are unfortunately rather large right now. If merging nutchbase back into trunk would be easier then sure, let's do that. > >>>> * support for HBase : via ORM or not (see >>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>>> ) >>> >>> This IMHO is promising, this could open the doors to small-to-medium >>> installations that are currently too cumbersome to handle. >>> >> >> Yeah, there is already a simple ORM within nutchbase that is >> avro-based and should >> be generic enough to also support MySQL, cassandra and berkeleydb. But >> any good ORM will >> be a very good addition. > > Again, the advantage of DataNucleus is that we don't have to handcraft > all the mid- to low-level mappings, just the mid-level ones (JOQL or > whatever) - the cost of maintenance is lower, and the number of backends > that are supported out of the box is larger. Of course, this is just > IMHO - we won't know for sure until we try to use both your custom ORM > and DataNucleus... I am obviously a bit biased here but I have no strong feelings really. DataNucleus is an excellent project. What I like about avro-based approach is the essentially free MapReduce support we get and the fact that supporting another language is easy. So, we can expose partial hbase data through a server and a python-client can easily read/write to it, thanks to avro. That being said, I am all for DataNucleus or something else. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney
-
Re: Nutch 2.0 roadmapDoğacan Güney 2010-04-08, 07:44
Hi,
On Wed, Apr 7, 2010 at 21:19, MilleBii <[EMAIL PROTECTED]> wrote: > Just a question ? > Will the new HBase implementation allow more sophisticated crawling > strategies than the current score based. > > Give you a few example of what I'd like to do : > Define different crawling frequency for different set of URLs, say > weekly for some url, monthly or more for others. > > Select URLs to re-crawl based on attributes previously extracted.Just > one example: recrawl urls that contained a certain keyword (or set of) > > Select URLs that have not yet been crawled, at the frontier of the > crawl therefore > At some point, it would be nice to change generator so that it is only a handful of methods and a pig (or something else) script. So, we would provide most of the functions you may need during generation (accessing various data) but actual generation would be a pig process. This way, anyone can easily change generate any way they want (even make it more jobs than 2 if they want more complex schemes). > > > > 2010/4/7, Doğacan Güney <[EMAIL PROTECTED]>: >> Hey everyone, >> >> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >>> On 2010-04-06 15:43, Julien Nioche wrote: >>>> Hi guys, >>>> >>>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will be >>>> based on what is currently referred to as NutchBase. Shall we create a >>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly >>>> for >>>> JIRA so that we can file issues / feature requests on 2.0? Do you think >>>> that >>>> the current NutchBase could be used as a basis for the 2.0 branch? >>> >>> I'm not sure what is the status of the nutchbase - it's missed a lot of >>> fixes and changes in trunk since it's been last touched ... >>> >> >> I know... But I still intend to finish it, I just need to schedule >> some time for it. >> >> My vote would be to go with nutchbase. >> >>>> >>>> Talking about features, what else would we add apart from : >>>> >>>> * support for HBase : via ORM or not (see >>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>>> ) >>> >>> This IMHO is promising, this could open the doors to small-to-medium >>> installations that are currently too cumbersome to handle. >>> >> >> Yeah, there is already a simple ORM within nutchbase that is >> avro-based and should >> be generic enough to also support MySQL, cassandra and berkeleydb. But >> any good ORM will >> be a very good addition. >> >>>> * plugin cleanup : Tika only for parsing - get rid of everything else? >>> >>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >>> different API) so that we can post-process the DOM created in Tika from >>> whatever original format. >>> >>> Also, the goal of the crawler-commons project is to provide APIs and >>> implementations of stuff that is needed for every open source crawler >>> project, like: robots handling, url filtering and url normalization, URL >>> state management, perhaps deduplication. We should coordinate our >>> efforts, and share code freely so that other projects (bixo, heritrix, >>> droids) may contribute to this shared pool of functionality, much like >>> Tika does for the common need of parsing complex formats. >>> >>>> * remove index / search and delegate to SOLR >>> >>> +1 - we may still keep a thin abstract layer to allow other >>> indexing/search backends, but the current mess of indexing/query filters >>> and competing indexing frameworks (lucene, fields, solr) should go away. >>> We should go directly from DOM to a NutchDocument, and stop there. >>> >> >> Agreed. I would like to add support for katta and other indexing >> backends at some point but >> NutchDocument should be our canonical representation. The rest should >> be up to indexing backends. >> >>> Regarding search - currently the search API is too low-level, with the >>> custom text and query analysis chains. This needlessly introduces the >>> (in)famous Nutch Query classes and Nutch query syntax limitations, We Doğacan Güney
-
Re: Nutch 2.0 roadmapMilleBii 2010-04-08, 18:11
Not sure what u mean by pig script, but I'd like to be able to make a
multi-criteria selection of Url for fetching... The scoring method forces into a kind of mono dimensional approach which is not really easy to deal with. The regex filters are good but it assumes you want select URLs on data which is in the URL... Pretty limited in fact I basically would like to do 'content' based crawling. Say for example: that I'm interested in "topic A". I'd'like to label URLs that match "Topic A" (user supplied logic). Later on I would want to crawl "topic A" urls at a certain frequency and non labeled urls for exploring in a different way. This looks like hard to do right now 2010/4/8, Doğacan Güney <[EMAIL PROTECTED]>: > Hi, > > On Wed, Apr 7, 2010 at 21:19, MilleBii <[EMAIL PROTECTED]> wrote: >> Just a question ? >> Will the new HBase implementation allow more sophisticated crawling >> strategies than the current score based. >> >> Give you a few example of what I'd like to do : >> Define different crawling frequency for different set of URLs, say >> weekly for some url, monthly or more for others. >> >> Select URLs to re-crawl based on attributes previously extracted.Just >> one example: recrawl urls that contained a certain keyword (or set of) >> >> Select URLs that have not yet been crawled, at the frontier of the >> crawl therefore >> > > At some point, it would be nice to change generator so that it is only a > handful > of methods and a pig (or something else) script. So, we would provide > most of the functions > you may need during generation (accessing various data) but actual > generation would be a pig > process. This way, anyone can easily change generate any way they want > (even make it more jobs > than 2 if they want more complex schemes). > >> >> >> >> 2010/4/7, Doğacan Güney <[EMAIL PROTECTED]>: >>> Hey everyone, >>> >>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >>>> On 2010-04-06 15:43, Julien Nioche wrote: >>>>> Hi guys, >>>>> >>>>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will >>>>> be >>>>> based on what is currently referred to as NutchBase. Shall we create a >>>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly >>>>> for >>>>> JIRA so that we can file issues / feature requests on 2.0? Do you think >>>>> that >>>>> the current NutchBase could be used as a basis for the 2.0 branch? >>>> >>>> I'm not sure what is the status of the nutchbase - it's missed a lot of >>>> fixes and changes in trunk since it's been last touched ... >>>> >>> >>> I know... But I still intend to finish it, I just need to schedule >>> some time for it. >>> >>> My vote would be to go with nutchbase. >>> >>>>> >>>>> Talking about features, what else would we add apart from : >>>>> >>>>> * support for HBase : via ORM or not (see >>>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>>>> ) >>>> >>>> This IMHO is promising, this could open the doors to small-to-medium >>>> installations that are currently too cumbersome to handle. >>>> >>> >>> Yeah, there is already a simple ORM within nutchbase that is >>> avro-based and should >>> be generic enough to also support MySQL, cassandra and berkeleydb. But >>> any good ORM will >>> be a very good addition. >>> >>>>> * plugin cleanup : Tika only for parsing - get rid of everything else? >>>> >>>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >>>> different API) so that we can post-process the DOM created in Tika from >>>> whatever original format. >>>> >>>> Also, the goal of the crawler-commons project is to provide APIs and >>>> implementations of stuff that is needed for every open source crawler >>>> project, like: robots handling, url filtering and url normalization, URL >>>> state management, perhaps deduplication. We should coordinate our >>>> efforts, and share code freely so that other projects (bixo, heritrix, >>>> droids) may contribute to this shared pool of functionality, much like -MilleBii-
-
Re: Nutch 2.0 roadmapDoğacan Güney 2010-04-08, 20:20
On Thu, Apr 8, 2010 at 21:11, MilleBii <[EMAIL PROTECTED]> wrote:
> Not sure what u mean by pig script, but I'd like to be able to make a > multi-criteria selection of Url for fetching... I mean a query language like http://hadoop.apache.org/pig/ if we expose data correctly, then you should be able to generate on any criteria that you want. > The scoring method forces into a kind of mono dimensional approach > which is not really easy to deal with. > > The regex filters are good but it assumes you want select URLs on data > which is in the URL... Pretty limited in fact > > I basically would like to do 'content' based crawling. Say for > example: that I'm interested in "topic A". > I'd'like to label URLs that match "Topic A" (user supplied logic). > Later on I would want to crawl "topic A" urls at a certain frequency > and non labeled urls for exploring in a different way. > > This looks like hard to do right now > > 2010/4/8, Doğacan Güney <[EMAIL PROTECTED]>: >> Hi, >> >> On Wed, Apr 7, 2010 at 21:19, MilleBii <[EMAIL PROTECTED]> wrote: >>> Just a question ? >>> Will the new HBase implementation allow more sophisticated crawling >>> strategies than the current score based. >>> >>> Give you a few example of what I'd like to do : >>> Define different crawling frequency for different set of URLs, say >>> weekly for some url, monthly or more for others. >>> >>> Select URLs to re-crawl based on attributes previously extracted.Just >>> one example: recrawl urls that contained a certain keyword (or set of) >>> >>> Select URLs that have not yet been crawled, at the frontier of the >>> crawl therefore >>> >> >> At some point, it would be nice to change generator so that it is only a >> handful >> of methods and a pig (or something else) script. So, we would provide >> most of the functions >> you may need during generation (accessing various data) but actual >> generation would be a pig >> process. This way, anyone can easily change generate any way they want >> (even make it more jobs >> than 2 if they want more complex schemes). >> >>> >>> >>> >>> 2010/4/7, Doğacan Güney <[EMAIL PROTECTED]>: >>>> Hey everyone, >>>> >>>> On Tue, Apr 6, 2010 at 20:23, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: >>>>> On 2010-04-06 15:43, Julien Nioche wrote: >>>>>> Hi guys, >>>>>> >>>>>> I gather that we'll jump straight to 2.0 after 1.1 and that 2.0 will >>>>>> be >>>>>> based on what is currently referred to as NutchBase. Shall we create a >>>>>> branch for 2.0 in the Nutch SVN repository and have a label accordingly >>>>>> for >>>>>> JIRA so that we can file issues / feature requests on 2.0? Do you think >>>>>> that >>>>>> the current NutchBase could be used as a basis for the 2.0 branch? >>>>> >>>>> I'm not sure what is the status of the nutchbase - it's missed a lot of >>>>> fixes and changes in trunk since it's been last touched ... >>>>> >>>> >>>> I know... But I still intend to finish it, I just need to schedule >>>> some time for it. >>>> >>>> My vote would be to go with nutchbase. >>>> >>>>>> >>>>>> Talking about features, what else would we add apart from : >>>>>> >>>>>> * support for HBase : via ORM or not (see >>>>>> NUTCH-808<https://issues.apache.org/jira/browse/NUTCH-808> >>>>>> ) >>>>> >>>>> This IMHO is promising, this could open the doors to small-to-medium >>>>> installations that are currently too cumbersome to handle. >>>>> >>>> >>>> Yeah, there is already a simple ORM within nutchbase that is >>>> avro-based and should >>>> be generic enough to also support MySQL, cassandra and berkeleydb. But >>>> any good ORM will >>>> be a very good addition. >>>> >>>>>> * plugin cleanup : Tika only for parsing - get rid of everything else? >>>>> >>>>> Basically, yes - keep only stuff like HtmlParseFilters (probably with a >>>>> different API) so that we can post-process the DOM created in Tika from >>>>> whatever original format. >>>>> >>>>> Also, the goal of the crawler-commons project is to provide APIs and >>>>> implementations of stuff that is needed for every open source crawler Doğacan Güney
-
Nutch 2.0 roadmaplewis john mcgibbney 2011-07-02, 00:19
Hi,
This is to all dev's although I am referring to Julien (as he established/last edited the wiki page) Currently the slightly (in places) dated roadmap can be found here [1], I was wondering if we could give this an overhaul/update as it would give a more robust overview of where trunk is going. Most of the points you make are still in development, however some have been achieved and integrated into trunk builds. Is there anything else we can add to this page to reflect current initiatives currently in dev regarding trunk (major or minor?). You make a lot of good points in your Berlin Buzzwords presentation Julien, would it be possible to initiate further disucssion amongst devs on these points. I noticed another point you mentioned was that we are thin on documentation for trunk... this is very much true. It would be great to get an up-to-date roadmap for trunk as we plan to release this year moving forward it is essential that this is seen to. N.B. I moved to old Nutch 2.0 road map to the legacy and archive section of the wiki in an attempt to disambiguate data and future intentions. Thanks [1] http://wiki.apache.org/nutch/Nutch2Roadmap -- *Lewis* |