-Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java
Andrzej Bialecki 2010-07-20, 19:01
On 2010-07-20 20:29, Julien Nioche wrote:
> I meant putting the migration code and 1.x Nutch jars in the contrib
> directory of the trunk - that shouldn't require a different committers
> list or should it?
I don't feel strongly about contrib... there is a different precedent:
for a while there were migration tools in the main tree for conversion
between 0.8 and 0.9+.
> A. branch cleaned up, SVN commits, etc., stable working
> B. at some point, branch ready to be merged (assumption: branch
> devel stops)
> C. define branch merge into 3-5 patches
Due to a total API incompatibility (CrawlDatum is replaced by a WebPage,
content and link storage is different, the way we run jobs in nutchbase
is also different) I don't expect more than 2 patches, of which the
first one will contain 90% of API changes...
> D. foreach patch in C:
> create JIRA issue for patch
> call for review of patch
> if no objections, then commit in 24-48 hours
> E. trunk now ready for 2.0 development
> F. schedule current open issues for 2.0, grab any low hanging fruit (1-2
> G. all other issues pushed out to 2.1
> H. release 2.0
> Andrzej and myself are in the process of porting the last missing tests
> in NutchBase and debugging Gora along the way. There is just a handful
> of plugins which have not been ported and I should have finished that
> pretty quickly. Hopefully we'll get to (A) soonish and can then follow
> the plan above.
> However we still need to address the issue raise by Dogacan i.e shall we
> provide tools to convert from 1.x structures to 2.0 and if so how shall
> we organise it. Again - some things have been removed fom NutchBase for
> the sake of clarity but since they are in the trunk they are not lost
> and we can decide what to do with them later.
IMO it would take enormous effort to implement a runtime compatibility
between 1.x and 2.x, so users will have to either convert or recrawl. I
think that at a minimum we should provide a clear procedure on how to
export the old crawldb and import into a new db.
If there's a strong desire to have a tool to convert 1.x segments into
the new crawl job data format we could also implement this - but I don't
expect there would be ... after all, segments are a throwaway property
with a limited time to live...
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com