Home | About | Sematext search-lucene.com search-hadoop.com
 Search Lucene and all its subprojects:

Switch to Threaded View
Nutch, mail # dev - Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java


Copy link to this message
-
Re: svn commit: r965815 - in /nutch/branches/nutchbase/src: java/org/apache/nutch/parse/ParseStatus.java java/org/apache/nutch/parse/ParseText.java test/org/apache/nutch/parse/TestParseText.java
Andrzej Bialecki 2010-07-20, 19:01
On 2010-07-20 20:29, Julien Nioche wrote:

> I meant putting the migration code and 1.x Nutch jars in the contrib
> directory of the trunk - that shouldn't require a different committers
> list or should it?

I don't feel strongly about contrib... there is a different precedent:
for a while there were migration tools in the main tree for conversion
between 0.8 and 0.9+.
>     A. branch cleaned up, SVN commits, etc., stable working
>     B. at some point, branch ready to be merged (assumption: branch
>     devel stops)
>     C. define branch merge into 3-5 patches

Due to a total API incompatibility (CrawlDatum is replaced by a WebPage,
content and link storage is different, the way we run jobs in nutchbase
is also different) I don't expect more than 2 patches, of which the
first one will contain 90% of API changes...

>     D. foreach patch in C:
>        create JIRA issue for patch
>        call for review of patch
>        if no objections, then commit in 24-48 hours
>
>     E. trunk now ready for 2.0 development
>     F. schedule current open issues for 2.0, grab any low hanging fruit (1-2
>     days)
>     G. all other issues pushed out to 2.1
>     H. release 2.0
>
>
> Andrzej and myself are in the process of porting the last missing tests
> in NutchBase and debugging Gora along the way. There is just a handful
> of plugins which have not been ported and I should have finished that
> pretty quickly. Hopefully we'll get to (A) soonish and can then follow
> the plan above.
>
> However we still need to address the issue raise by Dogacan i.e shall we
> provide tools to convert from 1.x structures to 2.0 and if so how shall
> we organise it. Again - some things have been removed fom NutchBase for
> the sake of clarity but since they are in the trunk they are not lost
> and we can decide what to do with them later.

IMO it would take enormous effort to implement a runtime compatibility
between 1.x and 2.x, so users will have to either convert or recrawl. I
think that at a minimum we should provide a clear procedure on how to
export the old crawldb and import into a new db.

If there's a strong desire to have a tool to convert 1.x segments into
the new crawl job data format we could also implement this - but I don't
expect there would be ... after all, segments are a throwaway property
with a limited time to live...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com