|
Isabel Drost
2011-10-30, 03:45
Lance Norskog
2011-10-31, 02:01
Lance Norskog
2011-10-31, 02:08
Jeff Eastman
2011-10-31, 20:31
Isabel Drost
2011-11-01, 08:04
Grant Ingersoll
2011-11-01, 12:09
Grant Ingersoll
2011-11-01, 12:58
Ted Dunning
2011-11-01, 16:15
Grant Ingersoll
2011-11-01, 16:52
Isabel Drost
2011-11-01, 17:17
Isabel Drost
2011-11-01, 17:18
Isabel Drost
2011-11-04, 03:10
Isabel Drost
2011-11-07, 21:27
|
-
Towards 1.0 - Defining backwards compatibility guaranteesIsabel Drost 2011-10-30, 03:45
Mahout seems to be at a stage where we have covered most of the interesting machine learning problems, where it is being used in production by quite some developers - hey, we even got a book that is now available in a printed version. Maybe it's time to start taking first steps towards a 1.0 release. One* important step in my opinion is to define what kind of backwards compatibility guarantees we want to give our users - and what guarantees our users really need - after releasing 1.0. Just a rough list below - feel free to extend, shrink and change: 1) Data input formats - people probably do not want to re-generate vectors from their original data every time they use a new Mahout version. 2) Model formats - people probably do not want to have to retrain a model only to make it work with the latest and greatest features of a new Mahout release. 3) Model output - when upgrading users probably want to receive model output that is then integrated in their system the same way as with the older relase. 4) APIs - I don't see us keeping all interfaces or even abstract classes stable. However users should know which APIs we consider "public facing" and will likely keep stable. Maybe an annotation makes that clear? 5) Command line scripts - is there a significant user base relying on the bin/mahout script to warrant working towards keeping that stable between releases? Most likely I've forgotten about other vital pieces - just wanted to kick off that discussion. Isabel * though not the only one - others include but are not limited to the time frame for which we offer support for any given release.
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesLance Norskog 2011-10-31, 02:01
2) Model formats
Proposal: a few common structures with higher-level conventions about how to compose them. . For matrix data, the R "dataframe" is a time-tested format for dense vectors, matrices and tensors. Something like this that also handles most sparsity cases would allow ditching a lot of hard-coded formats. We would need a counterpart format for discrete data structures like graphs, fpgrowth etc. If there are none in the public sphere, here is one: an object with two lists, each with a label. This can represent one node or edge of a graph. To read in the graph you would need to fill hashtables from the labels. Add a double and you have a weighted graph. Call it a "bundle". FPGrowth uses a more complex data structure. This provides 2 use cases: 1) a hard use case for composing its data with a simpler object, because you have to save the simple objects with metadata that lets you read and reconstitute. 2) a simpler use case is saving "flattened" variations of the full data structure as a stream of bundles. On Sat, Oct 29, 2011 at 8:45 PM, Isabel Drost <[EMAIL PROTECTED]> wrote: > > Mahout seems to be at a stage where we have covered most of the interesting > machine learning problems, where it is being used in production by quite > some > developers - hey, we even got a book that is now available in a printed > version. > > Maybe it's time to start taking first steps towards a 1.0 release. One* > important step in my opinion is to define what kind of backwards > compatibility > guarantees we want to give our users - and what guarantees our users really > need > - after releasing 1.0. > > Just a rough list below - feel free to extend, shrink and change: > > 1) Data input formats - people probably do not want to re-generate vectors > from > their original data every time they use a new Mahout version. > > 2) Model formats - people probably do not want to have to retrain a model > only > to make it work with the latest and greatest features of a new Mahout > release. > > 3) Model output - when upgrading users probably want to receive model > output > that is then integrated in their system the same way as with the older > relase. > > 4) APIs - I don't see us keeping all interfaces or even abstract classes > stable. > However users should know which APIs we consider "public facing" and will > likely > keep stable. Maybe an annotation makes that clear? > > 5) Command line scripts - is there a significant user base relying on the > bin/mahout script to warrant working towards keeping that stable between > releases? > > Most likely I've forgotten about other vital pieces - just wanted to kick > off > that discussion. > > > Isabel > > > * though not the only one - others include but are not limited to the time > frame > for which we offer support for any given release. > -- Lance Norskog [EMAIL PROTECTED]
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesLance Norskog 2011-10-31, 02:08
6) Quick access to the online algorithms
The servlet implementation in taste is simple. It should be possible to package a lot of the online algorithms in one big servlet. Call it " Mahout Online"? One problem here is that uploading and downloading data for each operation is not practical. MO would be very useful if it has direct file system access to user data. Yes, this is a security problem :) Lance On Sun, Oct 30, 2011 at 7:01 PM, Lance Norskog <[EMAIL PROTECTED]> wrote: > 2) Model formats > Proposal: a few common structures with higher-level conventions about how > to compose them. > . > For matrix data, the R "dataframe" is a time-tested format for dense > vectors, matrices and tensors. Something like this that also handles most > sparsity cases would allow ditching a lot of hard-coded formats. > > We would need a counterpart format for discrete data structures like > graphs, fpgrowth etc. If there are none in the public sphere, here is one: > an object with two lists, each with a label. This can represent one node or > edge of a graph. To read in the graph you would need to fill hashtables from > the labels. Add a double and you have a weighted graph. Call it a "bundle". > > FPGrowth uses a more complex data structure. This provides 2 use cases: > 1) a hard use case for composing its data with a simpler object, because > you have to save the simple objects with metadata that lets you read and > reconstitute. > 2) a simpler use case is saving "flattened" variations of the full data > structure as a stream of bundles. > > > On Sat, Oct 29, 2011 at 8:45 PM, Isabel Drost <[EMAIL PROTECTED]> wrote: > >> >> Mahout seems to be at a stage where we have covered most of the >> interesting >> machine learning problems, where it is being used in production by quite >> some >> developers - hey, we even got a book that is now available in a printed >> version. >> >> Maybe it's time to start taking first steps towards a 1.0 release. One* >> important step in my opinion is to define what kind of backwards >> compatibility >> guarantees we want to give our users - and what guarantees our users >> really need >> - after releasing 1.0. >> >> Just a rough list below - feel free to extend, shrink and change: >> >> 1) Data input formats - people probably do not want to re-generate vectors >> from >> their original data every time they use a new Mahout version. >> >> 2) Model formats - people probably do not want to have to retrain a model >> only >> to make it work with the latest and greatest features of a new Mahout >> release. >> >> 3) Model output - when upgrading users probably want to receive model >> output >> that is then integrated in their system the same way as with the older >> relase. >> >> 4) APIs - I don't see us keeping all interfaces or even abstract classes >> stable. >> However users should know which APIs we consider "public facing" and will >> likely >> keep stable. Maybe an annotation makes that clear? >> >> 5) Command line scripts - is there a significant user base relying on the >> bin/mahout script to warrant working towards keeping that stable between >> releases? >> >> Most likely I've forgotten about other vital pieces - just wanted to kick >> off >> that discussion. >> >> >> Isabel >> >> >> * though not the only one - others include but are not limited to the time >> frame >> for which we offer support for any given release. >> > > > > -- > Lance Norskog > [EMAIL PROTECTED] > > -- Lance Norskog [EMAIL PROTECTED]
-
RE: Towards 1.0 - Defining backwards compatibility guaranteesJeff Eastman 2011-10-31, 20:31
I think users would benefit a lot by 1) to 3) and would be dismayed if we could not maintain data consistency between releases (maybe just point releases?). This could require us to build and ship migrating tools along with any releases which change these formats.
4) and 5) are related and it is a question which is more important if we can't do both. Since a lot of users are using the CLI I think backwards compatibility is pretty important there. This is especially the case for the MiA examples. The book is really our user manual and many people will be turned off if gratuitous API changes make the book obsolete as a learning tool. Of course, the book has plenty of API usage examples which need to keep compatibility too. Our 1.0 release will have a lot of solid implementations of scalable machine learning software, but everything is not at the same level of maturity. I think it is critical that we adopt a maturity scheme so that we can realistically make changes to evolving algorithms while making reasonable guarantees about stable code. Moving still-evolving implementations to a separate source tree would certainly make their status visible, but I wonder about the mechanics: to we need a parallel contrib universe (with math, core, integration, examples subtrees?) or would the annotations work better? I kind of favor the annotations as the former seems like too much dependency plumbing. And, of course, defining the content of 1.0 is still something we need to do. That is a separate thread TBD. -----Original Message----- From: Isabel Drost [mailto:[EMAIL PROTECTED]] Sent: Saturday, October 29, 2011 8:46 PM To: [EMAIL PROTECTED] Subject: Towards 1.0 - Defining backwards compatibility guarantees Mahout seems to be at a stage where we have covered most of the interesting machine learning problems, where it is being used in production by quite some developers - hey, we even got a book that is now available in a printed version. Maybe it's time to start taking first steps towards a 1.0 release. One* important step in my opinion is to define what kind of backwards compatibility guarantees we want to give our users - and what guarantees our users really need - after releasing 1.0. Just a rough list below - feel free to extend, shrink and change: 1) Data input formats - people probably do not want to re-generate vectors from their original data every time they use a new Mahout version. 2) Model formats - people probably do not want to have to retrain a model only to make it work with the latest and greatest features of a new Mahout release. 3) Model output - when upgrading users probably want to receive model output that is then integrated in their system the same way as with the older relase. 4) APIs - I don't see us keeping all interfaces or even abstract classes stable. However users should know which APIs we consider "public facing" and will likely keep stable. Maybe an annotation makes that clear? 5) Command line scripts - is there a significant user base relying on the bin/mahout script to warrant working towards keeping that stable between releases? Most likely I've forgotten about other vital pieces - just wanted to kick off that discussion. Isabel * though not the only one - others include but are not limited to the time frame for which we offer support for any given release.
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesIsabel Drost 2011-11-01, 08:04
On 31.10.2011 Jeff Eastman wrote:
> I think users would benefit a lot by 1) to 3) and would be dismayed if we > could not maintain data consistency between releases > (maybe just point releases?). Good point that I forgot to define in the original mail: Levels of back-compat should depend on which type of release is being built. > 4) and 5) are related and it is a question which is more important if we > can't do both. I think for minor releases we should do both. However it might be easier to do 4) if we could restrict it to a subset of the code only - meaning only code that is intended to be used by external code. > Since a lot of users are using the CLI I think backwards > compatibility is pretty important there. This is especially the case for > the MiA examples. The book is really our user manual and many people will > be turned off if gratuitous API changes make the book obsolete as a > learning tool. Of course, the book has plenty of API usage examples which > need to keep compatibility too. > > Our 1.0 release will have a lot of solid implementations of scalable > machine learning software, but everything is not at the same level of > maturity. I think it is critical that we adopt a maturity scheme so that > we can realistically make changes to evolving algorithms while making > reasonable guarantees about stable code. Moving still-evolving > implementations to a separate source tree would certainly make their > status visible, but I wonder about the mechanics: to we need a parallel > contrib universe (with math, core, integration, examples subtrees?) or > would the annotations work better? I kind of favor the annotations as the > former seems like too much dependency plumbing. Me personally, I am currently quite undecided here - annotations have the advantage of keeping everything in one source tree and module. Keeping stuff in contrib modules could give us the chance of lowering the bar to committership substancially - question is, does that really work out well or will it just cause overhead and trouble? Any experience from the Lucene world that we can build on here? > And, of course, defining the content of 1.0 is still something we need to > do. That is a separate thread TBD. +1 for taking one step at a time. Isabel
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesGrant Ingersoll 2011-11-01, 12:09
FWIW, in Lucene, we do the following:
1. All minor versions within a major release can read prior versions index within the same major release. That is, 3.4 can read a 3.3 index. However, 3.3 cannot read a 3.4 index. When a user reads a 3.3 index w/ 3.4, it is silently upgraded to 3.4. I think this versioning scheme should work well for us to when it comes to models. In the new 4.x line, we have a Codec system which will make it fairly easy for any version to read any other version. 2. For APIs, we typically mark things as @lucene.experimental if we think they may change within minor releases. We also mark things as deprecated that are going away. Deprecated items are then removed on the next major release. The upgrade path is usually to go to x.9, remove all deprecations and then go to x+1.0. We also communicate to users via release notes when we purposefully broke back compat. For the most part this works and I would recommend we take similar steps. First steps would be to start versioning our models and perhaps our input formats. I suspect we could simply take the Lucene code for this (it's time stamp plus something else that I forget, I think) -Grant On Oct 29, 2011, at 11:45 PM, Isabel Drost wrote: > > Mahout seems to be at a stage where we have covered most of the interesting > machine learning problems, where it is being used in production by quite some > developers - hey, we even got a book that is now available in a printed version. > > Maybe it's time to start taking first steps towards a 1.0 release. One* > important step in my opinion is to define what kind of backwards compatibility > guarantees we want to give our users - and what guarantees our users really need > - after releasing 1.0. > > Just a rough list below - feel free to extend, shrink and change: > > 1) Data input formats - people probably do not want to re-generate vectors from > their original data every time they use a new Mahout version. > > 2) Model formats - people probably do not want to have to retrain a model only > to make it work with the latest and greatest features of a new Mahout release. > > 3) Model output - when upgrading users probably want to receive model output > that is then integrated in their system the same way as with the older relase. > > 4) APIs - I don't see us keeping all interfaces or even abstract classes stable. > However users should know which APIs we consider "public facing" and will likely > keep stable. Maybe an annotation makes that clear? > > 5) Command line scripts - is there a significant user base relying on the > bin/mahout script to warrant working towards keeping that stable between > releases? > > Most likely I've forgotten about other vital pieces - just wanted to kick off > that discussion. > > > Isabel > > > * though not the only one - others include but are not limited to the time frame > for which we offer support for any given release. -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesGrant Ingersoll 2011-11-01, 12:58
On Nov 1, 2011, at 8:09 AM, Grant Ingersoll wrote: > FWIW, in Lucene, we do the following: > > 1. All minor versions within a major release can read prior versions index within the same major release. That is, 3.4 can read a 3.3 index. However, 3.3 cannot read a 3.4 index. When a user reads a 3.3 index w/ 3.4, it is silently upgraded to 3.4. I think this versioning scheme should work well for us to when it comes to models. In the new 4.x line, we have a Codec system which will make it fairly easy for any version to read any other version. This assumes, of course, that a model is upgradeable in format, which I haven't thought about whether it applies to us or not. > > 2. For APIs, we typically mark things as @lucene.experimental if we think they may change within minor releases. We also mark things as deprecated that are going away. Deprecated items are then removed on the next major release. The upgrade path is usually to go to x.9, remove all deprecations and then go to x+1.0. > > We also communicate to users via release notes when we purposefully broke back compat. > > For the most part this works and I would recommend we take similar steps. First steps would be to start versioning our models and perhaps our input formats. I suspect we could simply take the Lucene code for this (it's time stamp plus something else that I forget, I think) > > -Grant > > On Oct 29, 2011, at 11:45 PM, Isabel Drost wrote: > >> >> Mahout seems to be at a stage where we have covered most of the interesting >> machine learning problems, where it is being used in production by quite some >> developers - hey, we even got a book that is now available in a printed version. >> >> Maybe it's time to start taking first steps towards a 1.0 release. One* >> important step in my opinion is to define what kind of backwards compatibility >> guarantees we want to give our users - and what guarantees our users really need >> - after releasing 1.0. >> >> Just a rough list below - feel free to extend, shrink and change: >> >> 1) Data input formats - people probably do not want to re-generate vectors from >> their original data every time they use a new Mahout version. >> >> 2) Model formats - people probably do not want to have to retrain a model only >> to make it work with the latest and greatest features of a new Mahout release. >> >> 3) Model output - when upgrading users probably want to receive model output >> that is then integrated in their system the same way as with the older relase. >> >> 4) APIs - I don't see us keeping all interfaces or even abstract classes stable. >> However users should know which APIs we consider "public facing" and will likely >> keep stable. Maybe an annotation makes that clear? >> >> 5) Command line scripts - is there a significant user base relying on the >> bin/mahout script to warrant working towards keeping that stable between >> releases? >> >> Most likely I've forgotten about other vital pieces - just wanted to kick off >> that discussion. >> >> >> Isabel >> >> >> * though not the only one - others include but are not limited to the time frame >> for which we offer support for any given release. > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > > > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesTed Dunning 2011-11-01, 16:15
I think the trend is away from an explicit version in serialized data and toward systems like protobufs or avro which allow much more flexibility.
Sent from my iPhone On Nov 1, 2011, at 5:09, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > For the most part this works and I would recommend we take similar steps. First steps would be to start versioning our models and perhaps our input formats. I suspect we could simply take the Lucene code for this (it's time stamp plus something else that I forget, I think)
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesGrant Ingersoll 2011-11-01, 16:52
On Nov 1, 2011, at 12:15 PM, Ted Dunning wrote: > I think the trend is away from an explicit version in serialized data and toward systems like protobufs or avro which allow much more flexibility. +1 > > Sent from my iPhone > > On Nov 1, 2011, at 5:09, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> For the most part this works and I would recommend we take similar steps. First steps would be to start versioning our models and perhaps our input formats. I suspect we could simply take the Lucene code for this (it's time stamp plus something else that I forget, I think) -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesIsabel Drost 2011-11-01, 17:17
On 01.11.2011 Grant Ingersoll wrote:
> On Nov 1, 2011, at 12:15 PM, Ted Dunning wrote: > > I think the trend is away from an explicit version in serialized data and > > toward systems like protobufs or avro which allow much more flexibility. > > +1 +1 > > Sent from my iPhone > > > > On Nov 1, 2011, at 5:09, Grant Ingersoll <[EMAIL PROTECTED]> wrote: > >> For the most part this works and I would recommend we take similar > >> steps. First steps would be to start versioning our models and perhaps > >> our input formats. I suspect we could simply take the Lucene code for > >> this (it's time stamp plus something else that I forget, I think)
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesIsabel Drost 2011-11-01, 17:18
On 01.11.2011 Grant Ingersoll wrote:
> FWIW, in Lucene, we do the following: > > 1. All minor versions within a major release can read prior versions index > within the same major release. That is, 3.4 can read a 3.3 index. > However, 3.3 cannot read a 3.4 index. When a user reads a 3.3 index w/ > 3.4, it is silently upgraded to 3.4. I think this versioning scheme > should work well for us to when it comes to models. In the new 4.x line, > we have a Codec system which will make it fairly easy for any version to > read any other version. > > 2. For APIs, we typically mark things as @lucene.experimental if we think > they may change within minor releases. We also mark things as deprecated > that are going away. Deprecated items are then removed on the next major > release. The upgrade path is usually to go to x.9, remove all > deprecations and then go to x+1.0. > > We also communicate to users via release notes when we purposefully broke > back compat. Sounds all good to me. +1 Isabel
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesIsabel Drost 2011-11-04, 03:10
On 31.10.2011 Lance Norskog wrote:
> 6) Quick access to the online algorithms > > The servlet implementation in taste is simple. It should be possible to > package a lot of the online algorithms in one big servlet. Call it " Mahout > Online"? Just a thought: Is that something that a) Mahout users expect from our library? and b) something that anyone interested in contributing to Mahout is interested in working on? If have to decline one of the questions maybe we should consider postponing that or leave it to projects entirely independent of core Mahout? Isabel
-
Re: Towards 1.0 - Defining backwards compatibility guaranteesIsabel Drost 2011-11-07, 21:27
On 01.11.2011 Grant Ingersoll wrote:
> FWIW, in Lucene, we do the following: > > [...] For documentation purposes I added a summary of the discussion to our wiki (including a disclaimer that we are still working on the draft): https://cwiki.apache.org/confluence/display/MAHOUT/Downloads (Section "Backwards compatibility of releases") Feel free to refine, shorten or extend, Isabel |