|
Jeff Eastman
2012-02-11, 19:01
Frank Scholten
2012-02-11, 21:29
Lance Norskog
2012-02-12, 04:45
Jeff Eastman
2012-02-12, 08:02
Jeff Eastman
2012-02-12, 08:20
John Conwell
2012-02-13, 17:31
Ted Dunning
2012-02-13, 18:51
Jeff Eastman
2012-02-14, 18:46
Sean Owen
2012-02-14, 19:25
Jeff Eastman
2012-02-14, 20:29
Dmitriy Lyubimov
2012-02-14, 20:56
Lance Norskog
2012-02-15, 03:37
Ioan Eugen Stan
2012-02-23, 14:03
Ted Dunning
2012-02-23, 15:53
Ioan Eugen Stan
2012-02-23, 20:13
Grant Ingersoll
2012-02-24, 18:42
|
-
Goals for Mahout 0.7Jeff Eastman 2012-02-11, 19:01
Now that 0.6 is in the box, it seems a good time to start thinking about
0.7, from a high level goal perspective at least. Here are a couple that come to mind: * Target code freeze date August 1, 2012 * Get Jenkins working for us again * Complete clustering refactoring and classification convergence * ...
-
Re: Goals for Mahout 0.7Frank Scholten 2012-02-11, 21:29
I'd like to add solving ClassNotFoundException problems with third
party jars in some jobs. I experimented with having seq2sparse uploading a third party jar with analyzer and add it to the DistributedCache. Uploading works but didn't yet get it working inside the Mappers. I have some code lying around for this that can be used as a starting point, including a separate project that has dependencies on Mahout and on an analyzer to test things out. Another thing would be adding or improving the integration tools. For example adding a mysql2seq to cluster text from a SQL database. On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > Now that 0.6 is in the box, it seems a good time to start thinking about > 0.7, from a high level goal perspective at least. Here are a couple that > come to mind: > > Target code freeze date August 1, 2012 > Get Jenkins working for us again > Complete clustering refactoring and classification convergence What kind of clustering refactoring do mean here? I did some work on creating bean configurations in the past (MAHOUT-612). I underestimated the amount of work required to do the entire refactoring. If this can be contributed and committed on a per-job basis I would like to help out. > ...
-
Re: Goals for Mahout 0.7Lance Norskog 2012-02-12, 04:45
For incremental improvements, usability and correctness of algorithms.
The "new" Naive Bayes and SGD algorithms both seem to have trouble classifying. Also, interpretation of results. It is hard to summarize the quality of results. I often feel like the math-savvy implementors print a bunch of numbers and say "that looks right", and the rest of us struggle to get an intuition of what's going on and why. For new features, "Mahout Online" would be great: a web service that packages all of the "online" algorithms (tractable speed and memory use). On Sat, Feb 11, 2012 at 1:29 PM, Frank Scholten <[EMAIL PROTECTED]> wrote: > I'd like to add solving ClassNotFoundException problems with third > party jars in some jobs. > > I experimented with having seq2sparse uploading a third party jar with > analyzer and add it to the DistributedCache. Uploading works but > didn't yet get it working inside the Mappers. I have some code lying > around for this that can be used as a starting point, including a > separate project that has dependencies on Mahout and on an analyzer to > test things out. > > Another thing would be adding or improving the integration tools. For > example adding a mysql2seq to cluster text from a SQL database. > > On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman > <[EMAIL PROTECTED]> wrote: >> Now that 0.6 is in the box, it seems a good time to start thinking about >> 0.7, from a high level goal perspective at least. Here are a couple that >> come to mind: >> >> Target code freeze date August 1, 2012 >> Get Jenkins working for us again >> Complete clustering refactoring and classification convergence > > What kind of clustering refactoring do mean here? I did some work on > creating bean configurations in the past (MAHOUT-612). I > underestimated the amount of work required to do the entire > refactoring. If this can be contributed and committed on a per-job > basis I would like to help out. > >> ... -- Lance Norskog [EMAIL PROTECTED]
-
Re: Goals for Mahout 0.7Jeff Eastman 2012-02-12, 08:02
+ users@
These are great ideas, and are just the kinds of high level conversations I was hoping to engender. From my agile background, I'd hope to define 0.7 by a small number of "epic stories", in a subset of our overall capabilities, which could focus our attention to a set of derivative JIRAs that will give Mahout a quantum step forward in some functional area from our user's perspective. I think maybe 2-3 such "epics" are all we can handle in a release. I don't necessarily think mine are the right ones either, but are prime for the pump. If we could only do 2-3 epics, what would they be? Where would the biggest contributions lie? On 2/11/12 9:45 PM, Lance Norskog wrote: > For incremental improvements, usability and correctness of algorithms. > The "new" Naive Bayes and SGD algorithms both seem to have trouble > classifying. Also, interpretation of results. It is hard to summarize > the quality of results. I often feel like the math-savvy implementors > print a bunch of numbers and say "that looks right", and the rest of > us struggle to get an intuition of what's going on and why. > > For new features, "Mahout Online" would be great: a web service that > packages all of the "online" algorithms (tractable speed and memory > use). > > On Sat, Feb 11, 2012 at 1:29 PM, Frank Scholten<[EMAIL PROTECTED]> wrote: >> I'd like to add solving ClassNotFoundException problems with third >> party jars in some jobs. >> >> I experimented with having seq2sparse uploading a third party jar with >> analyzer and add it to the DistributedCache. Uploading works but >> didn't yet get it working inside the Mappers. I have some code lying >> around for this that can be used as a starting point, including a >> separate project that has dependencies on Mahout and on an analyzer to >> test things out. >> >> Another thing would be adding or improving the integration tools. For >> example adding a mysql2seq to cluster text from a SQL database. >> >> On Sat, Feb 11, 2012 at 8:01 PM, Jeff Eastman >> <[EMAIL PROTECTED]> wrote: >>> Now that 0.6 is in the box, it seems a good time to start thinking about >>> 0.7, from a high level goal perspective at least. Here are a couple that >>> come to mind: >>> >>> Target code freeze date August 1, 2012 >>> Get Jenkins working for us again >>> Complete clustering refactoring and classification convergence >> What kind of clustering refactoring do mean here? I did some work on >> creating bean configurations in the past (MAHOUT-612). I >> underestimated the amount of work required to do the entire >> refactoring. If this can be contributed and committed on a per-job >> basis I would like to help out. >> >>> ... > >
-
Re: Goals for Mahout 0.7Jeff Eastman 2012-02-12, 08:20
We have a couple JIRAs that relate here: We want to factor all the (-cl)
classification steps out of all of the driver classes (MAHOUT-930) and into a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable outlier removal capability to this job; and MAHOUT-933 is aimed at factoring all the iteration mechanics from each driver class into the ClusterIterator, which uses a ClusterClassifier which is itself an OnlineLearner. This will hopefully allow semi-supervised classifier applications to be constructed by feeding cluster-derived models into the classification process. Still kind of fuzzy at this point but promising too. On 2/11/12 2:29 PM, Frank Scholten wrote: > ... > What kind of clustering refactoring do mean here? I did some work on > creating bean configurations in the past (MAHOUT-612). I > underestimated the amount of work required to do the entire > refactoring. If this can be contributed and committed on a per-job > basis I would like to help out. >> ... >
-
Re: Goals for Mahout 0.7John Conwell 2012-02-13, 17:31
>From my perspective, I'd really like to see the Mahout API migrate away
from a command line centric design it currently utilizes, and migrate more towards an library centric API design. I think this would go a long way in getting Mahout adopted into real life commercial applications. While there might be a few algorithm drivers that you interact with by creating an instance of a class, and calling some method(s) on the instance to interact with it (I havent actually seen one like that, but there might be a few), many algorithms are invoked by calling some static function on a class that takes ~37 typed arguments. Buts whats worse, many drivers are invoked by having to create a String array with ~37 arguments as string values, and calling the static main function on the class. Now I'm not saying that having a static main function available to invoke an algorithm from the command line isn't useful. It is, when your testing an algorithm. But once you want to integrate the algorithm into a commercial workflow it kind of sucks. For example, immagine if the API for invoking Math.max was designed the way many of the Mahout algorithms currently are? You'd have something like this: String[] args = new String[2]; args[0] = "max"; args[1] = "7"; args[0] = "4"; int max = Math.main(args); It makes your code a horrible mess and very hard to maintain, as well as very prone to bugs. When I see a bunch of static main functions as the only way to interact with a library, no matter what the quality of the library is, my initial impression is that this has to be some minimally supported effort by a few PhD candidates still in academia, who will drop the project as soon as they graduate. And while this might not be the case, it is one of the first impressions it gives, and can lead a company to drop the library from consideration before they do any due diligence into its quality and utility. I think as Mahout matures and gets closer to a 1.0 release, this kind of API re-design will become more and more necessary, especially if you want a higher Mahout integration rate into commercial applications and workflows. Also, I hope I dont sound too negative. I'm very impressed with Mahout and its capabilities. I really like that there is a well thought out class library of primitives for designing new serial and distributed machine learning algorithms. And I think it has a high utility for integration into highly visible commercial projects. But its high level public API really is a barrier to entry when trying to design commercial applications. On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman <[EMAIL PROTECTED]>wrote: > We have a couple JIRAs that relate here: We want to factor all the (-cl) > classification steps out of all of the driver classes (MAHOUT-930) and into > a separate job to remove duplicated code; MAHOUT-931 is to add a pluggable > outlier removal capability to this job; and MAHOUT-933 is aimed at > factoring all the iteration mechanics from each driver class into the > ClusterIterator, which uses a ClusterClassifier which is itself an > OnlineLearner. This will hopefully allow semi-supervised classifier > applications to be constructed by feeding cluster-derived models into the > classification process. Still kind of fuzzy at this point but promising too. > > On 2/11/12 2:29 PM, Frank Scholten wrote: > >> ... >> >> What kind of clustering refactoring do mean here? I did some work on >> creating bean configurations in the past (MAHOUT-612). I underestimated the >> amount of work required to do the entire refactoring. If this can be >> contributed and committed on a per-job basis I would like to help out. >> >>> ... >>> >> >> > -- Thanks, John C
-
Re: Goals for Mahout 0.7Ted Dunning 2012-02-13, 18:51
John,
This is well said and is a critical need. There are some beginnings to this. The recommender side of the house already works the way you say. The classifier and hashed encoding API's are beginning to work that way. The naive Bayes classifiers pretty much do not and the classifier API's are just beginning to have an API-centric form. On Mon, Feb 13, 2012 at 5:31 PM, John Conwell <[EMAIL PROTECTED]> wrote: > From my perspective, I'd really like to see the Mahout API migrate away > from a command line centric design it currently utilizes, and migrate more > towards an library centric API design. I think this would go a long way in > getting Mahout adopted into real life commercial applications. > > While there might be a few algorithm drivers that you interact with by > creating an instance of a class, and calling some method(s) on the instance > to interact with it (I havent actually seen one like that, but there might > be a few), many algorithms are invoked by calling some static function on a > class that takes ~37 typed arguments. Buts whats worse, many drivers are > invoked by having to create a String array with ~37 arguments as string > values, and calling the static main function on the class. > > Now I'm not saying that having a static main function available to invoke > an algorithm from the command line isn't useful. It is, when your testing > an algorithm. But once you want to integrate the algorithm into a > commercial workflow it kind of sucks. > > For example, immagine if the API for invoking Math.max was designed the way > many of the Mahout algorithms currently are? You'd have something like > this: > > String[] args = new String[2]; > args[0] = "max"; > args[1] = "7"; > args[0] = "4"; > int max = Math.main(args); > > It makes your code a horrible mess and very hard to maintain, as well as > very prone to bugs. > > When I see a bunch of static main functions as the only way to interact > with a library, no matter what the quality of the library is, my initial > impression is that this has to be some minimally supported effort by a few > PhD candidates still in academia, who will drop the project as soon as they > graduate. And while this might not be the case, it is one of the first > impressions it gives, and can lead a company to drop the library from > consideration before they do any due diligence into its quality and > utility. > > I think as Mahout matures and gets closer to a 1.0 release, this kind of > API re-design will become more and more necessary, especially if you want a > higher Mahout integration rate into commercial applications and workflows. > > Also, I hope I dont sound too negative. I'm very impressed with Mahout and > its capabilities. I really like that there is a well thought out class > library of primitives for designing new serial and distributed machine > learning algorithms. And I think it has a high utility for integration > into highly visible commercial projects. But its high level public API > really is a barrier to entry when trying to design commercial applications. > > > On Sun, Feb 12, 2012 at 12:20 AM, Jeff Eastman > <[EMAIL PROTECTED]>wrote: > > > We have a couple JIRAs that relate here: We want to factor all the (-cl) > > classification steps out of all of the driver classes (MAHOUT-930) and > into > > a separate job to remove duplicated code; MAHOUT-931 is to add a > pluggable > > outlier removal capability to this job; and MAHOUT-933 is aimed at > > factoring all the iteration mechanics from each driver class into the > > ClusterIterator, which uses a ClusterClassifier which is itself an > > OnlineLearner. This will hopefully allow semi-supervised classifier > > applications to be constructed by feeding cluster-derived models into the > > classification process. Still kind of fuzzy at this point but promising > too. > > > > On 2/11/12 2:29 PM, Frank Scholten wrote: > > > >> ... > >> > >> What kind of clustering refactoring do mean here? I did some work on
-
Re: Goals for Mahout 0.7Jeff Eastman 2012-02-14, 18:46
+1 I think this is an excellent goal. The current code base does not
approach its Java APIs in a uniform manner nor are we where we had hoped to be on the CLI API uniformity. There's a lot to do here in both areas. In the Java API area, we do have some notable successes, with the recommender APIs truly being designed for this kind of invocation. In the clustering drivers, we have tried to support native Java access as well, though there are a lot of arguments required for most invocations. Other drivers have really only been written for CLI access as you note and some large amounts of rather simple refactoring would be required to present a usable Java API. The challenge here is that the Java API must account for all of the optional CLI arguments of every algorithm. This either leads to ~37 typed arguments (hyperbole) or a set of helper methods which provide useful defaults for use in common situations. Another approach is to implement configuration beans which contain all the argument values required for full specification. In the current clustering refactoring under way to utilize the ClusterClassifier, arguments are to be provided in ClusteringPolicy objects so I'm biased towards the latter approach. We ought to agree upon which style we want to take this goal forward, but I am 100% behind it. Jeff On 2/13/12 10:31 AM, John Conwell wrote: > > From my perspective, I'd really like to see the Mahout API migrate away > from a command line centric design it currently utilizes, and migrate more > towards an library centric API design. I think this would go a long way in > getting Mahout adopted into real life commercial applications. > > While there might be a few algorithm drivers that you interact with by > creating an instance of a class, and calling some method(s) on the instance > to interact with it (I havent actually seen one like that, but there might > be a few), many algorithms are invoked by calling some static function on a > class that takes ~37 typed arguments. Buts whats worse, many drivers are > invoked by having to create a String array with ~37 arguments as string > values, and calling the static main function on the class. > > Now I'm not saying that having a static main function available to invoke > an algorithm from the command line isn't useful. It is, when your testing > an algorithm. But once you want to integrate the algorithm into a > commercial workflow it kind of sucks. > > For example, immagine if the API for invoking Math.max was designed the way > many of the Mahout algorithms currently are? You'd have something like > this: > > String[] args = new String[2]; > args[0] = "max"; > args[1] = "7"; > args[0] = "4"; > int max = Math.main(args); > > It makes your code a horrible mess and very hard to maintain, as well as > very prone to bugs. > > When I see a bunch of static main functions as the only way to interact > with a library, no matter what the quality of the library is, my initial > impression is that this has to be some minimally supported effort by a few > PhD candidates still in academia, who will drop the project as soon as they > graduate. And while this might not be the case, it is one of the first > impressions it gives, and can lead a company to drop the library from > consideration before they do any due diligence into its quality and utility. > > I think as Mahout matures and gets closer to a 1.0 release, this kind of > API re-design will become more and more necessary, especially if you want a > higher Mahout integration rate into commercial applications and workflows. > > Also, I hope I dont sound too negative. I'm very impressed with Mahout and > its capabilities. I really like that there is a well thought out class > library of primitives for designing new serial and distributed machine > learning algorithms. And I think it has a high utility for integration > into highly visible commercial projects. But its high level public API > really is a barrier to entry when trying to design commercial applications.
-
Re: Goals for Mahout 0.7Sean Owen 2012-02-14, 19:25
When 0.6 was released, there was an all-time record of open JIRAs --
something like 90-100 (I closed maybe 10 quickly.) It's just math: there is a certain level of interest and rate of new requests and issues. There is some level of committer time and energy available to work on them. The former is just getting larger and the latter is shrinking. Neither of these things are the problem per se, and neither is something to be fixed; you can't ask people to not have ideas or issues, and you can't tell people they should be contributing more here. But I do think it means that it's more urgent than ever to have some strategy to tackle the JIRA, rather than talk about more green-field plans. This has been discussed before, and there were ideas like new JIRA tags, but I don't think it's been more than some labeling of the problem. There haven't been new committers, and JIRA rot is discouraging new ones, which makes it worse. JIRA is really a symptom; there is just a lot of sprawl and cruft to the project that's not being talked about or addressed. I can't say don't write down any new plans in JIRA. I can only point out what's happened many times: big ideas go half implemented if at all. Writing them down isn't really useful work. Meanwhile, I can see ten JIRAs from new contributors that have been ignored, and, many new bug reports are avoidable, jsut symptoms of scattered un-unified code that was never refined. It won't be different if this cycle is repeated. It's not going to kill this project but it's not going to get out of AAA to the Major Leagues at this rate, and that is frustrating. Fortunately, I think this remains pretty solvable. More work on existing issues sure helps, but nobody can count on that. It's then a question of scope: narrowing scope to something maintainable, making that scope clear, turning down JIRAs that don't fit, focusing attention on actionable JIRAs that do. Yes, you have to be able to not-do things in a project as well as do things, even in open source. I think that scope is still large at "maintaining what exists already, and fixing it up". Since I think this is the only realistic approach to a next version, in this conversation I could not support anything approach that pretends to do five more things in the next version -- at least not unless accompanied by some plan to address the contributions already in line in JIRA. It's not OK to be implicitly rejecting so much from the community by not planning to fix that first and foremost.
-
Re: Goals for Mahout 0.7Jeff Eastman 2012-02-14, 20:29
+users@
Just to be clear, I'm not advocating replacing the JIRA process with a new set of green-field goals. Rather, IMHO, having a small number of overarching goals for a release *could* help us focus our efforts (triage our feature JIRAs) and *might* suggest some missing JIRAs that would give that release more completeness, usability and "sizzle" in those few areas. Hopefully more completeness and usability and sizzle than we might otherwise obtain using a scattered, bottom-up approach. It's the sort of release planning and priority setting I've observed product managers doing in my many past lives. Of course, fixing defects has a higher priority than adding new features, but giving each release some focus and coherence is a mark of a mature product program. An 80% solution in three areas is not as good as a 100% solution in one. At HP, we used to say "Do a few things well". We've been saying "Well, let's do a few more things" too long. On 2/14/12 12:25 PM, Sean Owen wrote: > When 0.6 was released, there was an all-time record of open JIRAs -- > something like 90-100 (I closed maybe 10 quickly.) It's just math: > there is a certain level of interest and rate of new requests and > issues. There is some level of committer time and energy available to > work on them. The former is just getting larger and the latter is > shrinking. Neither of these things are the problem per se, and neither > is something to be fixed; you can't ask people to not have ideas or > issues, and you can't tell people they should be contributing more > here. > > But I do think it means that it's more urgent than ever to have some > strategy to tackle the JIRA, rather than talk about more green-field > plans. This has been discussed before, and there were ideas like new > JIRA tags, but I don't think it's been more than some labeling of the > problem. There haven't been new committers, and JIRA rot is > discouraging new ones, which makes it worse. > > JIRA is really a symptom; there is just a lot of sprawl and cruft to > the project that's not being talked about or addressed. > > I can't say don't write down any new plans in JIRA. I can only point > out what's happened many times: big ideas go half implemented if at > all. Writing them down isn't really useful work. Meanwhile, I can see > ten JIRAs from new contributors that have been ignored, and, many new > bug reports are avoidable, jsut symptoms of scattered un-unified code > that was never refined. It won't be different if this cycle is > repeated. It's not going to kill this project but it's not going to > get out of AAA to the Major Leagues at this rate, and that is > frustrating. > > Fortunately, I think this remains pretty solvable. More work on > existing issues sure helps, but nobody can count on that. It's then a > question of scope: narrowing scope to something maintainable, making > that scope clear, turning down JIRAs that don't fit, focusing > attention on actionable JIRAs that do. Yes, you have to be able to > not-do things in a project as well as do things, even in open source. > > I think that scope is still large at "maintaining what exists already, > and fixing it up". Since I think this is the only realistic approach > to a next version, in this conversation I could not support anything > approach that pretends to do five more things in the next version -- > at least not unless accompanied by some plan to address the > contributions already in line in JIRA. It's not OK to be implicitly > rejecting so much from the community by not planning to fix that first > and foremost. > >
-
Re: Goals for Mahout 0.7Dmitriy Lyubimov 2012-02-14, 20:56
I and my company have allocated some time to create some mixed
environment of R and other "stuff", and, in particular, Mahout. I am thinking of a "contributed" project with R where R is enabled to do the following roles: #1 Mahout's front end driver mixing Mahout computations and R vector/matrices #2 data vectorization/preparation routines loaded into backend of Mahout's abstract job and adapted to write DRM; #3 perhaps some routines allowing subsampling & subsequent visulalization of Mahout result for prototyping and control purposes. #2 kind of comes close to what R-Hadoop project does with their mapreduce package but unfortunately it looks like that project focuses on a particular way of serialization of R objects and adaptation for DRM serialization doesn't seem plausible at this time. Besides, I am thinking that it's not so difficult to run R from inside mapper (R-Hadoop uses streaming, but i think it's worth to try R inverse java package instead of streaming and bypass the whole text/parse routine completely). Rapid prototyping and visualization of results i think is one of the bigger barriers to Mahout adoption. But enabling mixed environment for cpu-laden computations in R is a huge leap towards prototyping big data pipeline IMO. Or at least it seems from the vantage point of problems i am currently with. Rapid prototyping of Mahout pipelines may be a huge help, esp. as new methods become available to try and validate. -d On Sat, Feb 11, 2012 at 11:01 AM, Jeff Eastman <[EMAIL PROTECTED]> wrote: > Now that 0.6 is in the box, it seems a good time to start thinking about > 0.7, from a high level goal perspective at least. Here are a couple that > come to mind: > > Target code freeze date August 1, 2012 > Get Jenkins working for us again > Complete clustering refactoring and classification convergence > ...
-
Re: Goals for Mahout 0.7Lance Norskog 2012-02-15, 03:37
Yes! Connection R and Mahout within the same JVM is an awesome idea.
Approaching Mahout as a non-mathematician user is frustrating because of the difficulty in visualizing and tuning results. I've done some hacky things with KNime and Excel, but the ability to do math-heavy post-processing and visualization directly would be excellent. On Tue, Feb 14, 2012 at 12:56 PM, Dmitriy Lyubimov <[EMAIL PROTECTED]> wrote: > I and my company have allocated some time to create some mixed > environment of R and other "stuff", and, in particular, Mahout. I am > thinking of a "contributed" project with R where R is enabled to do > the following roles: > > #1 Mahout's front end driver mixing Mahout computations and R vector/matrices > #2 data vectorization/preparation routines loaded into backend of > Mahout's abstract job and adapted to write DRM; > #3 perhaps some routines allowing subsampling & subsequent > visulalization of Mahout result for prototyping and control purposes. > > > #2 kind of comes close to what R-Hadoop project does with their > mapreduce package but unfortunately it looks like that project focuses > on a particular way of serialization of R objects and adaptation for > DRM serialization doesn't seem plausible at this time. Besides, I am > thinking that it's not so difficult to run R from inside mapper > (R-Hadoop uses streaming, but i think it's worth to try R inverse java > package instead of streaming and bypass the whole text/parse routine > completely). > > Rapid prototyping and visualization of results i think is one of the > bigger barriers to Mahout adoption. But enabling mixed environment for > cpu-laden computations in R is a huge leap towards prototyping big > data pipeline IMO. Or at least it seems from the vantage point of > problems i am currently with. Rapid prototyping of Mahout pipelines > may be a huge help, esp. as new methods become available to try and > validate. > > -d > > On Sat, Feb 11, 2012 at 11:01 AM, Jeff Eastman > <[EMAIL PROTECTED]> wrote: >> Now that 0.6 is in the box, it seems a good time to start thinking about >> 0.7, from a high level goal perspective at least. Here are a couple that >> come to mind: >> >> Target code freeze date August 1, 2012 >> Get Jenkins working for us again >> Complete clustering refactoring and classification convergence >> ... -- Lance Norskog [EMAIL PROTECTED]
-
Re: Goals for Mahout 0.7Ioan Eugen Stan 2012-02-23, 14:03
> String[] args = new String[2]; > args[0] = "max"; > args[1] = "7"; > args[0] = "4"; > int max = Math.main(args); > A more elegant solution is: List<String> argList = new LinkedList<String>(); argList.add("-t"); argList.add(INPUT_TABLE); argList.add("-m"); argList.add(MAIL_ACCOUNT_ID); argList.toArray(new String[ argList.size() ]); Cheers, -- Ioan Eugen Stan http://ieugen.blogspot.com
-
Re: Goals for Mahout 0.7Ted Dunning 2012-02-23, 15:53
Is this a joke?
new String[] {"-t", INPUT_TABLE, "-m", MAIL_ACCOUNT_ID} seems better than farting around with lists. On Thu, Feb 23, 2012 at 2:03 PM, Ioan Eugen Stan <[EMAIL PROTECTED]>wrote: > > String[] args = new String[2]; >> args[0] = "max"; >> args[1] = "7"; >> args[0] = "4"; >> int max = Math.main(args); >> >> > A more elegant solution is: > > List<String> argList = new LinkedList<String>(); > argList.add("-t"); > argList.add(INPUT_TABLE); > argList.add("-m"); > argList.add(MAIL_ACCOUNT_ID); > > argList.toArray(new String[ argList.size() ]); > > > Cheers, > > -- > Ioan Eugen Stan > http://ieugen.blogspot.com >
-
Re: Goals for Mahout 0.7Ioan Eugen Stan 2012-02-23, 20:13
2012/2/23 Ted Dunning <[EMAIL PROTECTED]>:
> Is this a joke? > > new String[] {"-t", INPUT_TABLE, "-m", MAIL_ACCOUNT_ID} > > seems better than farting around with lists. True, thank you. -- Ioan Eugen Stan http://ieugen.blogspot.com/
-
Re: Goals for Mahout 0.7Grant Ingersoll 2012-02-24, 18:42
One of our top goals, in my mind, has to be speeding up our tests! I only wish I knew how given basic attempts at parallelism and Maven have failed miserably.
On Feb 14, 2012, at 3:29 PM, Jeff Eastman wrote: > +users@ > > Just to be clear, I'm not advocating replacing the JIRA process with a new set of green-field goals. Rather, IMHO, having a small number of overarching goals for a release *could* help us focus our efforts (triage our feature JIRAs) and *might* suggest some missing JIRAs that would give that release more completeness, usability and "sizzle" in those few areas. Hopefully more completeness and usability and sizzle than we might otherwise obtain using a scattered, bottom-up approach. > > It's the sort of release planning and priority setting I've observed product managers doing in my many past lives. Of course, fixing defects has a higher priority than adding new features, but giving each release some focus and coherence is a mark of a mature product program. An 80% solution in three areas is not as good as a 100% solution in one. At HP, we used to say "Do a few things well". We've been saying "Well, let's do a few more things" too long. > > On 2/14/12 12:25 PM, Sean Owen wrote: >> When 0.6 was released, there was an all-time record of open JIRAs -- >> something like 90-100 (I closed maybe 10 quickly.) It's just math: >> there is a certain level of interest and rate of new requests and >> issues. There is some level of committer time and energy available to >> work on them. The former is just getting larger and the latter is >> shrinking. Neither of these things are the problem per se, and neither >> is something to be fixed; you can't ask people to not have ideas or >> issues, and you can't tell people they should be contributing more >> here. >> >> But I do think it means that it's more urgent than ever to have some >> strategy to tackle the JIRA, rather than talk about more green-field >> plans. This has been discussed before, and there were ideas like new >> JIRA tags, but I don't think it's been more than some labeling of the >> problem. There haven't been new committers, and JIRA rot is >> discouraging new ones, which makes it worse. >> >> JIRA is really a symptom; there is just a lot of sprawl and cruft to >> the project that's not being talked about or addressed. >> >> I can't say don't write down any new plans in JIRA. I can only point >> out what's happened many times: big ideas go half implemented if at >> all. Writing them down isn't really useful work. Meanwhile, I can see >> ten JIRAs from new contributors that have been ignored, and, many new >> bug reports are avoidable, jsut symptoms of scattered un-unified code >> that was never refined. It won't be different if this cycle is >> repeated. It's not going to kill this project but it's not going to >> get out of AAA to the Major Leagues at this rate, and that is >> frustrating. >> >> Fortunately, I think this remains pretty solvable. More work on >> existing issues sure helps, but nobody can count on that. It's then a >> question of scope: narrowing scope to something maintainable, making >> that scope clear, turning down JIRAs that don't fit, focusing >> attention on actionable JIRAs that do. Yes, you have to be able to >> not-do things in a project as well as do things, even in open source. >> >> I think that scope is still large at "maintaining what exists already, >> and fixing it up". Since I think this is the only realistic approach >> to a next version, in this conversation I could not support anything >> approach that pretends to do five more things in the next version -- >> at least not unless accompanied by some plan to address the >> contributions already in line in JIRA. It's not OK to be implicitly >> rejecting so much from the community by not planning to fix that first >> and foremost. >> >> > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com |