|
Isabel Drost
2012-07-04, 18:25
Dawid Weiss
2012-07-04, 18:43
Sean Owen
2012-07-04, 20:19
Dawid Weiss
2012-07-04, 20:24
Sean Owen
2012-07-04, 20:29
Dawid Weiss
2012-07-04, 20:50
Ted Dunning
2012-07-05, 06:22
Frank Scholten
2012-07-06, 07:28
Isabel Drost
2012-07-06, 13:26
Isabel Drost
2012-07-06, 13:30
Isabel Drost
2012-07-06, 13:33
Frank Scholten
2012-07-07, 12:56
|
-
Swedish refactoringIsabel Drost 2012-07-04, 18:25
Hi, recently I spend a week in Sweden - getting there involved spending 6h on a ferry. I spend that time on some (potentially crazy) refactoring idaes I wanted to try out with Mahout for quite some time already. This is to share the results with you and maybe get some feedback. When looking at the final result, keep in mind that this is the result of a spike - so it's more a proof of concept than anything clean and polished. The goals I wanted to accomplish were two-fold: a) Running the Mahout tests takes ages at least on my machine (I admit that that one is pretty dated, however even running them on a faster laptop with ssd and three times more memory did not help). b) When telling the story of Mahout I am faced with comments along the lines of "I don't want to install Hadoop to run it" over and over again. It seems non- obvious that Mahout actually aims to be a stable, scalable Java library that comes with the added benefit of having quite a few of it's algorithms implemented on Hadoop. The third goal was pretty selfish, namely to get some experience with the random testing framework used over in Lucene-land. Here's what I did: - introduce the RandomTestRunner - fix all issues that bubbled up as a result (I ran into quite a few problems due to threads not being shut down either as part of the test (not too critical) or even during regular computation - happy to isolate what I ran into here and file separate issues (including fixes obviously) for each of these. - mark all long running tests with the nightly annotation - my goal here is not to switch them off forever but rather draw contributors' attention to those running particularly long (>20s) and fix them - convert our current core module into a parent, move any code in that into a submodule called stuff - move anything out of stuff and into module write that concerns serialization and is reasonably algorithm independent - move anything out of stuff and into module hadoop that really needs mapreduce to run - move anything out of stuff and into cli that offers just a command line interface to implementations (I might have missed some jobs here that still contain logic in addition to the command line stuff, all I did was to go through and fix failing tests, for several jobs I factored the parameters into separate beans to deal with default values, I suppose some of Frank's work could come in handy when doing that right.) - factor some of the unit testing utils into their own modules (those two could be collapsed actually) to avoid depending on running the tests just for compiling all the source code. Here's where you can look at the results: https://github.com/MaineC/mahout/tree/swedish-refactoring Unfortunately there's one huge commit on June 26th - I was so naive to believe in "let me just do a tiny refactoring and commit the successfully building code back" kind of assumption... If there's any interest in any of the results above, I'd be happy to go through the refactoring work in a non-feasibility-study mode and change thing piece by piece. There are still several open questions - like enable running tests in parallel (the changes I made to get that in the parent pom are right now disabled as I ran into some errors due to tests failing when running in parallel to other tests), our tests could benefit a lot from using more of the random testing functionality. Naming is more than sub-optimal. In the end I'd love to have a "all mahout included" jar in addition to the individual modules, etc. Feedback welcome - also feel free to ignore in case that stuff is just too far from the current roadmap, Isabel PS: If you read that - Alan, thanks for getting us to drive to the coast instead of into the mountains close to the arctic circle - that recommendation was awesome!
-
Re: Swedish refactoringDawid Weiss 2012-07-04, 18:43
> - introduce the RandomTestRunner
Awesome :) > > - fix all issues that bubbled up as a result (I ran into quite a few problems > due to threads not being shut down either as part of the test (not too critical) > or even during regular computation - happy to isolate what I ran into here and > file separate issues (including fixes obviously) for each of these. You can always mark the superclass with: @ThreadLeaks(failTestIfLeaking = false) For the most part thread leaks should be handled gracefully but there are situations in which recovery from a thread leak is hard. I'm currently fighting the subject in the Lucene land and I expect a few refactorings to take place -- hope it's not going to be problematic for you. If you have any questions or problems -- fire at me directly (don't want to miss something on the list). > - mark all long running tests with the nightly annotation - my goal here is not > to switch them off forever but rather draw contributors' attention to those > running particularly long (>20s) and fix them You can also design your own thread group and disable it by default. Look at: LuceneTestCase.@Slow annotation -- I just recently marked a lot of tests as slow. They can be bulk enabled or disabled with a system property. Dawid
-
Re: Swedish refactoringSean Owen 2012-07-04, 20:19
On Wed, Jul 4, 2012 at 9:25 PM, Isabel Drost <[EMAIL PROTECTED]> wrote:
> b) When telling the story of Mahout I am faced with comments along the lines of > "I don't want to install Hadoop to run it" over and over again. It seems non- > obvious that Mahout actually aims to be a stable, scalable Java library that > comes with the added benefit of having quite a few of it's algorithms > implemented on Hadoop. Personally, I think that Mahout equals Hadoop-based. I don't think it's so useful to pretend otherwise, and do not see an issue with this identity. It does not have consistent implementations based on anything else. Implementations based on a very different framework sounds like a different project in itself. > - mark all long running tests with the nightly annotation - my goal here is not > to switch them off forever but rather draw contributors' attention to those > running particularly long (>20s) and fix them There's a scene in a movie I love called Kicking and Screaming (not the Will Ferrell one), where someone drops a glass on the kitchen floor. The housemates are quintessentially lazy, and so its left to the next day. But one takes action the next day -- he places a carefully lettered sign reading "Watch Out" over the broken glass and moves on. Annotations with this purpose remind me of this scene. These aren't going to get fixed; these half-hour tests have been in for a year with notes to fix. They're kinda broken tests; fix them by removing them. It's less good than fixing, but better than people not running tests since they take like an hour to finish. > - convert our current core module into a parent, move any code in that into a > submodule called stuff > > - move anything out of stuff and into module write that concerns serialization > and is reasonably algorithm independent > > - move anything out of stuff and into module hadoop that really needs mapreduce > to run > > - move anything out of stuff and into cli that offers just a command line > interface to implementations (I might have missed some jobs here that still > contain logic in addition to the command line stuff, all I did was to go through > and fix failing tests, for several jobs I factored the parameters into separate > beans to deal with default values, I suppose some of Frank's work could come in > handy when doing that right.) > > - factor some of the unit testing utils into their own modules (those two could > be collapsed actually) to avoid depending on running the tests just for > compiling all the source code. I think I'm against this -- it's complex, and i think it supposes that the code is meaningfully separable along these lines. I don't think it is. Everything is fairly tied in with Hadoop and its serialization. Weigh that against the complication of making users -- who are going to want all the Hadoop stuff -- now import 4-5 JARs / artifacts instead of 2-3, which already feels like 1-2 too many. There is a lot of messy stuff that needs to be refactored, and maybe these ideas encompass some of them, but I don't think any I can see are at the module level. It's dirtier and harder than that.
-
Re: Swedish refactoringDawid Weiss 2012-07-04, 20:24
> These aren't going to get fixed;
I think the idea is to allow long-running tests on build servers where they don't really annoy people so much and have a quick run for developers to check before they commit etc. Dawid
-
Re: Swedish refactoringSean Owen 2012-07-04, 20:29
Sure, does this annotation make that happen automatically? so it
doesn't run unless I set a certain flag, that the Hudson build sets? makes sense but I didn't know whether that was the idea, or just to flag these things. It's the just flagging things that I am not so keen on. They've been flagged for a long time in JIRA. On Wed, Jul 4, 2012 at 11:24 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >> These aren't going to get fixed; > > I think the idea is to allow long-running tests on build servers where > they don't really annoy people so much and have a quick run for > developers to check before they commit etc. > > Dawid
-
Re: Swedish refactoringDawid Weiss 2012-07-04, 20:50
> Sure, does this annotation make that happen automatically? so it
> doesn't run unless I set a certain flag, that the Hudson build sets? Yes. Test groups are controlled by system properties (you pick the names). The only one predefined is @Nightly which is enabled with "-Dtests.nightly=on". See the definition here and you'll see how to build up your own: https://github.com/carrotsearch/randomizedtesting/blob/master/randomized-runner/src/main/java/com/carrotsearch/randomizedtesting/annotations/Nightly.java Lucene's test groups are nested static class inside LuceneTestCase. D.
-
Re: Swedish refactoringTed Dunning 2012-07-05, 06:22
Frankly, the Hadoop dependency doesn't really cause a problem here. I
disagree with Sean about the essential identity of Hadoop and Mahout, but having the Hadoop dependency doesn't cause me any anxiety when doing non-Hadoop stuff. On Wed, Jul 4, 2012 at 1:50 PM, Dawid Weiss <[EMAIL PROTECTED]>wrote: > > Sure, does this annotation make that happen automatically? so it > > doesn't run unless I set a certain flag, that the Hudson build sets? > > Yes. Test groups are controlled by system properties (you pick the > names). The only one predefined is @Nightly which is enabled with > "-Dtests.nightly=on". See the definition here and you'll see how to > build up your own: > > > https://github.com/carrotsearch/randomizedtesting/blob/master/randomized-runner/src/main/java/com/carrotsearch/randomizedtesting/annotations/Nightly.java > > Lucene's test groups are nested static class inside LuceneTestCase. > > D. >
-
Re: Swedish refactoringFrank Scholten 2012-07-06, 07:28
Yeah, I think that would be useful to have. An alternative is to use
Maven profiles and use conventions like *IntegrationTest or *MapReduceTest as part of the name of the test class. Users have the option of running only the unit test, MR tests or do a full build by activating a Maven profile. Another benefit of these naming conventions is it integrates nicely in an IDE because with autocompletion and so on you can quickly see what kind of tests are available. The nightly on Jenkins will run a full build. Would it make a difference in test speed if we used MRUnit for some of the tests? Just curious. Frank On Wed, Jul 4, 2012 at 10:24 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >> These aren't going to get fixed; > > I think the idea is to allow long-running tests on build servers where > they don't really annoy people so much and have a quick run for > developers to check before they commit etc. > > Dawid >
-
Re: Swedish refactoringIsabel Drost 2012-07-06, 13:26
On Fri, Jul 6, 2012 at 9:28 AM, Frank Scholten <[EMAIL PROTECTED]> wrote:
> Yeah, I think that would be useful to have. An alternative is to use > Maven profiles and use conventions like *IntegrationTest or > *MapReduceTest as part of the name of the test class. Jepp - true. One thing though: I never managed to convince my IDE to not run tests named *ITest* when right-clicking on the package or folder containing the tests. (And yes, I have to admit that I also like some other features the random testing component provides - e.g. tracking leaking threads, executing tests out of order etc. and needed an excuse to play with them during my holiday.) > Would it make a difference in test speed if we used MRUnit for some of > the tests? Just curious. I haven't looked too deeply into what our tests do - the ones I did mark @Nightly mostly looked like integration tests that check the whole workflow instead of the individual units. Which does not mean that using MRUnit for some of our tests might not actually speed things up. Isabel
-
Re: Swedish refactoringIsabel Drost 2012-07-06, 13:30
On Wed, Jul 4, 2012 at 10:19 PM, Sean Owen <[EMAIL PROTECTED]> wrote:
> There's a scene in a movie I love called Kicking and Screaming (not > the Will Ferrell one), where someone drops a glass on the kitchen > floor. The housemates are quintessentially lazy, and so its left to > the next day. But one takes action the next day -- he places a > carefully lettered sign reading "Watch Out" over the broken glass and > moves on. > > Annotations with this purpose remind me of this scene. <irony>Isn't that what we are already doing by telling our users to run the build with tests set to skip?</irony> This is not to say that taking long-running stuff out of the build by an annotation is a whole lot better - committers won't be annoyed by long running tests anymore as a result and might be even less likely to fix them. That part of my branch just helped me to iterate more quickly by having a build that fails quickly in case of trivial mistakes (due to the faster tests failing) and having a few slow final iterations to fix the rest. Isabel
-
Re: Swedish refactoringIsabel Drost 2012-07-06, 13:33
On Wed, Jul 4, 2012 at 8:43 PM, Dawid Weiss
<[EMAIL PROTECTED]> wrote: >> - introduce the RandomTestRunner > > Awesome :) Yeah - I really liked what I saw in Poznan. >> - fix all issues that bubbled up as a result (I ran into quite a few problems >> due to threads not being shut down either as part of the test (not too critical) >> or even during regular computation - happy to isolate what I ran into here and >> file separate issues (including fixes obviously) for each of these. > > You can always mark the superclass with: > > @ThreadLeaks(failTestIfLeaking = false) Jepp, I know that one. However in my case I really wanted them to fail as there were a few issues with executor pools being left open etc. - so having that part did help. > For the most part thread leaks should be handled gracefully but there > are situations in which recovery from a thread leak is hard. I'm > currently fighting the subject in the Lucene land and I expect a few > refactorings to take place -- hope it's not going to be problematic > for you. If you have any questions or problems -- fire at me directly > (don't want to miss something on the list). Will certainly do! >> - mark all long running tests with the nightly annotation - my goal here is not >> to switch them off forever but rather draw contributors' attention to those >> running particularly long (>20s) and fix them > > You can also design your own thread group and disable it by default. Look at: > > LuceneTestCase.@Slow > > annotation -- I just recently marked a lot of tests as slow. They can > be bulk enabled or disabled with a system property. Awesome. Thanks for pointing that out. Isabel
-
Re: Swedish refactoringFrank Scholten 2012-07-07, 12:56
On Fri, Jul 6, 2012 at 3:26 PM, Isabel Drost <[EMAIL PROTECTED]> wrote:
> On Fri, Jul 6, 2012 at 9:28 AM, Frank Scholten <[EMAIL PROTECTED]> wrote: >> Yeah, I think that would be useful to have. An alternative is to use >> Maven profiles and use conventions like *IntegrationTest or >> *MapReduceTest as part of the name of the test class. > > Jepp - true. One thing though: I never managed to convince my IDE to > not run tests named *ITest* when right-clicking on the package or > folder containing the tests. This can be solved by having subpackages like 'integration' and 'mapreduce' in the test source tree. > (And yes, I have to admit that I also> like some other features the random testing component provides - e.g. > tracking leaking threads, executing tests out of order etc. and needed > an excuse to play with them during my holiday.) > >> Would it make a difference in test speed if we used MRUnit for some of >> the tests? Just curious. > > I haven't looked too deeply into what our tests do - the ones I did > mark @Nightly mostly looked like integration tests that check the > whole workflow instead of the individual units. Which does not mean > that using MRUnit for some of our tests might not actually speed > things up. I think there are also a few tests that use more iterations / data then necessary. > > > Isabel > |