|
|
-
Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2011-12-30, 12:32
I've been quietly working on an ANT task that would run tests in isolated JVMs, similar to what Lucene build files do using macros and selectors. It's been fun and I was finally able to integrate a few other features I've always wanted (like detailed progress listeners), but it's another story.
If you have a multi-multi-core machine (or if you don't and want to provide some feedback) then please run the following script/ commands:
# get the code from my github fork: git clone [EMAIL PROTECTED]:dweiss/lucene-solr.git --depth 1 -b junit4 cd lucene-solr/lucene
# Get a baseline for core tests running on trunk/ant macros. git checkout trunk ant test-core -Dtestcase=compileonly # This is a single run and it depends on the seed, but we'll consider it a baseline -> write down the execution time or remember it. ant test-core -Dtests.seed=random
# Switch over to junit4 branch and recompile. git checkout junit4 ant test-core -Dtestcase=compileonly
# An initial pass collects statistics; these can be stored with the project to bootstrap, # but for now they're zero. Adjust the number of CPUs to your system: # typically, you'll want the physical number of cores - 1 (reserved for the aggregator). # tests.seed is set to an empty value because junit4 and ltc use a different format! So there # will be some variability across test executions (and that's good because estimates will vary). ant test-core -Dtests.seed= -Dtests.cpus=4 ant test-core -Dtests.seed= -Dtests.cpus=4 ant test-core -Dtests.seed= -Dtests.cpus=4
# with each run the estimates (shown up front) should be getting closer to the real execution # time for each slave. They will not be exact because of randomness, but should be fairly close. For example # I get: # [junit4] Expected execution time on slave 0: 233.94s # [junit4] Expected execution time on slave 3: 233.94s # [junit4] Expected execution time on slave 1: 233.95s # [junit4] Expected execution time on slave 2: 233.95s # # and the real times:
#
I would very much appreciate feedback on (including but not limited to):
1) If something is not working. The tests hung on my machine once, the slave JVM wasn't responsive, it didn't even dump a stack trace, didn't react to kill -QUIT, nothing.
2) Is test execution faster than the baseline? By how much? For multi-multi-cores, if you have time how does execution time correlate with tests.cpus (I assume memory bandwidth or disk will be the bottleneck at some point).
3) Did you enjoy the sweet hum of cpu fans? For zero-noise systems: you better crank up those pumps or put something cold on the cpu unit :)
Thanks, Dawid
---------------------------------------------------------------------
+
Dawid Weiss 2011-12-30, 12:32
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2011-12-30, 12:36
Darm, gmal likes to line wrap... I've put the script here too: https://gist.github.com/1539653Dawid ---------------------------------------------------------------------
+
Dawid Weiss 2011-12-30, 12:36
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2011-12-30, 13:01
Robert just informed me that there is an exception coming out from ANT if you run it with ANT 1.7.1. Don't know if it's a known issue, but I use ANT 1.8.x and the problem is not present there. Dawid On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > Darm, gmal likes to line wrap... I've put the script here too: > > https://gist.github.com/1539653> > Dawid ---------------------------------------------------------------------
+
Dawid Weiss 2011-12-30, 13:01
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2011-12-30, 17:22
I updated the code and it works with Ant 1.7.1 now. I also noticed parameters are parsed slightly different (maybe it's windows), so you need to quote to pass an empty parameter as in: ant test-core "-Dtests.seed=" -Dtests.cpus=7 Dawid On Fri, Dec 30, 2011 at 2:01 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > Robert just informed me that there is an exception coming out from ANT > if you run it with ANT 1.7.1. Don't know if it's a known issue, but I > use ANT 1.8.x and the problem is not present there. > > Dawid > > On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >> Darm, gmal likes to line wrap... I've put the script here too: >> >> https://gist.github.com/1539653>> >> Dawid ---------------------------------------------------------------------
+
Dawid Weiss 2011-12-30, 17:22
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Robert Muir 2011-12-30, 19:44
Here's a couple runs from my machine... but I think some of this is some wild swings in the tests (bad apples). [junit4] Slave 0: 0.15 .. 95.85 = 95.70s [junit4] Slave 1: 0.17 .. 76.62 = 76.45s [junit4] Slave 2: 0.14 .. 47.33 = 47.19s [junit4] Slave 3: 0.13 .. 48.20 = 48.08s [junit4] Execution time total: 95.90s [junit4] Tests summary: 278 suites, 1550 tests, 9 ignored (6 assumptions) [junit4] Slave 0: 0.16 .. 61.38 = 61.22s [junit4] Slave 1: 0.16 .. 84.89 = 84.74s [junit4] Slave 2: 0.16 .. 59.31 = 59.15s [junit4] Slave 3: 0.16 .. 77.67 = 77.51s [junit4] Execution time total: 84.95s [junit4] Tests summary: 278 suites, 1550 tests, 5 ignored (2 assumptions) [junit4] Slave 0: 0.17 .. 69.68 = 69.50s [junit4] Slave 1: 0.16 .. 67.49 = 67.33s [junit4] Slave 2: 0.16 .. 64.00 = 63.84s [junit4] Slave 3: 0.16 .. 72.68 = 72.51s [junit4] Execution time total: 72.73s [junit4] Tests summary: 278 suites, 1550 tests, 7 ignored (4 assumptions) [junit4] Slave 0: 0.16 .. 64.94 = 64.78s [junit4] Slave 1: 0.19 .. 67.69 = 67.50s [junit4] Slave 2: 0.16 .. 62.59 = 62.43s [junit4] Slave 3: 0.21 .. 66.12 = 65.91s [junit4] Execution time total: 67.74s [junit4] Tests summary: 278 suites, 1550 tests, 17 ignored (14 assumptions) [junit4] Slave 0: 0.19 .. 57.03 = 56.84s [junit4] Slave 1: 0.17 .. 65.57 = 65.40s [junit4] Slave 2: 0.18 .. 77.44 = 77.26s [junit4] Slave 3: 0.15 .. 64.90 = 64.74s [junit4] Execution time total: 77.48s [junit4] Tests summary: 278 suites, 1550 tests, 6 ignored (3 assumptions) [junit4] Slave 0: 0.15 .. 73.56 = 73.41s [junit4] Slave 1: 0.15 .. 70.84 = 70.69s [junit4] Slave 2: 0.15 .. 97.94 = 97.79s [junit4] Slave 3: 0.18 .. 66.66 = 66.47s [junit4] Execution time total: 97.99s [junit4] Tests summary: 278 suites, 1550 tests, 13 ignored (10 assumptions) On Fri, Dec 30, 2011 at 12:22 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > I updated the code and it works with Ant 1.7.1 now. I also noticed > parameters are parsed slightly different (maybe it's windows), so you > need to quote to pass an empty parameter as in: > > ant test-core "-Dtests.seed=" -Dtests.cpus=7 > > Dawid > > On Fri, Dec 30, 2011 at 2:01 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >> Robert just informed me that there is an exception coming out from ANT >> if you run it with ANT 1.7.1. Don't know if it's a known issue, but I >> use ANT 1.8.x and the problem is not present there. >> >> Dawid >> >> On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >>> Darm, gmal likes to line wrap... I've put the script here too: >>> >>> https://gist.github.com/1539653>>> >>> Dawid > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > -- lucidimagination.com ---------------------------------------------------------------------
+
Robert Muir 2011-12-30, 19:44
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2011-12-30, 20:45
Thanks Robert. Yes, the variation in certain suites is pretty large -- if you open the generated execution times cache you can see the timings for each test suite. I've seen differences going into tens of seconds depending on the seed (and the environment?). What are your timing for ant-based splits? Roughly the same? Dawid On Fri, Dec 30, 2011 at 8:44 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > Here's a couple runs from my machine... but I think some of this is > some wild swings in the tests (bad apples). > > [junit4] Slave 0: 0.15 .. 95.85 = �� 95.70s > [junit4] Slave 1: 0.17 .. 76.62 = �� 76.45s > [junit4] Slave 2: 0.14 .. 47.33 = �� 47.19s > [junit4] Slave 3: 0.13 .. 48.20 = �� 48.08s > [junit4] Execution time total: 95.90s > [junit4] Tests summary: 278 suites, 1550 tests, 9 ignored (6 assumptions) > > [junit4] Slave 0: 0.16 .. 61.38 = �� 61.22s > [junit4] Slave 1: 0.16 .. 84.89 = �� 84.74s > [junit4] Slave 2: 0.16 .. 59.31 = �� 59.15s > [junit4] Slave 3: 0.16 .. 77.67 = �� 77.51s > [junit4] Execution time total: 84.95s > [junit4] Tests summary: 278 suites, 1550 tests, 5 ignored (2 assumptions) > > [junit4] Slave 0: 0.17 .. 69.68 = �� 69.50s > [junit4] Slave 1: 0.16 .. 67.49 = �� 67.33s > [junit4] Slave 2: 0.16 .. 64.00 = �� 63.84s > [junit4] Slave 3: 0.16 .. 72.68 = �� 72.51s > [junit4] Execution time total: 72.73s > [junit4] Tests summary: 278 suites, 1550 tests, 7 ignored (4 assumptions) > > [junit4] Slave 0: 0.16 .. 64.94 = �� 64.78s > [junit4] Slave 1: 0.19 .. 67.69 = �� 67.50s > [junit4] Slave 2: 0.16 .. 62.59 = �� 62.43s > [junit4] Slave 3: 0.21 .. 66.12 = �� 65.91s > [junit4] Execution time total: 67.74s > [junit4] Tests summary: 278 suites, 1550 tests, 17 ignored (14 assumptions) > > [junit4] Slave 0: 0.19 .. 57.03 = �� 56.84s > [junit4] Slave 1: 0.17 .. 65.57 = �� 65.40s > [junit4] Slave 2: 0.18 .. 77.44 = �� 77.26s > [junit4] Slave 3: 0.15 .. 64.90 = �� 64.74s > [junit4] Execution time total: 77.48s > [junit4] Tests summary: 278 suites, 1550 tests, 6 ignored (3 assumptions) > > [junit4] Slave 0: 0.15 .. 73.56 = �� 73.41s > [junit4] Slave 1: 0.15 .. 70.84 = �� 70.69s > [junit4] Slave 2: 0.15 .. 97.94 = �� 97.79s > [junit4] Slave 3: 0.18 .. 66.66 = �� 66.47s > [junit4] Execution time total: 97.99s > [junit4] Tests summary: 278 suites, 1550 tests, 13 ignored (10 assumptions) > > On Fri, Dec 30, 2011 at 12:22 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >> I updated the code and it works with Ant 1.7.1 now. I also noticed >> parameters are parsed slightly different (maybe it's windows), so you >> need to quote to pass an empty parameter as in: >> >> ant test-core "-Dtests.seed=" -Dtests.cpus=7 >> >> Dawid >> >> On Fri, Dec 30, 2011 at 2:01 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >>> Robert just informed me that there is an exception coming out from ANT >>> if you run it with ANT 1.7.1. Don't know if it's a known issue, but I >>> use ANT 1.8.x and the problem is not present there. >>> >>> Dawid >>> >>> On Fri, Dec 30, 2011 at 1:36 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >>>> Darm, gmal likes to line wrap... I've put the script here too: >>>> >>>> https://gist.github.com/1539653>>>> >>>> Dawid >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [EMAIL PROTECTED] >> For additional commands, e-mail: [EMAIL PROTECTED] >> > > > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] > ---------------------------------------------------------------------
+
Dawid Weiss 2011-12-30, 20:45
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Robert Muir 2011-12-30, 20:49
On Fri, Dec 30, 2011 at 3:45 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > Thanks Robert. Yes, the variation in certain suites is pretty large -- > if you open the generated execution times cache you can see the > timings for each test suite. I've seen differences going into tens of > seconds depending on the seed (and the environment?). What are your > timing for ant-based splits? Roughly the same? >
I think i got to the bottom of this. Depending upon your seed, 95% of the time a test gets "RamDirectory" but 5% of the time it gets a file-system backed implementation.
Because of this, depending upon environment, test times swing wildly because of fsync(). For example in the last nightly build we fsynced over 7,000 times in tests.
This is really crazy and I want to prolong the life of my SSD: see my latest comment with a fix on LUCENE-3667. With that patch my times are no longer swinging wildly.
(easy way to see what i am talking about: just run tests with -Dtests.directory=MMapDirectory or something like that)
-- lucidimagination.com
---------------------------------------------------------------------
+
Robert Muir 2011-12-30, 20:49
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Michael McCandless 2012-01-04, 15:22
This looks cool! I ran this a few times: ant test-core -Dtests.seed=0:0:0 -Dtests.cpus=20 -Dtests.directory=RAMDirectory -Dtests.codec=Lucene40 I fixed seed & RAMDir to reduce variance... [junit4] Slave 16: 0.29 .. 24.65 = 24.36s [junit4] Slave 17: 0.36 .. 30.62 = 30.26s [junit4] Slave 18: 0.44 .. 30.84 = 30.41s [junit4] Slave 19: 0.50 .. 28.65 = 28.15s [junit4] Execution time total: 36.69s [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored [junit4] Slave 16: 0.44 .. 29.61 = 29.17s [junit4] Slave 17: 0.55 .. 31.59 = 31.04s [junit4] Slave 18: 0.30 .. 25.85 = 25.54s [junit4] Slave 19: 0.31 .. 32.64 = 32.33s [junit4] Execution time total: 37.12s [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored [junit4] Slave 16: 0.28 .. 25.70 = 25.42s [junit4] Slave 17: 0.23 .. 29.83 = 29.60s [junit4] Slave 18: 0.28 .. 27.50 = 27.22s [junit4] Slave 19: 0.37 .. 27.67 = 27.30s [junit4] Execution time total: 35.23s [junit4] Tests summary: 278 suites, 1550 tests, 1 failure, 3 ignored [junit4] Slave 16: 0.38 .. 28.99 = 28.61s [junit4] Slave 17: 0.41 .. 30.79 = 30.38s [junit4] Slave 18: 0.48 .. 30.05 = 29.57s [junit4] Slave 19: 0.35 .. 30.71 = 30.36s [junit4] Execution time total: 38.46s [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored [junit4] Slave 16: 0.27 .. 29.56 = 29.29s [junit4] Slave 17: 0.44 .. 32.64 = 32.21s [junit4] Slave 18: 0.40 .. 31.99 = 31.60s [junit4] Slave 19: 0.27 .. 32.64 = 32.37s [junit4] Execution time total: 37.70s [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored Does the "Execution time total" include compilation, or is it just the actual test runtime? Can this change run "across" the different groups of tests we have (core, modules/*, contrib/*, solr/*, etc.)? I found that to be a major bottleneck in the current "ant test"'s concurrency, ie we have a pinch point after each group of tests (must wait for all JVMs to finish before moving on to next group...), but I think fixing that in ant is going to be hard? When I use the hacked up Python test runner (runAllTests.py in luceneutil), running only core tests w/ RAMDir and Lucene40 codec it takes ~30 seconds; I think it's doing roughly the same thing as this change (balancing the tests across JVMs). BUT: that's on current trunk, vs your git clone which is somewhat old by now... so it's an apples/pears comparison ;) Mike McCandless http://blog.mikemccandless.comOn Fri, Dec 30, 2011 at 3:49 PM, Robert Muir <[EMAIL PROTECTED]> wrote: > On Fri, Dec 30, 2011 at 3:45 PM, Dawid Weiss > <[EMAIL PROTECTED]> wrote: >> Thanks Robert. Yes, the variation in certain suites is pretty large -- >> if you open the generated execution times cache you can see the >> timings for each test suite. I've seen differences going into tens of >> seconds depending on the seed (and the environment?). What are your >> timing for ant-based splits? Roughly the same? >> > > I think i got to the bottom of this. Depending upon your seed, 95% of > the time a test gets "RamDirectory" but 5% of the time it gets a > file-system backed implementation. > > Because of this, depending upon environment, test times swing wildly > because of fsync(). For example in the last nightly build we fsynced > over 7,000 times in tests. > > This is really crazy and I want to prolong the life of my SSD: see my > latest comment with a fix on LUCENE-3667. With that patch my times are > no longer swinging wildly. > > (easy way to see what i am talking about: just run tests with > -Dtests.directory=MMapDirectory or something like that) > > -- > lucidimagination.com > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED]
+
Michael McCandless 2012-01-04, 15:22
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2012-01-04, 22:11
Thanks Mike. Answers/ comments in-line below > [junit4] Slave 16: 0.29 .. 24.65 = 24.36s > [junit4] Slave 17: 0.36 .. 30.62 = 30.26s > [junit4] Slave 18: 0.44 .. 30.84 = 30.41s > [junit4] Slave 19: 0.50 .. 28.65 = 28.15s > [junit4] Execution time total: 36.69s > [junit4] Tests summary: 278 suites, 1550 tests, 3 ignored I forgot how nasty your beast computer is... 20 slaves?! Remind me how many actual (real) cores do you have? Did you experiment with different slave numbers? I ask because I noticed that: 1) it makes little sense to run cpu-intense tests on hyper-cores, doesn't yield much if anything, 2) you should leave some room for system vm threads (GC, compilers); the more VMs, the more room you'll need. > Does the "Execution time total" include compilation, or is it just the > actual test runtime? The total is calculated before slave VMs are launched and after they complete, so even launch time is included. It's here: https://github.com/carrotsearch/randomizedtesting/blob/master/ant-junit4/src/main/java/com/carrotsearch/ant/tasks/junit4/JUnit4.java> Can this change run "across" the different groups of tests we have > (core, modules/*, contrib/*, solr/*, etc.)? I found that to be a > major bottleneck in the current "ant test"'s concurrency, ie we have a > pinch point after each group of tests (must wait for all JVMs to > finish before moving on to next group...), but I think fixing that in > ant is going to be hard? If I understand you correctly the problem is that ANT in Lucene/ Solr is calling to sub-module ANT scripts and these in turn invoke the test macro. So running everything from a single test task would be possible if we had a master-level test script, it's not directly related to how the tests are actually executed. That JUnit4 task supports globbing in suite selectors so it could be executed with, say, -Dtests.class=org.apache.lucene.blah.* to restrict tests to run just a certain section of all tests, but include everything by default. Don't know how it affects modularization though -- the tests will run faster but they'll be more difficult to maintain I guess. > When I use the hacked up Python test runner (runAllTests.py in luceneutil), This was my inspiration -- Robert pointed me at that, very helpful although you need your kind of machine to run so many SSH sessions :D > change (balancing the tests across JVMs). BUT: that's on current > trunk, vs your git clone which is somewhat old by now... so it's an > apples/pears comparison ;) Oh, come on, my fork is only a few days behind! :) I've pulled the current trunk and merged. I'd appreciate if you could re-run again, this time with, say, 5, 10, 15 and 20 threads. I wonder what the speedup/ overhead is. Thanks. Dawid ---------------------------------------------------------------------
+
Dawid Weiss 2012-01-04, 22:11
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Michael McCandless 2012-01-04, 23:45
On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > I forgot how nasty your beast computer is... 20 slaves?! Remind me how> many actual (real) cores do you have? Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24 withhyperthreading). > Did you experiment with> different slave numbers? I ask because I noticed that:>> 1) it makes little sense to run cpu-intense tests on hyper-cores,> doesn't yield much if anything,> 2) you should leave some room for system vm threads (GC, compilers);> the more VMs, the more room you'll need. In the past I found somewhere around 20 was good w/ the Pythonrunner... but I went and tested again! With the Python runner I see these run times on just lucene core tests: 2 cpus: 72.2 sec 5 cpus: 35.0 sec 10 cpus: 28.1 sec 15 cpus: 26.2 sec 20 cpus: 26.0 sec 25 cpus: 27.5 sec So seems like after 15 cores it's not helping much... but then I ranon all tests (well minus a few intermittently failing tests): 10 cpus: 88.3 sec 15 cpus: 80.2 sec 20 cpus: 77.4 sec 25 cpus: 76.7 sec The above were just running on beast, but the Python runner can sendjobs (hacked up, just using ssh) to other machines... I have two othernon-beasts, and which I ran 3 jvms on each: 25 + 3 + 3 cpus: 64.7 sec With the new ant runner: 2 cpus: [junit4] Slave 0: 0.16 .. 50.68 = 50.52s [junit4] Slave 1: 0.16 .. 49.58 = 49.42s [junit4] Execution time total: 50.73s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 5 cpus: [junit4] Slave 0: 0.19 .. 21.87 = 21.68s [junit4] Slave 1: 0.16 .. 21.86 = 21.70s [junit4] Slave 2: 0.16 .. 29.31 = 29.15s [junit4] Slave 3: 0.16 .. 26.64 = 26.48s [junit4] Slave 4: 0.19 .. 29.82 = 29.63s [junit4] Execution time total: 29.89s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 10 cpus: [junit4] Slave 0: 0.21 .. 14.62 = 14.41s [junit4] Slave 1: 0.22 .. 17.21 = 16.99s [junit4] Slave 2: 0.23 .. 18.79 = 18.56s [junit4] Slave 3: 0.23 .. 22.99 = 22.76s [junit4] Slave 4: 0.20 .. 27.39 = 27.19s [junit4] Slave 5: 0.19 .. 27.23 = 27.04s [junit4] Slave 6: 0.23 .. 20.40 = 20.17s [junit4] Slave 7: 0.19 .. 26.52 = 26.33s [junit4] Slave 8: 0.24 .. 26.42 = 26.18s [junit4] Slave 9: 0.22 .. 23.57 = 23.35s [junit4] Execution time total: 27.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 15 cpus: [junit4] Slave 0: 0.29 .. 5.16 = 4.87s [junit4] Slave 1: 0.26 .. 15.36 = 15.10s [junit4] Slave 2: 0.26 .. 12.99 = 12.73s [junit4] Slave 3: 0.29 .. 24.20 = 23.92s [junit4] Slave 4: 0.26 .. 27.00 = 26.74s [junit4] Slave 5: 0.33 .. 19.97 = 19.63s [junit4] Slave 6: 0.31 .. 25.29 = 24.98s [junit4] Slave 7: 0.24 .. 28.92 = 28.68s [junit4] Slave 8: 0.33 .. 23.67 = 23.34s [junit4] Slave 9: 0.43 .. 24.43 = 24.00s [junit4] Slave 10: 0.40 .. 27.61 = 27.21s [junit4] Slave 11: 0.22 .. 21.77 21.56s [junit4] Slave 12: 0.22 .. 26.78 = 26.56s [junit4] Slave 13: 0.26 .. 25.92 = 25.66s [junit4] Slave 14: 0.35 .. 27.77 = 27.42s [junit4] Execution time total: 28.98s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 20 cpus: [junit4] Slave 0: 0.35 .. 23.32 = 22.97s [junit4] Slave 1: 0.30 .. 24.32 = 24.02s [junit4] Slave 2: 0.35 .. 21.35 = 21.00s [junit4] Slave 3: 0.37 .. 23.63 = 23.26s [junit4] Slave 4: 0.38 .. 20.74 = 20.35s [junit4] Slave 5: 0.30 .. 19.74 = 19.44s [junit4] Slave 6: 0.36 .. 26.39 = 26.03s [junit4] Slave 7: 0.46 .. 23.64 = 23.18s [junit4] Slave 8: 0.43 .. 22.44 = 22.02s [junit4] Slave 9: 0.30 .. 24.05 = 23.76s [junit4] Slave 10: 0.41 .. 24.75 = 24.33s [junit4] Slave 11: 0.30 .. 22.66 22.36s [junit4] Slave 12: 0.30 .. 24.93 = 24.62s [junit4] Slave 13: 0.40 .. 24.39 = 24.00s [junit4] Slave 14: 0.24 .. 24.47 = 24.23s [junit4] Slave 15: 0.45 .. 25.23 = 24.78s [junit4] Slave 16: 0.34 .. 23.06 22.72s [junit4] Slave 17: 0.23 .. 23.50 = 23.28s [junit4] Slave 18: 0.30 .. 24.27 = 23.97s [junit4] Slave 19: 0.30 .. 24.91 = 24.61s [junit4] Execution time total: 26.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored I only ran once each and results are likely noisy... so it's hard to pick a best CPU count... Hmm so does that include compile time (my numbers don't)? Sounds likeno? I'm also measuring from first launch to last finish. Yes I think that's the problem! Ideally ant would just gather up all "jobs" to run and then we'daggregate/distribute across JVMs. Cool. Hmm... can we somehow keep today's directory structure but have anttreat it as a single "module"? Or is the problem that we need tochange the JVM settings (eg CLASSPATH) per test module we havetoday so we must make separate modules for that...? OK cool :) Actually it doesn't open any SSH sessions unless you giveit remote machines to use -- for the "local" JVMs it just forks. I re-ran above -- looks like the times came down some so the new antrunner is basically the same as the Python runner (on core tests): great! Mike McCandless http://blog.mikemccandless.com
+
Michael McCandless 2012-01-04, 23:45
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2012-01-05, 07:37
> With the Python runner I see these run times on just lucene core tests: > 2 cpus: 72.2 sec 5 cpus: 35.0 sec 10 cpus: 28.1 sec 15 cpus: > 26.2 sec 20 cpus: 26.0 sec 25 cpus: 27.5 sec
I would say this is aligned with my intuition -- after you exceed the physical number of cores things don't speed up anymore.
> 10 cpus: 88.3 sec 15 cpus: 80.2 sec 20 cpus: 77.4 sec 25 cpus: 76.7 sec > The above were just running on beast, but the Python runner can
This is probably because some tests don't add anything to CPU load (they're disk bound or use the network)? The speedup is also not that significant -- adding 15 cpus only yielded about 10 secs.
> Hmm so does that include compile time (my numbers don't)? Sounds > likeno? I'm also measuring from first launch to last finish.
Oh, you mean ANT compile/ execution time before actual testing? No, I don't include that -- the execution time is actual spawned jvms.
> Yes I think that's the problem! > Ideally ant would just gather up all "jobs" to run and then > we'daggregate/distribute across JVMs.
Could be done by emitting test class/ classpath names from each module and then running a final testing task that would execute whatever was appended to the current run... but it seems clumsy to me, don't know how to do it better though.
> tochange the JVM settings (eg CLASSPATH) per test module we havetoday > so we must make separate modules for that...?
Yeah, that would be one thing -- different classpaths/ vm properties etc. This could be problematic.
> I re-ran above -- looks like the times came down some so the new > antrunner is basically the same as the Python runner (on core tests): > great!
Thanks. I'm still working on the rough edges (like reporting a jvm crash, there were problems with ibm j9) and Stanislaw is preparing a nice(r) test report. We will contribute a patch once this is done and if there is interest we would love to contribute this in.
Dawid
---------------------------------------------------------------------
+
Dawid Weiss 2012-01-05, 07:37
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Michael McCandless 2012-01-05, 15:53
On Thu, Jan 5, 2012 at 2:37 AM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >> With the Python runner I see these run times on just lucene core tests: >> 2 cpus: 72.2 sec 5 cpus: 35.0 sec 10 cpus: 28.1 sec 15 cpus: >> 26.2 sec 20 cpus: 26.0 sec 25 cpus: 27.5 sec > > I would say this is aligned with my intuition -- after you exceed the > physical number of cores things don't speed up anymore. > >> 10 cpus: 88.3 sec 15 cpus: 80.2 sec 20 cpus: 77.4 sec 25 cpus: 76.7 sec >> The above were just running on beast, but the Python runner can > > This is probably because some tests don't add anything to CPU load > (they're disk bound or use the network)? The speedup is also not that > significant -- adding 15 cpus only yielded about 10 secs. Right... looks like most of the gains are by 10 CPUs. Still I'll take 10 seconds ;) >> Hmm so does that include compile time (my numbers don't)? Sounds >> likeno? I'm also measuring from first launch to last finish. > > Oh, you mean ANT compile/ execution time before actual testing? No, I > don't include that -- the execution time is actual spawned jvms. OK good. >> Yes I think that's the problem! >> Ideally ant would just gather up all "jobs" to run and then >> we'daggregate/distribute across JVMs. > > Could be done by emitting test class/ classpath names from each module > and then running a final testing task that would execute whatever was > appended to the current run... but it seems clumsy to me, don't know > how to do it better though. OK.... we lose a lot because of this. Though, I haven't tried w/ your git clone -- can it run a top-level "ant test" and it does the load balancing by module...? >> tochange the JVM settings (eg CLASSPATH) per test module we havetoday >> so we must make separate modules for that...? > > Yeah, that would be one thing -- different classpaths/ vm properties > etc. This could be problematic. The Python runner completely cheats here, which is bad (because we may pick up a dep we didn't intend to, and never catch it)... just takes the union of all CLASSPATHS. >> I re-ran above -- looks like the times came down some so the new >> antrunner is basically the same as the Python runner (on core tests): >> great! > > Thanks. I'm still working on the rough edges (like reporting a jvm > crash, there were problems with ibm j9) and Stanislaw is preparing a > nice(r) test report. We will contribute a patch once this is done and > if there is interest we would love to contribute this in. Awesome! Mike McCandless http://blog.mikemccandless.com---------------------------------------------------------------------
+
Michael McCandless 2012-01-05, 15:53
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Chris Hostetter 2012-01-05, 19:03
: > Yeah, that would be one thing -- different classpaths/ vm properties : > etc. This could be problematic. : : The Python runner completely cheats here, which is bad (because we may : pick up a dep we didn't intend to, and never catch it)... just takes : the union of all CLASSPATHS.
as long as the default "ant test" does recursive testing of "ant test" in the individual modules with their isolated classpaths to ensure no dependency bleedover, a special case top level "ant run-all-tests-parallel" target that unions all hte classpaths seems like it might be acceptible for things like continuously randomized test only jenkins builds.
but i wonder if reproducibility might be a problem? if you don't get the same classpath, and some classes are loaded i na diff order, would you be able to "cd modules/foo && ant test -D..." and see the same failures? -Hoss
---------------------------------------------------------------------
+
Chris Hostetter 2012-01-05, 19:03
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Mark Miller 2012-01-05, 20:31
On Jan 5, 2012, at 2:37 AM, Dawid Weiss wrote:
> if there is interest we would love to contribute this in.
+1! I've been itching to work on something like this since parallel tests where first put in - can't wait to see it go in.
- Mark Miller lucidimagination.com
---------------------------------------------------------------------
+
Mark Miller 2012-01-05, 20:31
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2012-01-05, 20:37
I like this too and I'm sure there'll be plenty of places where helping hands will be more than welcome :)
Side note -- Maven surefire has built-in support for parallel builds (forked) too, didn't have the time to check how they handled some of the issued we mentioned.
Dawid
On Thu, Jan 5, 2012 at 9:31 PM, Mark Miller <[EMAIL PROTECTED]> wrote: > > On Jan 5, 2012, at 2:37 AM, Dawid Weiss wrote: > >> if there is interest we would love to contribute this in. > > +1! I've been itching to work on something like this since parallel tests where first put in - can't wait to see it go in. > > - Mark Miller > lucidimagination.com > > > > > > > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [EMAIL PROTECTED] > For additional commands, e-mail: [EMAIL PROTECTED] >
---------------------------------------------------------------------
+
Dawid Weiss 2012-01-05, 20:37
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Michael McCandless 2012-01-04, 23:52
Trying again... hopefully this time NOT hitting this nasty Chrome bug: http://code.google.com/p/chromium/issues/detail?id=102407On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > I forgot how nasty your beast computer is... 20 slaves?! Remind me how> many actual (real) cores do you have? Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24 withhyperthreading). > Did you experiment with> different slave numbers? I ask because I noticed that:>> 1) it makes little sense to run cpu-intense tests on hyper-cores,> doesn't yield much if anything,> 2) you should leave some room for system vm threads (GC, compilers);> the more VMs, the more room you'll need. In the past I found somewhere around 20 was good w/ the Pythonrunner... but I went and tried again! With the Python runner I see these run times on just lucene core tests: 2 cpus: 72.2 sec 5 cpus: 35.0 sec 10 cpus: 28.1 sec 15 cpus: 26.2 sec 20 cpus: 26.0 sec 25 cpus: 27.5 sec So seems like after 15 cores it's not helping much... but then I ranon all tests (well minus a few intermittently failing tests): 10 cpus: 88.3 sec 15 cpus: 80.2 sec 20 cpus: 77.4 sec 25 cpus: 76.7 sec The above were just running on beast, but the Python runner can sendjobs (hacked up, just using ssh) to other machines... I have two othernon-beasts, and which I ran 3 jvms on each: 25 + 3 + 3 cpus: 64.7 sec With the new ant runner: 2 cpus: [junit4] Slave 0: 0.16 .. 50.68 = 50.52s [junit4] Slave 1: 0.16 .. 49.58 = 49.42s [junit4] Execution time total: 50.73s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 5 cpus: [junit4] Slave 0: 0.19 .. 21.87 = 21.68s [junit4] Slave 1: 0.16 .. 21.86 = 21.70s [junit4] Slave 2: 0.16 .. 29.31 = 29.15s [junit4] Slave 3: 0.16 .. 26.64 = 26.48s [junit4] Slave 4: 0.19 .. 29.82 = 29.63s [junit4] Execution time total: 29.89s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 10 cpus: [junit4] Slave 0: 0.21 .. 14.62 = 14.41s [junit4] Slave 1: 0.22 .. 17.21 = 16.99s [junit4] Slave 2: 0.23 .. 18.79 = 18.56s [junit4] Slave 3: 0.23 .. 22.99 = 22.76s [junit4] Slave 4: 0.20 .. 27.39 = 27.19s [junit4] Slave 5: 0.19 .. 27.23 = 27.04s [junit4] Slave 6: 0.23 .. 20.40 = 20.17s [junit4] Slave 7: 0.19 .. 26.52 = 26.33s [junit4] Slave 8: 0.24 .. 26.42 = 26.18s [junit4] Slave 9: 0.22 .. 23.57 = 23.35s [junit4] Execution time total: 27.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 15 cpus: [junit4] Slave 0: 0.29 .. 5.16 = 4.87s [junit4] Slave 1: 0.26 .. 15.36 = 15.10s [junit4] Slave 2: 0.26 .. 12.99 = 12.73s [junit4] Slave 3: 0.29 .. 24.20 = 23.92s [junit4] Slave 4: 0.26 .. 27.00 = 26.74s [junit4] Slave 5: 0.33 .. 19.97 = 19.63s [junit4] Slave 6: 0.31 .. 25.29 = 24.98s [junit4] Slave 7: 0.24 .. 28.92 = 28.68s [junit4] Slave 8: 0.33 .. 23.67 = 23.34s [junit4] Slave 9: 0.43 .. 24.43 = 24.00s [junit4] Slave 10: 0.40 .. 27.61 = 27.21s [junit4] Slave 11: 0.22 .. 21.77 21.56s [junit4] Slave 12: 0.22 .. 26.78 = 26.56s [junit4] Slave 13: 0.26 .. 25.92 = 25.66s [junit4] Slave 14: 0.35 .. 27.77 = 27.42s [junit4] Execution time total: 28.98s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 20 cpus: [junit4] Slave 0: 0.35 .. 23.32 = 22.97s [junit4] Slave 1: 0.30 .. 24.32 = 24.02s [junit4] Slave 2: 0.35 .. 21.35 = 21.00s [junit4] Slave 3: 0.37 .. 23.63 = 23.26s [junit4] Slave 4: 0.38 .. 20.74 = 20.35s [junit4] Slave 5: 0.30 .. 19.74 = 19.44s [junit4] Slave 6: 0.36 .. 26.39 = 26.03s [junit4] Slave 7: 0.46 .. 23.64 = 23.18s [junit4] Slave 8: 0.43 .. 22.44 = 22.02s [junit4] Slave 9: 0.30 .. 24.05 = 23.76s [junit4] Slave 10: 0.41 .. 24.75 = 24.33s [junit4] Slave 11: 0.30 .. 22.66 22.36s [junit4] Slave 12: 0.30 .. 24.93 = 24.62s [junit4] Slave 13: 0.40 .. 24.39 = 24.00s [junit4] Slave 14: 0.24 .. 24.47 = 24.23s [junit4] Slave 15: 0.45 .. 25.23 = 24.78s [junit4] Slave 16: 0.34 .. 23.06 22.72s [junit4] Slave 17: 0.23 .. 23.50 = 23.28s [junit4] Slave 18: 0.30 .. 24.27 = 23.97s [junit4] Slave 19: 0.30 .. 24.91 = 24.61s [junit4] Execution time total: 26.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored I only ran once each and results are likely noisy... so it's hard topick a best CPU count... Hmm so does that include compile time (my numbers don't)? Sounds likeno? I'm also measuring from first launch to last finish. Yes I think that's the problem! Ideally ant would just gather up all "jobs" to run and then we'daggregate/distribute across JVMs. Cool. Hmm... can we somehow keep today's directory structure but have anttreat it as a single "module"? Or is the problem that we need tochange the JVM settings (eg CLASSPATH) per test module we havetoday so we must make separate modules for that...? OK cool :) Actually it doesn't open any SSH sessions unless you giveit remote machines to use -- for the "local" JVMs it just forks. I re-ran above -- looks like the times came down some so the new antrunner is basically the same as the Python runner (on core tests):great! Mike McCandless http://blog.mikemccandless.com
+
Michael McCandless 2012-01-04, 23:52
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Michael McCandless 2012-01-04, 23:58
Maybe... 3rd time's the charm...? (This time from Opera). On Wed, Jan 4, 2012 at 5:11 PM, Dawid Weiss <[EMAIL PROTECTED]> wrote: > I forgot how nasty your beast computer is... 20 slaves?! Remind me how > many actual (real) cores do you have? Beast has two 6-core CPUs (x5680 xeons), so 12 real cores (24 with hyperthreading). > Did you experiment with > different slave numbers? I ask because I noticed that: > > 1) it makes little sense to run cpu-intense tests on hyper-cores, > doesn't yield much if anything, > 2) you should leave some room for system vm threads (GC, compilers); > the more VMs, the more room you'll need. In the past I found somewhere around 20 was good w/ the Python runner... but I went and tried again! With the Python runner I see these run times on just lucene core tests: 2 cpus: 72.2 sec 5 cpus: 35.0 sec 10 cpus: 28.1 sec 15 cpus: 26.2 sec 20 cpus: 26.0 sec 25 cpus: 27.5 sec So seems like after 15 cores it's not helping much... but then I ran on all tests (well minus a few intermittently failing tests): 10 cpus: 88.3 sec 15 cpus: 80.2 sec 20 cpus: 77.4 sec 25 cpus: 76.7 sec The above were just running on beast, but the Python runner can send jobs (hacked up, just using ssh) to other machines... I have two other non-beasts, and which I ran 3 jvms on each: 25 + 3 + 3 cpus: 64.7 sec With the new ant runner: 2 cpus: [junit4] Slave 0: 0.16 .. 50.68 = 50.52s [junit4] Slave 1: 0.16 .. 49.58 = 49.42s [junit4] Execution time total: 50.73s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 5 cpus: [junit4] Slave 0: 0.19 .. 21.87 = 21.68s [junit4] Slave 1: 0.16 .. 21.86 = 21.70s [junit4] Slave 2: 0.16 .. 29.31 = 29.15s [junit4] Slave 3: 0.16 .. 26.64 = 26.48s [junit4] Slave 4: 0.19 .. 29.82 = 29.63s [junit4] Execution time total: 29.89s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 10 cpus: [junit4] Slave 0: 0.21 .. 14.62 = 14.41s [junit4] Slave 1: 0.22 .. 17.21 = 16.99s [junit4] Slave 2: 0.23 .. 18.79 = 18.56s [junit4] Slave 3: 0.23 .. 22.99 = 22.76s [junit4] Slave 4: 0.20 .. 27.39 = 27.19s [junit4] Slave 5: 0.19 .. 27.23 = 27.04s [junit4] Slave 6: 0.23 .. 20.40 = 20.17s [junit4] Slave 7: 0.19 .. 26.52 = 26.33s [junit4] Slave 8: 0.24 .. 26.42 = 26.18s [junit4] Slave 9: 0.22 .. 23.57 = 23.35s [junit4] Execution time total: 27.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 15 cpus: [junit4] Slave 0: 0.29 .. 5.16 = 4.87s [junit4] Slave 1: 0.26 .. 15.36 = 15.10s [junit4] Slave 2: 0.26 .. 12.99 = 12.73s [junit4] Slave 3: 0.29 .. 24.20 = 23.92s [junit4] Slave 4: 0.26 .. 27.00 = 26.74s [junit4] Slave 5: 0.33 .. 19.97 = 19.63s [junit4] Slave 6: 0.31 .. 25.29 = 24.98s [junit4] Slave 7: 0.24 .. 28.92 = 28.68s [junit4] Slave 8: 0.33 .. 23.67 = 23.34s [junit4] Slave 9: 0.43 .. 24.43 = 24.00s [junit4] Slave 10: 0.40 .. 27.61 = 27.21s [junit4] Slave 11: 0.22 .. 21.77 = 21.56s [junit4] Slave 12: 0.22 .. 26.78 = 26.56s [junit4] Slave 13: 0.26 .. 25.92 = 25.66s [junit4] Slave 14: 0.35 .. 27.77 = 27.42s [junit4] Execution time total: 28.98s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored 20 cpus: [junit4] Slave 0: 0.35 .. 23.32 = 22.97s [junit4] Slave 1: 0.30 .. 24.32 = 24.02s [junit4] Slave 2: 0.35 .. 21.35 = 21.00s [junit4] Slave 3: 0.37 .. 23.63 = 23.26s [junit4] Slave 4: 0.38 .. 20.74 = 20.35s [junit4] Slave 5: 0.30 .. 19.74 = 19.44s [junit4] Slave 6: 0.36 .. 26.39 = 26.03s [junit4] Slave 7: 0.46 .. 23.64 = 23.18s [junit4] Slave 8: 0.43 .. 22.44 = 22.02s [junit4] Slave 9: 0.30 .. 24.05 = 23.76s [junit4] Slave 10: 0.41 .. 24.75 = 24.33s [junit4] Slave 11: 0.30 .. 22.66 = 22.36s [junit4] Slave 12: 0.30 .. 24.93 = 24.62s [junit4] Slave 13: 0.40 .. 24.39 = 24.00s [junit4] Slave 14: 0.24 .. 24.47 = 24.23s [junit4] Slave 15: 0.45 .. 25.23 = 24.78s [junit4] Slave 16: 0.34 .. 23.06 = 22.72s [junit4] Slave 17: 0.23 .. 23.50 = 23.28s [junit4] Slave 18: 0.30 .. 24.27 = 23.97s [junit4] Slave 19: 0.30 .. 24.91 = 24.61s [junit4] Execution time total: 26.52s [junit4] Tests summary: 279 suites, 1546 tests, 4 ignored I only ran once each and results are likely noisy... so it's hard to pick a best CPU count... Hmm so does that include compile time (my numbers don't)? Sounds like no? I'm also measuring from first launch to last finish. Yes I think that's the problem! Ideally ant would just gather up all "jobs" to run and then we'd aggregate/distribute across JVMs. Cool. Hmm... can we somehow keep today's directory structure but have ant treat it as a single "module"? Or is the problem that we need to change the JVM settings (eg CLASSPATH) per test module we have today so we must make separate modules for that...? OK cool :) Actually it doesn't open any SSH sessions unless you give it remote machines to use -- for the "local" JVMs it just forks. I re-ran above -- looks like the times came down some so the new ant runner is basically the same as the Python runner (on core tests): great! Mike McCandless http://blog.mikemccandless.com
+
Michael McCandless 2012-01-04, 23:58
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2012-01-05, 07:43
> 15 cpus: > > [junit4] Slave 0: 0.29 .. 5.16 = �� 4.87s ... > [junit4] Slave 3: 0.29 .. 24.20 = �� 23.92s > [junit4] Slave 4: 0.26 .. 27.00 = �� 26.74s
This is weird -- such discrepancy shouldn't happen after it has some initial timings unless there was a really skewed test case inside. I do all per-vm suite balancing beforehand and don't adjust once the execution is in progress (probably using job stealing); maybe this is a mistake that should be corrected. Then the order of suites should be reported in case of a failure and if you have 20 slaves this would be a fairly large log ;)
Dawid
---------------------------------------------------------------------
+
Dawid Weiss 2012-01-05, 07:43
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Michael McCandless 2012-01-05, 15:57
On Thu, Jan 5, 2012 at 2:43 AM, Dawid Weiss <[EMAIL PROTECTED]> wrote: >> 15 cpus: >> >> [junit4] Slave 0: 0.29 .. 5.16 = 4.87s > ... >> [junit4] Slave 3: 0.29 .. 24.20 = 23.92s >> [junit4] Slave 4: 0.26 .. 27.00 = 26.74s > > This is weird -- such discrepancy shouldn't happen after it has some > initial timings unless there was a really skewed test case inside. I > do all per-vm suite balancing beforehand and don't adjust once the > execution is in progress (probably using job stealing); maybe this is > a mistake that should be corrected. Then the order of suites should be > reported in case of a failure and if you have 20 slaves this would be > a fairly large log ;) It is strange... because I'm running w/ fixed seed, RAMDir and Lucene40 codec. There shouldn't be much variance... The Python runner pre-aggregates the tests into a JVM run, but, it tries to put ~ 30 seconds worth of tests per JVM, and then front-loads for any tests that take > 30 seconds (that test runs alone in the JVM). So then it's just pulling from that priority queue... This is somewhat wasteful in that the Python runner is running more JVMs than the new ant runner, but I do this because the tests can have such variability on run time... so I think the net effect is just like job stealing except the Python runner is launching new JVMs to "steal". Mike McCandless http://blog.mikemccandless.com---------------------------------------------------------------------
+
Michael McCandless 2012-01-05, 15:57
-
Re: Parallel tests in ANT, experiment volunteers welcome :)
Dawid Weiss 2012-01-05, 19:29
> It is strange... because I'm running w/ fixed seed, RAMDir and > Lucene40 codec. There shouldn't be much variance...
I don't think it's running with a fixed seed. The problem is that junit4 has the same property to control seed (tests.seed) but a different seed format; that's why I suggested a few runs and specifying an empty seed (which is compatible with junit4 and ltc).
> for any tests that take > 30 seconds (that test runs alone in the > JVM). So then it's just pulling from that priority queue...
I was thinking to do job stealing but then comes the issue of reproducibility (the order of suites sent to a particular jvm) in case the jvm crashes or something. Technically it's easy to do, but after some deliberation I opted for a fixed list of seeds per slave (then you can re-run with the same list because it's on disk, passed as a parameter).
@Hoss:
> but i wonder if reproducibility might be a problem? if you don't get the > same classpath, and some classes are loaded i na diff order, would you be > able to "cd modules/foo && ant test -D..." and see the same failures?
Most likely not. Classpath variations will be an issue. Now that I think of it even load-balancing will be an issue if it's to be calculated from repeatedly updated data. On the other hand, if balancing is calculated from a fixed set of precomputed statistics the quality may vary from system to system... again no good solutions for this I guess.
D.
---------------------------------------------------------------------
+
Dawid Weiss 2012-01-05, 19:29
|
|