Yeah - we are really hoping for a simple explanation, but we have yet to find it :)
We have tried profiling the queries on our new 5.4 test environment. Here is an example read out for one test query from one shard (we picked the slowest shard to show here...
The "took" time for this query was 15.6 seconds, but when we add up all the times recorded for each shard they add up to 5.3 seconds - and that's if each shard worked in serial which isn't the case. We weren't sure what to make of this. Most of the time in the "took" time can't be accounted for by looking at the times in the profile stats one each shard. Let me know if you have any ideas.