Thanks for the feedback. We took a deeper dive and found some more interesting insight.
First - we tried your suggestion just running the query on one shard. We found that did not run quickly, so that seemed to indicate our guess that the slowness was at the transport level was maybe not right.
We then shifted the cluster logging to TRACE and watched the timing as the query went through the 4 stages (query, fetch, expand, response) on each node. We found the query stage was fast - around 10 ms. The major slowness was in the fetch stage, but not on all nodes. We found fetch runs quickly (10-100ms) on many nodes and then a random one or two nodes will take 10-15 seconds to run the fetch stage. If we run the same query again - a different couple nodes will have a slow fetch stage. Once in a while all nodes run quickly and the query returns quickly.
The expand and response stages are pretty quick all the time.
So we are now drilling in to understand what could be causing the fetch stage to run slowly. Let me know if you have any more ideas.