It turns out this is not a regression. The way the test is being carried out is different in this case.
so, I worked out why this java failover isn't working. it's related to using kill -STOP
Here's the current behavior,
there's a per-node continuious operation timeout threshold
after a given node times out a bunch, the client will drop the connection to that node
then it'll try to reestablish it
meanwhile, there's another counter for how often we can't find an established connection to a node the config says we should be using
that second one, the algorithm is 10 failures to find the node in a 10 second window means re-bootstrap
so, the problem...
is that when we kill -STOP (instead of an actual cable pull)
you can still establish new connections to 11210
so, we drop and reestablish, send a bunch of stuff, then drop and reestablish quickly
but this algorithm that I'd tested with actual cable pulls will work with actual cable pulls, but it won't work (without big changes) in the sigstop case ingenthr
because we consider the connection "good" at the time of established, not at the time of sending data
maybe that's incorrect to do