Affects Version/s: 2.7.3
Fix Version/s: 2.7.4
Environment:Linux ip-10-0-129-240 4.4.11-23.53.amzn1.x86_64 #1 SMP Wed Jun 1 22:22:50 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
When running cbc-pillowfight, if a node is stopped, and a failover triggered, cbc-pillowfight does not recover.
Version tested as failing: 2.7.3
Version tested as working: 2.6.4
Steps to reproduce:
Note also, brief video showing reproduction steps here:
- Create 3 node cluster, 1 bucket
- From a separate application server, run cbc-pillowfight
- One one node in the cluster run:
- service couchbase-server stop
- The node will show up as 'down' in the UI
- There are less than 1024 active vbuckets.
- Pillowfight's traffic load will become sporadic due to timeouts (this is expected, I think).
- Trigger a failover from the UI of the down node
- There are now 1024 active vbuckets
- Pillowfight does not recover, traffic remains sporadic.
- ... ...
- Start the couchbase node again
- service couchbase-server start
- Add the node back in with 'delta recovery' option.
- Issue rebalance
- As soon as the rebalance is started (and before it is complete!) pillowfight appears to start running full traffic loads again successfully.
- I found same behaviour regardless of whether couchbase was shutdown cleanly or had a 'hard stop'.
- My tests showed that this worked successfully on 2.6.4, I will collect additional logs for this version and attach to ticket.
- pillowfight.log a -vvv log from pillowfight client run during the full process outlined above, with version 2.7.3.
- Screenshots showing the traffic load at different stages (pre node down, during node down/failover, after rebalance started)
Server side logs are available: