When investigating an issue on the java client library with retrying operations based on not-my-vbucket responses, I've noticed that at the end of a rebalance removing a server, the server being removed will drop the connection while operations are in flight.
There would be a period of time when the bucket transitions from active to dead, after the takeover, when it would only respond with not-my-vbucket replies.
Unfortunately, the current behavior makes application code, at best, need to handle more complex failure logic. At worst, unhandled by the application it could lead to data loss.
The challenge here is determining the period of time. Some clients do not disconnect, and there is no server polite hangup.
The attached log demonstrates the issue, and the attached test program will let one observe it. This test was carried out by:
1) Set up 3 node cluster with a default bucket which is of the Couchbase type
2) Start the test program, first argument is number of seconds to run, arguments after that are hostname/ips for the nodes in the cluster
3) Remove a node from the cluster
Expected behavior: All operations sent to the server receive a not-my-vbucket reply and are rescheduled as we receive config updates from the server.
Observed behavior: At the end of the remove server/rebalance cycle, the connection is dropped and in-flight operations will be canceled by the client, since it doesn't really know the status of those operations.