Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.0
-
Untriaged
-
Yes
Description
When the rebalance fails with mover crash error, we don't see any logs in the memcached logs. Also the memcached log is ends abruptly.
This was found when analysing rebalance failures with jepsen tests. The test does the following:
- Setup a 6 node cluster
- Load 30 documents and keep the document load with updates running continuously with durability level to replicate_to_majority
- Remove a node out and start rebalance.
- The rebalance fails with mover crash error
The following is the abruptly ended memcached file
2019-07-03T01:10:27.138141-07:00 WARNING 50: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.104.255->ns_1@172.23.105.3:default - (vb:71) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early because the conn was disconnected |
2019-07-03T01:10:27.138147-07:00 WARNING 50: (default) DCP (Cons |
Attaching the logs we collected from the tests. CB version: 6.5.0-3644
Another interesting log we see in memcached log is
2019-07-03T01:10:26.610863-07:00 ERROR (default) VBucket::abort (vb:439) failed as HashTable value is not CommittedState::Pending - <ud> SV @0x7f2a39d5f810 ..J ..R.Cp temp: seq:4 rev:1 cas:1562141406129225728 key:"cid:0x0:jepsen0022, size:b" exp:0 age:2 nru:0 fc:4 vallen:1 val age:2 :"8"</ud> |
2019-07-03T01:10:26.610874-07:00 WARNING 54: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.2->ns_1@172.23.105.3:default - PassiveStream::processAbort: vb:439 Got error 'invalid arguments' while trying to process abort |
To run the test again, start a job in http://qa.sc.couchbase.com/job/jepsen-durability-trigger/ with params as cb_version=6.5.0, cb_build=<latest build> and build the job. Wait for 10 mins for the sanity job to be finished and then trigger another job from http://qa.sc.couchbase.com/job/jepsen-durability-rebalance-daily with same parameters.