I think this is completely normal and correct behavior: https://docs.couchbase.com/server/current/cli/cbcli/couchbase-cli-setting-autoreprovision.html.
"Auto-reprovisioning" can happen for ephemeral buckets when memcached restarts - and it happened here. If memcached restarts it will lose all data for active vbuckets in the ephemeral bucket on that node that restarted. If the janitor onlines the bucket on that node, the replicas will connect to it and immediately drop all their data.
So what happens is that the janitor checks to see if the memcached vbuckets are in the "missing" state and if they are and the bucket is of type ephemeral, ns_server won't online vbuckets on the node which restarted but will rather activate replica vbuckets elsewhere in the cluster. A rebalance is required to get things back to normal.
In this case, .102 goes down:
2021-03-01T10:42:08.868-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.105.152) - Node 'ns_1@172.23.105.152' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason,
|
connection_closed}]
|
2021-03-01T10:42:08.869-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.98.215) - Node 'ns_1@172.23.98.215' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason,
|
connection_closed}]
|
2021-03-01T10:42:08.869-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.107.52) - Node 'ns_1@172.23.107.52' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason,
|
connection_closed}]
|
...
|
The rebalance starts:
2021-03-01T10:42:30.361-08:00, ns_orchestrator:0:info:message(ns_1@172.23.106.228) - Starting rebalance, KeepNodes = ['ns_1@172.23.106.228','ns_1@172.23.123.102'], EjectNodes = ['ns_1@172.23.123.100',
|
'ns_1@172.23.123.101',
|
'ns_1@172.23.107.52',
|
'ns_1@172.23.105.152',
|
'ns_1@172.23.98.215'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 8a3dc656fb73df01a78d3f9678c0c550
|
.102 comes back up:
2021-03-01T10:42:35.883-08:00, ns_node_disco:4:info:node up(ns_1@172.23.105.152) - Node 'ns_1@172.23.105.152' saw that node 'ns_1@172.23.123.102' came up. Tags: []
|
2021-03-01T10:42:35.886-08:00, ns_node_disco:4:info:node up(ns_1@172.23.123.102) - Node 'ns_1@172.23.123.102' saw that node 'ns_1@172.23.105.152' came up. Tags: []
|
2021-03-01T10:42:35.900-08:00, ns_node_disco:4:info:node up(ns_1@172.23.106.228) - Node 'ns_1@172.23.106.228' saw that node 'ns_1@172.23.123.102' came up. Tags: []
|
2021-03-01T10:42:35.906-08:00, ns_node_disco:4:info:node up(ns_1@172.23.123.102) - Node 'ns_1@172.23.123.102' saw that node 'ns_1@172.23.106.228' came up. Tags: []
|
...
|
Pre-rebalance ns_server checks to see if we'd lose data in ephemeral buckets due to onlining a node that had been restarted - and in this case data would be lost. So the rebalance stops, correctly:
2021-03-01T10:42:37.211-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.106.228) - Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",
|
{error,unsafe_nodes,['ns_1@172.23.123.102']}}.
|
Subsequently the janitor runs and auto-reprovisions the vbuckets in the ephemeral bucket on .102:
2021-03-01T10:42:41.487-08:00, auto_reprovision:0:info:message(ns_1@172.23.106.228) - Bucket "default" has been reprovisioned on following nodes: ['ns_1@172.23.123.102']. Nodes on which the data service restarted: ['ns_1@172.23.123.102'].
|
2021-03-01T10:42:41.521-08:00, auto_reprovision:0:info:message(ns_1@172.23.106.228) - auto-reprovision is disabled as maximum number of nodes (1) that can be auto-reprovisioned has been reached.
|
I think this should be resolved as: feature is working correctly.
Can you update with cbcollects?