Description
After I added --stdin arg (and writing of bootstrap deks) for memcached I noticed that our cluster tests started to hang intermittently.
Example:
https://cv.jenkins.couchbase.com/job/ns-server-cluster-tests/9554/console
In those tests we see that node n12 is leaving the cluster at 04:30:15:
[cluster:debug,2024-08-21T04:30:15.147Z,n_12@127.0.0.1:ns_cluster<0.277.0>:ns_cluster:leave_body:475]Leaving cluster
|
One of the steps of that process is memcached restart:
[ns_server:debug,2024-08-21T04:30:21.935Z,babysitter_of_n_12@cb.local:ns_ports_manager<0.141.0>:ns_ports_manager:handle_call:68]Restart of port memcached is requested
|
Right after the restart memcached_config_mgr writes bootstrap deks to memcached.
The string looks like the following:
BOOTSTRAP_DEK=<BootstrapKeysJson>\nDONE\n
|
That happens at 04:30:22.177:
[ns_server:debug,2024-08-21T04:30:22.177Z,n_12@127.0.0.1:memcached_config_mgr<0.30212.0>:memcached_config_mgr:write_bootstrap_keys:140]0 bootstrap config keys were written to memcached's stdin (encryption is off)
|
Since that was the last test for this cluster, shutdown is initiated for this node immediately:
[ns_server:info,2024-08-21T04:30:22.185Z,babysitter_of_n_12@cb.local:<0.9.0>:ns_babysitter_bootstrap:stop:34]50068: got shutdown request. Terminating.
|
Which triggers shutdown for memcached:
[ns_server:debug,2024-08-21T04:30:22.190Z,babysitter_of_n_12@cb.local:<0.1436.0>:ns_port_server:terminate:207]Shutting down port memcached
|
[ns_server:debug,2024-08-21T04:30:22.190Z,babysitter_of_n_12@cb.local:<0.1436.0>:ns_port_server:port_shutdown:357]Shutdown command: "shutdown"
|
After that ns_server waits for memcached to stop for >4 hours until it stops finally because jenkins job terminates everything by timeout:
[ns_server:info,2024-08-21T08:45:54.138Z,babysitter_of_n_12@cb.local:<0.1436.0>:ns_port_server:handle_info:152]Got {exit_status,0} from port memcached. Exiting normally
|
So it looks like the shutdown command was ignored by memcached.
In the beginning I thought ns_server was sending shutdown before writing BootstrapKeys (which indeed can happen in current implementation), but according to logs that's not the case.
Note: changes that add --stdin and bootstrap deks are not merged to master yet. Here is the gerrit change where the problem happened: https://review.couchbase.org/c/ns_server/+/214623