Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.6.1, 6.6.2, 6.6.5, 7.0.0, 7.0.1
-
Triaged
-
1
-
No
-
KV-Engine Sprint 2021 August
Description
Summary
If warmup is cancelled before it gets past the "populateVBucketMap" phase, and there are outstanding client requests which are blocked until populateVBucketMap completes, then KV-Engine never notifies the waiting client connections.
As such, any subsequent deletion of the affected bucket is blocked forever, as the KV-Engine frontend is waiting in vain for those connections to finish their current operation.
Details
This kind of scenario can occur during delta-node recovery; if ns_server doesn't read vBucket states for all warming up buckets within the prepare_delta_recovery_bucket timeout value - default 180s. If that timeout is reached, it cancels the delta recovery and deletes the Buckets on the affected node. If that "DELETE_BUCKET" command is received before ep-engine has finished the populateVBucketMap phase of warmup then the waiting GET_VBSTATE command from ns_server is stuck forever and delete cannot complete.
This looks like the following in the logs:
2021-08-01T00:01:44.901492-04:00 INFO 4 Delete bucket [Customers]. Notifying engine
|
2021-08-01T00:01:44.901662-04:00 INFO 4 Delete bucket [Customers]. Engine ready for shutdown
|
2021-08-01T00:01:44.901669-04:00 INFO 4 Delete bucket [Customers]. Wait for 4 clients to disconnect
|
2021-08-01T00:02:14.911226-04:00 INFO 4 Delete bucket [Customers]. Still waiting: 1 clients connected: ...
|
2021-08-01T00:02:44.917036-04:00 INFO 4 Delete bucket [Customers]. Still waiting: 1 clients connected (state is unchanged).
|
2021-08-01T00:03:14.923323-04:00 INFO 4 Delete bucket [Customers]. Still waiting: 1 clients connected (state is unchanged).
|
And the waiting connection is typically an ns_server STATS(vbucket} request:
2021-08-01T00:02:14.911226-04:00 INFO 4 Delete bucket [Customers]. Still waiting: 1 clients connected: {"420":{"agent_name":"regular","bucket_index":1,"clustermap_revno":"unknown","connection":"0x00006fe0ae035580","connection_id":"127.0.0.1:43663","cookies":[{"aiostat":"would block","connection":"[ 127.0.0.1:43663 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":true,"packet":{"bodylen":7,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket</ud>","keylen":7,"magic":"ClientRequest","opaque":0,"opcode":"STAT","vbucket":0},"refcount":0}],"datatype":"json","dcp":false,"dcp_no_value":false,"dcp_xattr_aware":false,"features":["tcp nodelay"],"internal":true,"iov":{"size":10,"used":0},"itemlist":{"size":0},"libevent":{"ev_flags":{"decoded":["read","persist"],"raw":"0x0012"},"registered":false,"which":{"decoded":["read"],"raw":"0x0002"}},"max_reqs_per_event":20,"max_sched_time":"31252","min_sched_time":"887","msglist":{"bytes":0,"curr":0,"used":1},"nevents":20,"parent_port":11209,"peername":"127.0.0.1:43663","priority":"Medium","protocol":"memcached","read":{"buffer":123256223688704,"empty":false,"read_head":0,"size":2048,"write_head":31},"refcount":1,"sasl_enabled":true,"socket":420,"sockname":"127.0.0.1:11209","ssl":{"enabled":false},"state":"execute","temp_alloc_list":{"size":0},"thread":47,"total_cpu_time":"3010287","total_recv":262,"total_send":28610,"username":"@ns_server","write_and_go":"new_cmd","yields":0}}
|