Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.2.0
Affects Version/s: 7.1.4, 7.1.3
Component/s: couchbase-bucket
Labels:

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Yes

Description

In ~~MB-46729~~ / patch / 7.1.3, Core_DeleteBucketTask became a NonIO task (having previously executed on memcached's separate pool).

This task calls BucketManager::destroy which eventually sleeps until all remaining connections for a bucket have disconnected.

This blocks a NonIO thread for the entire duration.

If any DCP connections are present, they will block bucket deletion until they release their cookie when the ConnHandler is destroyed.

ConnHandlers are destroyed by the ConnManager task - this runs once a second and is also a NonIO task.

If the number of buckets being destroyed equals or exceeds the number of nonio threads, Core_DeleteBucketTask will occupy all nonio threads, so ConnManager will never be able to run, DCP connections will not release their cookie, so the bucket deletion will not continue.

The default number of nonio threads is floor(30% of the CPU cores), clamped to the range [2,8]. For nodes with <10 cores, this will be 2 threads -> 2 concurrent bucket deletions required to trigger.

This can easily be demonstrated by manually setting num_nonio_threads to 1 and attempting to drop a bucket:

2023-03-21T15:57:14.482788+00:00 INFO 53: Delete bucket [default]. Wait for 5 clients to disconnect

2023-03-21T15:59:15.082490+00:00 INFO 53: Delete bucket [default]. Still waiting: 1 clients connected: {"68":{"agent_name":"proxy","bucket_index":1,"clustermap":{"epoch":-1,"revno":0},"connection":"0x000000010de9fa00","connection_id":"127.0.0.1:53664","cookies":[{"aiostat":"success","ewouldblock":false,"packet":null,"refcount":1,"throttled":false}],"datatype":"snappy,json,xattr","dcp":true,"dcp_deleted_user_xattr":true,"dcp_no_value":false,"dcp_xattr_aware":true,"features":["tcp nodelay","collections"],"internal":true,"last_used":"1007 ms","max_reqs_per_event":20,"max_sched_time":"2131","min_sched_time":"835","nevents":20,"peername":{"ip":"127.0.0.1","port":53664},"priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":false,"sendqueue":{"last":4667080928373459,"size":0,"term":true},"socket":68,"sockname":{"ip":"127.0.0.1","port":11999},"ssl":false,"state":"pending close","thread":1,"total_cpu_time":"361361555","total_queued_send":68338,"total_recv":51754,"total_send":68338,"type":"consumer","user":{"domain":"local","name":"@ns_server"},"yields":0}}

2023-03-21T16:00:15.401406+00:00 INFO 53: Delete bucket [default]. Still waiting: 1 clients connected: {"68":{"agent_name":"proxy","bucket_index":1,"clustermap":{"epoch":-1,"revno":0},"connection":"0x000000010de9fa00","connection_id":"127.0.0.1:53664","cookies":[{"aiostat":"success","ewouldblock":false,"packet":null,"refcount":1,"throttled":false}],"datatype":"snappy,json,xattr","dcp":true,"dcp_deleted_user_xattr":true,"dcp_no_value":false,"dcp_xattr_aware":true,"features":["tcp nodelay","collections"],"internal":true,"last_used":"1003 ms","max_reqs_per_event":20,"max_sched_time":"1200","min_sched_time":"835","nevents":20,"peername":{"ip":"127.0.0.1","port":53664},"priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":false,"sendqueue":{"last":4667080928373459,"size":0,"term":true},"socket":68,"sockname":{"ip":"127.0.0.1","port":11999},"ssl":false,"state":"pending close","thread":1,"total_cpu_time":"361479482","total_queued_send":68338,"total_recv":51754,"total_send":68338,"type":"consumer","user":{"domain":"local","name":"@ns_server"},"yields":0}}

2023-03-21T16:01:15.703727+00:00 INFO 53: Delete bucket [default]. Still waiting: 1 clients connected: {"68":{"agent_name":"proxy","bucket_index":1,"clustermap":{"epoch":-1,"revno":0},"connection":"0x000000010de9fa00","connection_id":"127.0.0.1:53664","cookies":[{"aiostat":"success","ewouldblock":false,"packet":null,"refcount":1,"throttled":false}],"datatype":"snappy,json,xattr","dcp":true,"dcp_deleted_user_xattr":true,"dcp_no_value":false,"dcp_xattr_aware":true,"features":["tcp nodelay","collections"],"internal":true,"last_used":"1005 ms","max_reqs_per_event":20,"max_sched_time":"1646","min_sched_time":"835","nevents":20,"peername":{"ip":"127.0.0.1","port":53664},"priority":"Medium","protocol":"memcached","refcount":1,"sasl_enabled":false,"sendqueue":{"last":4667080928373459,"size":0,"term":true},"socket":68,"sockname":{"ip":"127.0.0.1","port":11999},"ssl":false,"state":"pending close","thread":1,"total_cpu_time":"361593458","total_queued_send":68338,"total_recv":51754,"total_send":68338,"type":"consumer","user":{"domain":"local","name":"@ns_server"},"yields":0}}

...

bucket deletion never proceeds

This is very detrimental. This was first seen as a repeated rebalance failure (when trying to remove buckets from a node), but would also affect explicit bucket drop attempts. Other NonIO tasks, including from other buckets, will similarly be impeded:

TASK(NotifyHighPriorityReqTask, NONIO_TASK_IDX, 0)

TASK(DurabilityCompletionTask, NONIO_TASK_IDX, 1)

TASK(DurabilityTimeoutTask, NONIO_TASK_IDX, 1)

TASK(DurabilityTimeoutVisitor, NONIO_TASK_IDX, 1)

TASK(ActiveStreamCheckpointProcessorTask, NONIO_TASK_IDX, 3)

TASK(CheckpointDestroyerTask, NONIO_TASK_IDX, 6)

TASK(CheckpointMemRecoveryTask, NONIO_TASK_IDX, 6)

TASK(VBucketMemoryDeletionTask, NONIO_TASK_IDX, 6)

TASK(SeqnoPersistenceNotifyTask, NONIO_TASK_IDX, 1)

So this could also affect, for all buckets on a node:

memory recovery from deleting checkpoints, or cursor dropping
sync write completion/timeout
seqno persistence
vbucket deletion
in-memory streaming of items

In master, this will also prevent fetching any per-bucket stats via CMD_STAT:

TASK(Core_StatsBucketTask, NONIO_TASK_IDX, 10)

and potentially any SASL authenticated connection:

TASK(Core_SaslStartTask, NONIO_TASK_IDX, 10)

TASK(Core_SaslStepTask, NONIO_TASK_IDX, 10)

This state will not resolve itself; to recover the number of nonio threads could be increased, or memcached restarted.

This is a regression; previously the task was executed in a separate thread pool.

Attachments

Issue Links

causes

MB-58350 Long running commands during bucket deletion cause memcached to crash in propagateDisconnect()

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: James Harrison (Inactive)

Reporter:: James Harrison (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Due:: 28/Mar/23

Created:: 21/Mar/23 9:13 AM

Updated:: 16/Nov/23 7:50 AM

Resolved:: 28/Mar/23 5:26 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 8 closed Gerrit changes

Hide There are 8 closed Gerrit changes

MB-56067: Add YieldingTask for multi-phase daemon tasks: Gerrit Review:

MB-56067: Split BucketManager::destroy into a state machine: Gerrit Review:

MB-56067: Move bucket destruction to a YieldingTask: Gerrit Review:

Merge branch 'neo' into couchbase/master: Gerrit Review:

Merge branch 'neo' into couchbase/master: Gerrit Review:

Merge branch 'neo' into couchbase/master: Gerrit Review:

MB-58350: Wait for clients to disconnect before deleting bucket: Gerrit Review:

MB-58350: Wait for clients to disconnect before deleting bucket: Gerrit Review:

Delete bucket task can block NonIO tasks and deadlock if any are needed for bucket shutdown

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty