Due to a bug in how one of the Ephemeral bucket background tasks is scheduled; nonIO tasks can be arbitrarily delayed in being executed. This can result in DCP connections with NO-OPs enabled getting disconnected incorrectly; as the DCP Connection Notifier which checks for NO-OP responses incorrectly believes they haven't sent a NO-OP in the required time period.
The consequences of this bug include DCP clients such as GSI having their DCP connections abruptly disconnected, causing index building to fail.
While this bug requires at least one Ephemeral bucket to be present, the nonIO task scheduling bug affects nonIO tasks from all buckets.
The EphTombstoneStaleItemDeleter task is supposed to be woken up and scheduled once the paired EphTombstoneHTCleaner task
completes. However, due to a mis-use of the GlobalTask::wakeUp() function the task was not actually scheduled - only it's desired wakeupTime was updated.
This manifested in the logs as a super-slow scheduling time - e.g.
An immediate consequence is the task wasn't run when desired. However, it also had a knock-on effect of breaking the ordering of the
futureQueue (a std::priority_queue); as that uses the wakeTime as the ordering property.
Because the expected invariants of the futureQueue have been broken, the ordering of Tasks in the FutureQueue is also broken (as we use
std::push_heap to add elements which assumes the queue is correctly sorted). As a result, other nonIO tasks can fail to be scheduled at the desired time.
This can affect all nonIO tasks which are periodically scheduled; a non-exhaustive list of which includes:
- 'Adjusting hash table sizes.'
- 'Connection Manager'
- 'DCP connection notifier'
- 'Eph tombstone hashtable cleaner'
- 'Eph tombstone stale item deleter'
- 'Hashtable resizer'
- 'Item Compressor'
- 'Memory defragmenter'
- 'Monitoring a workload pattern'
- 'Paging out items.'
- 'Removing closed unreferenced checkpoints from memory'
Note this includes the DCP connection notifier which is responsible for checking if DCP no-op timeouts have been exceeded, and can cause DCP connections to be incorrectly considered to have exceeded the no-op interval and hence disconnected - for example: