Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32181

DCP connections can be prematurely disconnected if an Ephemeral bucket is present

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.5.3, 6.0.1
    • 5.0.0, 5.0.1, 5.1.0, 5.1.1, 5.1.2, 5.1.3, 5.5.0, 5.5.1, 5.5.2, 6.0.0
    • couchbase-bucket
    • Untriaged
    • Yes

    Description

      Summary

      Due to a bug in how one of the Ephemeral bucket background tasks is scheduled; nonIO tasks can be arbitrarily delayed in being executed. This can result in DCP connections with NO-OPs enabled getting disconnected incorrectly; as the DCP Connection Notifier which checks for NO-OP responses incorrectly believes they haven't sent a NO-OP in the required time period.

      The consequences of this bug include DCP clients such as GSI having their DCP connections abruptly disconnected, causing index building to fail.

      While this bug requires at least one Ephemeral bucket to be present, the nonIO task scheduling bug affects nonIO tasks from all buckets.

      Details

      The EphTombstoneStaleItemDeleter task is supposed to be woken up and scheduled once the paired EphTombstoneHTCleaner task
      completes. However, due to a mis-use of the GlobalTask::wakeUp() function the task was not actually scheduled - only it's desired wakeupTime was updated.

      This manifested in the logs as a super-slow scheduling time - e.g.

      2018-11-29T13:55:18.566925Z WARNING (ephemeral-1) Slow scheduling for NON_IO task 'Eph tombstone stale item deleter' on thread nonIO_worker_1. Schedule overhead: 10 s
      

      An immediate consequence is the task wasn't run when desired. However, it also had a knock-on effect of breaking the ordering of the
      futureQueue (a std::priority_queue); as that uses the wakeTime as the ordering property.

      Because the expected invariants of the futureQueue have been broken, the ordering of Tasks in the FutureQueue is also broken (as we use
      std::push_heap to add elements which assumes the queue is correctly sorted). As a result, other nonIO tasks can fail to be scheduled at the desired time.

      This can affect all nonIO tasks which are periodically scheduled; a non-exhaustive list of which includes:

      • 'Adjusting hash table sizes.'
      • 'Connection Manager'
      • 'DCP connection notifier'
      • 'Eph tombstone hashtable cleaner'
      • 'Eph tombstone stale item deleter'
      • 'Hashtable resizer'
      • 'Item Compressor'
      • 'Memory defragmenter'
      • 'Monitoring a workload pattern'
      • 'Paging out items.'
      • 'Removing closed unreferenced checkpoints from memory'

      Note this includes the DCP connection notifier which is responsible for checking if DCP no-op timeouts have been exceeded, and can cause DCP connections to be incorrectly considered to have exceeded the no-op interval and hence disconnected - for example:

      2018-11-29T15:03:13.029478Z INFO (ephemeral-4) DCP (Consumer) eq_dcpq:replication:ns_1@data-1->ns_1@data-2:ephemeral-4 - Disconnecting because a message has not been received for 360s. lastMessageTime:361
      2018-11-29T15:03:13.032533Z WARNING (ephemeral-6) Slow scheduling for NON_IO task 'DCP connection notifier' on thread nonIO_worker_1. Schedule overhead: 554 s
      

      Attachments

        Activity

          People

            anitha.kuberan Anitha Kuberan
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty