Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-53695

[CBBS] Running task can have status waiting due to concurrency

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.6.0
    • 7.1.0
    • tools
    • Untriaged
    • 1
    • Unknown

    Description

      What happens
      In a forum post Hyun-Ju Vega noticed that we see a weird series of task statuses for the given logs:

       ❯ rg "status\"|cbbackupmgr backup|Received store event|Added task|Running task|Dispatching task|Set running task|Task dispatched"
      CBLogs.txt
      2:2022-09-12T10:00:05.198Z INFO (Dispatcher) Dispatching task {"clusterName": "self", "repositoryID": "PersistentStorage_06_09_2022", "task": "Hourly"}
      3:2022-09-12T10:00:05.678Z INFO (Dispatcher) Dispatching task {"targetNode": {"node_id":"5dfdb3ca7f84ab86904f08fac9d9eef3","host":"10.100.110.129","http_port":8097,"grpc_port":9124,"status":"","leader":false}}
      5:2022-09-12T10:00:05.811Z INFO (Worker) Added task to queue {"cluster": "self", "repository": "PersistentStorage_06_09_2022", "task": "Hourly"}
      6:2022-09-12T10:00:05.811Z INFO (Worker) Running task {"cluster": "self", "repositoryID": "PersistentStorage_06_09_2022", "state": "active", "taskName": "Hourly", "taskType": "BACKUP"}
      7:2022-09-12T10:00:05.812Z DEBUG (Leader Manager) Received store event {"eventType": 5}
      8:2022-09-12T10:00:05.812Z DEBUG (Event Handler) (Node Run) Task updates {"cluster": "self", "repository": "PersistentStorage_06_09_2022", "task": "Hourly", "status": "running"}
      10:2022-09-12T10:00:05.813Z INFO (Runner) Running command /opt/couchbase/bin/cbbackupmgr backup --cluster http://10.100.110.129:8091 --archive s3://<obscured-s3-bucket>/PersistentStorage --repo 0ef4be99-30f6-46b3-a1e3-8abf0af43ace --purge --auto-select-threads --json-progress 994e4338-8f08-4530-adb5-04880e8a3e5d --obj-region eu-west-1 --obj-staging-dir /cb_backup/PersistentStorage_06_09_2022-0ef4be99-30f6-46b3-a1e3-8abf0af43ace --obj-auth-by-instance-metadata
      12:2022-09-12T10:00:05.815Z INFO (Dispatcher) Set running task {"task": {"task_name":"Hourly","status":"waiting","start":"2022-09-12T10:00:05.230739377Z","error_code":0,"type":"BACKUP"}}
      13:2022-09-12T10:00:05.822Z INFO (Dispatcher) Task dispatched {"nodeID": "5dfdb3ca7f84ab86904f08fac9d9eef3"}
      14:2022-09-12T10:00:05.839Z DEBUG (Leader Manager) Received store event {"eventType": 5}
      15:2022-09-12T10:00:05.839Z DEBUG (Event Handler) (Node Run) Task updates {"cluster": "self", "repository": "PersistentStorage_06_09_2022", "task": "Hourly", "status": "waiting"}
      

      It appears the task status goes waiting->running->waiting, which should not happen.

      Explanation
      The interesting parts here are when we update node runs (eventType 5). Tasks come in to a worker in the Add method. Here we add the task to a channel (which the worker drains), take a lock to update some state and then call SetNodeRun. The channel is drained and tasks are Run. This is the main method in running tasks and we SetNodeRun through the call to updateWorkerCurrentTask.

      I think what is happening is Add and Run are on different coroutines. When we send over the channel the Run goroutine wakes up immediately and executes - managing to call updateWorkerCurrentTask before Add has called SetNodeRun.

      Fix
      We could check the task isn't already running before updating it.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            gilad.kalchheim Gilad Kalchheim
            Matt.Hall Matt Hall
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty