Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60378

[CBM] REST request can hang indefinitely if we fail and then can't retrieve the cluster config

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 7.6.0
    • tools, tools-common
    • None
    • Untriaged
    • 0
    • No

    Description

      What is the problem?

      Gilad Kalchheim saw a test with large cbbackupmgr logs. They looked like:

      2024-01-15T03:06:55.967-08:00 (Plan) (Query) Transferring Query metadata for bucket 'default'
      2024-01-15T03:06:55.967-08:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.136.138:8093/api/v1/bucket/default/backup'
      2024-01-15T03:07:00.777-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8093/api/v1/bucket/default/backup': Get "http://172.23.136.138:8093/api/v1/bucket/default/backup": dial tcp 172.23.136.138:8093: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_comm
      on.go:30
      2024-01-15T03:07:00.777-08:00 WARN: (REST) (Attempt 1) (GET) Request to endpoint '/api/v1/bucket/default/backup' failed due to error: failed to perform request: Get "http://172.23.136.138:8093/api/v1/bucket/default/backup": dial tcp 172.23.136.138:8093: connect: connection refused -- logging.ToolsCommonLogger.Log() a
      t tools_common.go:28
      2024-01-15T03:07:00.778-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8091/pools': Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:30
      2024-01-15T03:07:00.778-08:00 WARN: (REST) (CCP) Failed to update config using host '172.23.136.138': failed to check if node is valid: failed to execute request: Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:28
      2024-01-15T03:07:00.778-08:00 WARN: (REST) Failed to update cluster config, will retry: exhausted cluster nodes -- logging.ToolsCommonLogger.Log() at tools_common.go:28
      2024-01-15T03:07:10.700-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8091/pools': Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:30
      2024-01-15T03:07:10.701-08:00 WARN: (REST) (CCP) Failed to update config using host '172.23.136.138': failed to check if node is valid: failed to execute request: Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:28
      2024-01-15T03:07:10.701-08:00 WARN: (REST) Failed to update cluster config, will retry: exhausted cluster nodes -- logging.ToolsCommonLogger.Log() at tools_common.go:28 

      with the last few logs repeating forever. The problem is in shouldRetryWithError we wait for a config update by calling waitUntilUpdated. If the config is never received and the context is never cancelled then this will loop forever. The context here comes from Execute and it doesn’t have a timeout. The REST timeouts are set at a lower level.

      What is the customer impact?
      It should be unlikely that a customer hit this. They would need to never be able to get a config which would suggest either the cluster is down or the machine that is doing the backup is network partitioned from the entire cluster. Even if one of these events happen fixing this issue would only make the backup exit.

      For this reason I've not pulled this into Trinity.

      What is the fix?
      We could:

      • add the request timeout at a higher level so it is given to the should retry function
      • have a new timeout for waiting for the config update
      • have cbbackupmgr specify a top-level timeout (we’d need to read the rest timeouts and multiply it by the number of retries so this feels a bit hacky)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            owend Daniel Owen
            Matt.Hall Matt Hall
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty