Details
-
Bug
-
Resolution: Unresolved
-
Major
-
7.6.0
-
None
-
Untriaged
-
0
-
No
Description
What is the problem?
Gilad Kalchheim saw a test with large cbbackupmgr logs. They looked like:
2024-01-15T03:06:55.967-08:00 (Plan) (Query) Transferring Query metadata for bucket 'default'
|
2024-01-15T03:06:55.967-08:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.136.138:8093/api/v1/bucket/default/backup'
|
2024-01-15T03:07:00.777-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8093/api/v1/bucket/default/backup': Get "http://172.23.136.138:8093/api/v1/bucket/default/backup": dial tcp 172.23.136.138:8093: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_comm
|
on.go:30
|
2024-01-15T03:07:00.777-08:00 WARN: (REST) (Attempt 1) (GET) Request to endpoint '/api/v1/bucket/default/backup' failed due to error: failed to perform request: Get "http://172.23.136.138:8093/api/v1/bucket/default/backup": dial tcp 172.23.136.138:8093: connect: connection refused -- logging.ToolsCommonLogger.Log() a
|
t tools_common.go:28
|
2024-01-15T03:07:00.778-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8091/pools': Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:30
|
2024-01-15T03:07:00.778-08:00 WARN: (REST) (CCP) Failed to update config using host '172.23.136.138': failed to check if node is valid: failed to execute request: Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:28
|
2024-01-15T03:07:00.778-08:00 WARN: (REST) Failed to update cluster config, will retry: exhausted cluster nodes -- logging.ToolsCommonLogger.Log() at tools_common.go:28
|
2024-01-15T03:07:10.700-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8091/pools': Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:30
|
2024-01-15T03:07:10.701-08:00 WARN: (REST) (CCP) Failed to update config using host '172.23.136.138': failed to check if node is valid: failed to execute request: Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:28
|
2024-01-15T03:07:10.701-08:00 WARN: (REST) Failed to update cluster config, will retry: exhausted cluster nodes -- logging.ToolsCommonLogger.Log() at tools_common.go:28
|
with the last few logs repeating forever. The problem is in shouldRetryWithError we wait for a config update by calling waitUntilUpdated. If the config is never received and the context is never cancelled then this will loop forever. The context here comes from Execute and it doesn’t have a timeout. The REST timeouts are set at a lower level.
What is the customer impact?
It should be unlikely that a customer hit this. They would need to never be able to get a config which would suggest either the cluster is down or the machine that is doing the backup is network partitioned from the entire cluster. Even if one of these events happen fixing this issue would only make the backup exit.
For this reason I've not pulled this into Trinity.
What is the fix?
We could:
- add the request timeout at a higher level so it is given to the should retry function
- have a new timeout for waiting for the config update
- have cbbackupmgr specify a top-level timeout (we’d need to read the rest timeouts and multiply it by the number of retries so this feels a bit hacky)