Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 7.6.0
Component/s: tools, tools-common
Labels:
None

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
No

Description

What is the problem?

Gilad Kalchheim saw a test with large cbbackupmgr logs. They looked like:

2024-01-15T03:06:55.967-08:00 (Plan) (Query) Transferring Query metadata for bucket 'default'

2024-01-15T03:06:55.967-08:00 (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.136.138:8093/api/v1/bucket/default/backup'

2024-01-15T03:07:00.777-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8093/api/v1/bucket/default/backup': Get "http://172.23.136.138:8093/api/v1/bucket/default/backup": dial tcp 172.23.136.138:8093: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_comm

on.go:30

2024-01-15T03:07:00.777-08:00 WARN: (REST) (Attempt 1) (GET) Request to endpoint '/api/v1/bucket/default/backup' failed due to error: failed to perform request: Get "http://172.23.136.138:8093/api/v1/bucket/default/backup": dial tcp 172.23.136.138:8093: connect: connection refused -- logging.ToolsCommonLogger.Log() a

t tools_common.go:28

2024-01-15T03:07:00.778-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8091/pools': Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:30

2024-01-15T03:07:00.778-08:00 WARN: (REST) (CCP) Failed to update config using host '172.23.136.138': failed to check if node is valid: failed to execute request: Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:28

2024-01-15T03:07:00.778-08:00 WARN: (REST) Failed to update cluster config, will retry: exhausted cluster nodes -- logging.ToolsCommonLogger.Log() at tools_common.go:28

2024-01-15T03:07:10.700-08:00 ERRO: (REST) (Attempt 1) (GET) Failed to perform request to 'http://172.23.136.138:8091/pools': Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:30

2024-01-15T03:07:10.701-08:00 WARN: (REST) (CCP) Failed to update config using host '172.23.136.138': failed to check if node is valid: failed to execute request: Get "http://172.23.136.138:8091/pools": dial tcp 172.23.136.138:8091: connect: connection refused -- logging.ToolsCommonLogger.Log() at tools_common.go:28

2024-01-15T03:07:10.701-08:00 WARN: (REST) Failed to update cluster config, will retry: exhausted cluster nodes -- logging.ToolsCommonLogger.Log() at tools_common.go:28

with the last few logs repeating forever. The problem is in shouldRetryWithError we wait for a config update by calling waitUntilUpdated. If the config is never received and the context is never cancelled then this will loop forever. The context here comes from Execute and it doesn’t have a timeout. The REST timeouts are set at a lower level.

What is the customer impact?
It should be unlikely that a customer hit this. They would need to never be able to get a config which would suggest either the cluster is down or the machine that is doing the backup is network partitioned from the entire cluster. Even if one of these events happen fixing this issue would only make the backup exit.

For this reason I've not pulled this into Trinity.

What is the fix?
We could:

add the request timeout at a higher level so it is given to the should retry function
have a new timeout for waiting for the config update
have cbbackupmgr specify a top-level timeout (we’d need to read the rest timeouts and multiply it by the number of retries so this feels a bit hacky)

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Daniel Owen

Reporter:: Matt Hall

Votes:: 0 Vote for this issue

Watchers:: 3 Start watching this issue

Dates

Created:: 15/Jan/24 4:34 AM

Updated:: 15/Jan/24 4:34 AM

Gerrit Reviews

There are no open Gerrit changes

[CBM] REST request can hang indefinitely if we fail and then can't retrieve the cluster config

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty