Description
What's the problem?
When using 'WaitUnitlReady' to wait for the gocbcore agent to connect to the cluster we are seeing an 'unambiguous timeout' error when in reality the server disconnected us because we were using the (unmerged in CC) 'backfill_order' control flag.
What do we expect to see?
When we get disconnected from the server, the error should be bubbled up to cbbackupmgr so that it can be handled correctly and returned to the user. I imagine that this isn't the only case in which a timeout will be masking an error that has occurred behind the scenes.
Steps to reproduce
Patrick Varley has commented a concise set of steps needed to reproduce this issue with cbbackupmgr in MB-39653 but to briefly recap:
1) Install CC build 2208 onto a CentOS 7 vagrant
2) Configure a one node cluster with only the data service
3) Create a bucket
4) Load some data in the bucket using cbworkloadgen
5) Run a backup
If we look in the memcached logs we will see:
2020-05-29T18:17:45.505412+00:00 INFO 44: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ [::1]:57896 - [::1]:11210 (<ud>Administrator</ud>) ]
|
2020-05-29T18:17:45.505588+00:00 WARNING 44: (default) DCP (Producer) eq_dcpq:cbbackupmgr_2020-05-29T18:17:20Z_19653_0 - Invalid ctrl parameter 'sequential' for backfill_order
|
2020-05-29T18:17:45.505734+00:00 INFO 44: (No Engine) DCP (Producer) eq_dcpq:cbbackupmgr_2020-05-29T18:17:20Z_19653_0 - Removing connection [ [::1]:57896 - [::1]:11210 (<ud>Administrator</ud>) ]
|
However cbbackupmgr will display:
/opt/couchbase/bin/cbbackupmgr backup -u Administrator -p password -c localhost -a backup -r MB-39653
|
Backing up to '2020-05-29T18_17_20.039976728Z'
|
Copying at 0B/s (about 0s remaining) - Transferring key value data for 'default' 0 items / 0B
|
[===============================================================================================================================================================================================================================================================================] 100.00%
|
Error backing up cluster: operation has timed out
|
Backed up bucket "default" failed
|
Mutations backed up: 0, Mutations failed to backup: 0
|
Deletions backed up: 0, Deletions failed to backup: 0
|
Skipped due to purge number or conflict resolution: Mutations: 0 Deletions: 0
|
I have since recalled a conversation with Charles Dixon when he said that "The 1 thing that you guys might care about is we don’t return an error on connect anymore. We’ll keep trying to connect under the hood until the agent is closed.". So I imagine that the "issue" I've described above is actually the expected behavior. However I don't think this is the correct way to handle connecting to the node because there are always going to some errors which should be bubbled up to the user because having a blanket error returned due to a timeout is not particularly helpful to users. I might be wrong though because the 'WaitUntilReady' callback does accept and error.