Scheduled backup on full cluster and only_services as two tasks.
Observing few cluster backup tasks failed due to reason "Stream request for vBucket 365 timed out after 3m0s"
"node_runs": [
{
"node_id": "2989c672ce3225a540a566ef716812d6",
"status": "failed",
"start": "2021-04-29T17:00:18.0714657-07:00",
"end": "2021-04-29T17:04:00.6665167-07:00",
"error": "exit status 1: failed to execute cluster operations: failed to execute bucket operation for bucket 'travel-sample': failed to transfer bucket data for bucket 'travel-sample': failed to transfer key value data: failed to transfer key value data: failed to open stream: failed to stream vBucket 365: client received unexpected error 'Stream request for vBucket 365 timed out after 3m0s'",
"progress": 0,
"stats": {
"error": "failed to execute cluster operations: failed to execute bucket operation for bucket 'travel-sample': failed to transfer bucket data for bucket 'travel-sample': failed to transfer key value data: failed to transfer key value data: failed to open stream: failed to stream vBucket 365: client received unexpected error 'Stream request for vBucket 365 timed out after 3m0s'",
"stats": {
"started_at": 1619740818323472100,
"buckets": {
"travel-sample": {
"estimated_total_items": 196499167,
"total_items": 17304970,
"total_vbuckets": 1024,
"vbuckets_complete": 915,
"bytes_received": 843303955,
"snapshot_markers_received": 915,
"snapshot_ends_received": 915,
"failover_logs_received": 915,
"mutations_received": 2306919,
"deletions_received": 5403716,
"started_at": 1619740852033783600,
"complete": false,
"errored": false
}
},
"complete": false
}
},
"error_code": 2
}
],
Note: Cluster has only one backup node (172.23.136.115)
Joe Mitchell Jones
added a comment - Closing now as unit tests are in place and passing.
Ashwin Govindarajulu , please feel free to reopen if this issue appears again.
The issue regarding the cbcollect can be fixed by using a more recent version. I've verified that it works in build 5076.
Bryan McCoid
added a comment - The issue regarding the cbcollect can be fixed by using a more recent version. I've verified that it works in build 5076.
I believe Carlos Gonzalez Betancort had a power cut towards the end of the day, so I'm going to assign this to myself and get a patch up and ready to be reviewed.
James Lee
added a comment - I believe Carlos Gonzalez Betancort had a power cut towards the end of the day, so I'm going to assign this to myself and get a patch up and ready to be reviewed.
In theory the SDK should be refreshing it's cluster config periodically throughout the time that it's connected to the cluster (via GCCCP). So we shouldn't need to be able to force an update of the SDK's vBucket map, simply retrying the stream request should suffice for the SDK to send the request to the correct node in this case.
If this is in-fact the issue we're hitting, then it should simply be a case of handling a timeout in our 'openStream' function and triggering a retry (for the benefit of others, our DCP client already supports retries so it should just be a case of wiring up this use case).
Thanks,
James
James Lee
added a comment - Hi Carlos Gonzalez Betancort ,
In theory the SDK should be refreshing it's cluster config periodically throughout the time that it's connected to the cluster (via GCCCP). So we shouldn't need to be able to force an update of the SDK's vBucket map, simply retrying the stream request should suffice for the SDK to send the request to the correct node in this case.
If this is in-fact the issue we're hitting, then it should simply be a case of handling a timeout in our ' openStream ' function and triggering a retry (for the benefit of others, our DCP client already supports retries so it should just be a case of wiring up this use case).
Thanks,
James
2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' on stream | {"uuid":117585510587600,"snap_start":255410,"snap_end":348700,"snap_complete":true,"last_seqno":348700,"retries":0} -- couchbase.(*DCPAsyncWorker).openStream() at dcp_async_worker.go:193
2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received an unexpected error whilst streaming, beginning teardown: failed to open stream: failed to stream vBucket 716: client received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' -- couchbase.(*DCPAsyncWorker).handleDCPError() at dcp_async_worker.go:561
Looking at the logs at that time the node ending in 107 should have the active version of vbucket 716
memcached.log
╰─$ ag "vb:716" 107_logs/cbcollect_info__20210430-033954/memcached.log | ag "setState"
128591:2021-04-29T13:53:25.680858-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:0 from:replica to:replica
136521:2021-04-29T14:00:55.530815-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448633 from:replica to:pending
136522:2021-04-29T14:00:55.537555-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448634 from:pending to:active
141777:2021-04-29T14:05:45.404593-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:457500 from:active to:active meta:{"topology":[["ns_1@172.23.136.107","ns_1@172.23.136.106"]]}
...
Interestingly enough it seems like it moved from pending to active just a couple of seconds before we started the stream. If we look in the same memcached logs for the stream open for vb 716 we can see that it is not there:
ag "T14:.*cbbackupmgr.*\(vb:716\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log
For reference we can find some of the other vbucket opens that worked e.g.:
ag "T14:.*cbbackupmgr.*\(vb:738\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log
136617:2021-04-29T14:00:56.722060-07:00 INFO 8104: (travel-sample) DCP (Producer) eq_dcpq:cbbackupmgr_2021-04-29T14:00:51-07:00_96500_4 - (vb:738) Creating stream with start seqno 353928 and end seqno 452322; requested end seqno was 452322, collections-filter-size:61 sid:none
So seems like the most likely scenario is that we sent the request to the previous node that had the vbucket in the active state.
Looking at the logs I can see that all the backups that fail occured during a rebalance that took quite a while which was:
'ns_1@172.23.136.105'], EjectNodes = ['ns_1@172.23.136.112'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf
[rebalance:info,2021-04-29T12:43:38.142-07:00,ns_1@172.23.136.106:<0.27630.116>:ns_rebalancer:drop_old_2i_indexes:1274]Going to drop possible old 2i indexes on nodes ['ns_1@172.23.136.107']
Rebalance Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf
This rebalance runs between 12:43:38 and 17:32:28 it seems like the most likely scenario is that at the time of the failures the vBuckets where being moved and we tried to connect to the old node. Which failed.
It is also worth noting that the vBucket that we failed to open was originally in node 172.23.136.112 which is the one ejected we can see this from the vBucket map in the rebalance message in node 105 (I will not put it here as it is massive) but the entry number 716 is ['ns_1@172.23.136.112', 'ns_1@172.23.136.106'].
We can also see that the entry in the target vbuckety map for vbucket 716 is ['ns_1@172.23.136.107', 'ns_1@172.23.136.106'] which matches what we see in the logs for node 107.
Furthermore we can observer how vBucket 716 transitions between states at the time of the open stream:
[rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.15649.129>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change
[rebalance:info,2021-04-29T14:05:45.560-07:00,ns_1@172.23.136.106:<0.15727.131>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106',
{update_vbucket_state,716,replica,undefined,
'ns_1@172.23.136.107'}}
Seeing how the vBucket state was still transitioning whilst we where opening the streams my hypothesis is that we most likely connected to the old server which did not respond at all which cause the timeout. (We cannot confirm this theory as gocbcore does not log which node we tried to connect to and we don't have any logs for node 112) to verify this is the case we would need to get a tcp dump at the time this happened. But assuming this is the case the fix would be retrying to open the streams a couple times after timeouts. Note that this requires gocbcore to refresh the cluster map before the next try as if not we would just get the same behaviour again.
James Lee what are your thoughts, I think just retrying would probably work, but I don't remeber if there is any way to force update the vBucket map in gocbcore.
Carlos Gonzalez Betancort (Inactive)
added a comment - - edited I've been having a look at this. It seems like the error occurs thrice as pointed out by one of the images:
╰─$ cbm_timeline.py backup-0.log
Action | Version | #Line | Start | End | Status | Options
---------------------------------------------------------------------------------------------------------------------------------
config | 7.0.0-5060 | 2 | 2021-04-29T00:05:09.270000-0700 | 2021-04-29T00:05:09.273000-0700 | OK |
backup | 7.0.0-5060 | 6 | 2021-04-29T02:00:13.076000-0700 | 2021-04-29T02:02:58.568000-0700 | OK |
backup | 7.0.0-5060 | 2137 | 2021-04-29T03:00:13.405000-0700 | 2021-04-29T03:02:09.399000-0700 | OK |
backup | 7.0.0-5060 | 4268 | 2021-04-29T04:00:13.523000-0700 | 2021-04-29T04:00:50.818000-0700 | OK |
backup | 7.0.0-5060 | 6397 | 2021-04-29T05:00:13.376000-0700 | 2021-04-29T05:00:47.080000-0700 | OK |
backup | 7.0.0-5060 | 8527 | 2021-04-29T06:00:14.373000-0700 | 2021-04-29T06:00:47.814000-0700 | OK |
merge | 7.0.0-5060 | 10657 | 2021-04-29T06:30:14.580000-0700 | 2021-04-29T06:33:20.470000-0700 | OK |
backup | 7.0.0-5060 | 15921 | 2021-04-29T07:00:14.740000-0700 | 2021-04-29T07:00:50.407000-0700 | OK |
backup | 7.0.0-5060 | 18051 | 2021-04-29T08:00:15.049000-0700 | 2021-04-29T08:00:46.624000-0700 | OK |
backup | 7.0.0-5060 | 20181 | 2021-04-29T09:00:15.407000-0700 | 2021-04-29T09:00:47.752000-0700 | OK |
backup | 7.0.0-5060 | 22311 | 2021-04-29T10:00:15.711000-0700 | 2021-04-29T10:00:48.152000-0700 | OK |
backup | 7.0.0-5060 | 24441 | 2021-04-29T11:00:15.945000-0700 | 2021-04-29T11:00:50.744000-0700 | OK |
backup | 7.0.0-5060 | 26571 | 2021-04-29T12:00:16.194000-0700 | 2021-04-29T12:01:27.654000-0700 | OK |
merge | 7.0.0-5060 | 28701 | 2021-04-29T12:30:16.224000-0700 | 2021-04-29T12:35:30.546000-0700 | OK |
backup | 7.0.0-5060 | 36067 | 2021-04-29T12:38:03.441000-0700 | 2021-04-29T12:39:51.102000-0700 | OK |
backup | 7.0.0-5060 | 38198 | 2021-04-29T13:00:15.923000-0700 | 2021-04-29T13:03:57.207000-0700 | OK |
backup | 7.0.0-5060 | 40328 | 2021-04-29T14:00:17.168000-0700 | 2021-04-29T14:03:58.578000-0700 | Failed |
backup | 7.0.0-5060 | 42299 | 2021-04-29T15:00:17.478000-0700 | 2021-04-29T15:03:50.303000-0700 | OK |
backup | 7.0.0-5060 | 44429 | 2021-04-29T16:00:17.853000-0700 | 2021-04-29T16:03:58.232000-0700 | Failed |
backup | 7.0.0-5060 | 46219 | 2021-04-29T17:00:18.246000-0700 | 2021-04-29T17:03:59.162000-0700 | Failed |
backup | 7.0.0-5060 | 48131 | 2021-04-29T18:00:18.547000-0700 | 2021-04-29T18:05:05.065000-0700 | OK |
merge | 7.0.0-5060 | 50261 | 2021-04-29T18:30:18.681000-0700 | 2021-04-29T18:33:48.683000-0700 | OK |
backup | 7.0.0-5060 | 54474 | 2021-04-29T19:00:18.925000-0700 | 2021-04-29T19:02:10.066000-0700 | OK |
backup | 7.0.0-5060 | 56604 | 2021-04-29T20:00:19.159000-0700 | 2021-04-29T20:01:53.617000-0700 | OK |
The first failure is at 14:00:59 which is a timeout on vb:716. This would indicate that KV did not respond to out open stream request in 3minutes.
Looking at the cbbackupmgr logs we can see that at time 14:00:59 we opened the stream for vbucket 716
backup-0.log
2021-04-29T14:00:59.270-07:00 (DCP) (travel-sample) (vb 716) Creating DCP stream | {"uuid":117585510587600,"start_seqno":348700,"end_seqno":448028,"snap_start":255410,"snap_end":348700,"retries":1}
2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' on stream | {"uuid":117585510587600,"snap_start":255410,"snap_end":348700,"snap_complete":true,"last_seqno":348700,"retries":0} -- couchbase.(*DCPAsyncWorker).openStream() at dcp_async_worker.go:193
2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received an unexpected error whilst streaming, beginning teardown: failed to open stream: failed to stream vBucket 716: client received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' -- couchbase.(*DCPAsyncWorker).handleDCPError() at dcp_async_worker.go:561
Looking at the logs at that time the node ending in 107 should have the active version of vbucket 716
memcached.log
╰─$ ag "vb:716" 107_logs/cbcollect_info__20210430-033954/memcached.log | ag "setState"
128591:2021-04-29T13:53:25.680858-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:0 from:replica to:replica
136521:2021-04-29T14:00:55.530815-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448633 from:replica to:pending
136522:2021-04-29T14:00:55.537555-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448634 from:pending to:active
141777:2021-04-29T14:05:45.404593-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:457500 from:active to:active meta:{"topology":[["ns_1@172.23.136.107","ns_1@172.23.136.106"]]}
...
Interestingly enough it seems like it moved from pending to active just a couple of seconds before we started the stream. If we look in the same memcached logs for the stream open for vb 716 we can see that it is not there:
ag "T14:.*cbbackupmgr.*\(vb:716\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log
For reference we can find some of the other vbucket opens that worked e.g.:
ag "T14:.*cbbackupmgr.*\(vb:738\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log
136617:2021-04-29T14:00:56.722060-07:00 INFO 8104: (travel-sample) DCP (Producer) eq_dcpq:cbbackupmgr_2021-04-29T14:00:51-07:00_96500_4 - (vb:738) Creating stream with start seqno 353928 and end seqno 452322; requested end seqno was 452322, collections-filter-size:61 sid:none
So seems like the most likely scenario is that we sent the request to the previous node that had the vbucket in the active state.
Looking at the logs I can see that all the backups that fail occured during a rebalance that took quite a while which was:
105 node ns_server.info.log
[user:info,2021-04-29T12:43:38.139-07:00,ns_1@172.23.136.106:<0.4856.11>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.136.114','ns_1@172.23.136.106',
'ns_1@172.23.136.107','ns_1@172.23.138.127',
'ns_1@172.23.136.108','ns_1@172.23.136.115',
'ns_1@172.23.136.113','ns_1@172.23.136.110',
'ns_1@172.23.136.105'], EjectNodes = ['ns_1@172.23.136.112'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf
[rebalance:info,2021-04-29T12:43:38.142-07:00,ns_1@172.23.136.106:<0.27630.116>:ns_rebalancer:drop_old_2i_indexes:1274]Going to drop possible old 2i indexes on nodes ['ns_1@172.23.136.107']
....
'ns_1@172.23.136.110',
'ns_1@172.23.136.113',
'ns_1@172.23.136.114',
'ns_1@172.23.136.115',
'ns_1@172.23.138.127']
[user:info,2021-04-29T17:32:28.716-07:00,ns_1@172.23.136.106:<0.4856.11>:ns_orchestrator:log_rebalance_completion:1405]Rebalance completed successfully.
Rebalance Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf
This rebalance runs between 12:43:38 and 17:32:28 it seems like the most likely scenario is that at the time of the failures the vBuckets where being moved and we tried to connect to the old node. Which failed.
It is also worth noting that the vBucket that we failed to open was originally in node 172.23.136.112 which is the one ejected we can see this from the vBucket map in the rebalance message in node 105 (I will not put it here as it is massive) but the entry number 716 is ['ns_1@172.23.136.112', 'ns_1@172.23.136.106'] .
We can also see that the entry in the target vbuckety map for vbucket 716 is ['ns_1@172.23.136.107', 'ns_1@172.23.136.106'] which matches what we see in the logs for node 107.
Furthermore we can observer how vBucket 716 transitions between states at the time of the open stream:
[rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.15649.129>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change
[{'ns_1@172.23.136.106',replica,undefined,'ns_1@172.23.136.112'},
{'ns_1@172.23.136.107',replica,passive,'ns_1@172.23.136.112'}]
[rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.14703.129>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106',
{update_vbucket_state,716,replica,undefined,
'ns_1@172.23.136.112'}}
[rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.15876.129>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107',
{update_vbucket_state,716,replica,passive,
'ns_1@172.23.136.112'}}
....
[rebalance:info,2021-04-29T13:56:48.454-07:00,ns_1@172.23.136.106:<0.1157.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.112',
{update_vbucket_state,716,active,paused,
undefined,
[['ns_1@172.23.136.112',
'ns_1@172.23.136.106']]}}
....
[rebalance:info,2021-04-29T13:58:55.934-07:00,ns_1@172.23.136.106:<0.11937.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.112',
{update_vbucket_state,716,active,paused,
undefined,
[['ns_1@172.23.136.112',
'ns_1@172.23.136.106'],
...
[rebalance:info,2021-04-29T14:00:55.701-07:00,ns_1@172.23.136.106:<0.23319.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107',
{update_vbucket_state,716,active,undefined,
undefined,undefined}}
[rebalance:info,2021-04-29T14:00:55.704-07:00,ns_1@172.23.136.106:<0.23184.130>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change
[{'ns_1@172.23.136.106',replica,undefined,undefined}]
[rebalance:info,2021-04-29T14:00:55.705-07:00,ns_1@172.23.136.106:<0.23479.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106',
{update_vbucket_state,716,replica,undefined,
undefined}}
....
[rebalance:info,2021-04-29T14:05:45.547-07:00,ns_1@172.23.136.106:<0.15240.131>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107',
{update_vbucket_state,716,active,undefined,
undefined,
[['ns_1@172.23.136.107',
'ns_1@172.23.136.106']]}}
[rebalance:info,2021-04-29T14:05:45.560-07:00,ns_1@172.23.136.106:<0.15240.131>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change
[{'ns_1@172.23.136.106',replica,undefined,'ns_1@172.23.136.107'}]
[rebalance:info,2021-04-29T14:05:45.560-07:00,ns_1@172.23.136.106:<0.15727.131>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106',
{update_vbucket_state,716,replica,undefined,
'ns_1@172.23.136.107'}}
Seeing how the vBucket state was still transitioning whilst we where opening the streams my hypothesis is that we most likely connected to the old server which did not respond at all which cause the timeout. (We cannot confirm this theory as gocbcore does not log which node we tried to connect to and we don't have any logs for node 112) to verify this is the case we would need to get a tcp dump at the time this happened. But assuming this is the case the fix would be retrying to open the streams a couple times after timeouts. Note that this requires gocbcore to refresh the cluster map before the next try as if not we would just get the same behaviour again.
James Lee what are your thoughts, I think just retrying would probably work, but I don't remeber if there is any way to force update the vBucket map in gocbcore.
Some of the cbcollect information seems to be missing and it looks to be the same issue as MB-46003. Will keep looking into the vbucket timeout issue now.
Carlos Gonzalez Betancort (Inactive)
added a comment - Some of the cbcollect information seems to be missing and it looks to be the same issue as MB-46003 . Will keep looking into the vbucket timeout issue now.
Closing now as unit tests are in place and passing.
Ashwin Govindarajulu, please feel free to reopen if this issue appears again.