Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46013

[Windows] cbbackupmgr failed due to "Stream request for vBucket 365 timed out after 3m0s"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • tools
    • Enterprise Edition 7.0.0 build 5060

    Description

      Build: 7.0.0 - 5060

      Steps:

      • Cluster with index, cbas, fts and eventing services actively working on bucket data
      • Constantly running rebalances (in/out/swap) of kv / n1ql:index / cbas nodes
      • Scheduled backup on full cluster and only_services as two tasks.

      Observing few cluster backup tasks failed due to reason "Stream request for vBucket 365 timed out after 3m0s"

      "node_runs": [
          {
            "node_id": "2989c672ce3225a540a566ef716812d6",
            "status": "failed",
            "start": "2021-04-29T17:00:18.0714657-07:00",
            "end": "2021-04-29T17:04:00.6665167-07:00",
            "error": "exit status 1: failed to execute cluster operations: failed to execute bucket operation for bucket 'travel-sample': failed to transfer bucket data for bucket 'travel-sample': failed to transfer key value data: failed to transfer key value data: failed to open stream: failed to stream vBucket 365: client received unexpected error 'Stream request for vBucket 365 timed out after 3m0s'",
            "progress": 0,
            "stats": {
              "error": "failed to execute cluster operations: failed to execute bucket operation for bucket 'travel-sample': failed to transfer bucket data for bucket 'travel-sample': failed to transfer key value data: failed to transfer key value data: failed to open stream: failed to stream vBucket 365: client received unexpected error 'Stream request for vBucket 365 timed out after 3m0s'",
              "stats": {
                "started_at": 1619740818323472100,
                "buckets": {
                  "travel-sample": {
                    "estimated_total_items": 196499167,
                    "total_items": 17304970,
                    "total_vbuckets": 1024,
                    "vbuckets_complete": 915,
                    "bytes_received": 843303955,
                    "snapshot_markers_received": 915,
                    "snapshot_ends_received": 915,
                    "failover_logs_received": 915,
                    "mutations_received": 2306919,
                    "deletions_received": 5403716,
                    "started_at": 1619740852033783600,
                    "complete": false,
                    "errored": false
                  }
                },
                "complete": false
              }
            },
            "error_code": 2
          }
        ],
            

      Note: Cluster has only one backup node (172.23.136.115)

      Logs from backup folder: https://cb-engineering.s3.amazonaws.com/cbbackup_failure/backup_logs.tar.gz

      Attachments

        1. backup_failures.png
          497 kB
          Ashwin Govindarajulu
        2. backup_repos.png
          550 kB
          Ashwin Govindarajulu
        3. backupmgr_error.png
          559 kB
          Ashwin Govindarajulu

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Some of the cbcollect information seems to be missing and it looks to be the same issue as MB-46003. Will keep looking into the vbucket timeout issue now.

            carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - Some of the cbcollect information seems to be missing and it looks to be the same issue as MB-46003 . Will keep looking into the vbucket timeout issue now.

            I've been having a look at this. It seems like the error occurs thrice as pointed out by one of the images:

            ╰─$ cbm_timeline.py backup-0.log   
            Action    | Version     | #Line   | Start                              | End                                | Status    | Options                                                                                                                            
            ---------------------------------------------------------------------------------------------------------------------------------                                                                                                                            
            config    | 7.0.0-5060  | 2       | 2021-04-29T00:05:09.270000-0700    | 2021-04-29T00:05:09.273000-0700    | OK        |
            backup    | 7.0.0-5060  | 6       | 2021-04-29T02:00:13.076000-0700    | 2021-04-29T02:02:58.568000-0700    | OK        |
            backup    | 7.0.0-5060  | 2137    | 2021-04-29T03:00:13.405000-0700    | 2021-04-29T03:02:09.399000-0700    | OK        |
            backup    | 7.0.0-5060  | 4268    | 2021-04-29T04:00:13.523000-0700    | 2021-04-29T04:00:50.818000-0700    | OK        |
            backup    | 7.0.0-5060  | 6397    | 2021-04-29T05:00:13.376000-0700    | 2021-04-29T05:00:47.080000-0700    | OK        |
            backup    | 7.0.0-5060  | 8527    | 2021-04-29T06:00:14.373000-0700    | 2021-04-29T06:00:47.814000-0700    | OK        |
            merge     | 7.0.0-5060  | 10657   | 2021-04-29T06:30:14.580000-0700    | 2021-04-29T06:33:20.470000-0700    | OK        |
            backup    | 7.0.0-5060  | 15921   | 2021-04-29T07:00:14.740000-0700    | 2021-04-29T07:00:50.407000-0700    | OK        |
            backup    | 7.0.0-5060  | 18051   | 2021-04-29T08:00:15.049000-0700    | 2021-04-29T08:00:46.624000-0700    | OK        |
            backup    | 7.0.0-5060  | 20181   | 2021-04-29T09:00:15.407000-0700    | 2021-04-29T09:00:47.752000-0700    | OK        |
            backup    | 7.0.0-5060  | 22311   | 2021-04-29T10:00:15.711000-0700    | 2021-04-29T10:00:48.152000-0700    | OK        |
            backup    | 7.0.0-5060  | 24441   | 2021-04-29T11:00:15.945000-0700    | 2021-04-29T11:00:50.744000-0700    | OK        |
            backup    | 7.0.0-5060  | 26571   | 2021-04-29T12:00:16.194000-0700    | 2021-04-29T12:01:27.654000-0700    | OK        |
            merge     | 7.0.0-5060  | 28701   | 2021-04-29T12:30:16.224000-0700    | 2021-04-29T12:35:30.546000-0700    | OK        |
            backup    | 7.0.0-5060  | 36067   | 2021-04-29T12:38:03.441000-0700    | 2021-04-29T12:39:51.102000-0700    | OK        |
            backup    | 7.0.0-5060  | 38198   | 2021-04-29T13:00:15.923000-0700    | 2021-04-29T13:03:57.207000-0700    | OK        |
            backup    | 7.0.0-5060  | 40328   | 2021-04-29T14:00:17.168000-0700    | 2021-04-29T14:03:58.578000-0700    | Failed    |
            backup    | 7.0.0-5060  | 42299   | 2021-04-29T15:00:17.478000-0700    | 2021-04-29T15:03:50.303000-0700    | OK        |
            backup    | 7.0.0-5060  | 44429   | 2021-04-29T16:00:17.853000-0700    | 2021-04-29T16:03:58.232000-0700    | Failed    |
            backup    | 7.0.0-5060  | 46219   | 2021-04-29T17:00:18.246000-0700    | 2021-04-29T17:03:59.162000-0700    | Failed    |
            backup    | 7.0.0-5060  | 48131   | 2021-04-29T18:00:18.547000-0700    | 2021-04-29T18:05:05.065000-0700    | OK        |
            merge     | 7.0.0-5060  | 50261   | 2021-04-29T18:30:18.681000-0700    | 2021-04-29T18:33:48.683000-0700    | OK        |
            backup    | 7.0.0-5060  | 54474   | 2021-04-29T19:00:18.925000-0700    | 2021-04-29T19:02:10.066000-0700    | OK        |
            backup    | 7.0.0-5060  | 56604   | 2021-04-29T20:00:19.159000-0700    | 2021-04-29T20:01:53.617000-0700    | OK        |
            

            The first failure is at 14:00:59 which is a timeout on vb:716. This would indicate that KV did not respond to out open stream request in 3minutes.

            Looking at the cbbackupmgr logs we can see that at time 14:00:59 we opened the stream for vbucket 716

            backup-0.log

            2021-04-29T14:00:59.270-07:00 (DCP) (travel-sample) (vb 716) Creating DCP stream | {"uuid":117585510587600,"start_seqno":348700,"end_seqno":448028,"snap_start":255410,"snap_end":348700,"retries":1}                                                        
            2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' on stream | {"uuid":117585510587600,"snap_start":255410,"snap_end":348700,"snap_complete":true,"last_seqno":348700,"retries":0} -- couchbase.(*DCPAsyncWorker).openStream() at dcp_async_worker.go:193
            2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received an unexpected error whilst streaming, beginning teardown: failed to open stream: failed to stream vBucket 716: client received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' -- couchbase.(*DCPAsyncWorker).handleDCPError() at dcp_async_worker.go:561
            

            Looking at the logs at that time the node ending in 107 should have the active version of vbucket 716

            memcached.log

            ╰─$ ag "vb:716" 107_logs/cbcollect_info__20210430-033954/memcached.log | ag "setState"
            128591:2021-04-29T13:53:25.680858-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:0 from:replica to:replica                                                                                                               
            136521:2021-04-29T14:00:55.530815-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448633 from:replica to:pending                                                                                                          
            136522:2021-04-29T14:00:55.537555-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448634 from:pending to:active                                                                                                           
            141777:2021-04-29T14:05:45.404593-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:457500 from:active to:active meta:{"topology":[["ns_1@172.23.136.107","ns_1@172.23.136.106"]]}
            ...                                          
            

            Interestingly enough it seems like it moved from pending to active just a couple of seconds before we started the stream. If we look in the same memcached logs for the stream open for vb 716 we can see that it is not there:

            ag "T14:.*cbbackupmgr.*\(vb:716\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log    
            

            For reference we can find some of the other vbucket opens that worked e.g.:

            ag "T14:.*cbbackupmgr.*\(vb:738\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log                                                                                                                                                    
            136617:2021-04-29T14:00:56.722060-07:00 INFO 8104: (travel-sample) DCP (Producer) eq_dcpq:cbbackupmgr_2021-04-29T14:00:51-07:00_96500_4 - (vb:738) Creating stream with start seqno 353928 and end seqno 452322; requested end seqno was 452322, collections-filter-size:61 sid:none
            

            So seems like the most likely scenario is that we sent the request to the previous node that had the vbucket in the active state.

            Looking at the logs I can see that all the backups that fail occured during a rebalance that took quite a while which was:

            105 node ns_server.info.log

            [user:info,2021-04-29T12:43:38.139-07:00,ns_1@172.23.136.106:<0.4856.11>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.136.114','ns_1@172.23.136.106',
                                             'ns_1@172.23.136.107','ns_1@172.23.138.127',
                                             'ns_1@172.23.136.108','ns_1@172.23.136.115',
                                             'ns_1@172.23.136.113','ns_1@172.23.136.110',
                                             'ns_1@172.23.136.105'], EjectNodes = ['ns_1@172.23.136.112'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf
            [rebalance:info,2021-04-29T12:43:38.142-07:00,ns_1@172.23.136.106:<0.27630.116>:ns_rebalancer:drop_old_2i_indexes:1274]Going to drop possible old 2i indexes on nodes ['ns_1@172.23.136.107']
            ....
                                               'ns_1@172.23.136.110',
                                               'ns_1@172.23.136.113',
                                               'ns_1@172.23.136.114',
                                               'ns_1@172.23.136.115',
                                               'ns_1@172.23.138.127']
            [user:info,2021-04-29T17:32:28.716-07:00,ns_1@172.23.136.106:<0.4856.11>:ns_orchestrator:log_rebalance_completion:1405]Rebalance completed successfully.
            Rebalance Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf
            

            This rebalance runs between 12:43:38 and 17:32:28 it seems like the most likely scenario is that at the time of the failures the vBuckets where being moved and we tried to connect to the old node. Which failed.

            It is also worth noting that the vBucket that we failed to open was originally in node 172.23.136.112 which is the one ejected we can see this from the vBucket map in the rebalance message in node 105 (I will not put it here as it is massive) but the entry number 716 is ['ns_1@172.23.136.112', 'ns_1@172.23.136.106'].

            We can also see that the entry in the target vbuckety map for vbucket 716 is ['ns_1@172.23.136.107', 'ns_1@172.23.136.106'] which matches what we see in the logs for node 107.

            Furthermore we can observer how vBucket 716 transitions between states at the time of the open stream:

            [rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.15649.129>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change
            [{'ns_1@172.23.136.106',replica,undefined,'ns_1@172.23.136.112'},
             {'ns_1@172.23.136.107',replica,passive,'ns_1@172.23.136.112'}]
            [rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.14703.129>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106',
                                             {update_vbucket_state,716,replica,undefined,
                                              'ns_1@172.23.136.112'}}
            [rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.15876.129>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107',
                                             {update_vbucket_state,716,replica,passive,
                                              'ns_1@172.23.136.112'}}
             
            ....
            [rebalance:info,2021-04-29T13:56:48.454-07:00,ns_1@172.23.136.106:<0.1157.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.112',
                                             {update_vbucket_state,716,active,paused,
                                              undefined,
                                              [['ns_1@172.23.136.112',
                                                'ns_1@172.23.136.106']]}}
            ....
            [rebalance:info,2021-04-29T13:58:55.934-07:00,ns_1@172.23.136.106:<0.11937.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.112',
                                             {update_vbucket_state,716,active,paused,
                                              undefined,
                                              [['ns_1@172.23.136.112',
                                                'ns_1@172.23.136.106'],
            ...
            [rebalance:info,2021-04-29T14:00:55.701-07:00,ns_1@172.23.136.106:<0.23319.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107',
                                             {update_vbucket_state,716,active,undefined,
                                              undefined,undefined}}
            [rebalance:info,2021-04-29T14:00:55.704-07:00,ns_1@172.23.136.106:<0.23184.130>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change
            [{'ns_1@172.23.136.106',replica,undefined,undefined}]
            [rebalance:info,2021-04-29T14:00:55.705-07:00,ns_1@172.23.136.106:<0.23479.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106',
                                             {update_vbucket_state,716,replica,undefined,
                                              undefined}}
            ....
            [rebalance:info,2021-04-29T14:05:45.547-07:00,ns_1@172.23.136.106:<0.15240.131>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107',
                                             {update_vbucket_state,716,active,undefined,
                                              undefined,
                                              [['ns_1@172.23.136.107',
                                                'ns_1@172.23.136.106']]}}
            [rebalance:info,2021-04-29T14:05:45.560-07:00,ns_1@172.23.136.106:<0.15240.131>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change
            [{'ns_1@172.23.136.106',replica,undefined,'ns_1@172.23.136.107'}]
            [rebalance:info,2021-04-29T14:05:45.560-07:00,ns_1@172.23.136.106:<0.15727.131>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106',
                                             {update_vbucket_state,716,replica,undefined,
                                              'ns_1@172.23.136.107'}}
            

            Seeing how the vBucket state was still transitioning whilst we where opening the streams my hypothesis is that we most likely connected to the old server which did not respond at all which cause the timeout. (We cannot confirm this theory as gocbcore does not log which node we tried to connect to and we don't have any logs for node 112) to verify this is the case we would need to get a tcp dump at the time this happened. But assuming this is the case the fix would be retrying to open the streams a couple times after timeouts. Note that this requires gocbcore to refresh the cluster map before the next try as if not we would just get the same behaviour again.

            James Lee what are your thoughts, I think just retrying would probably work, but I don't remeber if there is any way to force update the vBucket map in gocbcore.

            carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - - edited I've been having a look at this. It seems like the error occurs thrice as pointed out by one of the images: ╰─$ cbm_timeline.py backup-0.log Action | Version | #Line | Start | End | Status | Options --------------------------------------------------------------------------------------------------------------------------------- config | 7.0.0-5060 | 2 | 2021-04-29T00:05:09.270000-0700 | 2021-04-29T00:05:09.273000-0700 | OK | backup | 7.0.0-5060 | 6 | 2021-04-29T02:00:13.076000-0700 | 2021-04-29T02:02:58.568000-0700 | OK | backup | 7.0.0-5060 | 2137 | 2021-04-29T03:00:13.405000-0700 | 2021-04-29T03:02:09.399000-0700 | OK | backup | 7.0.0-5060 | 4268 | 2021-04-29T04:00:13.523000-0700 | 2021-04-29T04:00:50.818000-0700 | OK | backup | 7.0.0-5060 | 6397 | 2021-04-29T05:00:13.376000-0700 | 2021-04-29T05:00:47.080000-0700 | OK | backup | 7.0.0-5060 | 8527 | 2021-04-29T06:00:14.373000-0700 | 2021-04-29T06:00:47.814000-0700 | OK | merge | 7.0.0-5060 | 10657 | 2021-04-29T06:30:14.580000-0700 | 2021-04-29T06:33:20.470000-0700 | OK | backup | 7.0.0-5060 | 15921 | 2021-04-29T07:00:14.740000-0700 | 2021-04-29T07:00:50.407000-0700 | OK | backup | 7.0.0-5060 | 18051 | 2021-04-29T08:00:15.049000-0700 | 2021-04-29T08:00:46.624000-0700 | OK | backup | 7.0.0-5060 | 20181 | 2021-04-29T09:00:15.407000-0700 | 2021-04-29T09:00:47.752000-0700 | OK | backup | 7.0.0-5060 | 22311 | 2021-04-29T10:00:15.711000-0700 | 2021-04-29T10:00:48.152000-0700 | OK | backup | 7.0.0-5060 | 24441 | 2021-04-29T11:00:15.945000-0700 | 2021-04-29T11:00:50.744000-0700 | OK | backup | 7.0.0-5060 | 26571 | 2021-04-29T12:00:16.194000-0700 | 2021-04-29T12:01:27.654000-0700 | OK | merge | 7.0.0-5060 | 28701 | 2021-04-29T12:30:16.224000-0700 | 2021-04-29T12:35:30.546000-0700 | OK | backup | 7.0.0-5060 | 36067 | 2021-04-29T12:38:03.441000-0700 | 2021-04-29T12:39:51.102000-0700 | OK | backup | 7.0.0-5060 | 38198 | 2021-04-29T13:00:15.923000-0700 | 2021-04-29T13:03:57.207000-0700 | OK | backup | 7.0.0-5060 | 40328 | 2021-04-29T14:00:17.168000-0700 | 2021-04-29T14:03:58.578000-0700 | Failed | backup | 7.0.0-5060 | 42299 | 2021-04-29T15:00:17.478000-0700 | 2021-04-29T15:03:50.303000-0700 | OK | backup | 7.0.0-5060 | 44429 | 2021-04-29T16:00:17.853000-0700 | 2021-04-29T16:03:58.232000-0700 | Failed | backup | 7.0.0-5060 | 46219 | 2021-04-29T17:00:18.246000-0700 | 2021-04-29T17:03:59.162000-0700 | Failed | backup | 7.0.0-5060 | 48131 | 2021-04-29T18:00:18.547000-0700 | 2021-04-29T18:05:05.065000-0700 | OK | merge | 7.0.0-5060 | 50261 | 2021-04-29T18:30:18.681000-0700 | 2021-04-29T18:33:48.683000-0700 | OK | backup | 7.0.0-5060 | 54474 | 2021-04-29T19:00:18.925000-0700 | 2021-04-29T19:02:10.066000-0700 | OK | backup | 7.0.0-5060 | 56604 | 2021-04-29T20:00:19.159000-0700 | 2021-04-29T20:01:53.617000-0700 | OK | The first failure is at 14:00:59 which is a timeout on vb:716. This would indicate that KV did not respond to out open stream request in 3minutes. Looking at the cbbackupmgr logs we can see that at time 14:00:59 we opened the stream for vbucket 716 backup-0.log 2021-04-29T14:00:59.270-07:00 (DCP) (travel-sample) (vb 716) Creating DCP stream | {"uuid":117585510587600,"start_seqno":348700,"end_seqno":448028,"snap_start":255410,"snap_end":348700,"retries":1} 2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' on stream | {"uuid":117585510587600,"snap_start":255410,"snap_end":348700,"snap_complete":true,"last_seqno":348700,"retries":0} -- couchbase.(*DCPAsyncWorker).openStream() at dcp_async_worker.go:193 2021-04-29T14:03:58.541-07:00 WARN: (DCP) (travel-sample) (vb 716) Received an unexpected error whilst streaming, beginning teardown: failed to open stream: failed to stream vBucket 716: client received unexpected error 'Stream request for vBucket 716 timed out after 3m0s' -- couchbase.(*DCPAsyncWorker).handleDCPError() at dcp_async_worker.go:561 Looking at the logs at that time the node ending in 107 should have the active version of vbucket 716 memcached.log ╰─$ ag "vb:716" 107_logs/cbcollect_info__20210430-033954/memcached.log | ag "setState" 128591:2021-04-29T13:53:25.680858-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:0 from:replica to:replica 136521:2021-04-29T14:00:55.530815-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448633 from:replica to:pending 136522:2021-04-29T14:00:55.537555-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:448634 from:pending to:active 141777:2021-04-29T14:05:45.404593-07:00 INFO (travel-sample) VBucket::setState: transitioning vb:716 with high seqno:457500 from:active to:active meta:{"topology":[["ns_1@172.23.136.107","ns_1@172.23.136.106"]]} ... Interestingly enough it seems like it moved from pending to active just a couple of seconds before we started the stream. If we look in the same memcached logs for the stream open for vb 716 we can see that it is not there: ag "T14:.*cbbackupmgr.*\(vb:716\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log For reference we can find some of the other vbucket opens that worked e.g.: ag "T14:.*cbbackupmgr.*\(vb:738\) Creating" 107_logs/cbcollect_info__20210430-033954/memcached.log 136617:2021-04-29T14:00:56.722060-07:00 INFO 8104: (travel-sample) DCP (Producer) eq_dcpq:cbbackupmgr_2021-04-29T14:00:51-07:00_96500_4 - (vb:738) Creating stream with start seqno 353928 and end seqno 452322; requested end seqno was 452322, collections-filter-size:61 sid:none So seems like the most likely scenario is that we sent the request to the previous node that had the vbucket in the active state. Looking at the logs I can see that all the backups that fail occured during a rebalance that took quite a while which was: 105 node ns_server.info.log [user:info,2021-04-29T12:43:38.139-07:00,ns_1@172.23.136.106:<0.4856.11>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.136.114','ns_1@172.23.136.106', 'ns_1@172.23.136.107','ns_1@172.23.138.127', 'ns_1@172.23.136.108','ns_1@172.23.136.115', 'ns_1@172.23.136.113','ns_1@172.23.136.110', 'ns_1@172.23.136.105'], EjectNodes = ['ns_1@172.23.136.112'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf [rebalance:info,2021-04-29T12:43:38.142-07:00,ns_1@172.23.136.106:<0.27630.116>:ns_rebalancer:drop_old_2i_indexes:1274]Going to drop possible old 2i indexes on nodes ['ns_1@172.23.136.107'] .... 'ns_1@172.23.136.110', 'ns_1@172.23.136.113', 'ns_1@172.23.136.114', 'ns_1@172.23.136.115', 'ns_1@172.23.138.127'] [user:info,2021-04-29T17:32:28.716-07:00,ns_1@172.23.136.106:<0.4856.11>:ns_orchestrator:log_rebalance_completion:1405]Rebalance completed successfully. Rebalance Operation Id = 12b8ca27d5f5cab819aae6e81d8b4eaf This rebalance runs between 12:43:38 and 17:32:28 it seems like the most likely scenario is that at the time of the failures the vBuckets where being moved and we tried to connect to the old node. Which failed. It is also worth noting that the vBucket that we failed to open was originally in node 172.23.136.112 which is the one ejected we can see this from the vBucket map in the rebalance message in node 105 (I will not put it here as it is massive) but the entry number 716 is ['ns_1@172.23.136.112', 'ns_1@172.23.136.106'] . We can also see that the entry in the target vbuckety map for vbucket 716 is ['ns_1@172.23.136.107', 'ns_1@172.23.136.106'] which matches what we see in the logs for node 107. Furthermore we can observer how vBucket 716 transitions between states at the time of the open stream: [rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.15649.129>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change [{'ns_1@172.23.136.106',replica,undefined,'ns_1@172.23.136.112'}, {'ns_1@172.23.136.107',replica,passive,'ns_1@172.23.136.112'}] [rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.14703.129>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106', {update_vbucket_state,716,replica,undefined, 'ns_1@172.23.136.112'}} [rebalance:info,2021-04-29T13:53:25.841-07:00,ns_1@172.23.136.106:<0.15876.129>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107', {update_vbucket_state,716,replica,passive, 'ns_1@172.23.136.112'}}   .... [rebalance:info,2021-04-29T13:56:48.454-07:00,ns_1@172.23.136.106:<0.1157.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.112', {update_vbucket_state,716,active,paused, undefined, [['ns_1@172.23.136.112', 'ns_1@172.23.136.106']]}} .... [rebalance:info,2021-04-29T13:58:55.934-07:00,ns_1@172.23.136.106:<0.11937.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.112', {update_vbucket_state,716,active,paused, undefined, [['ns_1@172.23.136.112', 'ns_1@172.23.136.106'], ... [rebalance:info,2021-04-29T14:00:55.701-07:00,ns_1@172.23.136.106:<0.23319.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107', {update_vbucket_state,716,active,undefined, undefined,undefined}} [rebalance:info,2021-04-29T14:00:55.704-07:00,ns_1@172.23.136.106:<0.23184.130>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change [{'ns_1@172.23.136.106',replica,undefined,undefined}] [rebalance:info,2021-04-29T14:00:55.705-07:00,ns_1@172.23.136.106:<0.23479.130>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106', {update_vbucket_state,716,replica,undefined, undefined}} .... [rebalance:info,2021-04-29T14:05:45.547-07:00,ns_1@172.23.136.106:<0.15240.131>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.107', {update_vbucket_state,716,active,undefined, undefined, [['ns_1@172.23.136.107', 'ns_1@172.23.136.106']]}} [rebalance:info,2021-04-29T14:05:45.560-07:00,ns_1@172.23.136.106:<0.15240.131>:janitor_agent:bulk_set_vbucket_state:379]Doing bulk vbucket 716 state change [{'ns_1@172.23.136.106',replica,undefined,'ns_1@172.23.136.107'}] [rebalance:info,2021-04-29T14:05:45.560-07:00,ns_1@172.23.136.106:<0.15727.131>:janitor_agent:set_vbucket_state_inner:413]Doing vbucket 716 state change: {'ns_1@172.23.136.106', {update_vbucket_state,716,replica,undefined, 'ns_1@172.23.136.107'}} Seeing how the vBucket state was still transitioning whilst we where opening the streams my hypothesis is that we most likely connected to the old server which did not respond at all which cause the timeout. (We cannot confirm this theory as gocbcore does not log which node we tried to connect to and we don't have any logs for node 112) to verify this is the case we would need to get a tcp dump at the time this happened. But assuming this is the case the fix would be retrying to open the streams a couple times after timeouts. Note that this requires gocbcore to refresh the cluster map before the next try as if not we would just get the same behaviour again. James Lee what are your thoughts, I think just retrying would probably work, but I don't remeber if there is any way to force update the vBucket map in gocbcore.
            james.lee James Lee added a comment -

            Hi Carlos Gonzalez Betancort,

            In theory the SDK should be refreshing it's cluster config periodically throughout the time that it's connected to the cluster (via GCCCP). So we shouldn't need to be able to force an update of the SDK's vBucket map, simply retrying the stream request should suffice for the SDK to send the request to the correct node in this case.

            If this is in-fact the issue we're hitting, then it should simply be a case of handling a timeout in our 'openStream' function and triggering a retry (for the benefit of others, our DCP client already supports retries so it should just be a case of wiring up this use case).

            Thanks,
            James

            james.lee James Lee added a comment - Hi Carlos Gonzalez Betancort , In theory the SDK should be refreshing it's cluster config periodically throughout the time that it's connected to the cluster (via GCCCP). So we shouldn't need to be able to force an update of the SDK's vBucket map, simply retrying the stream request should suffice for the SDK to send the request to the correct node in this case. If this is in-fact the issue we're hitting, then it should simply be a case of handling a timeout in our ' openStream ' function and triggering a retry (for the benefit of others, our DCP client already supports retries so it should just be a case of wiring up this use case). Thanks, James
            james.lee James Lee added a comment -

            I believe Carlos Gonzalez Betancort had a power cut towards the end of the day, so I'm going to assign this to myself and get a patch up and ready to be reviewed.

            james.lee James Lee added a comment - I believe Carlos Gonzalez Betancort had a power cut towards the end of the day, so I'm going to assign this to myself and get a patch up and ready to be reviewed.
            bryan.mccoid Bryan McCoid added a comment -

            The issue regarding the cbcollect can be fixed by using a more recent version. I've verified that it works in build 5076. 

            bryan.mccoid Bryan McCoid added a comment - The issue regarding the cbcollect can be fixed by using a more recent version. I've verified that it works in build 5076. 

            Build couchbase-server-7.0.0-5095 contains backup commit d609ddd with commit message:
            MB-46013 Retry timeout errors when requesting DCP streams

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-5095 contains backup commit d609ddd with commit message: MB-46013 Retry timeout errors when requesting DCP streams

            Closing now as unit tests are in place and passing.

            Ashwin Govindarajulu, please feel free to reopen if this issue appears again.

            joe.mitchelljones Joe Mitchell Jones added a comment - Closing now as unit tests are in place and passing. Ashwin Govindarajulu , please feel free to reopen if this issue appears again.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty