Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45970

[Windows]cbbackupmgr: Backup failed with reason "failed to transfer Analytics metadata"

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-5017

      Scenario:

      Performing scheduled backup on muti node cluster with single bucket + other services running

      Observation:

      Seeing one of the scheduled backup failed with "exit status 1" with reason"
      failed to get Analytics metadata

      {
        "task_name": "backup_hourly",
        "status": "failed",
        "start": "2021-04-27T23:00:35.2044958-07:00",
        "end": "2021-04-27T23:00:36.6512701-07:00",
        "node_runs": [
          {
            "node_id": "877329ce98540f201072f8ff94769fe8",
            "status": "failed",
            "start": "2021-04-27T23:00:35.2544982-07:00",
            "end": "2021-04-27T23:00:36.6142778-07:00",
            "error": "exit status 1: failed to execute cluster operations: failed to transfer Analytics metadata: failed to get Analytics metadata: failed to get analytics metadata: failed to execute request: http client failed to dispatch/receive request/response: Get \"http://172.23.138.127:8095/api/v1/backup\": dial tcp 172.23.138.127:8095: connectex: No connection could be made because the target machine actively refused it.",
            "progress": 0,
            "stats": {
              "error": "failed to execute cluster operations: failed to transfer Analytics metadata: failed to get Analytics metadata: failed to get analytics metadata: failed to execute request: http client failed to dispatch/receive request/response: Get \"http://172.23.138.127:8095/api/v1/backup\": dial tcp 172.23.138.127:8095: connectex: No connection could be made because the target machine actively refused it.",
              "stats": {
                "started_at": 1619589635472537600,
                "complete": false
              }
            },
            "error_code": 2
          }
        ],
        "error": "exit status 1: failed to execute cluster operations: failed to transfer Analytics metadata: failed to get Analytics metadata: failed to get analytics metadata: failed to execute request: http client failed to dispatch/receive request/response: Get \"http://172.23.138.127:8095/api/v1/backup\": dial tcp 172.23.138.127:8095: connectex: No connection could be made because the target machine actively refused it.",
        "error_code": 2,
        "type": "BACKUP",
        "show": true
      }

      Note: Even consecutive backup at start time '2021-04-28T00:00:35-07:00' has failed due to the same reason

       

      Attaching full backup dir data backups.tar.gz as an attachment

      Attachments

        1. active_backups.png
          active_backups.png
          202 kB
        2. backup_list.png
          backup_list.png
          291 kB
        3. backup_task_history.png
          backup_task_history.png
          417 kB
        4. backups.tar.gz
          20.61 MB
        5. image-2021-04-28-12-33-22-692.png
          image-2021-04-28-12-33-22-692.png
          426 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Removing tools as "actively refused it" means that the host sent a reset instead of an ack when you tried to connect. Either there is a firewall blocking the connection, or the process that is hosting the service is not listening on that port.

            carlos.gonzalez Carlos Gonzalez Betancort (Inactive) added a comment - Removing tools as "actively refused it" means that the host sent a reset instead of an ack when you tried to connect. Either there is a firewall blocking the connection, or the process that is hosting the service is not listening on that port.

            Able to trigger a initial full backup as soon as the backup task is created. So there could not be any firewall related issues here.

            Below is the successful cbbackupmgr run logs during the same.

            {
              "task_name": "BACKUP-6381bae2-4ab6-4dbb-bd00-f55d6fd99798",
              "status": "done",
              "start": "2021-04-27T22:00:53.7369956-07:00",
              "end": "2021-04-27T22:01:08.0478037-07:00",
              "node_runs": [
                {
                  "node_id": "877329ce98540f201072f8ff94769fe8",
                  "status": "done",
                  "start": "2021-04-27T22:00:53.7349927-07:00",
                  "end": "2021-04-27T22:01:08.0218087-07:00",
                  "progress": 100,
                  "stats": {
                    "id": "0e1002ff-9ed0-4891-bc64-7a12d8827110",
                    "current_transfer": 1,
                    "total_transfers": 1,
                    "transfers": [
                      {
                        "description": "Backing up to 2021-04-27T22_00_53.9510062-07_00",
                        "stats": {
                          "started_at": 1619586053911005400,
                          "finished_at": 1619586067745784300,
                          "buckets": {
                            "travel-sample": {
                              "estimated_total_items": 436024,
                              "total_items": 163309,
                              "total_vbuckets": 1024,
                              "vbuckets_complete": 1024,
                              "bytes_received": 68550474,
                              "snapshot_markers_received": 1024,
                              "snapshot_ends_received": 1024,
                              "failover_logs_received": 1024,
                              "mutations_received": 163308,
                              "deletions_received": 1,
                              "started_at": 1619586060758350600,
                              "finished_at": 1619586067713778700,
                              "complete": true,
                              "errored": false
                            }
                          },
                          "complete": true
                        },
                        "progress": 100,
                        "eta": "2021-04-27T22:01:07.7477841-07:00"
                      }
                    ],
                    "progress": 100,
                    "eta": "2021-04-27T22:01:07.7477841-07:00"
                  },
                  "error_code": 0
                }
              ],
              "error_code": 0,
              "type": "BACKUP",
              "show": true
            }

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Able to trigger a initial full backup as soon as the backup task is created. So there could not be any firewall related issues here. Below is the successful cbbackupmgr run logs during the same. { "task_name": "BACKUP-6381bae2-4ab6-4dbb-bd00-f55d6fd99798", "status": "done", "start": "2021-04-27T22:00:53.7369956-07:00", "end": "2021-04-27T22:01:08.0478037-07:00", "node_runs": [ { "node_id": "877329ce98540f201072f8ff94769fe8", "status": "done", "start": "2021-04-27T22:00:53.7349927-07:00", "end": "2021-04-27T22:01:08.0218087-07:00", "progress": 100, "stats": { "id": "0e1002ff-9ed0-4891-bc64-7a12d8827110", "current_transfer": 1, "total_transfers": 1, "transfers": [ { "description": "Backing up to 2021-04-27T22_00_53.9510062-07_00", "stats": { "started_at": 1619586053911005400, "finished_at": 1619586067745784300, "buckets": { "travel-sample": { "estimated_total_items": 436024, "total_items": 163309, "total_vbuckets": 1024, "vbuckets_complete": 1024, "bytes_received": 68550474, "snapshot_markers_received": 1024, "snapshot_ends_received": 1024, "failover_logs_received": 1024, "mutations_received": 163308, "deletions_received": 1, "started_at": 1619586060758350600, "finished_at": 1619586067713778700, "complete": true, "errored": false } }, "complete": true }, "progress": 100, "eta": "2021-04-27T22:01:07.7477841-07:00" } ], "progress": 100, "eta": "2021-04-27T22:01:07.7477841-07:00" }, "error_code": 0 } ], "error_code": 0, "type": "BACKUP", "show": true }

            CC Till Westmann

            Looks like something went wrong in the analytics node 172.23.138.127 node.

            Rebalance is also failing with the similar logs '<<"wait for condition on http://172.23.138.127:9111/analytics/cluster timed out">>'

            Rebalance exited with reason {service_rebalance_failed,cbas,
            {worker_died, {'EXIT',<0.8464.87>,
            {rebalance_failed, {service_error,
            <<"wait for condition on http://172.23.138.127:9111/analytics/cluster timed out">>}}}}}.
            Rebalance Operation Id = 21f21dd8267e7427ba60e17f37eb3655

            From ns_server.analytics_info.log

            2021-04-27T22:32:36.038-07:00 WARN CBAS.rebalance.Rebalance [Executor-31:ClusterController] enter Rebalance 1e876a1d016ce9b4ae8520375ce42aeb: {"id":"1e876a1d016ce9b4ae8520375ce42aeb","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[],"ejectNodes":[{"nodeId":"720abd8229e9bcad418153e53d6d79a3","priority":0,"opaque":{"cbas-version":"7.0.0-5017","cc-http-port":"9111","controller-id":"3","host":"172.23.138.127","num-iodevices":"2","starting-partition-id":"3","svc-http-port":"8095"}}]}^M
            2021-04-27T22:32:36.038-07:00 INFO CBAS.rebalance.Rebalance [Executor-31:ClusterController] Removing all analytics nodes. Declaring rebalance complete^M
            2021-04-27T22:32:36.038-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-31:ClusterController] rebalance completed^M
            2021-04-27T22:32:36.039-07:00 WARN CBAS.rebalance.Rebalance [Executor-31:ClusterController] exit Rebalance 1e876a1d016ce9b4ae8520375ce42aeb^M
            2021-04-27T22:32:36.039-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-30:ClusterController] dataset size fetcher interrupted^M
            2021-04-27T22:32:36.554-07:00 INFO CBAS.cbas instructing driver to shut down & eject...
            2021-04-27T22:32:36.554-07:00 INFO CBAS.cbas eject request rec'd; killing analytics driver
            2021-04-27T22:32:37.244-07:00 WARN CBAS.cbas analytics driver has exited w/ error exit status 33
            2021-04-27T22:32:37.244-07:00 INFO CBAS.cbas cleaning storage artifacts due to node removal...
            2021-04-27T22:32:37.325-07:00 INFO CBAS.cbas Cleaned up storage dir c:\Program Files\Couchbase\Server\var\lib\couchbase\data\@analytics\v_iodevice_0
            2021-04-27T22:32:37.365-07:00 INFO CBAS.cbas Cleaned up storage dir c:\Program Files\Couchbase\Server\var\lib\couchbase\data\@analytics\v_iodevice_1
            2021-04-27T22:32:37.365-07:00 INFO CBAS.cbas stopping coordinator
            2021-04-27T22:32:37.366-07:00 INFO CBAS.cbas Stopped clusterCompatibility monitor
            2021-04-27T22:32:37.368-07:00 INFO CBAS.cbas ignoring already ejected state...
            2021-04-27T22:32:37.372-07:00 INFO CBAS.cbas halting pending kill by ns_server...
            2021-04-27T22:32:37.395-07:00 INFO CBAS.cbas all analytics nodes have been ejected; clearing cbas dir from metakv
            2021-04-27T22:32:37.415-07:00 INFO CBAS.cbas got nil /cbas/topology; ignoring
            2021-04-27T22:32:41.596-07:00 INFO CBAS.cbas service is being re-added to the cluster; ensuring proper state
            2021-04-27T22:32:41.642-07:00 INFO CBAS.cbas Initialized starting partitionId to 0 for node 720abd8229e9bcad418153e53d6d79a3
            2021-04-27T22:32:41.642-07:00 INFO CBAS.cbas Initialized controllerId to 0 for node 720abd8229e9bcad418153e53d6d79a3
            2021-04-27T22:32:41.646-07:00 INFO CBAS.cbas no CC / Metadata node defined in metakv- will pick *this* node, 720abd8229e9bcad418153e53d6d79a3
            2021-04-27T22:32:41.649-07:00 INFO CBAS.cbas our node is within topology refresh 0749e68955426c2539546042a5db3870/27; ensuring driver is running...
            2021-04-27T22:35:08.850-07:00 INFO CBAS.cbas our node is within topology refresh a11fbe8274faa083553d1011e8aebc37/28; ensuring driver is running...
            2021-04-27T22:40:08.443-07:00 INFO CBAS.cbas failing rebalance due to wait for condition on http://172.23.138.127:9111/analytics/cluster timed out
            2021-04-27T22:40:08.443-07:00 INFO CBAS.cbas setting balanced state to balanced-65 for a11fbe8274faa083553d1011e8aebc37

            Latest cbcollect logs:
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.105.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.106.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.107.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.108.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.110.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.112.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.113.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.114.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.115.zip
            https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.138.127.zip

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - CC Till Westmann Looks like something went wrong in the analytics node 172.23.138.127 node. Rebalance is also failing with the similar logs ' <<"wait for condition on http://172.23.138.127:9111/analytics/cluster timed out">> ' Rebalance exited with reason {service_rebalance_failed,cbas, {worker_died, {'EXIT',<0.8464.87>, {rebalance_failed, {service_error, <<"wait for condition on http://172.23.138.127:9111/analytics/cluster timed out">>}}}}}. Rebalance Operation Id = 21f21dd8267e7427ba60e17f37eb3655 From ns_server.analytics_info.log 2021-04-27T22:32:36.038-07:00 WARN CBAS.rebalance.Rebalance [Executor-31:ClusterController] enter Rebalance 1e876a1d016ce9b4ae8520375ce42aeb: {"id":"1e876a1d016ce9b4ae8520375ce42aeb","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[],"ejectNodes":[{"nodeId":"720abd8229e9bcad418153e53d6d79a3","priority":0,"opaque":{"cbas-version":"7.0.0-5017","cc-http-port":"9111","controller-id":"3","host":"172.23.138.127","num-iodevices":"2","starting-partition-id":"3","svc-http-port":"8095"}}]}^M 2021-04-27T22:32:36.038-07:00 INFO CBAS.rebalance.Rebalance [Executor-31:ClusterController] Removing all analytics nodes. Declaring rebalance complete^M 2021-04-27T22:32:36.038-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-31:ClusterController] rebalance completed^M 2021-04-27T22:32:36.039-07:00 WARN CBAS.rebalance.Rebalance [Executor-31:ClusterController] exit Rebalance 1e876a1d016ce9b4ae8520375ce42aeb^M 2021-04-27T22:32:36.039-07:00 INFO CBAS.rebalance.RebalanceProgress [Executor-30:ClusterController] dataset size fetcher interrupted^M 2021-04-27T22:32:36.554-07:00 INFO CBAS.cbas instructing driver to shut down & eject... 2021-04-27T22:32:36.554-07:00 INFO CBAS.cbas eject request rec'd; killing analytics driver 2021-04-27T22:32:37.244-07:00 WARN CBAS.cbas analytics driver has exited w/ error exit status 33 2021-04-27T22:32:37.244-07:00 INFO CBAS.cbas cleaning storage artifacts due to node removal... 2021-04-27T22:32:37.325-07:00 INFO CBAS.cbas Cleaned up storage dir c:\Program Files\Couchbase\Server\var\lib\couchbase\data\@analytics\v_iodevice_0 2021-04-27T22:32:37.365-07:00 INFO CBAS.cbas Cleaned up storage dir c:\Program Files\Couchbase\Server\var\lib\couchbase\data\@analytics\v_iodevice_1 2021-04-27T22:32:37.365-07:00 INFO CBAS.cbas stopping coordinator 2021-04-27T22:32:37.366-07:00 INFO CBAS.cbas Stopped clusterCompatibility monitor 2021-04-27T22:32:37.368-07:00 INFO CBAS.cbas ignoring already ejected state... 2021-04-27T22:32:37.372-07:00 INFO CBAS.cbas halting pending kill by ns_server... 2021-04-27T22:32:37.395-07:00 INFO CBAS.cbas all analytics nodes have been ejected; clearing cbas dir from metakv 2021-04-27T22:32:37.415-07:00 INFO CBAS.cbas got nil /cbas/topology; ignoring 2021-04-27T22:32:41.596-07:00 INFO CBAS.cbas service is being re-added to the cluster; ensuring proper state 2021-04-27T22:32:41.642-07:00 INFO CBAS.cbas Initialized starting partitionId to 0 for node 720abd8229e9bcad418153e53d6d79a3 2021-04-27T22:32:41.642-07:00 INFO CBAS.cbas Initialized controllerId to 0 for node 720abd8229e9bcad418153e53d6d79a3 2021-04-27T22:32:41.646-07:00 INFO CBAS.cbas no CC / Metadata node defined in metakv- will pick *this* node, 720abd8229e9bcad418153e53d6d79a3 2021-04-27T22:32:41.649-07:00 INFO CBAS.cbas our node is within topology refresh 0749e68955426c2539546042a5db3870/27; ensuring driver is running... 2021-04-27T22:35:08.850-07:00 INFO CBAS.cbas our node is within topology refresh a11fbe8274faa083553d1011e8aebc37/28; ensuring driver is running... 2021-04-27T22:40:08.443-07:00 INFO CBAS.cbas failing rebalance due to wait for condition on http://172.23.138.127:9111/analytics/cluster timed out 2021-04-27T22:40:08.443-07:00 INFO CBAS.cbas setting balanced state to balanced-65 for a11fbe8274faa083553d1011e8aebc37 Latest cbcollect logs: https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.105.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.106.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.107.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.108.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.110.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.112.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.113.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.114.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.136.115.zip https://cb-engineering.s3.amazonaws.com/cbas_gw_timeout/collectinfo-2021-04-28T125557-ns_1%40172.23.138.127.zip
            michael.blow Michael Blow added a comment -

            this looks to be a duplicate of MB-45869

            michael.blow Michael Blow added a comment - this looks to be a duplicate of MB-45869

            Resolving as a duplicate issue to MB-45869. The backup failed due to the same rebalance issue.

            ali.alsuliman Ali Alsuliman added a comment - Resolving as a duplicate issue to  MB-45869 . The backup failed due to the same rebalance issue.

            Closing this ticket.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing this ticket.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty