Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48215

[ARM] backup service related rebalance failure observed in 4 node sanity

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • No

    Description

      7.1.0-1190

      Test:
      ./testrunner -i test_sanity.ini -p use_hostnames=true,get-cbcollect-info=True -t ent_backup_restore.enterprise_backup_restore_test.EnterpriseBackupRestoreTest.test_backup_restore_sanity,items=1000,reset_services=True

      daig.log:

      2021-08-27T16:24:20.936Z, ns_orchestrator:0:critical:message(ns_1@172.31.30.96) - Rebalance exited with reason {service_rebalance_failed,backup,
                                    {agent_died,<30280.5746.5>,
                                     {linked_process_died,<30280.7613.5>,
                                      {'ns_1@ec2-34-221-216-74.us-west-2.compute.amazonaws.com',
                                       {{badmatch,
                                         {false,
                                          {topology,[],
                                           [<<"66dd46e037188b3db715618d7a28d33b">>,
                                            <<"7f1435314ca7bde12749503dd0f06fd2">>],
                                           true,[]},
                                          {topology,[],
                                           [<<"66dd46e037188b3db715618d7a28d33b">>],
                                           true,[]}}},
                                        [{service_agent,long_poll_worker_loop,5,
                                          [{file,"src/service_agent.erl"},
                                           {line,654}]},
                                         {proc_lib,init_p,3,
                                          [{file,"proc_lib.erl"},{line,234}]}]}}}}}.
      Rebalance Operation Id = 93f13a15e36925d125627dc2b3dcc372
      

      backup service log:

      2021-08-27T16:23:50.929Z INFO (Rebalance) Starting rebalance {"change": {"id":"3b7e8e31dd08bef5e2e8c8b6cfa2e32b","currentTopologyRev":null,"type":"topology-change-rebalance","keepNodes":[{"nodeInfo":{"nodeId":"66dd46e037188b3db715618d7a28d33b","priority":2,"opaque":{"grpc_port":9124,"host":"172.31.30.96","http_port":8097}},"recoveryType":"recovery-full"}],"ejectNodes":[{"nodeId":"7f1435314ca7bde12749503dd0f06fd2","priority":1,"opaque":{"grpc_port":9124,"host":"ec2-34-221-216-74.us-west-2.compute.amazonaws.com","http_port":8097}}]}}
      2021-08-27T16:23:50.931Z INFO (Rebalance) Got old leader {"leader": "66dd46e037188b3db715618d7a28d33b"}
      2021-08-27T16:23:50.933Z INFO (Rebalance) Got current nodes {"#nodes": 2}
      2021-08-27T16:23:50.933Z INFO (Rebalance) Setting self as leader
      2021-08-27T16:23:50.933Z INFO (Rebalance) Did the failover nodes will do eject node now {"eject nodes": [{"nodeId":"7f1435314ca7bde12749503dd0f06fd2","priority":1,"opaque":{"grpc_port":9124,"host":"ec2-34-221-216-74.us-west-2.compute.amazonaws.com","http_port":8097}}]}
      2021-08-27T16:23:50.934Z INFO (Rebalance) Removing node from service {"nodeID": "7f1435314ca7bde12749503dd0f06fd2"}
      2021-08-27T16:23:50.934Z DEBUG (Leader Manager) Received store event {"eventType": 2}
      2021-08-27T16:24:10.939Z DEBUG (Rebalance) Failed to establish connection with remove node {"nodeID": "7f1435314ca7bde12749503dd0f06fd2", "err": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.31.34.137:9124: i/o timeout\""}
      2021-08-27T16:24:20.938Z INFO (Service Manager) Cancel task {"id": "rebalance/3b7e8e31dd08bef5e2e8c8b6cfa2e32b"}
      2021-08-27T16:24:20.938Z INFO (Rebalance) Cancelling rebalance
      2021-08-27T16:24:31.940Z DEBUG (Rebalance) Failed to establish connection with remove node {"nodeID": "7f1435314ca7bde12749503dd0f06fd2", "err": "rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing dial tcp 172.31.34.137:9124: i/o timeout\""}
      2021-08-27T16:24:31.940Z ERROR (Rebalance) Could not confirm node was removed {"nodeID": "7f1435314ca7bde12749503dd0f06fd2", "err": "could not remove node '7f1435314ca7bde12749503dd0f06fd2': operation was cancelled"}
      2021-08-27T16:27:34.780Z INFO (Stats) Start repositories data size collection
      2021-08-27T16:27:34.781Z INFO (Stats) Stop repositories data size collection
      2021-08-27T16:32:34.783Z INFO (Stats) Start repositories data size collection
      2021-08-27T16:32:34.784Z INFO (Stats) Stop repositories data size collection
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              arunkumar Arunkumar Senthilnathan (Inactive)
              arunkumar Arunkumar Senthilnathan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty