Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3559

[CAO-2.7.0-197] Operators fails over another node, before delta upgrade of previous node is complete.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • 2.8.0
    • 2.7.0
    • operator
    • 19 - A Rock and a Hard Place
    • 1

    Description

      Find the details of the Delta Recovery Upgrade here (with complete logs):
      https://issues.couchbase.com/browse/K8S-3548?focusedId=784164&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-784164

       

      Bug / Issue

      Delta upgrade of index-query cb-example-0007 has started and rebalance has not been completed.

      2024-07-04T15:46:17.675Z, menelaus_web_sup:1:info:web start ok(ns_1@cb-example-0007.cb-example.default.svc) - Couchbase Server has started on web port 8091 on node 'ns_1@cb-example-0007.cb-example.default.svc'. Version: "7.6.1-3200-enterprise".
      2024-07-04T15:46:39.757Z, ns_node_disco:4:info:node up(ns_1@cb-example-0003.cb-example.default.svc) - Node 'ns_1@cb-example-0003.cb-example.default.svc' saw that node 'ns_1@cb-example-0007.cb-example.default.svc' came up. Tags: [] (repeated 1 times, last seen 22.155195 secs ago)
      2024-07-04T15:46:48.278Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc',
                                       'ns_1@cb-example-0001.cb-example.default.svc',
                                       'ns_1@cb-example-0003.cb-example.default.svc',
                                       'ns_1@cb-example-0004.cb-example.default.svc',
                                       'ns_1@cb-example-0005.cb-example.default.svc',
                                       'ns_1@cb-example-0006.cb-example.default.svc',
                                       'ns_1@cb-example-0007.cb-example.default.svc',
                                       'ns_1@cb-example-0008.cb-example.default.svc',
                                       'ns_1@cb-example-0009.cb-example.default.svc',
                                       'ns_1@cb-example-0010.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = []; Delta recovery nodes = ['ns_1@cb-example-0007.cb-example.default.svc'], Delta recovery buckets = all;; Operation Id = af7d1bec05eac9184c59f0dbfb0f2140

       

      The rebalance fails, yet cb-example-0008 is failed over.

      2024-07-04T15:48:00.047Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {service_rebalance_failed,index,
                                    {agent_died,<0.1788.0>,
                                     {linked_process_died,<0.29724.0>,
                                      {'ns_1@cb-example-0006.cb-example.default.svc',
                                       {timeout,
                                        {gen_server,call,
                                         [<0.2865.0>,
                                          {call,"ServiceAPI.StartTopologyChange",
                                           #Fun<json_rpc_connection.0.36915653>,
                                           #{timeout => 60000}},
                                          60000]}}}}}}.
      Rebalance Operation Id = af7d1bec05eac9184c59f0dbfb0f2140
      2024-07-04T15:48:04.390Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failing over ['ns_1@cb-example-0008.cb-example.default.svc']
      2024-07-04T15:48:04.390Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Starting failover of nodes ['ns_1@cb-example-0008.cb-example.default.svc'] AllowUnsafe = false Operation Id = 794728626a4e024557c3d166aef5a38b
      2024-07-04T15:48:06.506Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Could not automatically fail over nodes (['ns_1@cb-example-0006.cb-example.default.svc']). Failover is running.
      2024-07-04T15:48:07.505Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Could not automatically fail over node ('ns_1@cb-example-0006.cb-example.default.svc') due to operation being unsafe for service index. Failing over nodes cb-example-0006.cb-example.default.svc:9102(d2684d8551e1ab69a3d890890dda3252) would lose the following indexes/partitions: bucket4._default._default.primary_idx_bucket4_1 0 bucket4._default._default.primary_idx_bucket4_2 0
      2024-07-04T15:48:11.507Z, auto_failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Could not automatically fail over node ('ns_1@cb-example-0006.cb-example.default.svc') due to operation being unsafe for service index. Failing over nodes cb-example-0006.cb-example.default.svc:9102(d2684d8551e1ab69a3d890890dda3252) would lose the following indexes/partitions: bucket4._default._default.primary_idx_bucket4_2 0 bucket4._default._default.primary_idx_bucket4_1 0
      2024-07-04T15:49:00.131Z, failover:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Failed over ['ns_1@cb-example-0008.cb-example.default.svc']. Failover couldn't complete on some nodes:
      ['ns_1@cb-example-0008.cb-example.default.svc']
      2024-07-04T15:49:00.158Z, failover:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Deactivating failed over nodes ['ns_1@cb-example-0008.cb-example.default.svc']
      2024-07-04T15:49:00.323Z, ns_orchestrator:0:info:message(ns_1@cb-example-0006.cb-example.default.svc) - Failover completed successfully.
      Rebalance Operation Id = 794728626a4e024557c3d166aef5a38b

      After this there is series of rebalance of failures

       

      2024-07-04T15:50:00.055Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {{badmatch,failed},
                                    [{ns_rebalancer,rebalance_body,7,
                                         [{file,"src/ns_rebalancer.erl"},
                                          {line,500}]},
                                     {async,'-async_init/4-fun-1-',3,
                                         [{file,"src/async.erl"},{line,199}]}]}.
      Rebalance Operation Id = 223c530b7bc62d61ee78e12ea0a8a460
      ...
      ...
      2024-07-04T15:52:00.066Z, ns_orchestrator:0:critical:message(ns_1@cb-example-0006.cb-example.default.svc) - Rebalance exited with reason {{badmatch,failed},
                                    [{ns_rebalancer,rebalance_body,7,
                                         [{file,"src/ns_rebalancer.erl"},
                                          {line,500}]},
                                     {async,'-async_init/4-fun-1-',3,
                                         [{file,"src/async.erl"},{line,199}]}]}.
      Rebalance Operation Id = 4a2e70a85ea938068221399caffd4a9a
      

       

      Eventually rebalance succeeds.

      Now, cb-example-0008 is failed over again. Now it comes up with 7.6.1. And is successfully rebalanced.

       

      Logs

      2024-07-04A_PostDeltaNodeRestart_cbopinfo-20240705T020256+0530.tar.gz

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              justin.ashworth Justin Ashworth
              aryaan.bhaskar Aryaan Bhaskar
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty