Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-50442

MultiNodeFailover: Failover failed with reason 'Failed to get failover info for bucket "default"'

    XMLWordPrintable

Details

    Description

       

      Steps:

      • Five node cluster

        +----------------+-------------+-----------------+-----------+----------+-----------------------+-------------------+
        | Node           | Services    | CPU_utilization | Mem_total | Mem_free | Swap_mem_used         | Active / Replica  |
        +----------------+-------------+-----------------+-----------+----------+-----------------------+-------------------+
        | 172.23.105.212 | kv          | 0               | 0.0 Byte  | 0.0 Byte | 0.0 Byte / 0.0 Byte   | 0 / 0             |
        | 172.23.105.244 | index       | 0.426599749059  | 3.91 GiB  | 3.29 GiB | 92.50 MiB / 3.50 GiB  | 0 / 0             |
        | 172.23.105.155 | kv          | 4.7800661073    | 3.91 GiB  | 2.98 GiB | 115.79 MiB / 3.50 GiB | 0 / 0             |
        | 172.23.105.213 | index, n1ql | 1.28237364848   | 3.91 GiB  | 3.17 GiB | 64.89 MiB / 3.50 GiB  | 0 / 0             |
        | 172.23.105.211 | kv          | 4.60642875221   | 3.91 GiB  | 3.20 GiB | 146.50 MiB / 3.50 GiB | 0 / 0             |
        +----------------+-------------+-----------------+-----------+----------+-----------------------+-------------------+

      • Couchbase bucket with replicas=2

        +---------+-----------+-----------------+----------+------------+-----+-------+-----------+-----------+-----------+-----+
        | Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items | RAM Quota | RAM Used  | Disk Used | ARR |
        +---------+-----------+-----------------+----------+------------+-----+-------+-----------+-----------+-----------+-----+
        | default | couchbase | couchstore      | 2        | none       | 0   | 3     | 5.09 GiB  | 81.97 MiB | 28.36 MiB | 100 |
        +---------+-----------+-----------------+----------+------------+-----+-------+-----------+-----------+-----------+-----+

      • Auto Failover maxCount=5 and timeout=60
      • Perform swap rebalance

        +----------------+-------------+-----------------------+----------------+--------------+
        | Nodes          | Services    | Version               | CPU            | Status       |
        +----------------+-------------+-----------------------+----------------+--------------+
        | 172.23.105.212 | kv          | 7.1.0-2073-enterprise | 13.9653414883  | Cluster node |
        | 172.23.105.244 | index       | 7.1.0-2073-enterprise | 0.853842290306 | --- OUT ---> |
        | 172.23.105.155 | kv          | 7.1.0-2073-enterprise | 6.73957273652  | --- OUT ---> |
        | 172.23.105.213 | index, n1ql | 7.1.0-2073-enterprise | 4.83666751076  | Cluster node |
        | 172.23.105.211 | kv          | 7.1.0-2073-enterprise | 5.13274336283  | Cluster node |
        | 172.23.105.245 | kv          |                       |                | <--- IN ---  |
        | 172.23.100.22  | kv          |                       |                | <--- IN ---  |
        +----------------+-------------+-----------------------+----------------+--------------+

      • Introduce failures on following nodes

        +----------------+----------+-------------+----------------+
        | Node           | Services | Node status | Failover type  |
        +----------------+----------+-------------+----------------+
        | 172.23.105.244 | index    | active      | stop_couchbase |
        | 172.23.105.211 | kv       | active      | stop_memcached |
        | 172.23.105.155 | kv       | active      | stop_memcached |
        +----------------+----------+-------------+----------------+

      Failover failed with following reason:

      Failover exited with reason {failover_failed,"default",
      "Failed to get failover info for bucket \"default\": ['ns_1@172.23.100.22']"}.
      Rebalance Operation Id = 565989613b752ac3f171b34eea50c86b

       

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          artem Artem Stemkovski added a comment - - edited

          Here's what happened: when memcached was shut down on 172.23.105.211 and 172.23.105.155 the rebalance was still under way

          So the attempts to establish replication streams with these 2 nodes from ns_1@172.23.100.22 hanged waiting for timeout

          Then the rebalance was interrupted by failover

          2022-01-18T04:56:04.839-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Rebalance interrupted due to auto-failover of nodes ['ns_1@172.23.105.155',
                                                               'ns_1@172.23.105.211',
                                                               'ns_1@172.23.105.244'].
          

          The failover try to request vbucket states from ns_1@172.23.100.22 (the timeout of this operation is 10sec)

          Before any other commands can be executed on janitor_agent, the janitor _agent must cleanup all the artifacts from the unfinished rebalance, which took way longer than 10 seconds because dcp stream commands were hanging waiting for bigger timeout

          [ns_server:debug,2022-01-18T04:56:04.844-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:janitor_agent:set_rebalance_mref:826]Changing rebalance pid from <18189.19387.0> to undefined for {'DOWN',
                                                                        #Ref<0.3475994529.757596164.168799>,
                                                                        process,
                                                                        <18189.19387.0>,
                                                                        shutdown}
          [ns_server:debug,2022-01-18T04:56:04.844-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:janitor_agent:set_rebalance_mref:830]Killing rebalance-related subprocess: <0.8066.26>
          [ns_server:debug,2022-01-18T04:56:04.844-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:janitor_agent:set_rebalance_mref:839]Killing apply_vbucket_states_worker: <0.6793.26>
           
          ........................
           
          [error_logger:error,2022-01-18T04:56:59.596-08:00,ns_1@172.23.100.22:<0.8065.26>:ale_error_logger_handler:do_log:101]
          =========================CRASH REPORT=========================
            crasher:
              initial call: dcp_proxy:init/1
              pid: <0.8065.26>
              registered_name: []
              exception error: no match of right hand side value {error,timeout}
                in function  mc_client_binary:cmd_vocal_recv/5 (src/mc_client_binary.erl, line 151)
                in call from mc_client_binary:auth/2 (src/mc_client_binary.erl, line 192)
                in call from dcp_proxy:connect/5 (src/dcp_proxy.erl, line 254)
                in call from dcp_proxy:maybe_connect/2 (src/dcp_proxy.erl, line 236)
                in call from dcp_producer_conn:init/2 (src/dcp_producer_conn.erl, line 25)
                in call from dcp_proxy:init/1 (src/dcp_proxy.erl, line 52)
                in call from gen_server:init_it/2 (gen_server.erl, line 423)
                in call from gen_server:init_it/6 (gen_server.erl, line 390)
              ancestors: ['dcp_replicator-default-ns_1@172.23.105.155',
                            'dcp_sup-default','single_bucket_kv_sup-default',
                            ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup,
                            ns_server_nodes_sup,<0.5943.26>,ns_server_cluster_sup,
                            root_sup,<0.145.0>]
              message_queue_len: 0
              messages: []
              links: [<0.7958.26>,#Port<0.40686>]
              dictionary: []
              trap_exit: false
              status: running
              heap_size: 987
              stack_size: 29
              reductions: 1837
            neighbours:
           
           
          [ns_server:debug,2022-01-18T04:56:59.596-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:dcp_sup:nuke:114]Nuking DCP replicators for bucket "default":
          

          This caused janitor_agent:query_vbuckets to fail with timeout and failover to be aborted

          [ns_server:error,2022-01-18T04:56:15.023-08:00,ns_1@172.23.105.155:<0.30983.0>:failover:failover_buckets:262]Caught failover exception: "Failed to get failover info for bucket \"default\": ['ns_1@172.23.100.22']"
          

          artem Artem Stemkovski added a comment - - edited Here's what happened: when memcached was shut down on 172.23.105.211 and 172.23.105.155 the rebalance was still under way So the attempts to establish replication streams with these 2 nodes from ns_1@172.23.100.22 hanged waiting for timeout Then the rebalance was interrupted by failover 2022-01-18T04:56:04.839-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Rebalance interrupted due to auto-failover of nodes ['ns_1@172.23.105.155', 'ns_1@172.23.105.211', 'ns_1@172.23.105.244']. The failover try to request vbucket states from ns_1@172.23.100.22 (the timeout of this operation is 10sec) Before any other commands can be executed on janitor_agent, the janitor _agent must cleanup all the artifacts from the unfinished rebalance, which took way longer than 10 seconds because dcp stream commands were hanging waiting for bigger timeout [ns_server:debug,2022-01-18T04:56:04.844-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:janitor_agent:set_rebalance_mref:826]Changing rebalance pid from <18189.19387.0> to undefined for {'DOWN', #Ref<0.3475994529.757596164.168799>, process, <18189.19387.0>, shutdown} [ns_server:debug,2022-01-18T04:56:04.844-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:janitor_agent:set_rebalance_mref:830]Killing rebalance-related subprocess: <0.8066.26> [ns_server:debug,2022-01-18T04:56:04.844-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:janitor_agent:set_rebalance_mref:839]Killing apply_vbucket_states_worker: <0.6793.26>   ........................   [error_logger:error,2022-01-18T04:56:59.596-08:00,ns_1@172.23.100.22:<0.8065.26>:ale_error_logger_handler:do_log:101] =========================CRASH REPORT========================= crasher: initial call: dcp_proxy:init/1 pid: <0.8065.26> registered_name: [] exception error: no match of right hand side value {error,timeout} in function mc_client_binary:cmd_vocal_recv/5 (src/mc_client_binary.erl, line 151) in call from mc_client_binary:auth/2 (src/mc_client_binary.erl, line 192) in call from dcp_proxy:connect/5 (src/dcp_proxy.erl, line 254) in call from dcp_proxy:maybe_connect/2 (src/dcp_proxy.erl, line 236) in call from dcp_producer_conn:init/2 (src/dcp_producer_conn.erl, line 25) in call from dcp_proxy:init/1 (src/dcp_proxy.erl, line 52) in call from gen_server:init_it/2 (gen_server.erl, line 423) in call from gen_server:init_it/6 (gen_server.erl, line 390) ancestors: ['dcp_replicator-default-ns_1@172.23.105.155', 'dcp_sup-default','single_bucket_kv_sup-default', ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup, ns_server_nodes_sup,<0.5943.26>,ns_server_cluster_sup, root_sup,<0.145.0>] message_queue_len: 0 messages: [] links: [<0.7958.26>,#Port<0.40686>] dictionary: [] trap_exit: false status: running heap_size: 987 stack_size: 29 reductions: 1837 neighbours:     [ns_server:debug,2022-01-18T04:56:59.596-08:00,ns_1@172.23.100.22:janitor_agent-default<0.6501.26>:dcp_sup:nuke:114]Nuking DCP replicators for bucket "default": This caused janitor_agent:query_vbuckets to fail with timeout and failover to be aborted [ns_server:error,2022-01-18T04:56:15.023-08:00,ns_1@172.23.105.155:<0.30983.0>:failover:failover_buckets:262]Caught failover exception: "Failed to get failover info for bucket \"default\": ['ns_1@172.23.100.22']"

          Build couchbase-server-7.1.0-2141 contains ns_server commit 0908e40 with commit message:
          MB-50442 release control back to dcp_sup from newly created

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-2141 contains ns_server commit 0908e40 with commit message: MB-50442 release control back to dcp_sup from newly created

          Still a crash during "ns_couchdb_chronicle_dup" on 7.1.0-2123

           

          Failover exited with reason {failover_failed,"default",
          "Failed to get failover info for bucket \"default\": ['ns_1@172.23.136.172']"}.
          Rebalance Operation Id = 9bd6fb838315c2cff340f29978d5da77

          =========================CRASH REPORT=========================
             crasher:
               initial call: ns_couchdb_chronicle_dup:init/1
               pid: <0.296.0>
               registered_name: ns_couchdb_chronicle_dup
               exception error: no match of right hand side value {error,not_found}
                 in function  ns_bucket:fetch_snapshot/3 (src/ns_bucket.erl, line 171)
                 in call from chronicle_compat:'-get_snapshot_with_revision/2-fun-0-'/3 (src/chronicle_compat.erl, line 241)
                 in call from lists:foldl/3 (lists.erl, line 1267)
                 in call from ns_couchdb_chronicle_dup:handle_call/3 (src/ns_couchdb_chronicle_dup.erl, line 73)
                 in call from gen_server:try_handle_call/4 (gen_server.erl, line 721)
                 in call from gen_server:handle_msg/6 (gen_server.erl, line 750)
               ancestors: [ns_couchdb_config_sup,ns_couchdb_sup,<0.92.0>]
               message_queue_len: 0
               messages: []
               links: [<0.285.0>,<0.297.0>]
               dictionary: []
               trap_exit: true
               status: running
               heap_size: 10958
               stack_size: 29
               reductions: 31581
             neighbours:
          [ns_server:debug,2022-01-27T22:23:35.122-08:00,couchdb_ns_1@cb.local:<0.297.0>:ns_pubsub:do_subscribe_link_continue:150]Parent process of subscription {{'kv-events','ns_1@cb.local'},<0.296.0>} exited with reason {{badmatch, {error, not_found}},   [{ns_bucket, fetch_snapshot, 3, [{file, "src/ns_bucket.erl"}, {line, 171}]},    {chronicle_compat, '-get_snapshot_with_revision/2-fun-0-', 3, [{file, "src/chronicle_compat.erl"}, {line, 241}]},    {lists, foldl, 3, [{file, "lists.erl"}, {line, 1267}]},    {ns_couchdb_chronicle_dup, handle_call, 3, [{file, "src/ns_couchdb_chronicle_dup.erl"}, {line, 73}]},    {gen_server, try_handle_call, 4, [{file, "gen_server.erl"}, {line, 721}]},    {gen_server, handle_msg, 6, [{file, "gen_server.erl"}, {line, 750}]},    {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 226}]}]}
          

           

          Snapshot:   

          http://supportal.couchbase.com/snapshot/fd953fc2ecc4c2fd309d702bdacf3f07::0

          cbcollect logs:
          https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.155.zip
          https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.211.zip
          https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.212.zip
          https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.213.zip
          https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.136.109.zip
          https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.136.165.zip
          https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.136.172.zip
           

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Still a crash during "ns_couchdb_chronicle_dup" on 7.1.0-2123   Failover exited with reason {failover_failed,"default", "Failed to get failover info for bucket \"default\": ['ns_1@172.23.136.172']"}. Rebalance Operation Id = 9bd6fb838315c2cff340f29978d5da77 =========================CRASH REPORT=========================   crasher:     initial call: ns_couchdb_chronicle_dup:init/1     pid: <0.296.0>     registered_name: ns_couchdb_chronicle_dup     exception error: no match of right hand side value {error,not_found}       in function  ns_bucket:fetch_snapshot/3 (src/ns_bucket.erl, line 171)       in call from chronicle_compat:'-get_snapshot_with_revision/2-fun-0-'/3 (src/chronicle_compat.erl, line 241)       in call from lists:foldl/3 (lists.erl, line 1267)       in call from ns_couchdb_chronicle_dup:handle_call/3 (src/ns_couchdb_chronicle_dup.erl, line 73)       in call from gen_server:try_handle_call/4 (gen_server.erl, line 721)       in call from gen_server:handle_msg/6 (gen_server.erl, line 750)     ancestors: [ns_couchdb_config_sup,ns_couchdb_sup,<0.92.0>]     message_queue_len: 0     messages: []     links: [<0.285.0>,<0.297.0>]     dictionary: []     trap_exit: true     status: running     heap_size: 10958     stack_size: 29     reductions: 31581   neighbours: [ns_server:debug,2022-01-27T22:23:35.122-08:00,couchdb_ns_1@cb.local:<0.297.0>:ns_pubsub:do_subscribe_link_continue:150]Parent process of subscription {{'kv-events','ns_1@cb.local'},<0.296.0>} exited with reason {{badmatch, {error, not_found}},   [{ns_bucket, fetch_snapshot, 3, [{file, "src/ns_bucket.erl"}, {line, 171}]},    {chronicle_compat, '-get_snapshot_with_revision/2-fun-0-', 3, [{file, "src/chronicle_compat.erl"}, {line, 241}]},    {lists, foldl, 3, [{file, "lists.erl"}, {line, 1267}]},    {ns_couchdb_chronicle_dup, handle_call, 3, [{file, "src/ns_couchdb_chronicle_dup.erl"}, {line, 73}]},    {gen_server, try_handle_call, 4, [{file, "gen_server.erl"}, {line, 721}]},    {gen_server, handle_msg, 6, [{file, "gen_server.erl"}, {line, 750}]},    {proc_lib, init_p_do_apply, 3, [{file, "proc_lib.erl"}, {line, 226}]}]}   Snapshot:     http://supportal.couchbase.com/snapshot/fd953fc2ecc4c2fd309d702bdacf3f07::0 cbcollect logs: https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.155.zip https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.211.zip https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.212.zip https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.105.213.zip https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.136.109.zip https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.136.165.zip https://cb-engineering.s3.amazonaws.com/failover_failing/collectinfo-2022-01-28T095550-ns_1%40172.23.136.172.zip  
          artem Artem Stemkovski added a comment - - edited

          Crash in ns_couchdb_chronicle_dup:init is a known issue and it should be benign and unrelated to failover failure. The bug is fixed in 7.1.0-2141, so we can expect the failover still fail on 7.1.0-2123

          There's already ticket opened for the ns_couchdb_chronicle_dup crash: https://issues.couchbase.com/browse/MB-50003

          artem Artem Stemkovski added a comment - - edited Crash in ns_couchdb_chronicle_dup:init is a known issue and it should be benign and unrelated to failover failure. The bug is fixed in 7.1.0-2141, so we can expect the failover still fail on 7.1.0-2123 There's already ticket opened for the ns_couchdb_chronicle_dup crash: https://issues.couchbase.com/browse/MB-50003

          Validated the fix using 7.1.0-2150

          Closing this ticket.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Validated the fix using 7.1.0-2150 Closing this ticket.

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty