Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44686

[Upgrade] Swap-rebalance fails with reason "wait_seqno_persisted_failed"

    XMLWordPrintable

    Details

      Description

       

      Base build: 6.0.0-1693-enterprise

      Target build: 7.0.0-4476-enterprise

      Scenario:

      • 4 node cluster (6.0.0-1693)
      • Couchbase bucket with replica=1
      • Load bucket until 70% DGM
      • Start online swap_upgrade to build **7.0.0-4476 (one node at a time)

      Observation:

      Rebalance failed during the following swap-rebalance:

      +----------------+----------+-----------------------+---------------+--------------+
      | Nodes          | Services | Version               | CPU           | Status       |
      +----------------+----------+-----------------------+---------------+--------------+
      | 172.23.105.212 | kv       | 6.0.0-1693-enterprise | 3.0303030303  | --- OUT ---> |
      | 172.23.105.244 | kv       | 7.0.0-4476-enterprise | 20.0251256281 | Cluster node |
      | 172.23.105.155 | kv       | 7.0.0-4476-enterprise | 24.6067985794 | Cluster node |
      | 172.23.105.213 | kv       | 6.0.0-1693-enterprise | 3.26633165829 | Cluster node |
      | 172.23.105.211 | None     |                       |               | <--- IN ---  |
      +----------------+----------+-----------------------+---------------+--------------+

      Rebalance failure logs,

      Rebalance exited with reason {mover_crashed,
      {unexpected_exit, {'EXIT',<0.21110.3>,
      {{wait_seqno_persisted_failed,"default",723, 26728,
      [{'ns_1@172.23.105.211', {'EXIT',
      {socket_closed, {gen_server,call,
      [{'janitor_agent-default', 'ns_1@172.23.105.211'},
      {if_rebalance,<0.8319.3>,
      {wait_seqno_persisted,723,26728}}, infinity]}}}}]},
      [{ns_single_vbucket_mover, '-wait_seqno_persisted_many/5-fun-2-',5, [{file,"src/ns_single_vbucket_mover.erl"}, {line,488}]},
      {proc_lib,init_p,3, [{file,"proc_lib.erl"},{line,234}]}]}}}}.
       
      Rebalance Operation Id = f7f0922241e2777b2eefdde991763996
       
      Worker <0.21022.3> (for action {move,{723,
      ['ns_1@172.23.105.212', 'ns_1@172.23.105.213'],
      ['ns_1@172.23.105.211', 'ns_1@172.23.105.213'],
      []}}) exited with reason {unexpected_exit, {'EXIT', <0.21110.3>,
      {{wait_seqno_persisted_failed, "default", 723,26728,
      [{'ns_1@172.23.105.211', {'EXIT',
      {socket_closed, {gen_server, call,
      [{'janitor_agent-default', 'ns_1@172.23.105.211'},
      {if_rebalance, <0.8319.3>,
      {wait_seqno_persisted, 723, 26728}}, infinity]}}}}]},
      [{ns_single_vbucket_mover,
      '-wait_seqno_persisted_many/5-fun-2-', 5, [{file, "src/ns_single_vbucket_mover.erl"}, {line, 488}]},
      {proc_lib, init_p,3, [{file, "proc_lib.erl"}, {line, 234}]}]}}}

      TAF test:

      upgrade.durability_upgrade.UpgradeTests:
          test_upgrade,nodes_init=4,replicas=1,update_nodes=kv,num_items=50000,upgrade_type=online_swap,initial_version=6.0.0-1693,upgrade_with_data_load=False,skip_buckets_handle=True,durability=MAJORITY,upgrade_version=7.0.0-4476,log_level=debug,active_resident_threshold=70

       

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Fresh logs including the rebalance_out node (172.23.105.213): https://cb-engineering.s3.amazonaws.com/upgrade_failed_log_2/collectinfo-2021-03-02T094211-ns_1%40172.23.105.155.zip  https://cb-engineering.s3.amazonaws.com/upgrade_failed_log_2/collectinfo-2021-03-02T094211-ns_1%40172.23.105.211.zip https://cb-engineering.s3.amazonaws.com/upgrade_failed_log_2/collectinfo-2021-03-02T094211-ns_1%40172.23.105.212.zip https://cb-engineering.s3.amazonaws.com/upgrade_failed_log_2/collectinfo-2021-03-02T094211-ns_1%40172.23.105.244.zip https://cb-engineering.s3.amazonaws.com/upgrade_failed_log_2/collectinfo-2021-03-02T094211-ns_1%40172.23.105.213.zip https://cb-engineering.s3.amazonaws.com/upgrade_failed_log_2/manual_log_copy%40172.23.105.213.zip
          Hide
          jwalker Jim Walker added a comment - - edited

          Looking at the logs latest logs here

          I think KV is the one who intiates a disconnect, but this is after a bucket delete (on the 6.0.0 node).

          Here's the trail I followed.

          diag.log says

          2021-03-02T01:37:44.348-08:00, ns_vbucket_mover:0:critical:message(ns_1@172.23.105.244) - Worker <0.8401.5> (for action {move,{726,
                                               ['ns_1@172.23.105.211',
                                                'ns_1@172.23.105.213'],
                                               ['ns_1@172.23.105.211',
                                                'ns_1@172.23.105.212'],
                                               []}}) exited with reason {unexpected_exit,
                                                                         {'EXIT',
                                                                          <0.8461.5>,
                                                                          {{wait_seqno_persisted_failed,
                                                                            "default",
                                                                            726,26672,
                                                                            [{'ns_1@172.23.105.212',
                                                                              {'EXIT',
                                                                               {socket_closed,
          ...
          

          So moved onto the .212 logs (7.0.0-4476), did KV initiate a disconnect? memcached doesn't have anything which indicates that .212 memcached disconnected.

          .212 ns_server.debug.log has something though which says .213 closed?

          [ns_server:error,2021-03-02T01:37:44.338-08:00,ns_1@172.23.105.212:<0.2355.0>:dcp_proxy:handle_info:117]Socket #Port<0.273> was closed. Closing myself. State = {state,#Port<0.273>,
                                                                   {producer,
                                                                    "replication:ns_1@172.23.105.213->ns_1@172.23.105.212:default",
                                                                    'ns_1@172.23.105.213',
                                                                    "default"},
                                                                   undefined,<<>>,
                                                                   dcp_producer_conn,[],
                                                                   #Port<0.272>,
                                                                   <0.2354.0>,true}
          

          So over to .213 (6.0.0-1693), and at 01:37:44 we do have memcached disconnecting connections, but because buckets are being deleted, so the DCP connections are disconnected (which is what .212 logs for above).

          2021-03-02T01:37:44.329934-08:00 INFO Received shutdown request
          2021-03-02T01:37:44.332929-08:00 INFO Initiating graceful shutdown.
          2021-03-02T01:37:44.332941-08:00 INFO Stop all buckets
          2021-03-02T01:37:44.332945-08:00 INFO Scheduling delete for bucket default
          2021-03-02T01:37:44.333198-08:00 INFO Waiting for delete of default to complete
          2021-03-02T01:37:44.333241-08:00 INFO <none> Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks
          2021-03-02T01:37:44.333313-08:00 INFO (default) Shutting down all DCP connections in preparation for bucket deletion.
          2021-03-02T01:37:44.333328-08:00 INFO (default) Shutting down dcp connections!
          

          Back on .212, and ns_server.debug continues with message about socket closing and also says

          [ns_server:debug,2021-03-02T01:37:44.351-08:00,ns_1@172.23.105.212:janitor_agent-default<0.8114.0>:dcp_sup:nuke:117]Nuking DCP replicators for bucket "default":
          [{'ns_1@172.23.105.155',<0.2033.0>},
           {'ns_1@172.23.105.211',<0.2009.0>},
           {'ns_1@172.23.105.244',<0.1998.0>}]
          

          Show
          jwalker Jim Walker added a comment - - edited Looking at the logs latest logs here I think KV is the one who intiates a disconnect, but this is after a bucket delete (on the 6.0.0 node). Here's the trail I followed. diag.log says 2021-03-02T01:37:44.348-08:00, ns_vbucket_mover:0:critical:message(ns_1@172.23.105.244) - Worker <0.8401.5> (for action {move,{726, ['ns_1@172.23.105.211', 'ns_1@172.23.105.213'], ['ns_1@172.23.105.211', 'ns_1@172.23.105.212'], []}}) exited with reason {unexpected_exit, {'EXIT', <0.8461.5>, {{wait_seqno_persisted_failed, "default", 726,26672, [{'ns_1@172.23.105.212', {'EXIT', {socket_closed, ... So moved onto the .212 logs (7.0.0-4476), did KV initiate a disconnect? memcached doesn't have anything which indicates that .212 memcached disconnected. .212 ns_server.debug.log has something though which says .213 closed? [ns_server:error,2021-03-02T01:37:44.338-08:00,ns_1@172.23.105.212:<0.2355.0>:dcp_proxy:handle_info:117]Socket #Port<0.273> was closed. Closing myself. State = {state,#Port<0.273>, {producer, "replication:ns_1@172.23.105.213->ns_1@172.23.105.212:default", 'ns_1@172.23.105.213', "default"}, undefined,<<>>, dcp_producer_conn,[], #Port<0.272>, <0.2354.0>,true} So over to .213 (6.0.0-1693), and at 01:37:44 we do have memcached disconnecting connections, but because buckets are being deleted, so the DCP connections are disconnected (which is what .212 logs for above). 2021-03-02T01:37:44.329934-08:00 INFO Received shutdown request 2021-03-02T01:37:44.332929-08:00 INFO Initiating graceful shutdown. 2021-03-02T01:37:44.332941-08:00 INFO Stop all buckets 2021-03-02T01:37:44.332945-08:00 INFO Scheduling delete for bucket default 2021-03-02T01:37:44.333198-08:00 INFO Waiting for delete of default to complete 2021-03-02T01:37:44.333241-08:00 INFO <none> Delete bucket [default]. Notifying all registered ON_DELETE_BUCKET callbacks 2021-03-02T01:37:44.333313-08:00 INFO (default) Shutting down all DCP connections in preparation for bucket deletion. 2021-03-02T01:37:44.333328-08:00 INFO (default) Shutting down dcp connections! Back on .212, and ns_server.debug continues with message about socket closing and also says [ns_server:debug,2021-03-02T01:37:44.351-08:00,ns_1@172.23.105.212:janitor_agent-default<0.8114.0>:dcp_sup:nuke:117]Nuking DCP replicators for bucket "default": [{'ns_1@172.23.105.155',<0.2033.0>}, {'ns_1@172.23.105.211',<0.2009.0>}, {'ns_1@172.23.105.244',<0.1998.0>}]
          Hide
          jwalker Jim Walker added a comment -

          Passing to ns_server as I can't determine if there's a KV fault here, analysis suggests connections are disconnected first because of a bucket being deleted.

          Show
          jwalker Jim Walker added a comment - Passing to ns_server as I can't determine if there's a KV fault here, analysis suggests connections are disconnected first because of a bucket being deleted.
          Hide
          lynn.straus Lynn Straus added a comment -

          Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow.

          Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.

          Show
          lynn.straus Lynn Straus added a comment - Per review in the Cheshire Cat Ops meeting and per email updates, all remaining majors/minors/trivials are being moved out of Cheshire Cat on March 10. The purpose is to allow teams to focus on Critical bugs for the remainder of the release. Teams can "pull back" deferred bugs as schedule and bandwidth allow. Changing target fix version to CheshireCat.next and adding "deferred-from-Cheshire-Cat" label.
          Hide
          dfinlay Dave Finlay added a comment -

          The rebalance fails because some process decided to shutdown Couchbase server on node .213. I suspect it's some kind of mixup in the tests scripts.

          Here's the rebalance error:

          [error_logger:error,2021-03-02T01:37:44.347-08:00,ns_1@172.23.105.244:<0.8439.5>:ale_error_logger_handler:do_log:107]
          =========================CRASH REPORT=========================
            crasher:
              initial call: ns_single_vbucket_mover:'-wait_seqno_persisted_many/5-fun-2-'/0
              pid: <0.8439.5>
              registered_name: []
              exception error: {wait_seqno_persisted_failed,"default",471,26552,
                                [{'ns_1@172.23.105.212',
                                  {'EXIT',
                                   {socket_closed,
                                    {gen_server,call,
                                     [{'janitor_agent-default','ns_1@172.23.105.212'},
                                      {if_rebalance,<0.29059.4>,
                                       {wait_seqno_persisted,471,26552}},
                                      infinity]}}}}]}
          

          This tells us that the janitor agent died on .212. However, prior to this error, we see that there are other sockets getting closed:

          [error_logger:error,2021-03-02T01:37:44.337-08:00,ns_1@172.23.105.244:dcp_replicator-default-ns_1@172.23.105.213<0.2733.0>:ale_error_logger_handler:do_log:107]
          =========================CRASH REPORT=========================
            crasher:
              initial call: dcp_replicator:init/1
              pid: <0.2733.0>
              registered_name: 'dcp_replicator-default-ns_1@172.23.105.213'
              exception exit: socket_closed
                in function  gen_server:handle_common_reply/8 (gen_server.erl, line 751)
              ancestors: ['dcp_sup-default','single_bucket_kv_sup-default',
                            ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup,
                            ns_server_nodes_sup,<0.1220.0>,ns_server_cluster_sup,
                            root_sup,<0.140.0>]
              message_queue_len: 1
              messages: [{'EXIT',<0.2734.0>,killed}]
          

          This tells us that the process responsible for replicating from .213 to .244 died due to the a socket being summarily closed.

          On .212 we indeed see the janitor agent dying:

          [error_logger:error,2021-03-02T01:37:44.349-08:00,ns_1@172.23.105.212:janitor_agent_sup-default<0.1826.0>:ale_error_logger_handler:do_log:107]
          =========================SUPERVISOR REPORT=========================
              supervisor: {local,'janitor_agent_sup-default'}
              errorContext: child_terminated
              reason: socket_closed
              offender: [{pid,<0.1828.0>},
                         {id,janitor_agent},
                         {mfargs,{janitor_agent,start_link,["default"]}},
                         {restart_type,permanent},
                         {shutdown,brutal_kill},
                         {child_type,worker}]
          

          And prior to this error on .212, we see other processes dying due to sockets closing on connections against .213:

          [error_logger:error,2021-03-02T01:37:44.338-08:00,ns_1@172.23.105.212:<0.2355.0>:ale_error_logger_handler:do_log:107]
          =========================CRASH REPORT=========================
            crasher:
              initial call: dcp_proxy:init/1
              pid: <0.2355.0>
              registered_name: []
              exception exit: socket_closed
                in function  gen_server:handle_common_reply/8 (gen_server.erl, line 751)
              ancestors: ['dcp_replicator-default-ns_1@172.23.105.213',
                            'dcp_sup-default','single_bucket_kv_sup-default',
                            ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup,
                            ns_server_nodes_sup,<0.1260.0>,ns_server_cluster_sup,
                            root_sup,<0.138.0>]
          

          Essentially, everything points to .213 being the root of the problem. On .213 in the babysitter log we see:

          [ns_server:info,2021-03-02T01:37:44.293-08:00,babysitter_of_ns_1@127.0.0.1:<0.252.0>:ns_babysitter_bootstrap:stop:35]69770: got shutdown request. Terminating.
          ...
          [ns_server:debug,2021-03-02T01:37:44.324-08:00,babysitter_of_ns_1@127.0.0.1:<0.106.0>:ns_port_server:terminate:195]Shutting down port memcached
          

          So, something showed up in the middle of the rebalance and decided to shutdown .213. This is why the rebalance failed.

          Show
          dfinlay Dave Finlay added a comment - The rebalance fails because some process decided to shutdown Couchbase server on node .213. I suspect it's some kind of mixup in the tests scripts. Here's the rebalance error: [error_logger:error,2021-03-02T01:37:44.347-08:00,ns_1@172.23.105.244:<0.8439.5>:ale_error_logger_handler:do_log:107] =========================CRASH REPORT========================= crasher: initial call: ns_single_vbucket_mover:'-wait_seqno_persisted_many/5-fun-2-'/0 pid: <0.8439.5> registered_name: [] exception error: {wait_seqno_persisted_failed,"default",471,26552, [{'ns_1@172.23.105.212', {'EXIT', {socket_closed, {gen_server,call, [{'janitor_agent-default','ns_1@172.23.105.212'}, {if_rebalance,<0.29059.4>, {wait_seqno_persisted,471,26552}}, infinity]}}}}]} This tells us that the janitor agent died on .212. However, prior to this error, we see that there are other sockets getting closed: [error_logger:error,2021-03-02T01:37:44.337-08:00,ns_1@172.23.105.244:dcp_replicator-default-ns_1@172.23.105.213<0.2733.0>:ale_error_logger_handler:do_log:107] =========================CRASH REPORT========================= crasher: initial call: dcp_replicator:init/1 pid: <0.2733.0> registered_name: 'dcp_replicator-default-ns_1@172.23.105.213' exception exit: socket_closed in function gen_server:handle_common_reply/8 (gen_server.erl, line 751) ancestors: ['dcp_sup-default','single_bucket_kv_sup-default', ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup, ns_server_nodes_sup,<0.1220.0>,ns_server_cluster_sup, root_sup,<0.140.0>] message_queue_len: 1 messages: [{'EXIT',<0.2734.0>,killed}] This tells us that the process responsible for replicating from .213 to .244 died due to the a socket being summarily closed. On .212 we indeed see the janitor agent dying: [error_logger:error,2021-03-02T01:37:44.349-08:00,ns_1@172.23.105.212:janitor_agent_sup-default<0.1826.0>:ale_error_logger_handler:do_log:107] =========================SUPERVISOR REPORT========================= supervisor: {local,'janitor_agent_sup-default'} errorContext: child_terminated reason: socket_closed offender: [{pid,<0.1828.0>}, {id,janitor_agent}, {mfargs,{janitor_agent,start_link,["default"]}}, {restart_type,permanent}, {shutdown,brutal_kill}, {child_type,worker}] And prior to this error on .212, we see other processes dying due to sockets closing on connections against .213: [error_logger:error,2021-03-02T01:37:44.338-08:00,ns_1@172.23.105.212:<0.2355.0>:ale_error_logger_handler:do_log:107] =========================CRASH REPORT========================= crasher: initial call: dcp_proxy:init/1 pid: <0.2355.0> registered_name: [] exception exit: socket_closed in function gen_server:handle_common_reply/8 (gen_server.erl, line 751) ancestors: ['dcp_replicator-default-ns_1@172.23.105.213', 'dcp_sup-default','single_bucket_kv_sup-default', ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup, ns_server_nodes_sup,<0.1260.0>,ns_server_cluster_sup, root_sup,<0.138.0>] Essentially, everything points to .213 being the root of the problem. On .213 in the babysitter log we see: [ns_server:info,2021-03-02T01:37:44.293-08:00,babysitter_of_ns_1@127.0.0.1:<0.252.0>:ns_babysitter_bootstrap:stop:35]69770: got shutdown request. Terminating. ... [ns_server:debug,2021-03-02T01:37:44.324-08:00,babysitter_of_ns_1@127.0.0.1:<0.106.0>:ns_port_server:terminate:195]Shutting down port memcached So, something showed up in the middle of the rebalance and decided to shutdown .213. This is why the rebalance failed.
          Hide
          ritam.sharma Ritam Sharma added a comment -

          Will reopen of the issue occur again.

          Show
          ritam.sharma Ritam Sharma added a comment - Will reopen of the issue occur again.
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Not able to reproduce the issue with different sets of VMs.

          Will reopen once this is reproduced.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not able to reproduce the issue with different sets of VMs. Will reopen once this is reproduced.
          Hide
          Balakumaran.Gopal Balakumaran Gopal added a comment -

          Closing this based on the above comment.

          Show
          Balakumaran.Gopal Balakumaran Gopal added a comment - Closing this based on the above comment.

            People

            Assignee:
            ashwin.govindarajulu Ashwin Govindarajulu
            Reporter:
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty