Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37070

Rebalance failed with reason 'mover_crashed' in 'couch_set_view_group,process_monitor_partition_update"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.5.0
    • 7.0.0
    • view-engine
    • Enterprise Edition 6.5.0 build 4874
    • Untriaged
    • Centos 64-bit
    • Unknown

    Description

      Build: 6.5.0-4874

      Scenario:

      • 4 node cluster, Couchbase bucket with 1K items loaded
      • Hard-failover 1 node then add back the failover node
      • Rebalance the cluster

      Observation:

      Seeing rebalance failure with following reason:

      Worker <0.10489.1> (for action {move,{1020,
      ['ns_1@172.23.105.205',undefined],
      ['ns_1@172.23.105.206', 'ns_1@172.23.105.205'],
      []}}) exited with reason {unexpected_exit,
      {'EXIT', <0.10650.1>,
      {{error,
      {badrpc,
      {'EXIT',
      {{{{badmatch, {error, dcp_conn_closed}},
      [{couch_set_view_group, process_monitor_partition_update, 4,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 3725}]},
      {couch_set_view_group, handle_call, 3,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 934}]},
      {gen_server, try_handle_call, 4,
      [{file, "gen_server.erl"}, {line, 636}]},
      {gen_server, handle_msg, 6,
      [{file, "gen_server.erl"}, {line, 665}]},
      {proc_lib, init_p_do_apply, 3,
      [{file, "proc_lib.erl"}, {line, 247}]}]},
      {gen_server, call,
      [<12939.514.0>,
      {monitor_partition_update, 1020,
      #Ref<12939.176846531.2340945921.119382>,
      <12939.564.0>}, infinity]}},
      {gen_server, call,
      ['capi_set_view_manager-default',
      {wait_index_updated, 1020}, infinity]}}}}},
      {gen_server, call,
      [{'janitor_agent-default', 'ns_1@172.23.105.206'},
      {if_rebalance, <0.10461.1>,
      {wait_index_updated, 1020}}, infinity]}}}}
       
      Rebalance exited with reason {mover_crashed,
      {unexpected_exit,
      {'EXIT',<0.10650.1>,
      {{error,
      {badrpc,
      {'EXIT',
      {{{{badmatch,{error,dcp_conn_closed}},
      [{couch_set_view_group, process_monitor_partition_update,4,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,3725}]},
      {couch_set_view_group,handle_call,3,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,934}]},
      {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,636}]},
      {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,665}]},
      {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,247}]}]},
      {gen_server,call,
      [<12939.514.0>,
      {monitor_partition_update,1020,
      #Ref<12939.176846531.2340945921.119382>,
      <12939.564.0>}, infinity]}},
      {gen_server,call,
      ['capi_set_view_manager-default',
      {wait_index_updated,1020}, infinity]}}}}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@172.23.105.206'},
      {if_rebalance,<0.10461.1>,
      {wait_index_updated,1020}}, infinity]}}}}}.
      Rebalance Operation Id = f8d81e60511090035968902b399f19b6
      

      Cluster state:

      Testrunner case:

      rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,blob_generator=false

      Note: Reproducibility is not consistent

       

      Attachments

        1. image-2019-11-27-12-06-50-288.png
          113 kB
          Ashwin Govindarajulu
        2. test.log
          69 kB
          Ashwin Govindarajulu

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Failover started at 21:49:10 and after failover test adds back the same node in quick succession.

            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 172.23.105.206:8091 with graceful=False
            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@172.23.105.206 successful
            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending....
            2019-11-26 21:49:10 | INFO | MainProcess | test_thread | [rest_client.add_back_node] add_back_node ns_1@172.23.105.206 successful
            

            memcached started removing all the dcp clients.

            2019-11-26T21:49:10.589179-08:00 INFO 58: The connected bucket is being deleted.. closing connection [ 127.0.0.1:48774 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.589546-08:00 INFO 62: The connected bucket is being deleted.. closing connection [ 127.0.0.1:34045 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593703-08:00 INFO 43 Delete bucket [default]. Engine ready for shutdown
            2019-11-26T21:49:10.593721-08:00 INFO 43 Delete bucket [default]. Wait for 8 clients to disconnect
            2019-11-26T21:49:10.593763-08:00 INFO 48: The connected bucket is being deleted.. closing connection [ 127.0.0.1:45388 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593788-08:00 INFO 49: The connected bucket is being deleted.. closing connection [ 127.0.0.1:57697 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593804-08:00 INFO 47: The connected bucket is being deleted.. closing connection [ 127.0.0.1:33886 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593826-08:00 INFO 50: The connected bucket is being deleted.. closing connection [ 127.0.0.1:42139 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593858-08:00 INFO 53: The connected bucket is being deleted.. closing connection [ 172.23.105.174:50998 - 172.23.105.206:11210 (<ud>cbadminbucket</ud>) ]
            2019-11-26T21:49:10.694072-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 568
            2019-11-26T21:49:10.694298-08:00 INFO Pushing new cluster config for bucket:[] revision:[568]
            2019-11-26T21:49:10.704402-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 570
            2019-11-26T21:49:10.704482-08:00 INFO Pushing new cluster config for bucket:[] revision:[570]
            2019-11-26T21:49:11.024484-08:00 INFO 43 Delete bucket [default]. Shut down the bucket
            

            view_engine received closed connection error and it will wait for some time and try to restart the
            connection.

            [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.449.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.439.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            

            Since we added back the node, new group server for that ddoc is created on node .206 and it will open a new dcp client with same dcp name.

            [couchdb:info,2019-11-26T21:49:11.973-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]main set view group `_design/default_view`, set `default` (prod), doc_fields_unused
            [couchdb:info,2019-11-26T21:49:11.974-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
            [couchdb:info,2019-11-26T21:49:11.975-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, linked PID <0.515.0> stopped normally
            [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Started undefined replica (prod) set view group `default`, group `_design/default_view`,  signature `81b924d6fb2889b3227487b72756326d', view count: 3
            [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]replica set view group `_design/default_view`, set `default` (prod), doc_fields_unused
            [couchdb:info,2019-11-26T21:49:12.415-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
            

            Now when the old gen_server which wakes up after some seconds delay it will try to connect with same dcp name and memcahced disconnects the old connection(in this case the new group server).

            2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0
            2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500
            

            Since vbucket monitor is listening on the closed dcp client(newly created group server) it will fail with dcp_conn_closed. This issue happened because failover and adding of same node happened within 2 seconds.

            ankit.prabhu Ankit Prabhu added a comment - Failover started at 21:49:10 and after failover test adds back the same node in quick succession. 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 172.23.105.206:8091 with graceful=False 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@172.23.105.206 successful 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending.... 2019-11-26 21:49:10 | INFO | MainProcess | test_thread | [rest_client.add_back_node] add_back_node ns_1@172.23.105.206 successful memcached started removing all the dcp clients. 2019-11-26T21:49:10.589179-08:00 INFO 58: The connected bucket is being deleted.. closing connection [ 127.0.0.1:48774 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.589546-08:00 INFO 62: The connected bucket is being deleted.. closing connection [ 127.0.0.1:34045 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593703-08:00 INFO 43 Delete bucket [default]. Engine ready for shutdown 2019-11-26T21:49:10.593721-08:00 INFO 43 Delete bucket [default]. Wait for 8 clients to disconnect 2019-11-26T21:49:10.593763-08:00 INFO 48: The connected bucket is being deleted.. closing connection [ 127.0.0.1:45388 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593788-08:00 INFO 49: The connected bucket is being deleted.. closing connection [ 127.0.0.1:57697 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593804-08:00 INFO 47: The connected bucket is being deleted.. closing connection [ 127.0.0.1:33886 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593826-08:00 INFO 50: The connected bucket is being deleted.. closing connection [ 127.0.0.1:42139 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593858-08:00 INFO 53: The connected bucket is being deleted.. closing connection [ 172.23.105.174:50998 - 172.23.105.206:11210 (<ud>cbadminbucket</ud>) ] 2019-11-26T21:49:10.694072-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 568 2019-11-26T21:49:10.694298-08:00 INFO Pushing new cluster config for bucket:[] revision:[568] 2019-11-26T21:49:10.704402-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 570 2019-11-26T21:49:10.704482-08:00 INFO Pushing new cluster config for bucket:[] revision:[570] 2019-11-26T21:49:11.024484-08:00 INFO 43 Delete bucket [default]. Shut down the bucket view_engine received closed connection error and it will wait for some time and try to restart the connection. [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.449.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.439.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... Since we added back the node, new group server for that ddoc is created on node .206 and it will open a new dcp client with same dcp name. [couchdb:info,2019-11-26T21:49:11.973-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]main set view group `_design/default_view`, set `default` (prod), doc_fields_unused [couchdb:info,2019-11-26T21:49:11.974-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes [couchdb:info,2019-11-26T21:49:11.975-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, linked PID <0.515.0> stopped normally [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Started undefined replica (prod) set view group `default`, group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d', view count: 3 [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]replica set view group `_design/default_view`, set `default` (prod), doc_fields_unused [couchdb:info,2019-11-26T21:49:12.415-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes Now when the old gen_server which wakes up after some seconds delay it will try to connect with same dcp name and memcahced disconnects the old connection(in this case the new group server). 2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0 2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500 Since vbucket monitor is listening on the closed dcp client(newly created group server) it will fail with dcp_conn_closed. This issue happened because failover and adding of same node happened within 2 seconds.
            jeelan.poola Jeelan Poola added a comment -

            Deferring this to CC as the issue should be transient and it is unlikely that failover and adding back the same node within 2 secs will happen in production environments. The fix seems simple and will target for 6.5.1.

            jeelan.poola Jeelan Poola added a comment - Deferring this to CC as the issue should be transient and it is unlikely that failover and adding back the same node within 2 secs will happen in production environments. The fix seems simple and will target for 6.5.1.

            Build couchbase-server-7.0.0-4765 contains couchdb commit b47c47a with commit message:
            MB-37070 : Address corner case in toplogy change resulting in duplicate DCP clients

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4765 contains couchdb commit b47c47a with commit message: MB-37070 : Address corner case in toplogy change resulting in duplicate DCP clients

            With the fix ^ view engine won't attempt at recreating an unnecessary DCP client if it receives a vbucket_stream_not_found response corresponding to a select_bucket request. This takes care of the condition mentioned in this case where we ended up overwriting the existing DCP producer stream.

            Ashwin Govindarajulu Although I have tested the hard failover followed by an immediate (within 2 seconds) full recovery scenario discussed in this case in my local, please feel free to test out the same with this fix.

            Resolving for the time being.

            abhishek.jindal Abhishek Jindal added a comment - With the fix ^ view engine won't attempt at recreating an unnecessary DCP client if it receives a vbucket_stream_not_found response corresponding to a select_bucket request. This takes care of the condition mentioned in this case where we ended up overwriting the existing DCP producer stream. Ashwin Govindarajulu Although I have tested the hard failover followed by an immediate (within 2 seconds) full recovery scenario discussed in this case in my local, please feel free to test out the same with this fix. Resolving for the time being.

            Not seeing this issue while trying with "Enterprise Edition 7.0.0 build 4907".

            Closing as verified.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not seeing this issue while trying with "Enterprise Edition 7.0.0 build 4907". Closing as verified.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty