Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37070

Rebalance failed with reason 'mover_crashed' in 'couch_set_view_group,process_monitor_partition_update"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.5.0
    • 7.0.0
    • view-engine
    • Enterprise Edition 6.5.0 build 4874
    • Untriaged
    • Centos 64-bit
    • Unknown

    Description

      Build: 6.5.0-4874

      Scenario:

      • 4 node cluster, Couchbase bucket with 1K items loaded
      • Hard-failover 1 node then add back the failover node
      • Rebalance the cluster

      Observation:

      Seeing rebalance failure with following reason:

      Worker <0.10489.1> (for action {move,{1020,
      ['ns_1@172.23.105.205',undefined],
      ['ns_1@172.23.105.206', 'ns_1@172.23.105.205'],
      []}}) exited with reason {unexpected_exit,
      {'EXIT', <0.10650.1>,
      {{error,
      {badrpc,
      {'EXIT',
      {{{{badmatch, {error, dcp_conn_closed}},
      [{couch_set_view_group, process_monitor_partition_update, 4,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 3725}]},
      {couch_set_view_group, handle_call, 3,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 934}]},
      {gen_server, try_handle_call, 4,
      [{file, "gen_server.erl"}, {line, 636}]},
      {gen_server, handle_msg, 6,
      [{file, "gen_server.erl"}, {line, 665}]},
      {proc_lib, init_p_do_apply, 3,
      [{file, "proc_lib.erl"}, {line, 247}]}]},
      {gen_server, call,
      [<12939.514.0>,
      {monitor_partition_update, 1020,
      #Ref<12939.176846531.2340945921.119382>,
      <12939.564.0>}, infinity]}},
      {gen_server, call,
      ['capi_set_view_manager-default',
      {wait_index_updated, 1020}, infinity]}}}}},
      {gen_server, call,
      [{'janitor_agent-default', 'ns_1@172.23.105.206'},
      {if_rebalance, <0.10461.1>,
      {wait_index_updated, 1020}}, infinity]}}}}
       
      Rebalance exited with reason {mover_crashed,
      {unexpected_exit,
      {'EXIT',<0.10650.1>,
      {{error,
      {badrpc,
      {'EXIT',
      {{{{badmatch,{error,dcp_conn_closed}},
      [{couch_set_view_group, process_monitor_partition_update,4,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,3725}]},
      {couch_set_view_group,handle_call,3,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,934}]},
      {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,636}]},
      {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,665}]},
      {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,247}]}]},
      {gen_server,call,
      [<12939.514.0>,
      {monitor_partition_update,1020,
      #Ref<12939.176846531.2340945921.119382>,
      <12939.564.0>}, infinity]}},
      {gen_server,call,
      ['capi_set_view_manager-default',
      {wait_index_updated,1020}, infinity]}}}}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@172.23.105.206'},
      {if_rebalance,<0.10461.1>,
      {wait_index_updated,1020}}, infinity]}}}}}.
      Rebalance Operation Id = f8d81e60511090035968902b399f19b6
      

      Cluster state:

      Testrunner case:

      rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,blob_generator=false

      Note: Reproducibility is not consistent

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Not seeing this issue while trying with "Enterprise Edition 7.0.0 build 4907".

            Closing as verified.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not seeing this issue while trying with "Enterprise Edition 7.0.0 build 4907". Closing as verified.

            With the fix ^ view engine won't attempt at recreating an unnecessary DCP client if it receives a vbucket_stream_not_found response corresponding to a select_bucket request. This takes care of the condition mentioned in this case where we ended up overwriting the existing DCP producer stream.

            Ashwin Govindarajulu Although I have tested the hard failover followed by an immediate (within 2 seconds) full recovery scenario discussed in this case in my local, please feel free to test out the same with this fix.

            Resolving for the time being.

            abhishek.jindal Abhishek Jindal added a comment - With the fix ^ view engine won't attempt at recreating an unnecessary DCP client if it receives a vbucket_stream_not_found response corresponding to a select_bucket request. This takes care of the condition mentioned in this case where we ended up overwriting the existing DCP producer stream. Ashwin Govindarajulu Although I have tested the hard failover followed by an immediate (within 2 seconds) full recovery scenario discussed in this case in my local, please feel free to test out the same with this fix. Resolving for the time being.

            Build couchbase-server-7.0.0-4765 contains couchdb commit b47c47a with commit message:
            MB-37070 : Address corner case in toplogy change resulting in duplicate DCP clients

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4765 contains couchdb commit b47c47a with commit message: MB-37070 : Address corner case in toplogy change resulting in duplicate DCP clients
            jeelan.poola Jeelan Poola added a comment -

            Deferring this to CC as the issue should be transient and it is unlikely that failover and adding back the same node within 2 secs will happen in production environments. The fix seems simple and will target for 6.5.1.

            jeelan.poola Jeelan Poola added a comment - Deferring this to CC as the issue should be transient and it is unlikely that failover and adding back the same node within 2 secs will happen in production environments. The fix seems simple and will target for 6.5.1.

            Failover started at 21:49:10 and after failover test adds back the same node in quick succession.

            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 172.23.105.206:8091 with graceful=False
            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@172.23.105.206 successful
            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending....
            2019-11-26 21:49:10 | INFO | MainProcess | test_thread | [rest_client.add_back_node] add_back_node ns_1@172.23.105.206 successful
            

            memcached started removing all the dcp clients.

            2019-11-26T21:49:10.589179-08:00 INFO 58: The connected bucket is being deleted.. closing connection [ 127.0.0.1:48774 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.589546-08:00 INFO 62: The connected bucket is being deleted.. closing connection [ 127.0.0.1:34045 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593703-08:00 INFO 43 Delete bucket [default]. Engine ready for shutdown
            2019-11-26T21:49:10.593721-08:00 INFO 43 Delete bucket [default]. Wait for 8 clients to disconnect
            2019-11-26T21:49:10.593763-08:00 INFO 48: The connected bucket is being deleted.. closing connection [ 127.0.0.1:45388 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593788-08:00 INFO 49: The connected bucket is being deleted.. closing connection [ 127.0.0.1:57697 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593804-08:00 INFO 47: The connected bucket is being deleted.. closing connection [ 127.0.0.1:33886 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593826-08:00 INFO 50: The connected bucket is being deleted.. closing connection [ 127.0.0.1:42139 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593858-08:00 INFO 53: The connected bucket is being deleted.. closing connection [ 172.23.105.174:50998 - 172.23.105.206:11210 (<ud>cbadminbucket</ud>) ]
            2019-11-26T21:49:10.694072-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 568
            2019-11-26T21:49:10.694298-08:00 INFO Pushing new cluster config for bucket:[] revision:[568]
            2019-11-26T21:49:10.704402-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 570
            2019-11-26T21:49:10.704482-08:00 INFO Pushing new cluster config for bucket:[] revision:[570]
            2019-11-26T21:49:11.024484-08:00 INFO 43 Delete bucket [default]. Shut down the bucket
            

            view_engine received closed connection error and it will wait for some time and try to restart the
            connection.

            [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.449.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.439.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            

            Since we added back the node, new group server for that ddoc is created on node .206 and it will open a new dcp client with same dcp name.

            [couchdb:info,2019-11-26T21:49:11.973-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]main set view group `_design/default_view`, set `default` (prod), doc_fields_unused
            [couchdb:info,2019-11-26T21:49:11.974-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
            [couchdb:info,2019-11-26T21:49:11.975-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, linked PID <0.515.0> stopped normally
            [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Started undefined replica (prod) set view group `default`, group `_design/default_view`,  signature `81b924d6fb2889b3227487b72756326d', view count: 3
            [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]replica set view group `_design/default_view`, set `default` (prod), doc_fields_unused
            [couchdb:info,2019-11-26T21:49:12.415-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
            

            Now when the old gen_server which wakes up after some seconds delay it will try to connect with same dcp name and memcahced disconnects the old connection(in this case the new group server).

            2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0
            2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500
            

            Since vbucket monitor is listening on the closed dcp client(newly created group server) it will fail with dcp_conn_closed. This issue happened because failover and adding of same node happened within 2 seconds.

            ankit.prabhu Ankit Prabhu added a comment - Failover started at 21:49:10 and after failover test adds back the same node in quick succession. 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 172.23.105.206:8091 with graceful=False 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@172.23.105.206 successful 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending.... 2019-11-26 21:49:10 | INFO | MainProcess | test_thread | [rest_client.add_back_node] add_back_node ns_1@172.23.105.206 successful memcached started removing all the dcp clients. 2019-11-26T21:49:10.589179-08:00 INFO 58: The connected bucket is being deleted.. closing connection [ 127.0.0.1:48774 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.589546-08:00 INFO 62: The connected bucket is being deleted.. closing connection [ 127.0.0.1:34045 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593703-08:00 INFO 43 Delete bucket [default]. Engine ready for shutdown 2019-11-26T21:49:10.593721-08:00 INFO 43 Delete bucket [default]. Wait for 8 clients to disconnect 2019-11-26T21:49:10.593763-08:00 INFO 48: The connected bucket is being deleted.. closing connection [ 127.0.0.1:45388 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593788-08:00 INFO 49: The connected bucket is being deleted.. closing connection [ 127.0.0.1:57697 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593804-08:00 INFO 47: The connected bucket is being deleted.. closing connection [ 127.0.0.1:33886 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593826-08:00 INFO 50: The connected bucket is being deleted.. closing connection [ 127.0.0.1:42139 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593858-08:00 INFO 53: The connected bucket is being deleted.. closing connection [ 172.23.105.174:50998 - 172.23.105.206:11210 (<ud>cbadminbucket</ud>) ] 2019-11-26T21:49:10.694072-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 568 2019-11-26T21:49:10.694298-08:00 INFO Pushing new cluster config for bucket:[] revision:[568] 2019-11-26T21:49:10.704402-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 570 2019-11-26T21:49:10.704482-08:00 INFO Pushing new cluster config for bucket:[] revision:[570] 2019-11-26T21:49:11.024484-08:00 INFO 43 Delete bucket [default]. Shut down the bucket view_engine received closed connection error and it will wait for some time and try to restart the connection. [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.449.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.439.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... Since we added back the node, new group server for that ddoc is created on node .206 and it will open a new dcp client with same dcp name. [couchdb:info,2019-11-26T21:49:11.973-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]main set view group `_design/default_view`, set `default` (prod), doc_fields_unused [couchdb:info,2019-11-26T21:49:11.974-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes [couchdb:info,2019-11-26T21:49:11.975-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, linked PID <0.515.0> stopped normally [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Started undefined replica (prod) set view group `default`, group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d', view count: 3 [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]replica set view group `_design/default_view`, set `default` (prod), doc_fields_unused [couchdb:info,2019-11-26T21:49:12.415-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes Now when the old gen_server which wakes up after some seconds delay it will try to connect with same dcp name and memcahced disconnects the old connection(in this case the new group server). 2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0 2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500 Since vbucket monitor is listening on the closed dcp client(newly created group server) it will fail with dcp_conn_closed. This issue happened because failover and adding of same node happened within 2 seconds.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty