Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37070

Rebalance failed with reason 'mover_crashed' in 'couch_set_view_group,process_monitor_partition_update"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.5.0
    • 7.0.0
    • view-engine
    • Enterprise Edition 6.5.0 build 4874
    • Untriaged
    • Centos 64-bit
    • Unknown

    Description

      Build: 6.5.0-4874

      Scenario:

      • 4 node cluster, Couchbase bucket with 1K items loaded
      • Hard-failover 1 node then add back the failover node
      • Rebalance the cluster

      Observation:

      Seeing rebalance failure with following reason:

      Worker <0.10489.1> (for action {move,{1020,
      ['ns_1@172.23.105.205',undefined],
      ['ns_1@172.23.105.206', 'ns_1@172.23.105.205'],
      []}}) exited with reason {unexpected_exit,
      {'EXIT', <0.10650.1>,
      {{error,
      {badrpc,
      {'EXIT',
      {{{{badmatch, {error, dcp_conn_closed}},
      [{couch_set_view_group, process_monitor_partition_update, 4,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 3725}]},
      {couch_set_view_group, handle_call, 3,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 934}]},
      {gen_server, try_handle_call, 4,
      [{file, "gen_server.erl"}, {line, 636}]},
      {gen_server, handle_msg, 6,
      [{file, "gen_server.erl"}, {line, 665}]},
      {proc_lib, init_p_do_apply, 3,
      [{file, "proc_lib.erl"}, {line, 247}]}]},
      {gen_server, call,
      [<12939.514.0>,
      {monitor_partition_update, 1020,
      #Ref<12939.176846531.2340945921.119382>,
      <12939.564.0>}, infinity]}},
      {gen_server, call,
      ['capi_set_view_manager-default',
      {wait_index_updated, 1020}, infinity]}}}}},
      {gen_server, call,
      [{'janitor_agent-default', 'ns_1@172.23.105.206'},
      {if_rebalance, <0.10461.1>,
      {wait_index_updated, 1020}}, infinity]}}}}
       
      Rebalance exited with reason {mover_crashed,
      {unexpected_exit,
      {'EXIT',<0.10650.1>,
      {{error,
      {badrpc,
      {'EXIT',
      {{{{badmatch,{error,dcp_conn_closed}},
      [{couch_set_view_group, process_monitor_partition_update,4,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,3725}]},
      {couch_set_view_group,handle_call,3,
      [{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,934}]},
      {gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,636}]},
      {gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,665}]},
      {proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,247}]}]},
      {gen_server,call,
      [<12939.514.0>,
      {monitor_partition_update,1020,
      #Ref<12939.176846531.2340945921.119382>,
      <12939.564.0>}, infinity]}},
      {gen_server,call,
      ['capi_set_view_manager-default',
      {wait_index_updated,1020}, infinity]}}}}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@172.23.105.206'},
      {if_rebalance,<0.10461.1>,
      {wait_index_updated,1020}}, infinity]}}}}}.
      Rebalance Operation Id = f8d81e60511090035968902b399f19b6
      

      Cluster state:

      Testrunner case:

      rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,blob_generator=false

      Note: Reproducibility is not consistent

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ashwin.govindarajulu Ashwin Govindarajulu created issue -
            dfinlay Dave Finlay made changes -
            Field Original Value New Value
            Priority Major [ 3 ] Critical [ 2 ]
            dfinlay Dave Finlay added a comment -

            The rebalance terminates because it looks like a DCP connection between KV and Views is closed on node .206.

            Prior to the rebalance .206 is failed over:

            2019-11-26T21:49:10.455-08:00, failover:0:info:message(ns_1@172.23.105.155) - Starting failing over ['ns_1@172.23.105.206']
            2019-11-26T21:49:10.455-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Starting failover of nodes ['ns_1@172.23.105.206']. Operation Id = de1a6d32db9156b48985f3bbb7446905
            2019-11-26T21:49:10.588-08:00, ns_memcached:0:info:message(ns_1@172.23.105.206) - Shutting down bucket "default" on 'ns_1@172.23.105.206' for deletion
            2019-11-26T21:49:10.669-08:00, failover:0:info:message(ns_1@172.23.105.155) - Failed over ['ns_1@172.23.105.206']: ok
            2019-11-26T21:49:10.681-08:00, failover:0:info:message(ns_1@172.23.105.155) - Deactivating failed over nodes ['ns_1@172.23.105.206']
            2019-11-26T21:49:10.686-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Failover completed successfully.
            

            Rebalance begins. .206 is added back full recovery:

            2019-11-26T21:49:11.711-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.155','ns_1@172.23.105.206',
                                             'ns_1@172.23.105.159','ns_1@172.23.105.205'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = f8d81e60511090035968902b399f19b6
            2019-11-26T21:49:11.812-08:00, ns_storage_conf:0:info:message(ns_1@172.23.105.206) - Deleting old data files of bucket "default"
            2019-11-26T21:49:11.880-08:00, ns_rebalancer:0:info:message(ns_1@172.23.105.155) - Started rebalancing bucket default
            2019-11-26T21:49:12.965-08:00, ns_vbucket_mover:0:info:message(ns_1@172.23.105.155) - Bucket "default" rebalance does not seem to be swap rebalance
            2019-11-26T21:49:15.936-08:00, ns_vbucket_mover:0:critical:message(ns_1@172.23.105.155) - Worker <0.10489.1> (for action {move,{1020,
                                                  ['ns_1@172.23.105.205',undefined],
                                                  ['ns_1@172.23.105.206',
                                                   'ns_1@172.23.105.205'],
                                                  []}}) exited with reason {unexpected_exit,
                                                                            {'EXIT',
                                                                             <0.10650.1>,
                                                                             {{error,
                                                                               {badrpc,
                                                                                {'EXIT',
                                                                                 {{{{badmatch,
                                                                                     {error,
                                                                                     dcp_conn_closed}},
                                                                                    [{couch_set_view_group,
                                                                                      process_monitor_partition_update,
            ...
            

            In the couchdb log we see the DCP connection gets closed on the view:

            [couchdb:info,2019-11-26T21:49:15.929-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d`, term\
            inating with reason: <ud>{{badmatch,{error,dcp_conn_closed}},
                 [{couch_set_view_group,process_monitor_partition_update,4,
                                        [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"},
                                         {line,3725}]},
                  {couch_set_view_group,handle_call,3,
                                        [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"},
                                         {line,934}]},
                  {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},
                  {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},
                  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}</ud>
            

            This causes rebalance to fail as ns_server is waiting for the view index to be updated on vbucket 1020.

            In the memcached logs I see several of these traces where views connections get closed due to a name conflict:

            2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0
            ...
            2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500
            ...
            2019-11-26T21:49:16.930186-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9cd500 as it has the same name as a new connection 0x7f94f8ee1200
            

            But none of these are at the exact same instant that we see the errors in the couchdb log. I also see some of the following:

            2019-11-26T21:49:13.044232-08:00 WARNING 49: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:46712 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:13.927625-08:00 WARNING 51: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:37184 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            

            Note that these latter are unrelated to the views connection getting closed. "Step returned 10" appears to mean that the connection isn't bound to a bucket, which seems odd for a DCP connection. I'd be interested in the KV view on this.

            Ankit Prabhu, Jeelan Poola: can you look into this on the views side and see if you can extract more information from the logs and if you can add it to the ticket?

            Daniel Owen: once the Views folks have taken a look can you have someone from KV investigate too?

            If some more ns_server insight is required let me know.

            dfinlay Dave Finlay added a comment - The rebalance terminates because it looks like a DCP connection between KV and Views is closed on node .206. Prior to the rebalance .206 is failed over: 2019-11-26T21:49:10.455-08:00, failover:0:info:message(ns_1@172.23.105.155) - Starting failing over ['ns_1@172.23.105.206'] 2019-11-26T21:49:10.455-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Starting failover of nodes ['ns_1@172.23.105.206']. Operation Id = de1a6d32db9156b48985f3bbb7446905 2019-11-26T21:49:10.588-08:00, ns_memcached:0:info:message(ns_1@172.23.105.206) - Shutting down bucket "default" on 'ns_1@172.23.105.206' for deletion 2019-11-26T21:49:10.669-08:00, failover:0:info:message(ns_1@172.23.105.155) - Failed over ['ns_1@172.23.105.206']: ok 2019-11-26T21:49:10.681-08:00, failover:0:info:message(ns_1@172.23.105.155) - Deactivating failed over nodes ['ns_1@172.23.105.206'] 2019-11-26T21:49:10.686-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Failover completed successfully. Rebalance begins. .206 is added back full recovery: 2019-11-26T21:49:11.711-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.155','ns_1@172.23.105.206', 'ns_1@172.23.105.159','ns_1@172.23.105.205'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = f8d81e60511090035968902b399f19b6 2019-11-26T21:49:11.812-08:00, ns_storage_conf:0:info:message(ns_1@172.23.105.206) - Deleting old data files of bucket "default" 2019-11-26T21:49:11.880-08:00, ns_rebalancer:0:info:message(ns_1@172.23.105.155) - Started rebalancing bucket default 2019-11-26T21:49:12.965-08:00, ns_vbucket_mover:0:info:message(ns_1@172.23.105.155) - Bucket "default" rebalance does not seem to be swap rebalance 2019-11-26T21:49:15.936-08:00, ns_vbucket_mover:0:critical:message(ns_1@172.23.105.155) - Worker <0.10489.1> (for action {move,{1020, ['ns_1@172.23.105.205',undefined], ['ns_1@172.23.105.206', 'ns_1@172.23.105.205'], []}}) exited with reason {unexpected_exit, {'EXIT', <0.10650.1>, {{error, {badrpc, {'EXIT', {{{{badmatch, {error, dcp_conn_closed}}, [{couch_set_view_group, process_monitor_partition_update, ... In the couchdb log we see the DCP connection gets closed on the view: [couchdb:info,2019-11-26T21:49:15.929-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d`, term\ inating with reason: <ud>{{badmatch,{error,dcp_conn_closed}}, [{couch_set_view_group,process_monitor_partition_update,4, [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,3725}]}, {couch_set_view_group,handle_call,3, [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,934}]}, {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]}, {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}</ud> This causes rebalance to fail as ns_server is waiting for the view index to be updated on vbucket 1020. In the memcached logs I see several of these traces where views connections get closed due to a name conflict: 2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0 ... 2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500 ... 2019-11-26T21:49:16.930186-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9cd500 as it has the same name as a new connection 0x7f94f8ee1200 But none of these are at the exact same instant that we see the errors in the couchdb log. I also see some of the following: 2019-11-26T21:49:13.044232-08:00 WARNING 49: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:46712 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:13.927625-08:00 WARNING 51: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:37184 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Note that these latter are unrelated to the views connection getting closed. "Step returned 10" appears to mean that the connection isn't bound to a bucket, which seems odd for a DCP connection. I'd be interested in the KV view on this. Ankit Prabhu , Jeelan Poola : can you look into this on the views side and see if you can extract more information from the logs and if you can add it to the ticket? Daniel Owen : once the Views folks have taken a look can you have someone from KV investigate too? If some more ns_server insight is required let me know.
            dfinlay Dave Finlay made changes -
            Component/s couchbase-bucket [ 10173 ]
            Component/s view-engine [ 10060 ]
            dfinlay Dave Finlay made changes -
            Assignee Dave Finlay [ dfinlay ] Ankit Prabhu [ ankit.prabhu ]
            drigby Dave Rigby added a comment -

            Thanks Dave Finlay.

            Recall that the view engine reconnects every 10s (it doesn't really "stream" DCP like other services due to the legacy of evolving from a disk-based approach), so closes / reconnects are expected to some degree. Additionally, given these constant stream open/closes, KV-Engine only logs DCP messages from view engine connections at WARN level or greater.

            drigby Dave Rigby added a comment - Thanks Dave Finlay . Recall that the view engine reconnects every 10s (it doesn't really "stream" DCP like other services due to the legacy of evolving from a disk-based approach), so closes / reconnects are expected to some degree. Additionally, given these constant stream open/closes, KV-Engine only logs DCP messages from view engine connections at WARN level or greater.
            jeelan.poola Jeelan Poola added a comment -

            Currently busy with a CBSE. Shall look into this as soon as possible and revert.

            jeelan.poola Jeelan Poola added a comment - Currently busy with a CBSE. Shall look into this as soon as possible and revert.
            dfinlay Dave Finlay made changes -
            Component/s ns_server [ 10019 ]

            KV Investigation

            Dave Finlay, The below message means that step (of the Dcp connection) returned Disconnect which will be the case when we disconnect a Producer due to a new one being created with the same name.

            2019-11-26T21:49:13.927625-08:00 WARNING 51: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:37184 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            

            From some digging with Dave Rigby we made the following observations about views connections:

            1. They don't HELO
            2. They authenticate with the name ns_server
            3. They appear in mcstat connections without an agent_name

            Looking at node 206.
            Some interesting logs from ns_couchdb.log. Unfortunately we don't have port numbers so we can't match up exactly with the memcached logs. We see at 21:49:15.07 that we have a timeout after 2 seconds which would take us to 21:49:13.07.

            [couchdb:error,2019-11-26T21:49:13.927-08:00,couchdb_ns_1@cb.local:<0.522.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            [couchdb:error,2019-11-26T21:49:14.928-08:00,couchdb_ns_1@cb.local:<0.537.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            [couchdb:error,2019-11-26T21:49:15.071-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:error:44]dcp client (<0.522.0>): dcp all seqs timed out after 2.0 seconds. Waiting...
            [couchdb:info,2019-11-26T21:49:15.929-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d`, terminating with reason: <ud>{{badmatch,{error,dcp_conn_closed}},
                 [{couch_set_view_group,process_monitor_partition_update,4,
                                        [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"},
                                         {line,3725}]},
                  {couch_set_view_group,handle_call,3,
                                        [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"},
                                         {line,934}]},
                  {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},
                  {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},
                  {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}</ud>
            

            In the memcached logs we have the following interesting logs. Looks like our connection is being closed by the other side. This (as we didn't receive a HELO) is views. This caused us to remove our old connection.

            2019-11-26T21:49:13.036886-08:00 INFO 54: Client 127.0.0.1:53114 authenticated as <ud>@ns_server</ud>
            2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500
            2019-11-26T21:49:13.037620-08:00 INFO 54: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:53114 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:13.038017-08:00 INFO 54: Failed to send data; peer closed the connection
            2019-11-26T21:49:13.038020-08:00 INFO 54 - Hard error, closing connection
            

            Another interesting connection at the same time.

            2019-11-26T21:49:13.036833-08:00 INFO 55: Client 127.0.0.1:58637 authenticated as <ud>@ns_server</ud>
            2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0
            2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:13.037837-08:00 INFO 55 Closing connection [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] due to read error: Connection reset by peer
            

            We have other connection closed errors in our logs for node 206 too. This one corresponds to a similar event in ns_couchdb log but there are other that do not.

            2019-11-26T21:46:32.098574-08:00 INFO 62: Failed to send data; peer closed the connection
            2019-11-26T21:46:32.098589-08:00 INFO 62 - Hard error, closing connection
            

            Also found this message is ns_server.views.log which I think correlates with the rebalance failure.

            [views:debug,2019-11-26T21:49:15.971-08:00,couchdb_ns_1@cb.local:capi_set_view_manager-default<0.582.0>:capi_set_view_manager:maybe_define_group:299]
            couch_set_view:define_group(mapreduce_view, <<"default">>, <<"_design/default_view">>, [{set_view_params,1024,[],[],true}]) raised
            throw:view_already_defined
            

            To conclude, views connection management is a bit non-ideal, it looks like they are closing connections to us improperly, and it doesn't look like KV did anything wrong from these logs.

            ben.huddleston Ben Huddleston added a comment - KV Investigation Dave Finlay , The below message means that step (of the Dcp connection) returned Disconnect which will be the case when we disconnect a Producer due to a new one being created with the same name. 2019-11-26T21:49:13.927625-08:00 WARNING 51: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:37184 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] From some digging with Dave Rigby we made the following observations about views connections: They don't HELO They authenticate with the name ns_server They appear in mcstat connections without an agent_name Looking at node 206. Some interesting logs from ns_couchdb.log. Unfortunately we don't have port numbers so we can't match up exactly with the memcached logs. We see at 21:49:15.07 that we have a timeout after 2 seconds which would take us to 21:49:13.07. [couchdb:error,2019-11-26T21:49:13.927-08:00,couchdb_ns_1@cb.local:<0.522.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... [couchdb:error,2019-11-26T21:49:14.928-08:00,couchdb_ns_1@cb.local:<0.537.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... [couchdb:error,2019-11-26T21:49:15.071-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:error:44]dcp client (<0.522.0>): dcp all seqs timed out after 2.0 seconds. Waiting... [couchdb:info,2019-11-26T21:49:15.929-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d`, terminating with reason: <ud>{{badmatch,{error,dcp_conn_closed}}, [{couch_set_view_group,process_monitor_partition_update,4, [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,3725}]}, {couch_set_view_group,handle_call,3, [{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,934}]}, {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]}, {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]}, {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}</ud> In the memcached logs we have the following interesting logs. Looks like our connection is being closed by the other side. This (as we didn't receive a HELO) is views. This caused us to remove our old connection. 2019-11-26T21:49:13.036886-08:00 INFO 54: Client 127.0.0.1:53114 authenticated as <ud>@ns_server</ud> 2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500 2019-11-26T21:49:13.037620-08:00 INFO 54: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:53114 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:13.038017-08:00 INFO 54: Failed to send data; peer closed the connection 2019-11-26T21:49:13.038020-08:00 INFO 54 - Hard error, closing connection Another interesting connection at the same time. 2019-11-26T21:49:13.036833-08:00 INFO 55: Client 127.0.0.1:58637 authenticated as <ud>@ns_server</ud> 2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0 2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:13.037837-08:00 INFO 55 Closing connection [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] due to read error: Connection reset by peer We have other connection closed errors in our logs for node 206 too. This one corresponds to a similar event in ns_couchdb log but there are other that do not. 2019-11-26T21:46:32.098574-08:00 INFO 62: Failed to send data; peer closed the connection 2019-11-26T21:46:32.098589-08:00 INFO 62 - Hard error, closing connection Also found this message is ns_server.views.log which I think correlates with the rebalance failure. [views:debug,2019-11-26T21:49:15.971-08:00,couchdb_ns_1@cb.local:capi_set_view_manager-default<0.582.0>:capi_set_view_manager:maybe_define_group:299] couch_set_view:define_group(mapreduce_view, <<"default">>, <<"_design/default_view">>, [{set_view_params,1024,[],[],true}]) raised throw:view_already_defined To conclude, views connection management is a bit non-ideal, it looks like they are closing connections to us improperly, and it doesn't look like KV did anything wrong from these logs.
            drigby Dave Rigby made changes -
            Component/s couchbase-bucket [ 10173 ]

            Failover started at 21:49:10 and after failover test adds back the same node in quick succession.

            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 172.23.105.206:8091 with graceful=False
            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@172.23.105.206 successful
            2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending....
            2019-11-26 21:49:10 | INFO | MainProcess | test_thread | [rest_client.add_back_node] add_back_node ns_1@172.23.105.206 successful
            

            memcached started removing all the dcp clients.

            2019-11-26T21:49:10.589179-08:00 INFO 58: The connected bucket is being deleted.. closing connection [ 127.0.0.1:48774 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.589546-08:00 INFO 62: The connected bucket is being deleted.. closing connection [ 127.0.0.1:34045 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593703-08:00 INFO 43 Delete bucket [default]. Engine ready for shutdown
            2019-11-26T21:49:10.593721-08:00 INFO 43 Delete bucket [default]. Wait for 8 clients to disconnect
            2019-11-26T21:49:10.593763-08:00 INFO 48: The connected bucket is being deleted.. closing connection [ 127.0.0.1:45388 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593788-08:00 INFO 49: The connected bucket is being deleted.. closing connection [ 127.0.0.1:57697 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593804-08:00 INFO 47: The connected bucket is being deleted.. closing connection [ 127.0.0.1:33886 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593826-08:00 INFO 50: The connected bucket is being deleted.. closing connection [ 127.0.0.1:42139 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:10.593858-08:00 INFO 53: The connected bucket is being deleted.. closing connection [ 172.23.105.174:50998 - 172.23.105.206:11210 (<ud>cbadminbucket</ud>) ]
            2019-11-26T21:49:10.694072-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 568
            2019-11-26T21:49:10.694298-08:00 INFO Pushing new cluster config for bucket:[] revision:[568]
            2019-11-26T21:49:10.704402-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 570
            2019-11-26T21:49:10.704482-08:00 INFO Pushing new cluster config for bucket:[] revision:[570]
            2019-11-26T21:49:11.024484-08:00 INFO 43 Delete bucket [default]. Shut down the bucket
            

            view_engine received closed connection error and it will wait for some time and try to restart the
            connection.

            [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.449.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.439.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
            

            Since we added back the node, new group server for that ddoc is created on node .206 and it will open a new dcp client with same dcp name.

            [couchdb:info,2019-11-26T21:49:11.973-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]main set view group `_design/default_view`, set `default` (prod), doc_fields_unused
            [couchdb:info,2019-11-26T21:49:11.974-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
            [couchdb:info,2019-11-26T21:49:11.975-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, linked PID <0.515.0> stopped normally
            [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Started undefined replica (prod) set view group `default`, group `_design/default_view`,  signature `81b924d6fb2889b3227487b72756326d', view count: 3
            [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]replica set view group `_design/default_view`, set `default` (prod), doc_fields_unused
            [couchdb:info,2019-11-26T21:49:12.415-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
            

            Now when the old gen_server which wakes up after some seconds delay it will try to connect with same dcp name and memcahced disconnects the old connection(in this case the new group server).

            2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0
            2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
            2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500
            

            Since vbucket monitor is listening on the closed dcp client(newly created group server) it will fail with dcp_conn_closed. This issue happened because failover and adding of same node happened within 2 seconds.

            ankit.prabhu Ankit Prabhu added a comment - Failover started at 21:49:10 and after failover test adds back the same node in quick succession. 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 172.23.105.206:8091 with graceful=False 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@172.23.105.206 successful 2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending.... 2019-11-26 21:49:10 | INFO | MainProcess | test_thread | [rest_client.add_back_node] add_back_node ns_1@172.23.105.206 successful memcached started removing all the dcp clients. 2019-11-26T21:49:10.589179-08:00 INFO 58: The connected bucket is being deleted.. closing connection [ 127.0.0.1:48774 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.589546-08:00 INFO 62: The connected bucket is being deleted.. closing connection [ 127.0.0.1:34045 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593703-08:00 INFO 43 Delete bucket [default]. Engine ready for shutdown 2019-11-26T21:49:10.593721-08:00 INFO 43 Delete bucket [default]. Wait for 8 clients to disconnect 2019-11-26T21:49:10.593763-08:00 INFO 48: The connected bucket is being deleted.. closing connection [ 127.0.0.1:45388 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593788-08:00 INFO 49: The connected bucket is being deleted.. closing connection [ 127.0.0.1:57697 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593804-08:00 INFO 47: The connected bucket is being deleted.. closing connection [ 127.0.0.1:33886 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593826-08:00 INFO 50: The connected bucket is being deleted.. closing connection [ 127.0.0.1:42139 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:10.593858-08:00 INFO 53: The connected bucket is being deleted.. closing connection [ 172.23.105.174:50998 - 172.23.105.206:11210 (<ud>cbadminbucket</ud>) ] 2019-11-26T21:49:10.694072-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 568 2019-11-26T21:49:10.694298-08:00 INFO Pushing new cluster config for bucket:[] revision:[568] 2019-11-26T21:49:10.704402-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 570 2019-11-26T21:49:10.704482-08:00 INFO Pushing new cluster config for bucket:[] revision:[570] 2019-11-26T21:49:11.024484-08:00 INFO 43 Delete bucket [default]. Shut down the bucket view_engine received closed connection error and it will wait for some time and try to restart the connection. [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.449.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... [couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.439.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker... Since we added back the node, new group server for that ddoc is created on node .206 and it will open a new dcp client with same dcp name. [couchdb:info,2019-11-26T21:49:11.973-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]main set view group `_design/default_view`, set `default` (prod), doc_fields_unused [couchdb:info,2019-11-26T21:49:11.974-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes [couchdb:info,2019-11-26T21:49:11.975-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, linked PID <0.515.0> stopped normally [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Started undefined replica (prod) set view group `default`, group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d', view count: 3 [couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]replica set view group `_design/default_view`, set `default` (prod), doc_fields_unused [couchdb:info,2019-11-26T21:49:12.415-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes Now when the old gen_server which wakes up after some seconds delay it will try to connect with same dcp name and memcahced disconnects the old connection(in this case the new group server). 2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0 2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] 2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500 Since vbucket monitor is listening on the closed dcp client(newly created group server) it will fail with dcp_conn_closed. This issue happened because failover and adding of same node happened within 2 seconds.
            jeelan.poola Jeelan Poola added a comment -

            Deferring this to CC as the issue should be transient and it is unlikely that failover and adding back the same node within 2 secs will happen in production environments. The fix seems simple and will target for 6.5.1.

            jeelan.poola Jeelan Poola added a comment - Deferring this to CC as the issue should be transient and it is unlikely that failover and adding back the same node within 2 secs will happen in production environments. The fix seems simple and will target for 6.5.1.
            jeelan.poola Jeelan Poola made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]
            Fix Version/s Mad-Hatter [ 15037 ]
            ritam.sharma Ritam Sharma made changes -
            Remote Link This issue links to "Page (Couchbase, Inc. Wiki)" [ 18946 ]
            jeelan.poola Jeelan Poola made changes -
            Assignee Ankit Prabhu [ ankit.prabhu ] Abhishek Jindal [ abhishek.jindal ]

            Build couchbase-server-7.0.0-4765 contains couchdb commit b47c47a with commit message:
            MB-37070 : Address corner case in toplogy change resulting in duplicate DCP clients

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4765 contains couchdb commit b47c47a with commit message: MB-37070 : Address corner case in toplogy change resulting in duplicate DCP clients

            With the fix ^ view engine won't attempt at recreating an unnecessary DCP client if it receives a vbucket_stream_not_found response corresponding to a select_bucket request. This takes care of the condition mentioned in this case where we ended up overwriting the existing DCP producer stream.

            Ashwin Govindarajulu Although I have tested the hard failover followed by an immediate (within 2 seconds) full recovery scenario discussed in this case in my local, please feel free to test out the same with this fix.

            Resolving for the time being.

            abhishek.jindal Abhishek Jindal added a comment - With the fix ^ view engine won't attempt at recreating an unnecessary DCP client if it receives a vbucket_stream_not_found response corresponding to a select_bucket request. This takes care of the condition mentioned in this case where we ended up overwriting the existing DCP producer stream. Ashwin Govindarajulu Although I have tested the hard failover followed by an immediate (within 2 seconds) full recovery scenario discussed in this case in my local, please feel free to test out the same with this fix. Resolving for the time being.
            abhishek.jindal Abhishek Jindal made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            mihir.kamdar Mihir Kamdar (Inactive) made changes -
            Assignee Abhishek Jindal [ abhishek.jindal ] Ashwin Govindarajulu [ ashwin.govindarajulu ]

            Not seeing this issue while trying with "Enterprise Edition 7.0.0 build 4907".

            Closing as verified.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not seeing this issue while trying with "Enterprise Edition 7.0.0 build 4907". Closing as verified.
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty