Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
6.5.0
-
Enterprise Edition 6.5.0 build 4874
-
Untriaged
-
Centos 64-bit
-
Unknown
Description
Build: 6.5.0-4874
Scenario:
- 4 node cluster, Couchbase bucket with 1K items loaded
- Hard-failover 1 node then add back the failover node
- Rebalance the cluster
Observation:
Seeing rebalance failure with following reason:
Worker <0.10489.1> (for action {move,{1020,
|
['ns_1@172.23.105.205',undefined],
|
['ns_1@172.23.105.206', 'ns_1@172.23.105.205'],
|
[]}}) exited with reason {unexpected_exit,
|
{'EXIT', <0.10650.1>,
|
{{error,
|
{badrpc,
|
{'EXIT',
|
{{{{badmatch, {error, dcp_conn_closed}},
|
[{couch_set_view_group, process_monitor_partition_update, 4,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 3725}]},
|
{couch_set_view_group, handle_call, 3,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 934}]},
|
{gen_server, try_handle_call, 4,
|
[{file, "gen_server.erl"}, {line, 636}]},
|
{gen_server, handle_msg, 6,
|
[{file, "gen_server.erl"}, {line, 665}]},
|
{proc_lib, init_p_do_apply, 3,
|
[{file, "proc_lib.erl"}, {line, 247}]}]},
|
{gen_server, call,
|
[<12939.514.0>,
|
{monitor_partition_update, 1020,
|
#Ref<12939.176846531.2340945921.119382>,
|
<12939.564.0>}, infinity]}},
|
{gen_server, call,
|
['capi_set_view_manager-default',
|
{wait_index_updated, 1020}, infinity]}}}}},
|
{gen_server, call,
|
[{'janitor_agent-default', 'ns_1@172.23.105.206'},
|
{if_rebalance, <0.10461.1>,
|
{wait_index_updated, 1020}}, infinity]}}}}
|
|
Rebalance exited with reason {mover_crashed,
|
{unexpected_exit,
|
{'EXIT',<0.10650.1>,
|
{{error,
|
{badrpc,
|
{'EXIT',
|
{{{{badmatch,{error,dcp_conn_closed}},
|
[{couch_set_view_group, process_monitor_partition_update,4,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,3725}]},
|
{couch_set_view_group,handle_call,3,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,934}]},
|
{gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,636}]},
|
{gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,665}]},
|
{proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,247}]}]},
|
{gen_server,call,
|
[<12939.514.0>,
|
{monitor_partition_update,1020,
|
#Ref<12939.176846531.2340945921.119382>,
|
<12939.564.0>}, infinity]}},
|
{gen_server,call,
|
['capi_set_view_manager-default',
|
{wait_index_updated,1020}, infinity]}}}}},
|
{gen_server,call,
|
[{'janitor_agent-default', 'ns_1@172.23.105.206'},
|
{if_rebalance,<0.10461.1>,
|
{wait_index_updated,1020}}, infinity]}}}}}.
|
Rebalance Operation Id = f8d81e60511090035968902b399f19b6
|
Cluster state:
Testrunner case:
rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,blob_generator=false
|
Note: Reproducibility is not consistent
Â
Attachments
Issue Links
- mentioned in
-
Page Loading...
Failover started at 21:49:10 and after failover test adds back the same node in quick succession.
2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task._failover_nodes] Failing over 172.23.105.206:8091 with graceful=False
2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [rest_client.fail_over] fail_over node ns_1@172.23.105.206 successful
2019-11-26 21:49:10 | INFO | MainProcess | Cluster_Thread | [task.execute] 0 seconds sleep after failover, for nodes to go pending....
2019-11-26 21:49:10 | INFO | MainProcess | test_thread | [rest_client.add_back_node] add_back_node ns_1@172.23.105.206 successful
memcached started removing all the dcp clients.
2019-11-26T21:49:10.589179-08:00 INFO 58: The connected bucket is being deleted.. closing connection [ 127.0.0.1:48774 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:10.589546-08:00 INFO 62: The connected bucket is being deleted.. closing connection [ 127.0.0.1:34045 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:10.593703-08:00 INFO 43 Delete bucket [default]. Engine ready for shutdown
2019-11-26T21:49:10.593721-08:00 INFO 43 Delete bucket [default]. Wait for 8 clients to disconnect
2019-11-26T21:49:10.593763-08:00 INFO 48: The connected bucket is being deleted.. closing connection [ 127.0.0.1:45388 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:10.593788-08:00 INFO 49: The connected bucket is being deleted.. closing connection [ 127.0.0.1:57697 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:10.593804-08:00 INFO 47: The connected bucket is being deleted.. closing connection [ 127.0.0.1:33886 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:10.593826-08:00 INFO 50: The connected bucket is being deleted.. closing connection [ 127.0.0.1:42139 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:10.593858-08:00 INFO 53: The connected bucket is being deleted.. closing connection [ 172.23.105.174:50998 - 172.23.105.206:11210 (<ud>cbadminbucket</ud>) ]
2019-11-26T21:49:10.694072-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 568
2019-11-26T21:49:10.694298-08:00 INFO Pushing new cluster config for bucket:[] revision:[568]
2019-11-26T21:49:10.704402-08:00 INFO 45: [ 127.0.0.1:37711 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ] Updated global cluster configuration. New revision: 570
2019-11-26T21:49:10.704482-08:00 INFO Pushing new cluster config for bucket:[] revision:[570]
2019-11-26T21:49:11.024484-08:00 INFO 43 Delete bucket [default]. Shut down the bucket
view_engine received closed connection error and it will wait for some time and try to restart the
connection.
[couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.449.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/replica)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
[couchdb:error,2019-11-26T21:49:11.024-08:00,couchdb_ns_1@cb.local:<0.439.0>:couch_log:error:44]dcp client (default, mapreduce_view: default _design/default_view (prod/main)): dcp receive worker failed due to reason: closed. Restarting dcp receive worker...
Since we added back the node, new group server for that ddoc is created on node .206 and it will open a new dcp client with same dcp name.
[couchdb:info,2019-11-26T21:49:11.973-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]main set view group `_design/default_view`, set `default` (prod), doc_fields_unused
[couchdb:info,2019-11-26T21:49:11.974-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
[couchdb:info,2019-11-26T21:49:11.975-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, linked PID <0.515.0> stopped normally
[couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Started undefined replica (prod) set view group `default`, group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d', view count: 3
[couchdb:info,2019-11-26T21:49:12.414-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]replica set view group `_design/default_view`, set `default` (prod), doc_fields_unused
[couchdb:info,2019-11-26T21:49:12.415-08:00,couchdb_ns_1@cb.local:<0.529.0>:couch_log:info:41]Flow control buffer size is 20971520 bytes
Now when the old gen_server which wakes up after some seconds delay it will try to connect with same dcp name and memcahced disconnects the old connection(in this case the new group server).
2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0
2019-11-26T21:49:13.037565-08:00 INFO 55: DCP connection opened successfully. PRODUCER, INCLUDE_XATTRS [ 127.0.0.1:58637 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500
Since vbucket monitor is listening on the closed dcp client(newly created group server) it will fail with dcp_conn_closed. This issue happened because failover and adding of same node happened within 2 seconds.