Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
6.5.0
-
Enterprise Edition 6.5.0 build 4874
-
Untriaged
-
Centos 64-bit
-
Unknown
Description
Build: 6.5.0-4874
Scenario:
- 4 node cluster, Couchbase bucket with 1K items loaded
- Hard-failover 1 node then add back the failover node
- Rebalance the cluster
Observation:
Seeing rebalance failure with following reason:
Worker <0.10489.1> (for action {move,{1020,
|
['ns_1@172.23.105.205',undefined],
|
['ns_1@172.23.105.206', 'ns_1@172.23.105.205'],
|
[]}}) exited with reason {unexpected_exit,
|
{'EXIT', <0.10650.1>,
|
{{error,
|
{badrpc,
|
{'EXIT',
|
{{{{badmatch, {error, dcp_conn_closed}},
|
[{couch_set_view_group, process_monitor_partition_update, 4,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 3725}]},
|
{couch_set_view_group, handle_call, 3,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line, 934}]},
|
{gen_server, try_handle_call, 4,
|
[{file, "gen_server.erl"}, {line, 636}]},
|
{gen_server, handle_msg, 6,
|
[{file, "gen_server.erl"}, {line, 665}]},
|
{proc_lib, init_p_do_apply, 3,
|
[{file, "proc_lib.erl"}, {line, 247}]}]},
|
{gen_server, call,
|
[<12939.514.0>,
|
{monitor_partition_update, 1020,
|
#Ref<12939.176846531.2340945921.119382>,
|
<12939.564.0>}, infinity]}},
|
{gen_server, call,
|
['capi_set_view_manager-default',
|
{wait_index_updated, 1020}, infinity]}}}}},
|
{gen_server, call,
|
[{'janitor_agent-default', 'ns_1@172.23.105.206'},
|
{if_rebalance, <0.10461.1>,
|
{wait_index_updated, 1020}}, infinity]}}}}
|
|
Rebalance exited with reason {mover_crashed,
|
{unexpected_exit,
|
{'EXIT',<0.10650.1>,
|
{{error,
|
{badrpc,
|
{'EXIT',
|
{{{{badmatch,{error,dcp_conn_closed}},
|
[{couch_set_view_group, process_monitor_partition_update,4,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,3725}]},
|
{couch_set_view_group,handle_call,3,
|
[{file, "/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"}, {line,934}]},
|
{gen_server,try_handle_call,4, [{file,"gen_server.erl"},{line,636}]},
|
{gen_server,handle_msg,6, [{file,"gen_server.erl"},{line,665}]},
|
{proc_lib,init_p_do_apply,3, [{file,"proc_lib.erl"},{line,247}]}]},
|
{gen_server,call,
|
[<12939.514.0>,
|
{monitor_partition_update,1020,
|
#Ref<12939.176846531.2340945921.119382>,
|
<12939.564.0>}, infinity]}},
|
{gen_server,call,
|
['capi_set_view_manager-default',
|
{wait_index_updated,1020}, infinity]}}}}},
|
{gen_server,call,
|
[{'janitor_agent-default', 'ns_1@172.23.105.206'},
|
{if_rebalance,<0.10461.1>,
|
{wait_index_updated,1020}}, infinity]}}}}}.
|
Rebalance Operation Id = f8d81e60511090035968902b399f19b6
|
Cluster state:
Testrunner case:
rebalance.rebalance_progress.RebalanceProgressTests.test_progress_add_back_after_failover,nodes_init=4,nodes_out=1,GROUP=P1,blob_generator=false
|
Note: Reproducibility is not consistent
Â
Attachments
Issue Links
- mentioned in
-
Page Loading...
Activity
Field | Original Value | New Value |
---|---|---|
Priority | Major [ 3 ] | Critical [ 2 ] |
Component/s | couchbase-bucket [ 10173 ] | |
Component/s | view-engine [ 10060 ] |
Assignee | Dave Finlay [ dfinlay ] | Ankit Prabhu [ ankit.prabhu ] |
Component/s | ns_server [ 10019 ] |
Component/s | couchbase-bucket [ 10173 ] |
Fix Version/s | Cheshire-Cat [ 15915 ] | |
Fix Version/s | Mad-Hatter [ 15037 ] |
Remote Link | This issue links to "Page (Couchbase, Inc. Wiki)" [ 18946 ] |
Assignee | Ankit Prabhu [ ankit.prabhu ] | Abhishek Jindal [ abhishek.jindal ] |
Resolution | Fixed [ 1 ] | |
Status | Open [ 1 ] | Resolved [ 5 ] |
Assignee | Abhishek Jindal [ abhishek.jindal ] | Ashwin Govindarajulu [ ashwin.govindarajulu ] |
Status | Resolved [ 5 ] | Closed [ 6 ] |
Fix Version/s | 7.0.0 [ 17233 ] |
Fix Version/s | Cheshire-Cat [ 15915 ] |
The rebalance terminates because it looks like a DCP connection between KV and Views is closed on node .206.
Prior to the rebalance .206 is failed over:
2019-11-26T21:49:10.455-08:00, failover:0:info:message(ns_1@172.23.105.155) - Starting failing over ['ns_1@172.23.105.206']
2019-11-26T21:49:10.455-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Starting failover of nodes ['ns_1@172.23.105.206']. Operation Id = de1a6d32db9156b48985f3bbb7446905
2019-11-26T21:49:10.588-08:00, ns_memcached:0:info:message(ns_1@172.23.105.206) - Shutting down bucket "default" on 'ns_1@172.23.105.206' for deletion
2019-11-26T21:49:10.669-08:00, failover:0:info:message(ns_1@172.23.105.155) - Failed over ['ns_1@172.23.105.206']: ok
2019-11-26T21:49:10.681-08:00, failover:0:info:message(ns_1@172.23.105.155) - Deactivating failed over nodes ['ns_1@172.23.105.206']
2019-11-26T21:49:10.686-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Failover completed successfully.
Rebalance begins. .206 is added back full recovery:
2019-11-26T21:49:11.711-08:00, ns_orchestrator:0:info:message(ns_1@172.23.105.155) - Starting rebalance, KeepNodes = ['ns_1@172.23.105.155','ns_1@172.23.105.206',
'ns_1@172.23.105.159','ns_1@172.23.105.205'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = f8d81e60511090035968902b399f19b6
2019-11-26T21:49:11.812-08:00, ns_storage_conf:0:info:message(ns_1@172.23.105.206) - Deleting old data files of bucket "default"
2019-11-26T21:49:11.880-08:00, ns_rebalancer:0:info:message(ns_1@172.23.105.155) - Started rebalancing bucket default
2019-11-26T21:49:12.965-08:00, ns_vbucket_mover:0:info:message(ns_1@172.23.105.155) - Bucket "default" rebalance does not seem to be swap rebalance
2019-11-26T21:49:15.936-08:00, ns_vbucket_mover:0:critical:message(ns_1@172.23.105.155) - Worker <0.10489.1> (for action {move,{1020,
['ns_1@172.23.105.205',undefined],
['ns_1@172.23.105.206',
'ns_1@172.23.105.205'],
[]}}) exited with reason {unexpected_exit,
{'EXIT',
<0.10650.1>,
{{error,
{badrpc,
{'EXIT',
{{{{badmatch,
{error,
dcp_conn_closed}},
[{couch_set_view_group,
process_monitor_partition_update,
...
In the couchdb log we see the DCP connection gets closed on the view:
[couchdb:info,2019-11-26T21:49:15.929-08:00,couchdb_ns_1@cb.local:<0.514.0>:couch_log:info:41]Set view `default`, main (prod) group `_design/default_view`, signature `81b924d6fb2889b3227487b72756326d`, term\
inating with reason: <ud>{{badmatch,{error,dcp_conn_closed}},
[{couch_set_view_group,process_monitor_partition_update,4,
[{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"},
{line,3725}]},
{couch_set_view_group,handle_call,3,
[{file,"/home/couchbase/jenkins/workspace/couchbase-server-unix/couchdb/src/couch_set_view/src/couch_set_view_group.erl"},
{line,934}]},
{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,636}]},
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,665}]},
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}</ud>
This causes rebalance to fail as ns_server is waiting for the view index to be updated on vbucket 1020.
In the memcached logs I see several of these traces where views connections get closed due to a name conflict:
2019-11-26T21:49:13.037513-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/main) - Disconnecting existing Dcp Producer 0x7f94f8ede500 as it has the same name as a new connection 0x7f952c9cd3c0
...
2019-11-26T21:49:13.037589-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9d00c0 as it has the same name as a new connection 0x7f952cb50500
...
2019-11-26T21:49:16.930186-08:00 INFO (default) DCP (Producer) eq_dcpq:mapreduce_view: default _design/default_view (prod/replica) - Disconnecting existing Dcp Producer 0x7f952c9cd500 as it has the same name as a new connection 0x7f94f8ee1200
But none of these are at the exact same instant that we see the errors in the couchdb log. I also see some of the following:
2019-11-26T21:49:13.044232-08:00 WARNING 49: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:46712 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
2019-11-26T21:49:13.927625-08:00 WARNING 51: ship_dcp_log - step returned 10 - closing connection [ 127.0.0.1:37184 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
Note that these latter are unrelated to the views connection getting closed. "Step returned 10" appears to mean that the connection isn't bound to a bucket, which seems odd for a DCP connection. I'd be interested in the KV view on this.
Ankit Prabhu, Jeelan Poola: can you look into this on the views side and see if you can extract more information from the logs and if you can add it to the ticket?
Daniel Owen: once the Views folks have taken a look can you have someone from KV investigate too?
If some more ns_server insight is required let me know.