Details
-
Bug
-
Resolution: Duplicate
-
Critical
-
None
-
7.2.0
-
7.2.0-5242-enterprise, debian
-
Triaged
-
Centos 64-bit
-
-
0
-
Unknown
Description
Build: 7.2.0-5342
Steps:
- Cluster setup
+----------------+-----------------+-----------+-----------+---------------------+
| Node | CPU_utilization | Mem_total | Mem_free | Swap_mem_used |
+----------------+-----------------+-----------+-----------+---------------------+
| 172.23.105.190 | 0.382517401923 | 11.74 GiB | 11.06 GiB | 0.0 Byte / 4.10 GiB |
| 172.23.105.62 | 0 | 11.74 GiB | 11.05 GiB | 0.0 Byte / 0.0 Byte |
| 172.23.105.217 | 1.29807206251 | 11.74 GiB | 11.06 GiB | 0.0 Byte / 4.10 GiB |
| 172.23.100.43 | 1.82507740759 | 11.74 GiB | 10.96 GiB | 0.0 Byte / 4.10 GiB |
+----------------+-----------------+-----------+-----------+---------------------++---------+-----------+-----------------+----------+-----------+
| Bucket | Type | Storage Backend | Replicas | RAM Quota |
+---------+-----------+-----------------+----------+-----------+
| bucket1 | couchbase | couchstore | 1 | 0.0 Byte |
| bucket2 | couchbase | magma | 1 | 3.91 GiB |
| default | couchbase | magma | 1 | 0.0 Byte |
+---------+-----------+-----------------+----------+-----------+
- Loading initial data + historical data (updates to existing data)
- Start dedupe load and
- Rebalance in 1 node and out 2 nodes
+----------------+---------------+--------------+-----------------------+
| Nodes | CPU | Status | Membership / Recovery |
+----------------+---------------+--------------+-----------------------+
| 172.23.105.190 | 59.2672495907 | --- OUT ---> | active / none |
| 172.23.105.254 | None | Cluster node | inactiveAdded / none |
| 172.23.105.62 | 76.1617125751 | --- OUT ---> | active / none |
| 172.23.105.217 | 89.0557203779 | Cluster node | active / none |
| 172.23.100.43 | 52.8147070305 | Cluster node | active / none |
+----------------+---------------+--------------+-----------------------+
Observation:
Seeing the following rebalance failure + node .43 memcached's log has the following error line
172.23.100.43: Found ' ERROR ' logs - ['2023-03-11T09:35:49.917861-08:00 ERROR 10671: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.43->ns_1@172.23.105.254:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":95,"opcode":"DCP_SYSTEM_EVENT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.100.43->ns_1@172.23.105.254:default, vb:233, state:in-memory\n']
|
Rebalance Id: 03492b4db91cca8f1995b58990724aab
Crash message:
{u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance',
|
u'masterRequestTimedOut': False, u'statusId': u'0d676e61004841ed40acfea20fb98d70', u'subtype': u'rebalance', u'statusIsStale': False,
|
u'lastReportURI': u'/logs/rebalanceReport?reportID=b6662b363119f0cd83d8fd22799d1818', u'status': u'notRunning'} - rebalance failed
|
{u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'info', u'node': u'ns_1@172.23.105.254', u'tstamp': 1678556151109L,
|
u'shortText': u'message', u'serverTime': u'2023-03-11T09:35:51.109Z',
|
u'text': u'Warning: On bucket "default" mutation history is greater than 90% of history retention size for at least 21/1024 vbuckets.
|
Please ensure that the history retention size is sufficiently large, in order for the mutation history to be retained for the history retention time.'}
|
{u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'warning', u'node': u'ns_1@172.23.105.254',
|
u'tstamp': 1678556151108L, u'shortText': u'message', u'serverTime': u'2023-03-11T09:35:51.108Z', u'text': u'The following vbuckets have mutation history size above the warning threshold: ["vb_1023","vb_767","vb_766","vb_765","vb_763","vb_759","vb_425","vb_424","vb_422","vb_253","vb_250","vb_249","vb_248","vb_244","vb_243","vb_242","vb_239","vb_238","vb_237","vb_235","vb_233"]'}
|
{u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556150329L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:35:50.329Z',
|
u'text': u'Rebalance exited with reason
|
{mover_crashed, {unexpected_exit,{\'EXIT\',<0.13809.15>,
|
{{bulk_set_vbucket_state_failed,[
|
{\'ns_1@172.23.105.254\',{\'EXIT\',{{{{{child_interrupted,{\'EXIT\',<26286.25020.3>,socket_closed}},
|
[{dcp_replicator,spawn_and_wait,1,[{file,"src/dcp_replicator.erl"},{line,358}]},
|
{dcp_replicator,handle_call,3,[{file,"src/dcp_replicator.erl"},{line,146}]},
|
{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,721}]},
|
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,750}]},
|
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]},
|
{gen_server,call,[<26286.25018.3>,{setup_replication,[230,233,234,235,237,238,239,241,242,243,244,248,249,250,253,255,510]},infinity]}},
|
{gen_server,call,[\'replication_manager-default\',{change_vbucket_replication,230,\'ns_1@172.23.100.43\'},infinity]}},
|
{gen_server,call,[{\'janitor_agent-default\',\'ns_1@172.23.105.254\'},
|
{if_rebalance,<0.7790.15>,{update_vbucket_state,230,replica,undefined,\'ns_1@172.23.100.43\'}},infinity]}}}}]},
|
[{janitor_agent,bulk_set_vbucket_state,4,[{file,"src/janitor_agent.erl"},{line,372}]},
|
{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,211}]}]}}}}.
|
Operation Id = 03492b4db91cca8f1995b58990724aab'}
|
{u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556150279L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:35:50.279Z',
|
u'text': u'Worker <0.13802.15>
|
(for action {move,{230,[\'ns_1@172.23.100.43\',\'ns_1@172.23.105.62\'],[\'ns_1@172.23.100.43\',\'ns_1@172.23.105.254\'],[]}})
|
exited with reason {unexpected_exit,
|
{\'EXIT\',<0.13809.15>,
|
|
{{bulk_set_vbucket_state_failed,[{\'ns_1@172.23.105.254\',{\'EXIT\',{
|
{{{{child_interrupted,{\'EXIT\',<26286.25020.3>,socket_closed}},
|
[{dcp_replicator,spawn_and_wait,1,[{file,"src/dcp_replicator.erl"},{line,358}]},
|
{dcp_replicator,handle_call,3,[{file,"src/dcp_replicator.erl"},{line,146}]},
|
{gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,721}]},
|
{gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,750}]},
|
{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]},
|
{gen_server,call,[<26286.25018.3>,{setup_replication,[230,233,234,235,237,238,239,241,242,243,244,248,249,250,253,255,510]},infinity]}},
|
{gen_server,call,[\'replication_manager-default\',{change_vbucket_replication,230,\'ns_1@172.23.100.43\'},infinity]}},
|
{gen_server,call,[{\'janitor_agent-default\',\'ns_1@172.23.105.254\'},{if_rebalance,<0.7790.15>,{update_vbucket_state,230,replica,undefined,\'ns_1@172.23.100.43\'}},infinity]}}}}]},
|
[{janitor_agent,bulk_set_vbucket_state,4,[{file,"src/janitor_agent.erl"},{line,372}]},{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,211}]}]}}}'}
|
{u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'info', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556139248L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:35:39.248Z', u'text': u'Bucket "default" rebalance does not seem to be swap rebalance'}
|
{u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.105.254', u'tstamp': 1678556137162L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:35:37.162Z', u'text': u'Bucket "default" loaded on node \'ns_1@172.23.105.254\' in 0 seconds.'}
|
{u'code': 0, u'module': u'ns_rebalancer', u'type': u'info', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556137064L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:35:37.064Z', u'text': u'Started rebalancing bucket default'}
|
{u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.105.190', u'tstamp': 1678556137030L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:35:37.030Z', u'text': u'Shutting down bucket "bucket2" on \'ns_1@172.23.105.190\' for deletion'}
|
{u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.105.62', u'tstamp': 1678556137018L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:35:37.018Z', u'text': u'Shutting down bucket "bucket2" on \'ns_1@172.23.105.62\' for deletion'}
|
{u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'info', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556092548L, u'shortText': u'message',
|
u'serverTime': u'2023-03-11T09:34:52.548Z', u'text': u'Bucket "bucket2" rebalance does not seem to be swap rebalance'}
|
Rebalance Failed: {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'0d676e61004841ed40acfea20fb98d70', u'subtype': u'rebalance', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=b6662b363119f0cd83d8fd22799d1818', u'status': u'notRunning'} - rebalance failed
|
TAF test:
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.123746.ini GROUP=rebalance_crud_on_collections,rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,dedupe_update_itrs=10000,upgrade_version=7.2.0-5242 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_in_out,nodes_init=4,nodes_in=1,nodes_out=2,bucket_spec=magma_dgm.10_percent_dgm.4_node_1_replica_magma_512,doc_size=512,randomize_value=True,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,skip_validations=False,default_history_retention_for_collections=false,bucket_history_retention_seconds=86400,bucket_history_retention_bytes=750000000,GROUP=rebalance_in_out;rebalance_crud_on_collections'
|
Attachments
Issue Links
- duplicates
-
MB-55930 CDC: Rebalance failed with reason 'dcp_wait_for_data_move_failed::ns_single_vbucket_mover'
- Closed