Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-55929

CDC: Rebalance failed with reason 'bulk_set_vbucket_state_failed'

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 7.2.0
    • couchbase-bucket
    • 7.2.0-5242-enterprise, debian

    Description

      Build: 7.2.0-5342

      Steps:

      • Cluster setup

        +----------------+-----------------+-----------+-----------+---------------------+
        | Node           | CPU_utilization | Mem_total | Mem_free  | Swap_mem_used       |
        +----------------+-----------------+-----------+-----------+---------------------+
        | 172.23.105.190 | 0.382517401923  | 11.74 GiB | 11.06 GiB | 0.0 Byte / 4.10 GiB |
        | 172.23.105.62  | 0               | 11.74 GiB | 11.05 GiB | 0.0 Byte / 0.0 Byte |
        | 172.23.105.217 | 1.29807206251   | 11.74 GiB | 11.06 GiB | 0.0 Byte / 4.10 GiB |
        | 172.23.100.43  | 1.82507740759   | 11.74 GiB | 10.96 GiB | 0.0 Byte / 4.10 GiB |
        +----------------+-----------------+-----------+-----------+---------------------++---------+-----------+-----------------+----------+-----------+
        | Bucket  | Type      | Storage Backend | Replicas | RAM Quota |
        +---------+-----------+-----------------+----------+-----------+
        | bucket1 | couchbase | couchstore      | 1        | 0.0 Byte  |
        | bucket2 | couchbase | magma           | 1        | 3.91 GiB  |
        | default | couchbase | magma           | 1        | 0.0 Byte  |
        +---------+-----------+-----------------+----------+-----------+
        

      • Loading initial data + historical data (updates to existing data)
      • Start dedupe load and
      • Rebalance in 1 node and out 2 nodes

        +----------------+---------------+--------------+-----------------------+
        | Nodes          | CPU           | Status       | Membership / Recovery |
        +----------------+---------------+--------------+-----------------------+
        | 172.23.105.190 | 59.2672495907 | --- OUT ---> | active / none         |
        | 172.23.105.254 | None          | Cluster node | inactiveAdded / none  |
        | 172.23.105.62  | 76.1617125751 | --- OUT ---> | active / none         |
        | 172.23.105.217 | 89.0557203779 | Cluster node | active / none         |
        | 172.23.100.43  | 52.8147070305 | Cluster node | active / none         |
        +----------------+---------------+--------------+-----------------------+
        

      Observation:

      Seeing the following rebalance failure + node .43 memcached's log has the following error line

      172.23.100.43: Found ' ERROR ' logs - ['2023-03-11T09:35:49.917861-08:00 ERROR 10671: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.100.43->ns_1@172.23.105.254:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":95,"opcode":"DCP_SYSTEM_EVENT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.100.43->ns_1@172.23.105.254:default, vb:233, state:in-memory\n']

      Rebalance Id: 03492b4db91cca8f1995b58990724aab

      Crash message:

      {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance',
       u'masterRequestTimedOut': False, u'statusId': u'0d676e61004841ed40acfea20fb98d70', u'subtype': u'rebalance', u'statusIsStale': False,
       u'lastReportURI': u'/logs/rebalanceReport?reportID=b6662b363119f0cd83d8fd22799d1818', u'status': u'notRunning'} - rebalance failed
      {u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'info', u'node': u'ns_1@172.23.105.254', u'tstamp': 1678556151109L,
       u'shortText': u'message', u'serverTime': u'2023-03-11T09:35:51.109Z',
       u'text': u'Warning: On bucket "default" mutation history is greater than 90% of history retention size for at least 21/1024 vbuckets.
                  Please ensure that the history retention size is sufficiently large, in order for the mutation history to be retained for the history retention time.'}
      {u'code': 0, u'module': u'menelaus_web_alerts_srv', u'type': u'warning', u'node': u'ns_1@172.23.105.254',
       u'tstamp': 1678556151108L, u'shortText': u'message', u'serverTime': u'2023-03-11T09:35:51.108Z', u'text': u'The following vbuckets have mutation history size above the warning threshold: ["vb_1023","vb_767","vb_766","vb_765","vb_763","vb_759","vb_425","vb_424","vb_422","vb_253","vb_250","vb_249","vb_248","vb_244","vb_243","vb_242","vb_239","vb_238","vb_237","vb_235","vb_233"]'}
      {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556150329L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:35:50.329Z',
       u'text': u'Rebalance exited with reason
           {mover_crashed, {unexpected_exit,{\'EXIT\',<0.13809.15>,
              {{bulk_set_vbucket_state_failed,[
                  {\'ns_1@172.23.105.254\',{\'EXIT\',{{{{{child_interrupted,{\'EXIT\',<26286.25020.3>,socket_closed}},
                      [{dcp_replicator,spawn_and_wait,1,[{file,"src/dcp_replicator.erl"},{line,358}]},
                       {dcp_replicator,handle_call,3,[{file,"src/dcp_replicator.erl"},{line,146}]},
                       {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,721}]},
                       {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,750}]},
                       {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]},
                      {gen_server,call,[<26286.25018.3>,{setup_replication,[230,233,234,235,237,238,239,241,242,243,244,248,249,250,253,255,510]},infinity]}},
                      {gen_server,call,[\'replication_manager-default\',{change_vbucket_replication,230,\'ns_1@172.23.100.43\'},infinity]}},
                      {gen_server,call,[{\'janitor_agent-default\',\'ns_1@172.23.105.254\'},
                                        {if_rebalance,<0.7790.15>,{update_vbucket_state,230,replica,undefined,\'ns_1@172.23.100.43\'}},infinity]}}}}]},
                  [{janitor_agent,bulk_set_vbucket_state,4,[{file,"src/janitor_agent.erl"},{line,372}]},
                   {proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,211}]}]}}}}.
            Operation Id = 03492b4db91cca8f1995b58990724aab'}
      {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'critical', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556150279L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:35:50.279Z',
       u'text': u'Worker <0.13802.15>
           (for action {move,{230,[\'ns_1@172.23.100.43\',\'ns_1@172.23.105.62\'],[\'ns_1@172.23.100.43\',\'ns_1@172.23.105.254\'],[]}})
            exited with reason {unexpected_exit,
              {\'EXIT\',<0.13809.15>,
       
                  {{bulk_set_vbucket_state_failed,[{\'ns_1@172.23.105.254\',{\'EXIT\',{
                      {{{{child_interrupted,{\'EXIT\',<26286.25020.3>,socket_closed}},
                          [{dcp_replicator,spawn_and_wait,1,[{file,"src/dcp_replicator.erl"},{line,358}]},
                           {dcp_replicator,handle_call,3,[{file,"src/dcp_replicator.erl"},{line,146}]},
                           {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,721}]},
                           {gen_server,handle_msg,6,[{file,"gen_server.erl"},{line,750}]},
                           {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,226}]}]},
                      {gen_server,call,[<26286.25018.3>,{setup_replication,[230,233,234,235,237,238,239,241,242,243,244,248,249,250,253,255,510]},infinity]}},
                      {gen_server,call,[\'replication_manager-default\',{change_vbucket_replication,230,\'ns_1@172.23.100.43\'},infinity]}},
                      {gen_server,call,[{\'janitor_agent-default\',\'ns_1@172.23.105.254\'},{if_rebalance,<0.7790.15>,{update_vbucket_state,230,replica,undefined,\'ns_1@172.23.100.43\'}},infinity]}}}}]},
                  [{janitor_agent,bulk_set_vbucket_state,4,[{file,"src/janitor_agent.erl"},{line,372}]},{proc_lib,init_p,3,[{file,"proc_lib.erl"},{line,211}]}]}}}'}
      {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'info', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556139248L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:35:39.248Z', u'text': u'Bucket "default" rebalance does not seem to be swap rebalance'}
      {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.105.254', u'tstamp': 1678556137162L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:35:37.162Z', u'text': u'Bucket "default" loaded on node \'ns_1@172.23.105.254\' in 0 seconds.'}
      {u'code': 0, u'module': u'ns_rebalancer', u'type': u'info', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556137064L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:35:37.064Z', u'text': u'Started rebalancing bucket default'}
      {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.105.190', u'tstamp': 1678556137030L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:35:37.030Z', u'text': u'Shutting down bucket "bucket2" on \'ns_1@172.23.105.190\' for deletion'}
      {u'code': 0, u'module': u'ns_memcached', u'type': u'info', u'node': u'ns_1@172.23.105.62', u'tstamp': 1678556137018L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:35:37.018Z', u'text': u'Shutting down bucket "bucket2" on \'ns_1@172.23.105.62\' for deletion'}
      {u'code': 0, u'module': u'ns_vbucket_mover', u'type': u'info', u'node': u'ns_1@172.23.100.43', u'tstamp': 1678556092548L, u'shortText': u'message',
       u'serverTime': u'2023-03-11T09:34:52.548Z', u'text': u'Bucket "bucket2" rebalance does not seem to be swap rebalance'}
      Rebalance Failed: {u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try again.', u'type': u'rebalance', u'masterRequestTimedOut': False, u'statusId': u'0d676e61004841ed40acfea20fb98d70', u'subtype': u'rebalance', u'statusIsStale': False, u'lastReportURI': u'/logs/rebalanceReport?reportID=b6662b363119f0cd83d8fd22799d1818', u'status': u'notRunning'} - rebalance failed

       

       

       

      TAF test:

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.123746.ini GROUP=rebalance_crud_on_collections,rerun=False,disk_optimized_thread_settings=True,get-cbcollect-info=True,autoCompactionDefined=true,dedupe_update_itrs=10000,upgrade_version=7.2.0-5242 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_rebalance_in_out,nodes_init=4,nodes_in=1,nodes_out=2,bucket_spec=magma_dgm.10_percent_dgm.4_node_1_replica_magma_512,doc_size=512,randomize_value=True,data_load_spec=volume_test_load_with_CRUD_on_collections,data_load_stage=during,skip_validations=False,default_history_retention_for_collections=false,bucket_history_retention_seconds=86400,bucket_history_retention_bytes=750000000,GROUP=rebalance_in_out;rebalance_crud_on_collections'

       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty