Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7290

Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM (~65% resident ratio)

    Details

      Description

      At the time of the rebalance failure:

      + 5 nodes rebalance in on each cluster
      Cluster setup: c1:c2::10:10
      biXDCR_bucket: c1 <---> c2
      uniXDCR_src: c1 ---> c2 :uniXDCR_dest
      Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src.
      c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/
      c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/

      On C1, Rebalance operation failed with this reason on the UI logs:

      Rebalance exited with reason {{bulk_set_vbucket_state_failed,
      [{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com',
      {'EXIT',
      {{timeout,
      {gen_server,call,
      ['ns_memcached-biXDCR_bucket',

      {set_vbucket,544,replica}

      ,
      180000]}},
      {gen_server,call,
      [

      {'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'}

      ,
      {if_rebalance,<0.10136.88>,
      {update_vbucket_state,544,replica,
      undefined,undefined}},
      infinity]}}}}]},
      [

      {janitor_agent,bulk_set_vbucket_state,4}

      ,

      {ns_vbucket_mover, update_replication_post_move,3}

      ,

      {ns_vbucket_mover,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}

      The second time, rebalance failed with the following UI log message:

      Rebalance exited with reason {{timeout,
      {gen_server,call,
      ['ns_memcached-biXDCR_bucket',

      {set_vbucket,849,active}

      ,
      180000]}},
      {gen_server,call,
      [

      {'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'}

      ,
      {if_rebalance,<0.21090.114>,
      {update_vbucket_state,849,active,paused,
      undefined}},
      infinity]}}

      After giving it some time, the third rebalance did complete successfully.

      Will attach the grabbed diags from one of the nodes at C1 in a bit.

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          abhinav Abhinav Dangeti created issue -
          abhinav Abhinav Dangeti made changes -
          Field Original Value New Value
          Fix Version/s 2.0.1 [ 10399 ]
          Component/s ns_server [ 10019 ]
          abhinav Abhinav Dangeti made changes -
          Summary Rebalance-in operation failed twice with heavy front end load on an XDCR set up and with system in DGM Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM
          abhinav Abhinav Dangeti made changes -
          Summary Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM (~65% resident ratio)
          Description + 5 nodes rebalance in on each cluster
          Cluster setup: c1:c2::10:10
          biXDCR_bucket: c1 <---> c2
          uniXDCR_src: c1 ---> c2 :uniXDCR_dest
          Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src.
          c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/
          c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/

          On C1, Rebalance operation failed with this reason on the UI logs:

          Rebalance exited with reason {{bulk_set_vbucket_state_failed,
          [{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com',
          {'EXIT',
          {{timeout,
          {gen_server,call,
          ['ns_memcached-biXDCR_bucket',
          {set_vbucket,544,replica},
          180000]}},
          {gen_server,call,
          [{'janitor_agent-biXDCR_bucket',
          'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'},
          {if_rebalance,<0.10136.88>,
          {update_vbucket_state,544,replica,
          undefined,undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover,
          update_replication_post_move,3},
          {ns_vbucket_mover,handle_info,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}

          The second time, rebalance failed with the following UI log message:

          Rebalance exited with reason {{timeout,
          {gen_server,call,
          ['ns_memcached-biXDCR_bucket',
          {set_vbucket,849,active},
          180000]}},
          {gen_server,call,
          [{'janitor_agent-biXDCR_bucket',
          'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'},
          {if_rebalance,<0.21090.114>,
          {update_vbucket_state,849,active,paused,
          undefined}},
          infinity]}}

          After giving it some time, the third rebalance did complete successfully.

          Will attach the grabbed diags from one of the nodes at C1 in a bit.
          At the time of the rebalance failure:

          + 5 nodes rebalance in on each cluster
          Cluster setup: c1:c2::10:10
          biXDCR_bucket: c1 <---> c2
          uniXDCR_src: c1 ---> c2 :uniXDCR_dest
          Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src.
          c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/
          c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/

          On C1, Rebalance operation failed with this reason on the UI logs:

          Rebalance exited with reason {{bulk_set_vbucket_state_failed,
          [{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com',
          {'EXIT',
          {{timeout,
          {gen_server,call,
          ['ns_memcached-biXDCR_bucket',
          {set_vbucket,544,replica},
          180000]}},
          {gen_server,call,
          [{'janitor_agent-biXDCR_bucket',
          'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'},
          {if_rebalance,<0.10136.88>,
          {update_vbucket_state,544,replica,
          undefined,undefined}},
          infinity]}}}}]},
          [{janitor_agent,bulk_set_vbucket_state,4},
          {ns_vbucket_mover,
          update_replication_post_move,3},
          {ns_vbucket_mover,handle_info,2},
          {gen_server,handle_msg,5},
          {proc_lib,init_p_do_apply,3}]}

          The second time, rebalance failed with the following UI log message:

          Rebalance exited with reason {{timeout,
          {gen_server,call,
          ['ns_memcached-biXDCR_bucket',
          {set_vbucket,849,active},
          180000]}},
          {gen_server,call,
          [{'janitor_agent-biXDCR_bucket',
          'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'},
          {if_rebalance,<0.21090.114>,
          {update_vbucket_state,849,active,paused,
          undefined}},
          infinity]}}

          After giving it some time, the third rebalance did complete successfully.

          Will attach the grabbed diags from one of the nodes at C1 in a bit.
          junyi Junyi Xie (Inactive) made changes -
          Assignee Junyi Xie [ junyi ] Abhinav Dangeti [ abhinav ]
          farshid Farshid Ghods (Inactive) made changes -
          Fix Version/s 2.0.1 [ 10399 ]
          Fix Version/s 2.0 [ 10114 ]
          farshid Farshid Ghods (Inactive) made changes -
          Assignee Abhinav Dangeti [ abhinav ] Aleksey Kondratenko [ alkondratenko ]
          alkondratenko Aleksey Kondratenko (Inactive) made changes -
          Assignee Aleksey Kondratenko [ alkondratenko ] Farshid Ghods [ farshid ]
          farshid Farshid Ghods (Inactive) made changes -
          Assignee Farshid Ghods [ farshid ] Chiyoung Seo [ chiyoung ]
          kzeller kzeller made changes -
          Comment [       Added to RN:

            Under a heavy load of write operations on two clusters and both
                  bi-directional and uni-directional replications occurring
                  via XDCR, Couchbase Server 2.0 may fail during rebalance. ]
          kzeller kzeller made changes -
          Comment [       Added to RN:

            Under a heavy load of write operations on two clusters and both
                  bi-directional and uni-directional replications occurring
                  via XDCR, Couchbase Server 2.0 may fail during rebalance. ]
          junyi Junyi Xie (Inactive) made changes -
          Component/s couchbase-bucket [ 10173 ]
          Component/s cross-datacenter-replication [ 10136 ]
          farshid Farshid Ghods (Inactive) made changes -
          Fix Version/s 2.1 [ 10414 ]
          Fix Version/s 2.0.1 [ 10399 ]
          mikew Mike Wiederhold made changes -
          Sprint Status Current Sprint
          chiyoung Chiyoung Seo made changes -
          Assignee Chiyoung Seo [ chiyoung ] Mike Wiederhold [ mikew ]
          chiyoung Chiyoung Seo made changes -
          Planned End (re-schedule end date based on new assignee)
          mikew Mike Wiederhold made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Cannot Reproduce [ 5 ]
          wayne Wayne Siu made changes -
          Link This issue relates to MB-9636 [ MB-9636 ]
          mikew Mike Wiederhold made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

            People

            • Assignee:
              mikew Mike Wiederhold
              Reporter:
              abhinav Abhinav Dangeti
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes