Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7290

Rebalance-in operation failed twice with "bulk_set_vbucket_state" failing with heavy front end load on an XDCR set up and with system in DGM (~65% resident ratio)

    Details

      Description

      At the time of the rebalance failure:

      + 5 nodes rebalance in on each cluster
      Cluster setup: c1:c2::10:10
      biXDCR_bucket: c1 <---> c2
      uniXDCR_src: c1 ---> c2 :uniXDCR_dest
      Front end loads on c1 and c2 for biXDCR_bucket, and on c1 for uniXDCR_src.
      c1: http://ec2-177-71-230-72.sa-east-1.compute.amazonaws.com:8091/
      c2: http://ec2-175-41-186-167.ap-southeast-1.compute.amazonaws.com:8091/

      On C1, Rebalance operation failed with this reason on the UI logs:

      Rebalance exited with reason {{bulk_set_vbucket_state_failed,
      [{'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com',
      {'EXIT',
      {{timeout,
      {gen_server,call,
      ['ns_memcached-biXDCR_bucket',

      {set_vbucket,544,replica}

      ,
      180000]}},
      {gen_server,call,
      [

      {'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-170-44.sa-east-1.compute.amazonaws.com'}

      ,
      {if_rebalance,<0.10136.88>,
      {update_vbucket_state,544,replica,
      undefined,undefined}},
      infinity]}}}}]},
      [

      {janitor_agent,bulk_set_vbucket_state,4}

      ,

      {ns_vbucket_mover, update_replication_post_move,3}

      ,

      {ns_vbucket_mover,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}

      The second time, rebalance failed with the following UI log message:

      Rebalance exited with reason {{timeout,
      {gen_server,call,
      ['ns_memcached-biXDCR_bucket',

      {set_vbucket,849,active}

      ,
      180000]}},
      {gen_server,call,
      [

      {'janitor_agent-biXDCR_bucket', 'ns_1@ec2-177-71-230-72.sa-east-1.compute.amazonaws.com'}

      ,
      {if_rebalance,<0.21090.114>,
      {update_vbucket_state,849,active,paused,
      undefined}},
      infinity]}}

      After giving it some time, the third rebalance did complete successfully.

      Will attach the grabbed diags from one of the nodes at C1 in a bit.

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          mikew Mike Wiederhold added a comment -

          This issue is 5 months old. Please open a new issue against the latest build if you see this issue again.

          Show
          mikew Mike Wiederhold added a comment - This issue is 5 months old. Please open a new issue against the latest build if you see this issue again.
          Hide
          chiyoung Chiyoung Seo added a comment -

          For the bug distributions in the engine team.

          Show
          chiyoung Chiyoung Seo added a comment - For the bug distributions in the engine team.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )

          Show
          farshid Farshid Ghods (Inactive) added a comment - deferring to 2.1 per bug scrub meeting ( Dipti & Farshid -December 7th )
          Hide
          junyi Junyi Xie (Inactive) added a comment -

          it has nothing to do with XDCR core code, remove xdcr from the component.

          Show
          junyi Junyi Xie (Inactive) added a comment - it has nothing to do with XDCR core code, remove xdcr from the component.
          Hide
          kzeller kzeller added a comment -

          Added to RN:

          Under a heavy load of write operations on two clusters and both
          bi-directional and uni-directional replications occurring
          via XDCR, Couchbase Server 2.0 may fail during rebalance.

          Show
          kzeller kzeller added a comment - Added to RN: Under a heavy load of write operations on two clusters and both bi-directional and uni-directional replications occurring via XDCR, Couchbase Server 2.0 may fail during rebalance.

            People

            • Assignee:
              mikew Mike Wiederhold
              Reporter:
              abhinav Abhinav Dangeti
            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes