Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7382

rebalance froze when node failed over and added back (observed mem used > high water mark for bucket)

    Details

      Description

      • 2 node cluster
      • 2 buckets
      • Bucket 'bkt' had a very high percentage of sets in its front end load.
      • Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
      • Rebalance froze at around 98%.
      • Stopped front end loads, disk write queue drained.
      • Mem used for both nodes, greater than higher water mark.
      • Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
      • Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

      Cluster diags:
      1 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

      2 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

      Attached the cbstats of all, raw memory for both nodes.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        abhinav Abhinav Dangeti created issue -
        abhinav Abhinav Dangeti made changes -
        Field Original Value New Value
        Description - 2 node cluster
        - Bucket 'bkt' had a very high percentage of sets in its front end load.
        - Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
        - Rebalance froze at around 98%.
        - Stopped front end loads, disk write queue drained.
        - Mem used for both nodes, greater than higher water mark.
        - Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
        - Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

        Live cluster:
        http://ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com:8091/
        http://ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com:8091/

        Attached the cbstats of all, raw memory for both nodes.
        - 2 node cluster
        - 2 buckets
        - Bucket 'bkt' had a very high percentage of sets in its front end load.
        - Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
        - Rebalance froze at around 98%.
        - Stopped front end loads, disk write queue drained.
        - Mem used for both nodes, greater than higher water mark.
        - Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
        - Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

        Live cluster:
        http://ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com:8091/
        http://ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com:8091/

        Attached the cbstats of all, raw memory for both nodes.
        trond Trond Norbye made changes -
        Component/s couchbase-bucket [ 10173 ]
        Component/s bucket-engine [ 10010 ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        The load rate from clients was too high, which caused the cluster to be highly overloaded during rebalance. There were lots of backlogs in the replication queues, which caused the bucket "bkt" to have memory usage more than 90% of bucket quota. If memory usage is above 90% of bucket quota, the replication or vbucket takeover would stop.

        If we don't set up the cluster with the enough capacity, we could have the rebalance issues.

        Please set up the cluster with the enough capacity.

        Show
        chiyoung Chiyoung Seo added a comment - The load rate from clients was too high, which caused the cluster to be highly overloaded during rebalance. There were lots of backlogs in the replication queues, which caused the bucket "bkt" to have memory usage more than 90% of bucket quota. If memory usage is above 90% of bucket quota, the replication or vbucket takeover would stop. If we don't set up the cluster with the enough capacity, we could have the rebalance issues. Please set up the cluster with the enough capacity.
        chiyoung Chiyoung Seo made changes -
        Assignee Chiyoung Seo [ chiyoung ] Abhinav Dangeti [ abhinav ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        Rebalance tests with two nodes wouldn't be good for system tests. All of our customers use three node cluster at least.

        Show
        chiyoung Chiyoung Seo added a comment - Rebalance tests with two nodes wouldn't be good for system tests. All of our customers use three node cluster at least.
        Hide
        abhinav Abhinav Dangeti added a comment -

        Not part of a system test, this was the cluster where I was checking the deleted items' status, I just tried failing over and adding back one of the nodes.

        Show
        abhinav Abhinav Dangeti added a comment - Not part of a system test, this was the cluster where I was checking the deleted items' status, I just tried failing over and adding back one of the nodes.
        abhinav Abhinav Dangeti made changes -
        Description - 2 node cluster
        - 2 buckets
        - Bucket 'bkt' had a very high percentage of sets in its front end load.
        - Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
        - Rebalance froze at around 98%.
        - Stopped front end loads, disk write queue drained.
        - Mem used for both nodes, greater than higher water mark.
        - Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
        - Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

        Live cluster:
        http://ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com:8091/
        http://ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com:8091/

        Attached the cbstats of all, raw memory for both nodes.
        - 2 node cluster
        - 2 buckets
        - Bucket 'bkt' had a very high percentage of sets in its front end load.
        - Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
        - Rebalance froze at around 98%.
        - Stopped front end loads, disk write queue drained.
        - Mem used for both nodes, greater than higher water mark.
        - Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
        - Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

        Cluster diags:
        1 http://ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com:8091/
        https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

        2 http://ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com:8091/
        https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

        Attached the cbstats of all, raw memory for both nodes.
        abhinav Abhinav Dangeti made changes -
        Description - 2 node cluster
        - 2 buckets
        - Bucket 'bkt' had a very high percentage of sets in its front end load.
        - Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
        - Rebalance froze at around 98%.
        - Stopped front end loads, disk write queue drained.
        - Mem used for both nodes, greater than higher water mark.
        - Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
        - Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

        Cluster diags:
        1 http://ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com:8091/
        https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

        2 http://ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com:8091/
        https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

        Attached the cbstats of all, raw memory for both nodes.
        - 2 node cluster
        - 2 buckets
        - Bucket 'bkt' had a very high percentage of sets in its front end load.
        - Failed over ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com, added back, rebalance.
        - Rebalance froze at around 98%.
        - Stopped front end loads, disk write queue drained.
        - Mem used for both nodes, greater than higher water mark.
        - Restarted couchbase server, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.
        - Rebooted nodes, waited for warm up to complete, retried rebalance, rebalance remained frozen at 50%.

        Cluster diags:
        1 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-25-132.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

        2 https://s3.amazonaws.com/bugdb/MB-7382/ec2-54-252-20-171.ap-southeast-2.compute.amazonaws.com-8091-diag.txt.gz

        Attached the cbstats of all, raw memory for both nodes.
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0.1 [ 10399 ]
        Fix Version/s 2.0 [ 10114 ]
        abhinav Abhinav Dangeti made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        mikew Mike Wiederhold made changes -
        Sprint Status Current Sprint
        Hide
        abhinav Abhinav Dangeti added a comment -

        Closing for now, will reopen if need be or seen again.

        Show
        abhinav Abhinav Dangeti added a comment - Closing for now, will reopen if need be or seen again.
        abhinav Abhinav Dangeti made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            abhinav Abhinav Dangeti
            Reporter:
            abhinav Abhinav Dangeti
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes