Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31551

CBAS rebalance stuck on 99.67% for more than 2 hrs

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 6.0.0
    • 6.0.0
    • analytics
    • Enterprise Edition 6.0.0 build 1673

    Description

      9 Node cluster, 6 KV and 3 CBAS

      CentOS7, 8 core VM's

       

      Note: I am assuming Analytics rebalance is struck based on rebalance % displayed on UI. It shows 99% for analytics and 100% for data nodes. Refer attached screenshot

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Hi Tanzeem Ahmed,
          I looked at the logs and the rebalance was actually progressing by extremely slowly due to IO congestion. When we were diagnosing the issue before the test was declared as failure, the rebalance was at the step of rebalancing the data. However, if you check the attached logs, you will see that the rebalance moved to the next step of creating the secondary indexes.

          I believe the IO congestion is caused by the number of partitions on each analytics node. In this setup, each node has 8 partitions. If those VMs have spinning disks, I recommend reducing the number of partitions to 2. Similar issues were reported before and reducing the number partitions per node helped. Another option that I don't recommend would be to increase the timeout before declaring the test as a failure.

          murtadha.hubail Murtadha Hubail added a comment - Hi Tanzeem Ahmed , I looked at the logs and the rebalance was actually progressing by extremely slowly due to IO congestion. When we were diagnosing the issue before the test was declared as failure, the rebalance was at the step of rebalancing the data. However, if you check the attached logs, you will see that the rebalance moved to the next step of creating the secondary indexes. I believe the IO congestion is caused by the number of partitions on each analytics node. In this setup, each node has 8 partitions. If those VMs have spinning disks, I recommend reducing the number of partitions to 2. Similar issues were reported before and reducing the number partitions per node helped. Another option that I don't recommend would be to increase the timeout before declaring the test as a failure.

          Tanzeem Ahmed pls try out the test with 2 partitions and reopen if needed. I am closing this one as per Murthadha's comments.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Tanzeem Ahmed pls try out the test with 2 partitions and reopen if needed. I am closing this one as per Murthadha's comments.

          People

            tanzeem.ahmed Tanzeem Ahmed (Inactive)
            tanzeem.ahmed Tanzeem Ahmed (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty