Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31551

CBAS rebalance stuck on 99.67% for more than 2 hrs

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Won't Fix
    • 6.0.0
    • 6.0.0
    • analytics
    • Enterprise Edition 6.0.0 build 1673

    Description

      9 Node cluster, 6 KV and 3 CBAS

      CentOS7, 8 core VM's

       

      Note: I am assuming Analytics rebalance is struck based on rebalance % displayed on UI. It shows 99% for analytics and 100% for data nodes. Refer attached screenshot

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          tanzeem.ahmed Tanzeem Ahmed (Inactive) created issue -
          tanzeem.ahmed Tanzeem Ahmed (Inactive) made changes -
          Field Original Value New Value
          Attachment Screen Shot 2018-10-05 at 11.18.29 PM.png [ 60142 ]
          tanzeem.ahmed Tanzeem Ahmed (Inactive) made changes -
          Description 9 Node cluster, 6 KV and 3 CBAS

          CentOS7, 8 core VM's

           
          9 Node cluster, 6 KV and 3 CBAS

          CentOS7, 8 core VM's

           

          Note: I am assuming Analytics rebalance is struck based on rebalance % displayed on UI. It shows 99% for analytics and 100% for data nodes

           
          tanzeem.ahmed Tanzeem Ahmed (Inactive) made changes -
          Description 9 Node cluster, 6 KV and 3 CBAS

          CentOS7, 8 core VM's

           

          Note: I am assuming Analytics rebalance is struck based on rebalance % displayed on UI. It shows 99% for analytics and 100% for data nodes

           
          9 Node cluster, 6 KV and 3 CBAS

          CentOS7, 8 core VM's

           

          Note: I am assuming Analytics rebalance is struck based on rebalance % displayed on UI. It shows 99% for analytics and 100% for data nodes. Refer attached screenshot

           
          ingenthr Matt Ingenthron made changes -
          Summary CBAS rebalance struck on 99.67% for more than 2 hrs CBAS rebalance stuck on 99.67% for more than 2 hrs
          murtadha.hubail Murtadha Hubail made changes -
          Assignee Till Westmann [ till ] Murtadha Hubail [ murtadha.hubail ]
          murtadha.hubail Murtadha Hubail made changes -
          Fix Version/s Alice [ 15048 ]
          murtadha.hubail Murtadha Hubail made changes -
          Sprint CX Sprint 122 [ 663 ]
          murtadha.hubail Murtadha Hubail made changes -
          Rank Ranked higher
          murtadha.hubail Murtadha Hubail made changes -
          Status Open [ 1 ] In Progress [ 3 ]

          Hi Tanzeem Ahmed,
          I looked at the logs and the rebalance was actually progressing by extremely slowly due to IO congestion. When we were diagnosing the issue before the test was declared as failure, the rebalance was at the step of rebalancing the data. However, if you check the attached logs, you will see that the rebalance moved to the next step of creating the secondary indexes.

          I believe the IO congestion is caused by the number of partitions on each analytics node. In this setup, each node has 8 partitions. If those VMs have spinning disks, I recommend reducing the number of partitions to 2. Similar issues were reported before and reducing the number partitions per node helped. Another option that I don't recommend would be to increase the timeout before declaring the test as a failure.

          murtadha.hubail Murtadha Hubail added a comment - Hi Tanzeem Ahmed , I looked at the logs and the rebalance was actually progressing by extremely slowly due to IO congestion. When we were diagnosing the issue before the test was declared as failure, the rebalance was at the step of rebalancing the data. However, if you check the attached logs, you will see that the rebalance moved to the next step of creating the secondary indexes. I believe the IO congestion is caused by the number of partitions on each analytics node. In this setup, each node has 8 partitions. If those VMs have spinning disks, I recommend reducing the number of partitions to 2. Similar issues were reported before and reducing the number partitions per node helped. Another option that I don't recommend would be to increase the timeout before declaring the test as a failure.
          murtadha.hubail Murtadha Hubail made changes -
          Assignee Murtadha Hubail [ murtadha.hubail ] Tanzeem Ahmed [ tanzeem.ahmed ]
          Resolution Won't Fix [ 2 ]
          Status In Progress [ 3 ] Resolved [ 5 ]

          Tanzeem Ahmed pls try out the test with 2 partitions and reopen if needed. I am closing this one as per Murthadha's comments.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Tanzeem Ahmed pls try out the test with 2 partitions and reopen if needed. I am closing this one as per Murthadha's comments.
          mihir.kamdar Mihir Kamdar (Inactive) made changes -
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            tanzeem.ahmed Tanzeem Ahmed (Inactive)
            tanzeem.ahmed Tanzeem Ahmed (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty