Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58244

KV rebalance hung

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Critical
    • None
    • 7.2.1
    • couchbase-bucket

    Description

      Test Steps:

      1. 15 scopes and 60 collections across 3 buckets
        • Each bucket -> 10M data 1 index on each collection -> 60 indexes (each index ~400k docs)
      2. Sleep - 15 mins
      3. Run FTS Queries FTS Flex Queries Random fashion for 2 hrs
      4. Kill CBFT
      5. sleep for 15 mins
      6. Scale out to 5 nodes
      7. Create 10 more fts indexes with following config on default scope and default index (500k docs per index) :
        • 1 Index | 1 Replica | 5 Partitions
        •  3 Indexes | 2 Replicas | 6 Partitions
        • 6 Indexes | 0 Replicas | 4 Partitions
      8. Sleep for 15 mins
      9. Run FTS Queries FTS Flex Queries Random fashion for 30 mins again
      10.  Kill CBFT
      11. Sleep for 15 mins
      12. Rebalance/Scale in to 4 nodes
      13. Create 10 more fts indexes with following config :
        1. 1 Index | 1 Replica | 5 Partitions
        1.  3 Indexes | 2 Replicas | 6 Partitions
        1. 6 Indexes | 0 Replicas | 4 Partitions"
      1. Run FTS Queries FTS Flex Queries Random fashion for 30 mins again
      1. Kill memcached
      1.  Sleep for 15 mins
      1. Scale in back to 3 nodes

      Test Logs: http://qe-jenkins1.sc.couchbase.com/job/cp-cli-fts-system-test/7/console


      Seeing that KV rebalance has hung.
      Suspecting it to be a sizing issue as I am seeing default1 bucket to go to 0% RR but even less docs would cause the same, as filed in MB-58014.

      On this bucket itself the ram used is 3.6GB/4GB making it > 90% of allocated memory, not sure why because all buckets have some size and no of data, but this hints towards undersized cluster.

      I am also seeing node getting failed over in this process 

      Failed over ['ns_1@svc-dqs-node-004.lie3v0iv5ulitlp.sandbox.nonprod-project-avengers.com']: okfailover 000ns_1@svc-dqs-node-001.lie3v0iv5ulitlp.sandbox.nonprod-project-avengers.com8:
      13:29 PM 10 Aug, 2023 

      please check if it is actually a sizing problem or if something actually went wrong.

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sarthak.dua Sarthak Dua
            sarthak.dua Sarthak Dua
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty