Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46209

FTS rebalance-in appears to be hung

    XMLWordPrintable

Details

    Description

      Steps to Reproduce:
      1. Create a init cluster with 8 kv nodes and 10 fts nodes (total of 18 node init cluster)

      +----------------+----------+-----------------------+---------------+--------------+
      | Nodes          | Services | Version               | CPU           | Status       |
      +----------------+----------+-----------------------+---------------+--------------+
      | 172.23.105.175 | kv       | 7.0.0-5127-enterprise | 1.11375297209 | Cluster node |
      | 172.23.106.233 | ['kv']   |                       |               | <--- IN ---  |
      | 172.23.106.236 | ['kv']   |                       |               | <--- IN ---  |
      | 172.23.106.238 | ['kv']   |                       |               | <--- IN ---  |
      | 172.23.106.250 | ['kv']   |                       |               | <--- IN ---  |
      | 172.23.106.251 | ['kv']   |                       |               | <--- IN ---  |
      | 172.23.121.74  | ['kv']   |                       |               | <--- IN ---  |
      | 172.23.121.78  | ['kv']   |                       |               | <--- IN ---  |
      | 172.23.107.43  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.58  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.44  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.45  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.54  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.47  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.78  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.84  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.85  | ['fts']  |                       |               | <--- IN ---  |
      | 172.23.107.88  | ['fts']  |                       |               | <--- IN ---  |
      +----------------+----------+-----------------------+---------------+--------------+

      2. Create 15 Couchbase buckets, each bucket has 66 scopes, each scope has 1 collections. Each collection has 100 items

      +-------------------------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+
      | Bucket                        | Type      | Replicas | Durability | TTL | Items | RAM Quota  | RAM Used  | Disk Used |
      +-------------------------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+
      | 0SOXi1m-59-524000             | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140282736 | 94626452  |
      | 3UIaV1M59btKMZAuKqa-59-531000 | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140497120 | 96884102  |
      | 3_wu-59-516000                | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140269200 | 92234044  |
      | 4bbj_BagLC0c-59-435000        | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140523024 | 98157575  |
      | 8RqcdNTBrSlthPcW-59-404000    | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140438736 | 94876811  |
      | Euje6l8r-59-472000            | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140507056 | 99141383  |
      | LVPOp_6A-59-491000            | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140478208 | 99772034  |
      | N6-59-462000                  | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140304688 | 91587435  |
      | OIKAR18m1TE4vnkl-59-450000    | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140315248 | 95236594  |
      | SKx8L-59-421000               | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140357600 | 97739345  |
      | g2YWv-59-500000               | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140354816 | 92845165  |
      | m36Ir0JPz2yTE5XN--59-384000   | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140372880 | 95261388  |
      | msUJk-59-538000               | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140431504 | 99952196  |
      | r-Jwfz_CSeo5MYe%B-59-481000   | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 137281232 | 98446038  |
      | x3is_DkH8-59-508000           | couchbase | 1        | none       | 0   | 6600  | 2516582400 | 140351760 | 93582118  |
      +-------------------------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+
      

      3. Create 100 fts indexes each with partitons=1 on 100 collections using rest api

      2021-05-09 22:53:36,375 | test  | INFO    | MainThread | [Collections:create_fts_indexes:335] Creating 100 fts indexes 

      the last Rest api calls return at 

      2021-05-09 22:56:45,744 | test | INFO | MainThread | [Collections:test_volume_taf:643] 

      4. Do some crud operations on documents
      5. Rebalance-in a fts node (172.23.107.107) (so almost 4 minutes from the last rest call to create fts index)

      2021-05-09 22:57:04,427 | test  | INFO    | pool-2-thread-2 | [table_view:display:72] Rebalance Overview

      This rebalance got hung at 42% for a long time .

      Hardware and RAM quota
      RAM quota for FTS: 22 GiB
      Each node CPUs: 8
      Each node RAM: ~24GiB

      CPU, RAM usage, FTS indexes building progress (is at 100%) - all look healthy 
      So not sure of why the rebalance got hung
      Screenshots and logs attached

       

      Attachments

        1. indexes.png
          indexes.png
          433 kB
        2. Screen Shot 2021-05-10 at 2.37.04 PM.png
          Screen Shot 2021-05-10 at 2.37.04 PM.png
          29 kB
        3. servers.png
          servers.png
          444 kB
        4. task.json
          11 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            From the logs, it roughly takes around 2 min to process an index movement (even when no partition movement is involved) and

            that extrapolates to 2 X 100 = 200 min and 3 hrs looks legit here.   (partition movement also won't take much time as this is only a ~100 docs here per index)

            Where that 2 min time goes into even when no partition movement is involved for many indexes?

            => Rebalance orchestrator still needs to evaluate the potential partition movement of every index.

            => Spin up a single supply and `N ` mover goroutines for all the nodes present in the cluster.  (12 of them in this case.)

            => Ceremoniously close all these mover/supply routines upon finding no/after all partition movements.

            So, it looks like there is no scope for further rebalance speed up optimizations here, and would expect the same time to be taken even in older builds too.

             

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited From the logs, it roughly takes around 2 min to process an index movement (even when no partition movement is involved) and that extrapolates to 2 X 100 = 200 min and 3 hrs looks legit here.   (partition movement also won't take much time as this is only a ~100 docs here per index) Where that 2 min time goes into even when no partition movement is involved for many indexes? => Rebalance orchestrator still needs to evaluate the potential partition movement of every index. => Spin up a single supply and `N ` mover goroutines for all the nodes present in the cluster.  (12 of them in this case.) => Ceremoniously close all these mover/supply routines upon finding no/after all partition movements. So, it looks like there is no scope for further rebalance speed up optimizations here, and would expect the same time to be taken even in older builds too.  

            Thanks, Sreekanth.  I tried this test on an older build: 7.0.0-5121 (ns_server changes for purging metakv tombstones went in the build 5126), and like you expected the rebalance took the same time (~3hrs) as it did on 5127 build.
            Attaching snapshot link of logs for build 5121 for your ref: https://supportal.couchbase.com/snapshot/362ab5f76f9bf03b78c4949272358a96::0 

            From the POV of our script, since the rebalance-progress(42.105%) from /pools/default/tasks endpoint did not change for almost 3 hours, we flag it as "hung" as otherwise the progress number would have kept increasing. (Not sure if that is expected, perhaps a separate question for ns-server). So will modify the test to wait for a longer time for rebalance to finish depending on the number of FTS indexes we have in the cluster.

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - - edited Thanks, Sreekanth.  I tried this test on an older build: 7.0.0-5121 (ns_server changes for purging metakv tombstones went in the build 5126), and like you expected the rebalance took the same time (~3hrs) as it did on 5127 build. Attaching snapshot link of logs for build 5121 for your ref: https://supportal.couchbase.com/snapshot/362ab5f76f9bf03b78c4949272358a96::0   From the POV of our script, since the rebalance-progress(42.105%) from /pools/default/tasks endpoint did not change for almost 3 hours, we flag it as "hung" as otherwise the progress number would have kept increasing. (Not sure if that is expected, perhaps a separate question for ns-server). So will modify the test to wait for a longer time for rebalance to finish depending on the number of FTS indexes we have in the cluster.

            Sumedh Basarkod, Is it possible to get the logs from the rebalance IN node alone in the above test?

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Sumedh Basarkod , Is it possible to get the logs from the rebalance IN node alone in the above test?

            Sure.  172.23.107.91 is the fts node that was rebalanced-in. Logs from this node (of build 5121):

            [s3://cb-customers-secure/mb-46209_build_5121/2021-05-11/collectinfo-2021-05-11t070546-ns_1@172.23.107.91.zip]

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - Sure.  172.23.107.91 is the fts node that was rebalanced-in. Logs from this node (of build 5121): [s3://cb-customers-secure/mb-46209_build_5121/2021-05-11/collectinfo-2021-05-11t070546-ns_1@172.23.107.91.zip]
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            Created another ticket for tracking the sub-optimum/unbalanced partition allocations here MB-46239

            Closing this ticket - as rebalance `hung` is no longer an issue here, and rebalance works as expected after closer inspection of the logs.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Created another ticket for tracking the sub-optimum/unbalanced partition allocations here  MB-46239 .  Closing this ticket - as rebalance `hung` is no longer an issue here, and rebalance works as expected after closer inspection of the logs.

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty