Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32068

Non linear increase in rebalance time when number of buckets increased

    XMLWordPrintable

Details

    Description

      Build 5.5.2-3733
      While doing multi bucket tests we observed that rebalance time is increasing non linearly wrt number of buckets.
      Test setup:
      Number of nodes: Initially 3, after rebalance 4
      Number of documents: 3 million (Total documents in all buckets)
      Key size: ~40b
      Average document size: 1024b
      Gets,Sets/sec: 500, 500(Addition of load on all buckets)
      Number of replicas: 1

      Test procedure:
      -Load documents in buckets
      -Start incremental load
      -After 15-20 mins rebalance in KV node
      -After rebalance continue load for 15-20 mins

      Here are observations:

      Logs are present at the end of console output on jenkins job.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          mahesh.mandhare Mahesh Mandhare (Inactive) created issue -
          drigby Dave Rigby made changes -
          Field Original Value New Value
          Issue Type Bug [ 1 ] Improvement [ 4 ]
          drigby Dave Rigby added a comment -

          Changed from Bug to Improvement - there’s no current requirement for reliance to take linear time above 10 buckets.

          drigby Dave Rigby added a comment - Changed from Bug to Improvement - there’s no current requirement for reliance to take linear time above 10 buckets.

          Poonam Dhavale FYI, this is the ticket for tracking non-linear increase in rebalance time during the bucket scaling tests. Should I assign to you?

          shivani.gupta Shivani Gupta added a comment - Poonam Dhavale FYI, this is the ticket for tracking non-linear increase in rebalance time during the bucket scaling tests. Should I assign to you?
          poonam Poonam Dhavale made changes -
          Assignee Dave Rigby [ drigby ] Poonam Dhavale [ poonam ]

           

          Will take a look.

          poonam Poonam Dhavale added a comment -   Will take a look.
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]

           

          Hi Mahesh,

          Logs for  50 and 70 buckets are not available at the end of their console o/p . Can you please provide those?

          In the mean time, I will take a look at the 10 and 30 buckets logs.

          poonam Poonam Dhavale added a comment -   Hi Mahesh, Logs for  50 and 70 buckets are not available at the end of their console o/p . Can you please provide those? In the mean time, I will take a look at the 10 and 30 buckets logs.

          For 50 and 70 buckets test cbcollect timed out. Let me know if 30 bucket test logs are not sufficient to debug the issue, I will rerun these tests and collect the logs manually.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - For 50 and 70 buckets test cbcollect timed out. Let me know if 30 bucket test logs are not sufficient to debug the issue, I will rerun these tests and collect the logs manually.
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]

           

          I have debugged the 30 buckets log and have a theory on why rebalance is taking longer. But, its more of a hunch.

          The 50 and 70 buckets logs are needed to check whether the same pattern exists there. And, if it does then it will increase confidence in the theory.

          poonam Poonam Dhavale added a comment -   I have debugged the 30 buckets log and have a theory on why rebalance is taking longer. But, its more of a hunch. The 50 and 70 buckets logs are needed to check whether the same pattern exists there. And, if it does then it will increase confidence in the theory.
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Here are logs for 50 bucket rebalance test rerun at - http://perf.jenkins.couchbase.com/job/arke-multi-bucket/147   https://s3.amazonaws.com/bugdb/jira/multi_bucket_50_reb/collectinfo-2018-11-29T114851-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/multi_bucket_50_reb/collectinfo-2018-11-29T114851-ns_1%40172.23.97.13.zip https://s3.amazonaws.com/bugdb/jira/multi_bucket_50_reb/collectinfo-2018-11-29T114851-ns_1%40172.23.97.14.zip https://s3.amazonaws.com/bugdb/jira/multi_bucket_50_reb/collectinfo-2018-11-29T114851-ns_1%40172.23.97.15.zip For 70 buckets it takes around 6hours to run test, will updates logs by next week.
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]

           

          I have analyzed the 50 buckets log and see similar issue as in the 30 buckets log.

          I have created a 5.5.2 toy build and fired BR_30bucket.test and BR_50bucket.test on the arke cluster. 

          The toy build does not have a fix but will help isolate the problem.

          poonam Poonam Dhavale added a comment -   I have analyzed the 50 buckets log and see similar issue as in the 30 buckets log. I have created a 5.5.2 toy build and fired BR_30bucket.test and BR_50bucket.test on the arke cluster.  The toy build does not have a fix but will help isolate the problem.

           

          I ran the multi bucket perf tests on couple of toy builds. I will post my findings tomorrow.

           

          Mahesh, in the mean time, can you please run the 30 bucket test on nodes with 48 vCPU?

          poonam Poonam Dhavale added a comment -   I ran the multi bucket perf tests on couple of toy builds. I will post my findings tomorrow.   Mahesh, in the mean time, can you please run the 30 bucket test on nodes with 48 vCPU?

          Poonam Dhavale, on arke cluster machines are data nodes 24 core.

          I tried using themis cluster to run on 48 cores and rebalance time for 30 buckets looks similar.

          Here are runs on themis cluster-

          with 48 core- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/154 

          with 24 core- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/155 

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - - edited Poonam Dhavale , on arke cluster machines are data nodes 24 core. I tried using themis cluster to run on 48 cores and rebalance time for 30 buckets looks similar. Here are runs on themis cluster- with 48 core- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/154   with 24 core- http://perf.jenkins.couchbase.com/job/arke-multi-bucket/155  
          poonam Poonam Dhavale added a comment - - edited

           
          Thanks Mahesh for conducting the tests on 48 core nodes.
           
          Here is the root cause: 

          • At the end of each vBucket move, the vBucket map is updated in the ns_config.
          • Then ns_config_rep:ensure_config_seen_by_nodes() is called to  make sure all nodes have seen this config update. Config for all buckets is stored under one “buckets” key in ns_config. So, even though we are just updating one chain in the vBucket map for one bucket, the config for all buckets is sent to other nodes and sync’d/merged.
          • As the #of buckets increase, this ns_config sync step takes slightly longer. E.g. With 10 bucket, this step takes less than 50ms. With 30 buckets, it takes ~100 – 200 ms.
          • This time adds up and rebalance of a bucket takes slightly longer. E.g. With 10 buckets, rebalance of each bucket takes ~20s; with 30 buckets, rebalance of each bucket takes ~50s and with 50 buckets, rebalance of each bucket takes ~90s.
          • So, rebalance of 30 buckets takes (30 * 50s) = ~25 minutes and so on.

           
          Normally, we will not notice the time taken for config sync because other vBucket move steps such as backfill/persistence are generally the long poles. But, in these tests, there is so less data in the buckets, that config transfer/sync takes longer than data movement.
           
          Here are the low level details:

          • Orchestrator spawns ns_vbucket_mover process to rebalance a bucket.
          •  ns_vbucket_mover in turn spawns ns_single_vbucket_mover processes – each process moves one vBucket.
          • Once the vBucket move is complete, the child ns_single_vbucket_mover sends a “move_done” message to the parent ns_vbucket_mover process.
          • When the parent process receives the “move_done” message, it updates the vBucket map in ns_config. Then it schedules more vBucket moves.
          • In the logs, it was seen that  the actual vBucket move completes very quickly ~100ms & the child sends the “move_done” message to the parent.
          • But, on a 30 bucket cluster, sometimes it took as long as 800ms for the parent to handle the move_done message. On a 50 bucket cluster, sometimes it took as long as 2s.
          • Since rebalance is performed one bucket at time, it was not clear initially why higher # of buckets should cause slow handling of move_done by the parent ns_vbucket_mover  process.
          • One possibility was Erlang scheduling. With each bucket creation, there are several new processes created in the ns_server/Erlang VM but #of schedulers remained the same. To rule out Erlang scheduling as the factor, I ran the 30 bucket test with a toy build and also asked QA to run the test on 48 core systems. But, these did not have any effect on the rebalance time.
          • So, I created another toy build with additional instrumentation.
          • From this it was found that, the ns_vbucket_mover starts processing move_done message right away but the done processing takes slightly longer due the ns_config sync reason mentioned earlier.
          • And, since the parent process handles the done messages serially, the timing adds up.
          • E.g. Say move for 10 vBuckets completes more or less at the same time. Each of the child process sends done message to the parent. The parent process starts processing each of the done message serially. If it takes around 100ms to process each done message, then it will take around ~1s before it processes the done message for the 10th vBucket.

           
           
           
           
           

          poonam Poonam Dhavale added a comment - - edited   Thanks Mahesh for conducting the tests on 48 core nodes.   Here is the root cause:  At the end of each vBucket move, the vBucket map is updated in the ns_config. Then ns_config_rep:ensure_config_seen_by_nodes() is called to  make sure all nodes have seen this config update. Config for all buckets is stored under one “buckets” key in ns_config. So, even though we are just updating one chain in the vBucket map for one bucket, the config for all buckets is sent to other nodes and sync’d/merged. As the #of buckets increase, this ns_config sync step takes slightly longer. E.g. With 10 bucket, this step takes less than 50ms. With 30 buckets, it takes ~100 – 200 ms. This time adds up and rebalance of a bucket takes slightly longer. E.g. With 10 buckets, rebalance of each bucket takes ~20s; with 30 buckets, rebalance of each bucket takes ~50s and with 50 buckets, rebalance of each bucket takes ~90s. So, rebalance of 30 buckets takes (30 * 50s) = ~25 minutes and so on.   Normally, we will not notice the time taken for config sync because other vBucket move steps such as backfill/persistence are generally the long poles. But, in these tests, there is so less data in the buckets, that config transfer/sync takes longer than data movement.   Here are the low level details: Orchestrator spawns ns_vbucket_mover process to rebalance a bucket.  ns_vbucket_mover in turn spawns ns_single_vbucket_mover processes – each process moves one vBucket. Once the vBucket move is complete, the child ns_single_vbucket_mover sends a “move_done” message to the parent ns_vbucket_mover process. When the parent process receives the “move_done” message, it updates the vBucket map in ns_config. Then it schedules more vBucket moves. In the logs, it was seen that  the actual vBucket move completes very quickly ~100ms & the child sends the “move_done” message to the parent. But, on a 30 bucket cluster, sometimes it took as long as 800ms for the parent to handle the move_done message. On a 50 bucket cluster, sometimes it took as long as 2s. Since rebalance is performed one bucket at time, it was not clear initially why higher # of buckets should cause slow handling of move_done by the parent ns_vbucket_mover  process. One possibility was Erlang scheduling. With each bucket creation, there are several new processes created in the ns_server/Erlang VM but #of schedulers remained the same. To rule out Erlang scheduling as the factor, I ran the 30 bucket test with a toy build and also asked QA to run the test on 48 core systems. But, these did not have any effect on the rebalance time. So, I created another toy build with additional instrumentation. From this it was found that, the ns_vbucket_mover starts processing move_done message right away but the done processing takes slightly longer due the ns_config sync reason mentioned earlier. And, since the parent process handles the done messages serially, the timing adds up. E.g. Say move for 10 vBuckets completes more or less at the same time. Each of the child process sends done message to the parent. The parent process starts processing each of the done message serially. If it takes around 100ms to process each done message, then it will take around ~1s before it processes the done message for the 10th vBucket.          
          dfinlay Dave Finlay added a comment -

          Thanks Poonam!

          Essentially, there is no bug here (though there might be an improvement.) As the number of buckets increases the following things happen:

          1. the size of the config increases and the length of time to config sync post each vbucket increases
          2. the amount of data in vbucket decreases so the proportion of time spent managing config for each vbucket increases

          So we should expect that as bucket count increases in the face of constant total data size, things will get worse.

          In this test total data size is 3 GB (which is tiny.) In the 50 bucket case, this means about 60 docs per vbucket / 60 KB total data per vbucket. If the total cluster size were 300 GB we would see closer to linear scaling.

          dfinlay Dave Finlay added a comment - Thanks Poonam! Essentially, there is no bug here (though there might be an improvement.) As the number of buckets increases the following things happen: the size of the config increases and the length of time to config sync post each vbucket increases the amount of data in vbucket decreases so the proportion of time spent managing config for each vbucket increases So we should expect that as bucket count increases in the face of constant total data size, things will get worse. In this test total data size is 3 GB (which is tiny.) In the 50 bucket case, this means about 60 docs per vbucket / 60 KB total data per vbucket. If the total cluster size were 300 GB we would see closer to linear scaling.

           

          • I ran QA's 50 bucket perf test with num_vbuckets set to 256. Rebalance completed in 6.3 minutes. Same test with 1024 vBuckets takes ~81-84 minutes.  
            • The total # of items in the buckets in both tests is the same. 
            • Rebalance time with 1024 vBuckets is almost 14x that of the one with 256 vBuckets. One would have expected it to be around ~4x given that the #of vBuckets is 4x. 
          • I also ran the 30 bucket test with num_vbuckets set to 512. Rebalance completed in 7.4 minutes. Same test with 1024 vBuckets takes ~26 minutes.
          • Comparison of 50 bucket test 1024 vs 256 vbuckets: All metrics look better with 256 vbs.
          • Reason for conducting the tests with lower vBucket count:
            • As mentioned earlier, as the # of buckets increases, the config sync, performed at the end of each vBucket move, takes slightly longer which then results in longer rebalance time.
            • This config sync includes payload for all buckets – each bucket has the 1024 vBucket map.
            • So, I wondered what would be the effect of reducing # of vBuckets for 50 and 30 bucket tests. We know that rebalance will be fast because of the low vBucket count but I also wanted to check what its impact is on the config sync. 
            • It was seen that the config sync is also relatively fast (most likely because of the smaller vBucket map in each bucket config).
          poonam Poonam Dhavale added a comment -   I ran QA's 50 bucket perf test with num_vbuckets set to 256 . Rebalance completed in 6.3 minutes. Same test with 1024 vBuckets takes ~81-84 minutes.   The total # of items in the buckets in both tests is the same.  Rebalance time with 1024 vBuckets is almost 14x that of the one with 256 vBuckets. One would have expected it to be around ~4x given that the #of vBuckets is 4x.  I also ran the 30 bucket test with num_vbuckets set to 512 . Rebalance completed in 7.4 minutes. Same test with 1024 vBuckets takes ~26 minutes. Comparison of 50 bucket test 1024 vs 256 vbuckets: All metrics look better with 256 vbs. http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_552-3733_process_05c5&snapshot=arke_basic_552-3733_process_12a8&label=1024vb&label=256vb    In the above metrics, couch_docs_data_size with 1024 vBuckets is almost twice of the one with 256 vBuckets. I will investigate why this is. Reason for conducting the tests with lower vBucket count: As mentioned earlier, as the # of buckets increases, the config sync, performed at the end of each vBucket move, takes slightly longer which then results in longer rebalance time. This config sync includes payload for all buckets – each bucket has the 1024 vBucket map. So, I wondered what would be the effect of reducing # of vBuckets for 50 and 30 bucket tests. We know that rebalance will be fast because of the low vBucket count but I also wanted to check what its impact is on the config sync.  It was seen that the config sync is also relatively fast (most likely because of the smaller vBucket map in each bucket config).
          poonam Poonam Dhavale added a comment -   Links to the tests: 50 Bucket, 256 vBuckets:  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/168/ 30 Bucket, 512 vBuckets:  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/169/
          shivani.gupta Shivani Gupta made changes -
          Labels Performance Performance bucket-scalability
          poonam Poonam Dhavale added a comment - - edited

           
          Also, ran couple of more tests:

          poonam Poonam Dhavale added a comment - - edited   Also, ran couple of more tests: 70 buckets, 128 vBuckets, rebalance took 4.6 minutes. (compared to 246 minutes with 1024 vBuckets) http://perf.jenkins.couchbase.com/job/arke-multi-bucket/171 /  30 buckets, 256 vBuckets, rebalance took 2.6 minutes http://perf.jenkins.couchbase.com/job/arke-multi-bucket/170 / 
          drigby Dave Rigby made changes -
          Rank Ranked lower
          shivani.gupta Shivani Gupta made changes -
          Labels Performance bucket-scalability Performance bucket-scalability high-bucket-density
          lynn.straus Lynn Straus made changes -
          Fix Version/s Mad-Hatter [ 15037 ]
          lynn.straus Lynn Straus added a comment -

          setting initial fix version to Mad Hatter so that investigation occurs in MH timeframe.  Please update fix version once investigation completes.

          lynn.straus Lynn Straus added a comment - setting initial fix version to Mad Hatter so that investigation occurs in MH timeframe.  Please update fix version once investigation completes.
          drigby Dave Rigby made changes -
          Component/s couchbase-bucket [ 10173 ]
          dfinlay Dave Finlay made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]
          Fix Version/s Mad-Hatter [ 15037 ]
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Ajit Yagaty [ ajit.yagaty ]
          wayne Wayne Siu made changes -
          Link This issue is triggering CBPS-754 [ CBPS-754 ]
          dfinlay Dave Finlay made changes -
          Assignee Ajit Yagaty [X] [ ajit.yagaty ] Dave Finlay [ dfinlay ]
          dfinlay Dave Finlay made changes -
          Assignee Dave Finlay [ dfinlay ] Artem Stemkovski [ artem ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s CheshireCat.Next [ 16908 ]
          Fix Version/s Cheshire-Cat [ 15915 ]
          Labels Performance bucket-scalability high-bucket-density Performance bucket-scalability deferred-from-Cheshire-Cat high-bucket-density
          dfinlay Dave Finlay made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]
          Fix Version/s CheshireCat.Next [ 16908 ]
          dfinlay Dave Finlay made changes -
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          wayne Wayne Siu made changes -
          Assignee Artem Stemkovski [ artem ] Bo-Chun Wang [ bo-chun.wang ]
          bo-chun.wang Bo-Chun Wang added a comment - - edited

          I have finished runs with different numbers of buckets. Each node has 24 CPU cores and 64GB memory. The memory quota is 45GB per node. 

           

          Build 7.0.0-4554

           

          Build 7.0.0-4797

           

          bo-chun.wang Bo-Chun Wang added a comment - - edited I have finished runs with different numbers of buckets. Each node has 24 CPU cores and 64GB memory. The memory quota is 45GB per node.    Build 7.0.0-4554 #Buckets Rebalance time(min) Jenkins job cbmonitor link CPU utilization (172.23.99.157) 10  1.3 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/6/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4554_rebalance_7842 during rebalance: 23% w/o rebalance: 8% 30  3.9 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/7/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4554_rebalance_a9d0 during rebalance: 28% w/o rebalance: 10% 50  7.8 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/8/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4554_rebalance_4559 during rebalance: 30% w/o rebalance: 13% 70  17.5 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/9/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4554_rebalance_e24c during rebalance: 30% w/o rebalance: 18% 100  50.0 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/10/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4554_rebalance_b15d during rebalance: 30% w/o rebalance: 20%   Build 7.0.0-4797 #Buckets Rebalance time(min) Jenkins job cbmonitor link CPU utilization (172.23.99.157) 10 1.3 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/11/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4797_rebalance_d30e during rebalance: 23% w/o rebalance: 8% 30 3.8 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/12/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4797_rebalance_0d87 during rebalance: 29% w/o rebalance: 10% 50 7.5 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/13/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4797_rebalance_5209 during rebalance: 32% w/o rebalance: 15% 70 16.0 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/14/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4797_rebalance_c912 during rebalance: 30% w/o rebalance: 18% 100  45.7 http://perf.jenkins.couchbase.com/job/themis-multibucket-kv-rebalance/15/ http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=themis_700-4797_rebalance_c545 during rebalance: 30% w/o rebalance: 20%  

          I close this issue as the current performance results show significant improvement.

          bo-chun.wang Bo-Chun Wang added a comment - I close this issue as the current performance results show significant improvement.
          bo-chun.wang Bo-Chun Wang made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s 7.0.0 [ 17233 ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]

          People

            bo-chun.wang Bo-Chun Wang
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            14 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty