Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-32645

[high-bucket] High CPU utilisation during kv rebalance

    XMLWordPrintable

Details

    Description

      Build 6.0.0-1693

      As discussed in high bucket density sync-up meeting, logging this issue for investigation.
      Observed CPU utilisation spikes upto 80% on 24 core orchestrator machine during KV rebalance going on with 30 buckets present in cluster.

      CPU utilisation graph during rebalance-

      cbmonitor link- http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_600-1693_run_kv_rebalance_dd46

      Logs: 

      KV node- https://s3.amazonaws.com/bugdb/jira/index_reb_multibucket/collectinfo-2019-01-08T151840-ns_1%40172.23.97.12.zip
      KV node- https://s3.amazonaws.com/bugdb/jira/index_reb_multibucket/collectinfo-2019-01-08T151840-ns_1%40172.23.97.13.zip
      KV node- https://s3.amazonaws.com/bugdb/jira/index_reb_multibucket/collectinfo-2019-01-08T151840-ns_1%40172.23.97.14.zip

      Attachments

        1. 1_bucket_ns_server.png
          1_bucket_ns_server.png
          496 kB
        2. 1_bucket.png
          1_bucket.png
          667 kB
        3. 10_bucket_ns_server.png
          10_bucket_ns_server.png
          472 kB
        4. 10_bucket.png
          10_bucket.png
          411 kB
        5. 30_buckets_kv_cpu_util.png
          30_buckets_kv_cpu_util.png
          51 kB
        6. 30_buckets_ns_server.png
          30_buckets_ns_server.png
          559 kB
        7. 30_buckets.png
          30_buckets.png
          592 kB
        8. 5_buckets_ns_server.png
          5_buckets_ns_server.png
          610 kB
        9. 5_buckets.png
          5_buckets.png
          509 kB
        10. 6.6.2_CPU.png
          6.6.2_CPU.png
          497 kB
        11. 7.0.0_CPU.png
          7.0.0_CPU.png
          282 kB
        12. 8cores_24cores.png
          8cores_24cores.png
          411 kB
        13. eventing.png
          eventing.png
          532 kB
        14. fts.png
          fts.png
          372 kB
        15. image-2019-01-15-12-11-15-521.png
          image-2019-01-15-12-11-15-521.png
          640 kB
        16. index_query_15.png
          index_query_15.png
          465 kB
        17. index_query_19.png
          index_query_19.png
          655 kB
        18. index_query_20.png
          index_query_20.png
          554 kB
        19. kv_12.png
          kv_12.png
          588 kB
        20. kv_13.png
          kv_13.png
          621 kB
        21. kv_14.png
          kv_14.png
          577 kB
        22. new_30_bucket_ns_server.png
          new_30_bucket_ns_server.png
          610 kB
        23. new_30_bucket.png
          new_30_bucket.png
          640 kB
        24. oc_and_cbas.png
          oc_and_cbas.png
          463 kB
        25. orchestrator_vs_fts_only_node.png
          orchestrator_vs_fts_only_node.png
          611 kB
        26. Screen Shot 2019-01-15 at 09.23.26.png
          Screen Shot 2019-01-15 at 09.23.26.png
          90 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          mahesh.mandhare Mahesh Mandhare (Inactive) created issue -
          drigby Dave Rigby made changes -
          Field Original Value New Value
          Attachment Screen Shot 2019-01-15 at 09.23.26.png [ 63313 ]
          drigby Dave Rigby added a comment -

          Same comments as per MB-32642 - Could you give a bit more background on this issue? You've marked it as a bug, but this just sounds like an observation that CPU goes up during a rebalance.

          As such, that doesn't sound like a bug to me (possibly an improvement?) - unless it's a regress from some previous build.

          Note also that the memcached %CPU on the node in question (97.12) is pretty flat during the rebalance:

          As such, it's not clear to me this is actually a KV-Engine issue.

          If you think this is a bug, please update to include expected behaviour and actual behaviour. If not then change to an improvement making clear what you think should be improved.

          drigby Dave Rigby added a comment - Same comments as per MB-32642 - Could you give a bit more background on this issue? You've marked it as a bug, but this just sounds like an observation that CPU goes up during a rebalance. As such, that doesn't sound like a bug to me (possibly an improvement?) - unless it's a regress from some previous build. Note also that the memcached %CPU on the node in question (97.12) is pretty flat during the rebalance: As such, it's not clear to me this is actually a KV-Engine issue. If you think this is a bug, please update to include expected behaviour and actual behaviour. If not then change to an improvement making clear what you think should be improved.
          drigby Dave Rigby made changes -
          Assignee Dave Rigby [ drigby ] Mahesh Mandhare [ mahesh.mandhare ]
          drigby Dave Rigby made changes -
          Component/s ns_server [ 10019 ]
          Component/s couchbase-bucket [ 10173 ]
          drigby Dave Rigby made changes -
          Component/s couchbase-bucket [ 10173 ]
          Component/s ns_server [ 10019 ]
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Component/s ns_server [ 10019 ]
          Component/s couchbase-bucket [ 10173 ]

          Assigning it to Poonam Dhavale

          As seen in cbmonitor link, on orchestrator node(172.23.97.12) we see cpu_utilisation spikes above 50%.

          Also beam.smp cpu usage on orchestrator spiking while rebalance is in progress.

          Discussed this with Shivani Gupta and Dave Finlay, they suggested to log the issue to investigate if this is expected.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Assigning it to  Poonam Dhavale As seen in cbmonitor link, on orchestrator node(172.23.97.12) we see cpu_utilisation spikes above 50%. Also beam.smp cpu usage on orchestrator spiking while rebalance is in progress. Discussed this with Shivani Gupta and Dave Finlay , they suggested to log the issue to investigate if this is expected.
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s Mad-Hatter [ 15037 ]
          lynn.straus Lynn Straus added a comment -

          setting initial fix version to Mad Hatter so that investigation occurs in MH timeframe.  Please update fix version once investigation completes.

          lynn.straus Lynn Straus added a comment - setting initial fix version to Mad Hatter so that investigation occurs in MH timeframe.  Please update fix version once investigation completes.

           

          Hi Mahesh,

           

          The links to the logs are not working. 

          poonam Poonam Dhavale added a comment -   Hi Mahesh,   The links to the logs are not working. 
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]

          Poonam Dhavale, Looks like they got archived, will upload new logs when I run test next time.

          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Poonam Dhavale , Looks like they got archived, will upload new logs when I run test next time.
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]

          Hi Mahesh,

          The links to the new logs are also not working.

          poonam Poonam Dhavale added a comment - Hi Mahesh, The links to the new logs are also not working.
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]

           

          Hi Mahesh Mandhare, the logs are not accessible. Please keep them around for longer.

          poonam Poonam Dhavale added a comment -   Hi  Mahesh Mandhare , the logs are not accessible. Please keep them around for longer.
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Logged ticket CBIT-15167 to increase log duration, but it is yet to be processed. I had local copies of logs, uploaded again at https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.96.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.97.14.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.97.20.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.96.23.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.97.15.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.97.12.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.97.177.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.97.13.zip   https://s3.amazonaws.com/bugdb/jira/hbd_3/collectinfo-2019-04-10T103209-ns_1@172.23.97.19.zip  
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]
          poonam Poonam Dhavale added a comment - - edited

           

          • Node 96.23 is the orchestrator during the rebalances in the latest logs. It is a cbas only node.
          • The overall cpu_utlization_rate and ns_server/cpu_utlization on the orchestrator/cbas node is similar to the one on the fts only node (96.20).
            • This indicates that the extra duties performed by the orchestrator during rebalance are not increasing the cpu utilization (at least significantly).
          • The ns_server cpu utilization, on these two nodes, stays more or less within a certain band but once in a while it spikes. The FTS node shows more spikes than the orchestrator/cbas nodes.
            • ns_server on FTS only node does not have much work to do during KV phase of rebalance other than  synchronizing ns_config.
            • I checked the ns_server log on the FTS node at around the time when cpu utilization spikes. But, it does not show any other activity other than synchronization of ns_config.
            • I checked whether there is any correlation between the cpu_utilization spikes on the FTS node and the time taken to synchronize the config (as displayed in the "Fully synchronized config …” message). That is whether ns_server cpu utilization spikes when the node takes longer to synchronize the config. However, I did not find any such correlation.
          • The ns_server cpu utilization is higher on the KV nodes during KV rebalance. This is as compared to the utilization on the orchestrator/cbas and other non-KV nodes. This is expected.
          • It is also expected that ns_server cpu utilization on the KV nodes will be higher with more # of buckets.
            • Higher the bucket count, more # of processes are running on the system, more stats are being collected and so on. 

           
          Mahesh, if possible, can you please run the 3 -> 3 swap rebalance test on this exact cluster configuration with following #  of buckets: 

          • 1 bucket
          • 5 buckets
          • 10 bucket
          • 30 bucket

          I would like to compare ns_server’s cpu_utilization on the KV nodes as the # of buckets grow.

          poonam Poonam Dhavale added a comment - - edited   Node 96.23 is the orchestrator during the rebalances in the latest logs. It is a cbas only node. The overall cpu_utlization_rate and ns_server/cpu_utlization on the orchestrator/cbas node is similar to the one on the fts only node (96.20). This indicates that the extra duties performed by the orchestrator during rebalance are not increasing the cpu utilization (at least significantly). The ns_server cpu utilization, on these two nodes, stays more or less within a certain band but once in a while it spikes. The FTS node shows more spikes than the orchestrator/cbas nodes. ns_server on FTS only node does not have much work to do during KV phase of rebalance other than  synchronizing ns_config. I checked the ns_server log on the FTS node at around the time when cpu utilization spikes. But, it does not show any other activity other than synchronization of ns_config. I checked whether there is any correlation between the cpu_utilization spikes on the FTS node and the time taken to synchronize the config (as displayed in the "Fully synchronized config …” message). That is whether ns_server cpu utilization spikes when the node takes longer to synchronize the config. However, I did not find any such correlation. The ns_server cpu utilization is higher on the KV nodes during KV rebalance. This is as compared to the utilization on the orchestrator/cbas and other non-KV nodes. This is expected. It is also expected that ns_server cpu utilization on the KV nodes will be higher with more # of buckets. Higher the bucket count, more # of processes are running on the system, more stats are being collected and so on.    Mahesh, if possible, can you please run the 3 -> 3 swap rebalance test on this exact cluster configuration with following #  of buckets:  1 bucket 5 buckets 10 bucket 30 bucket I would like to compare ns_server’s cpu_utilization on the KV nodes as the # of buckets grow.
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Mahesh Mandhare [ mahesh.mandhare ]
          poonam Poonam Dhavale made changes -
          Attachment orchestrator_vs_fts_only_node.png [ 67371 ]
          Attachment fts.png [ 67372 ]
          Attachment index_query_20.png [ 67373 ]
          Attachment index_query_19.png [ 67374 ]
          Attachment eventing.png [ 67375 ]
          Attachment index_query_15.png [ 67376 ]
          Attachment kv_14.png [ 67377 ]
          Attachment kv_13.png [ 67378 ]
          Attachment kv_12.png [ 67379 ]
          Attachment oc_and_cbas.png [ 67380 ]
          mahesh.mandhare Mahesh Mandhare (Inactive) added a comment - Build 6.5.0-3274 Here are job details of  3 -> 3 swap rebalance test on this exact cluster configuration with following #  of buckets, logs present at the end of console output:  1 bucket Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/296/ cbmonitor link-  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_650-3274_run_kv_rebalance_3de8 5 buckets Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/302 cbmonitor link-  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_650-3274_run_kv_rebalance_d6b7 10 bucket Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/298/ cbmonitor link-  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_650-3274_run_kv_rebalance_2b4a 30 bucket cbmonitor link- http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=arke_basic_650-3274_run_kv_rebalance_8f1a   Job-  http://perf.jenkins.couchbase.com/job/arke-multi-bucket/304/ https://s3.amazonaws.com/bugdb/jira/hbd-cpu-usage1/collectinfo-2019-05-30T072722-ns_1%40172.23.97.12.zip https://s3.amazonaws.com/bugdb/jira/hbd-cpu-usage1/collectinfo-2019-05-30T072722-ns_1%40172.23.97.13.zip https://s3.amazonaws.com/bugdb/jira/hbd-cpu-usage1/collectinfo-2019-05-30T072722-ns_1%40172.23.97.14.zip
          mahesh.mandhare Mahesh Mandhare (Inactive) made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Poonam Dhavale [ poonam ]
          poonam Poonam Dhavale made changes -
          Attachment 30_buckets_ns_server.png [ 68545 ]
          Attachment 30_buckets.png [ 68546 ]
          Attachment 10_bucket_ns_server.png [ 68547 ]
          Attachment 10_bucket.png [ 68548 ]
          Attachment 5_buckets_ns_server.png [ 68549 ]
          Attachment 5_buckets.png [ 68550 ]
          Attachment 1_bucket_ns_server.png [ 68551 ]
          Attachment 1_bucket.png [ 68552 ]
          poonam Poonam Dhavale made changes -
          Attachment new_30_bucket.png [ 68556 ]
          Attachment new_30_bucket_ns_server.png [ 68557 ]
          poonam Poonam Dhavale added a comment - - edited

           
          I wanted to analyze ns_server/cpu_utlization on KV nodes  with increasing # of buckets – 1, 5, 10 and 30.
           
          There are two aspects:
             

          • What is the  ns_server/cpu_utlization on KV nodes when rebalance is not running? Does it increase linearly with # of buckets?
            • Based on the analysis below, it appears that ns_server/cpu_utlization increases more or less linearly with the #of buckets. 
          • What is the  ns_server/cpu_utlization on KV nodes when rebalance is running?
            • We expect to see high CPU utilization during rebalance but technically it should not increase linearly with # of buckets because rebalance is done only one bucket at a time. 
            • Based on the analysis below, it seems that ns_server/cpu_utlization on KV nodes does not increase linearly with #of buckets. 
            • But, occasionally there are significant spikes in 30 bucket case which should be investigated.

           
          I think, this ticket can be moved to CC but will let Ajit & DaveF decide.
           
          QA folks, please keep the logs around.
           
          Here is my analysis: I have attached the screenshots.
           
          KV node 12 is the orchestrator.
           
          3 Bucket (1 + 2 eventing) : 

           

          • When rebalance is not running: ns_server/cpu_utlization on orchestrator  node 12 is ~80% and on other KV nodes is ~20%.
          • When rebalance is running:  ns_server/cpu_utlization on all KV nodes spikes to ~500%.

           
          7 Buckets (5 + 2 eventing):  

           

          • When rebalance is not running: ns_server/cpu_utlization on orchestrator node 12 is ~80%,  node 14 is ~70% and node 13 is ~40%. 
          • When rebalance is running: ns_server/cpu_utlization on all KV nodes spikes to ~500 - 600%.

           
          12 buckets (10 + 2 eventing): 

           

          • When rebalance is not running: ns_server/cpu_utlization on orchestrator node 12 is ~140%,  node 14 is ~120% and node 13 is ~100%. 
          • When rebalance is running: ns_server/cpu_utlization on orchestrator node 12 spikes to ~750%, other KV nodes is ~600%

           
          32 buckets (30 + 2 eventing): 

           

          • When rebalance is not running: ns_server/cpu_utlization on orchestrator node 12 is ~500-600%,  other 2 KV nodes ~300 - 400%. 
          • When rebalance is running: ns_server/cpu_utlization on all KV nodes is ~600% and occasionally spikes to ~1K to 2K%

           
           
           

          poonam Poonam Dhavale added a comment - - edited   I  wanted to analyze  ns_server/cpu_utlization on KV nodes  with increasing # of buckets – 1, 5, 10 and 30.   There are two aspects:     What is the    ns_server/cpu_utlization on KV nodes when rebalance is not running? Does it increase linearly with # of buckets? Based on the analysis below, it appears that ns_server/cpu_utlization increases more or less linearly with the #of buckets.  What is the    ns_server/cpu_utlization on KV nodes when rebalance is running? We expect to see high CPU utilization during rebalance but technically it should not increase linearly with # of buckets because rebalance is done only one bucket at a time.  Based on the analysis below, it seems that ns_server/cpu_utlization on KV nodes does not increase linearly with #of buckets.  But,  occasionally there are significant spikes in 30 bucket case which should be investigated.   I  think, this ticket can be moved to CC but will let Ajit & DaveF decide.   QA folks, please keep the logs around.   Here is my analysis: I have attached the screenshots.   KV node 12 is the orchestrator.   3 Bucket (1 + 2 eventing) :     When rebalance is not running:   ns_server/cpu_utlization on orchestrator  node 12 is ~80% and on other KV nodes is ~20%. When rebalance is running:  ns_server/cpu_utlization on all KV nodes spikes to ~500%.   7 Buckets (5 + 2 eventing):      When rebalance is not running:   ns_server/cpu_utlization on orchestrator node 12 is ~80%,  node 14 is ~70% and node 13 is ~40%.  When rebalance is running: ns_server/cpu_utlization on all KV nodes spikes to ~500 - 600%.   12 buckets (10 + 2 eventing):     When rebalance is not running:   ns_server/cpu_utlization on orchestrator node 12 is ~140%,  node 14 is ~120% and node 13 is ~100%.  When rebalance is running:  ns_server/cpu_utlization on orchestrator node 12 spikes to ~750%, other KV nodes is ~600%   32 buckets (30 + 2 eventing):     When rebalance is not running:   ns_server/cpu_utlization on orchestrator node 12 is ~500-600%,  other 2 KV nodes ~300 - 400%.  When rebalance is running:   ns_server/cpu_utlization on all KV nodes is ~600% and occasionally spikes to ~1K to 2K%      
          poonam Poonam Dhavale made changes -
          Assignee Poonam Dhavale [ poonam ] Ajit Yagaty [ ajit.yagaty ]

          Mahesh Mandhare - Can you please re-run the test on the latest MH build?

          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment - Mahesh Mandhare - Can you please re-run the test on the latest MH build?
          ajit.yagaty Ajit Yagaty [X] (Inactive) made changes -
          Assignee Ajit Yagaty [ ajit.yagaty ] Mahesh Mandhare [ mahesh.mandhare ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]
          Fix Version/s Mad-Hatter [ 15037 ]
          Labels high-bucket-density deferred-from-Mad-Hatter high-bucket-density
          raju Raju Suravarjjala made changes -
          Assignee Mahesh Mandhare [ mahesh.mandhare ] Wayne Siu [ wayne ]
          wayne Wayne Siu made changes -
          Labels deferred-from-Mad-Hatter high-bucket-density deferred-from-Mad-Hatter high-bucket-density performance
          wayne Wayne Siu made changes -
          Issue Type Bug [ 1 ] Task [ 3 ]

          Dave Finlay are we at the point of readiness in CC we can rerun this test?

          meni.hillel Meni Hillel (Inactive) added a comment - Dave Finlay  are we at the point of readiness in CC we can rerun this test?
          dfinlay Dave Finlay added a comment -

          Yes, I think so, Meni. It would be good to get a rerun of the 30 bucket test and see how things have changed in CC.

          Wayne Siu: I think we run the 30 bucket test periodically, correct?

          dfinlay Dave Finlay added a comment - Yes, I think so, Meni. It would be good to get a rerun of the 30 bucket test and see how things have changed in CC. Wayne Siu : I think we run the 30 bucket test periodically, correct?
          wayne Wayne Siu added a comment -

          Meni Hillel

          Yes, we are planning to rerun the 30-buckets tests this month (March).

          wayne Wayne Siu added a comment - Meni Hillel Yes, we are planning to rerun the 30-buckets tests this month (March).
          meni.hillel Meni Hillel (Inactive) made changes -
          Component/s performance [ 10222 ]
          wayne Wayne Siu made changes -
          Summary High CPU utilisation during kv rebalance [high-bucket] High CPU utilisation during kv rebalance
          bo-chun.wang Bo-Chun Wang made changes -
          Attachment 30_buckets_kv_cpu_util.png [ 132051 ]
          bo-chun.wang Bo-Chun Wang added a comment - - edited
          bo-chun.wang Bo-Chun Wang added a comment - - edited We finished a 30-bucket run on build 7.0.0-4678. Job:  http://perf.jenkins.couchbase.com/job/themis_multibucket/68/   cbmonitor link http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_700-4678_run_kv_rebalance_455d http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=destination_bucket_700-4678_run_kv_rebalance_b576 Logs: https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.96.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.96.20.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.96.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.97.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.99.160.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-68/172.23.99.161.zip   When rebalance is not running, the cpu utilization is about 10%. The cpu utilization increases to 30% during rebalance.  

          Bo-Chun Wang - Thanks for running the test. It seems we got good results. Assuming this is satisfactory, we can now close this task.

          meni.hillel Meni Hillel (Inactive) added a comment - Bo-Chun Wang  - Thanks for running the test. It seems we got good results. Assuming this is satisfactory, we can now close this task.
          meni.hillel Meni Hillel (Inactive) made changes -
          Resolution Done [ 6 ]
          Status Open [ 1 ] Closed [ 6 ]
          meni.hillel Meni Hillel (Inactive) made changes -
          Assignee Wayne Siu [ wayne ] Meni Hillel [ JIRAUSER25407 ]
          Resolution Done [ 6 ]
          Status Closed [ 6 ] Reopened [ 4 ]

          Want to review results a bit more closely 

          meni.hillel Meni Hillel (Inactive) added a comment - Want to review results a bit more closely 
          dfinlay Dave Finlay added a comment -

          Thanks Bo-Chun. Few questions:

          I see that the test was as follows:

          • 1 Analytics node
          • 2 Indexing & Query nodes, colocated
          • 1 FTS node
          • 1 Eventing node
          • 4 KV nodes

          Some of the nodes have a 48 cores, some 24.

          Can you remind me how much data you loaded in each bucket?
          Was this the same test with similar servers (I'm thinking in particular in terms of CPU) that was run earlier?
          Do we have any of the logs / cbmonitor graphs from the prior runs to allow us compare the 2 runs?

          dfinlay Dave Finlay added a comment - Thanks Bo-Chun. Few questions: I see that the test was as follows: 1 Analytics node 2 Indexing & Query nodes, colocated 1 FTS node 1 Eventing node 4 KV nodes Some of the nodes have a 48 cores, some 24. Can you remind me how much data you loaded in each bucket? Was this the same test with similar servers (I'm thinking in particular in terms of CPU) that was run earlier? Do we have any of the logs / cbmonitor graphs from the prior runs to allow us compare the 2 runs?
          dfinlay Dave Finlay made changes -
          Assignee Meni Hillel [ JIRAUSER25407 ] Bo-Chun Wang [ bo-chun.wang ]
          bo-chun.wang Bo-Chun Wang added a comment - - edited

          Dave Finlay
          We used a different cluster to do this run. Therefore, we don't have results with old builds on the same cluster. The cbmonitor links in runs Mahesh did are gone. I will re-run the same test with 6.6.2-9556 on this cluster.

          The KV nodes have 24 cores, and other nodes have 48 nodes.

          We loaded 1M x 1KB docs in each bucket.

           

          bo-chun.wang Bo-Chun Wang added a comment - - edited Dave Finlay We used a different cluster to do this run. Therefore, we don't have results with old builds on the same cluster. The cbmonitor links in runs Mahesh did are gone. I will re-run the same test with 6.6.2-9556 on this cluster. The KV nodes have 24 cores, and other nodes have 48 nodes. We loaded 1M x 1KB docs in each bucket.  
          dfinlay Dave Finlay added a comment -

          Bo-Chun

          That is great thank you. The results already look pretty decent in absolute terms, but it'll be much more useful to compare against 6.6.x.

          -dave

          dfinlay Dave Finlay added a comment - Bo-Chun That is great thank you. The results already look pretty decent in absolute terms, but it'll be much more useful to compare against 6.6.x. -dave
          bo-chun.wang Bo-Chun Wang made changes -
          Attachment 7.0.0_CPU.png [ 132322 ]
          bo-chun.wang Bo-Chun Wang made changes -
          Attachment 6.6.2_CPU.png [ 132323 ]
          bo-chun.wang Bo-Chun Wang added a comment - - edited

          I finished a run with build 6.6.2-9556. Compared to 6.6.2, the run on 7.0.0 has lower CPU utilization, and the rebalance time is lower, too. 

          Job: http://perf.jenkins.couchbase.com/job/themis_multibucket/70/ 

          cbmonitor link: 

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_662-9556_run_kv_rebalance_7e53

          Logs:

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.96.19.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.96.20.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.96.23.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.97.15.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.97.177.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.158.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.159.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.160.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.161.zip

           

          7.0.0

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_700-4678_run_kv_rebalance_455d#2bdebde23ffe4f80a3fd554f4918b6ea

          6.6.2

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_662-9556_run_kv_rebalance_7e53#2bdebde23ffe4f80a3fd554f4918b6ea

           

          bo-chun.wang Bo-Chun Wang added a comment - - edited I finished a run with build 6.6.2-9556. Compared to 6.6.2, the run on 7.0.0 has lower CPU utilization, and the rebalance time is lower, too.  Job: http://perf.jenkins.couchbase.com/job/themis_multibucket/70/   cbmonitor link:  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_662-9556_run_kv_rebalance_7e53 Logs: https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.96.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.96.20.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.96.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.97.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.160.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-70/172.23.99.161.zip   7.0.0 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_700-4678_run_kv_rebalance_455d#2bdebde23ffe4f80a3fd554f4918b6ea 6.6.2 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_662-9556_run_kv_rebalance_7e53#2bdebde23ffe4f80a3fd554f4918b6ea  
          dfinlay Dave Finlay added a comment -

          Thanks Bo-Chun - this is great. NS-server memory is lower too. I think we can definitely close this ticket.

          One question: can you remind me how to plot both of these cbmonitor reports on the same graphs?

          dfinlay Dave Finlay added a comment - Thanks Bo-Chun - this is great. NS-server memory is lower too. I think we can definitely close this ticket. One question: can you remind me how to plot both of these cbmonitor reports on the same graphs?
          bo-chun.wang Bo-Chun Wang added a comment - - edited If you connect two snapshots with "&", it will plot both reports on the same graphs,  http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_662-9556_run_kv_rebalance_7e53&snapshot=source_cluster_700-4678_run_kv_rebalance_455d  
          dfinlay Dave Finlay made changes -
          Resolution Fixed [ 1 ]
          Status Reopened [ 4 ] Resolved [ 5 ]

          This looks promising. 

          Two questions:

          • We gave a rough guidance in 6.5 to set aside 0.4 core as overhead of each bucket (without any frontend workload). This meant that for 30 buckets to be operationally stable (rebalance included), they should consider 12 cores as overhead. Does this guidance need to change now that we are seeing lower CPU consumption?
          • Can the number of buckets guidance be considered for upward revision? I think the answer to this is no, because rebalance times will still grow exponentially with number of buckets.
          shivani.gupta Shivani Gupta added a comment - This looks promising.  Two questions: We gave a rough guidance in 6.5 to set aside 0.4 core as overhead of each bucket (without any frontend workload). This meant that for 30 buckets to be operationally stable (rebalance included), they should consider 12 cores as overhead. Does this guidance need to change now that we are seeing lower CPU consumption? Can the number of buckets guidance be considered for upward revision? I think the answer to this is no, because rebalance times will still grow exponentially with number of buckets.

          Bo-Chun Wang we discussed the findings of this ticket to see if we can revise down the guidance of per bucket CPU overhead (based on 6.5 tests it was set to 0.4 core per bucket).

          There is one more test we would like to have results for. Can you please run the same 30 bucket test on 7.0 with the CPU for Data Service nodes limited to 8 cores? I believe the data service nodes are 24 core machines, but can you limit them to 8 cores only? Don't change anything on the Index/Query nodes. Thanks much for running these tests.

          cc Dave Finlay Wayne Siu

          shivani.gupta Shivani Gupta added a comment - Bo-Chun Wang  we discussed the findings of this ticket to see if we can revise down the guidance of per bucket CPU overhead (based on 6.5 tests it was set to 0.4 core per bucket). There is one more test we would like to have results for. Can you please run the same 30 bucket test on 7.0 with the CPU for Data Service nodes limited to 8 cores? I believe the data service nodes are 24 core machines, but can you limit them to 8 cores only? Don't change anything on the Index/Query nodes. Thanks much for running these tests. cc Dave Finlay Wayne Siu
          shivani.gupta Shivani Gupta added a comment - - edited

          Bo-Chun Wang pinging you again on this? Is this something you can do?

          There is one more test we would like to have results for. Can you please run the same 30 bucket test on 7.0 with the CPU for Data Service nodes limited to 8 cores? I believe the data service nodes are 24 core machines, but can you limit them to 8 cores only? Don't change anything on the Index/Query nodes. Thanks much for running these tests.

          shivani.gupta Shivani Gupta added a comment - - edited Bo-Chun Wang  pinging you again on this? Is this something you can do? There is one more test we would like to have results for. Can you please run the same 30 bucket test on 7.0 with the CPU for Data Service nodes limited to 8 cores? I believe the data service nodes are 24 core machines, but can you limit them to 8 cores only? Don't change anything on the Index/Query nodes. Thanks much for running these tests.
          bo-chun.wang Bo-Chun Wang added a comment -

          Shivani Gupta

          It's possible. However, to limit CPU for data service nodes without touching other nodes, I have to do some settings manually. We are running 6.6.2 and 7.0 weekly runs right now so the cluster is busy. I will do it later this week after we finish weekly runs

          bo-chun.wang Bo-Chun Wang added a comment - Shivani Gupta It's possible. However, to limit CPU for data service nodes without touching other nodes, I have to do some settings manually. We are running 6.6.2 and 7.0 weekly runs right now so the cluster is busy. I will do it later this week after we finish weekly runs

          Thanks Bo-Chun Wang.

          shivani.gupta Shivani Gupta added a comment - Thanks Bo-Chun Wang .
          bo-chun.wang Bo-Chun Wang made changes -
          Attachment 8cores_24cores.png [ 135753 ]

          Shivani Gupta

          I have finished a 30-bucket run with 7.0.0-4678. The number of CPU cores on data service nodes is limited to 8 cores.

          Job: http://perf.jenkins.couchbase.com/job/themis_multibucket/83/

          Log:

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.15.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.19.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.20.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.23.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.97.177.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.157.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.158.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.159.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.160.zip

          https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.161.zip

          cbmonitor link: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_700-4678_run_kv_rebalance_1df8

           

          Comparison between 8 cores and 24 cores:

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_700-4678_run_kv_rebalance_455d&label=24cores&snapshot=source_cluster_700-4678_run_kv_rebalance_1df8&label=8cores

           

          bo-chun.wang Bo-Chun Wang added a comment - Shivani Gupta I have finished a 30-bucket run with 7.0.0-4678. The number of CPU cores on data service nodes is limited to 8 cores. Job: http://perf.jenkins.couchbase.com/job/themis_multibucket/83/ Log: https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.15.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.19.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.20.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.96.23.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.97.177.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.157.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.158.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.159.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.160.zip https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis_multibucket-83/172.23.99.161.zip cbmonitor link: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_700-4678_run_kv_rebalance_1df8   Comparison between 8 cores and 24 cores: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=source_cluster_700-4678_run_kv_rebalance_455d&label=24cores&snapshot=source_cluster_700-4678_run_kv_rebalance_1df8&label=8cores  

          Thanks Bo-Chun Wang, this is very helpful.

          shivani.gupta Shivani Gupta added a comment - Thanks Bo-Chun Wang , this is very helpful.
          lynn.straus Lynn Straus made changes -
          Fix Version/s 7.0.0 [ 17233 ]
          lynn.straus Lynn Straus made changes -
          Fix Version/s Cheshire-Cat [ 15915 ]

          People

            bo-chun.wang Bo-Chun Wang
            mahesh.mandhare Mahesh Mandhare (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty