Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45987

Windows rebalance-out/swap tests have higher rebalance time in 7.0 compared to 6.6.2

    XMLWordPrintable

Details

    • 1

    Description

      Compared to 6.6.2, windows rebalance tests have higher rebalance time in 7.0.

      http://showfast.sc.couchbase.com/#/timeline/Windows/reb/kv/DGM

      Rebalance-swap (min), 3 -> 3, 150M x 1KB, 15K ops/sec (90/10 R/W), 10% cache miss rate

       

      cbmonitor comparison: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_67cc&snapshot=zeus_700-5017_rebalance_d419

       

      Rebalance-out (min), 4 -> 3, 150M x 1KB, 15K ops/sec (90/10 R/W), 10% cache miss rate

      cbmonitor comparison: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_d519&snapshot=zeus_700-5017_rebalance_f2cf
       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Thanks for re-running the tests Bo-Chun Wang.

          Looks like there might be a slight regression in the rebalance-in tests here of ~10%. We only look to have one run of each of these builds and the runs of 5017+ look to be about the same so I'll assume that there isn't a regression there. I graphed all of the runs - http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_9645&snapshot=zeus_700-5017_rebalance_8fd3&snapshot=zeus_700-5071_rebalance_21ff&snapshot=zeus_700-5127_rebalance_a7b6. There are a few interesting stats here.

          Rss looks to be a fair bit higher in the 6.6.2 run. This holds true for all of the original nodes.

          Rss profile is different for the in node for 6.6.2.

          Swap usage is higher in 6.6.2. This is probably going to be related to the rss and isn't particularly interesting on a windows machine.

          CPU usage is higher on the 6.6.2 run.

          Replica items remaining looks interesting, it appears to be higher on the 6.6.2 runs.

          The worst run had notably lower replica residency.

          The best run (6.6.2) had notably higher active residency.

          There's a rough relationship between mem_used and the time taken.

          And finally we see the bug fix for building as replica rather than pending.

          Suspect at the moment that either:

          1. We're seeing a relationship between mem_used/residency ratio and rebalance time. Lower residency (on actives) would likely negatively affect rebalance times.
          2. The changing of replica building from pending to replica state is the cause.

          I've scheduled a couple of re-runs for 6.6.2 and 7.0.0-5017 to see if there's still some variance in the test (probably due to memory used/residency). I'll also schedule some runs of the builds before and after we changed the replica building state.

          ben.huddleston Ben Huddleston added a comment - Thanks for re-running the tests Bo-Chun Wang . Looks like there might be a slight regression in the rebalance-in tests here of ~10%. We only look to have one run of each of these builds and the runs of 5017+ look to be about the same so I'll assume that there isn't a regression there. I graphed all of the runs - http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_9645&snapshot=zeus_700-5017_rebalance_8fd3&snapshot=zeus_700-5071_rebalance_21ff&snapshot=zeus_700-5127_rebalance_a7b6 . There are a few interesting stats here. Rss looks to be a fair bit higher in the 6.6.2 run. This holds true for all of the original nodes. Rss profile is different for the in node for 6.6.2. Swap usage is higher in 6.6.2. This is probably going to be related to the rss and isn't particularly interesting on a windows machine. CPU usage is higher on the 6.6.2 run. Replica items remaining looks interesting, it appears to be higher on the 6.6.2 runs. The worst run had notably lower replica residency. The best run (6.6.2) had notably higher active residency. There's a rough relationship between mem_used and the time taken. And finally we see the bug fix for building as replica rather than pending. Suspect at the moment that either: We're seeing a relationship between mem_used/residency ratio and rebalance time. Lower residency (on actives) would likely negatively affect rebalance times. The changing of replica building from pending to replica state is the cause. I've scheduled a couple of re-runs for 6.6.2 and 7.0.0-5017 to see if there's still some variance in the test (probably due to memory used/residency). I'll also schedule some runs of the builds before and after we changed the replica building state.
          ben.huddleston Ben Huddleston added a comment - - edited

          Runs of 700-4846 and 700-4847 took 42.3 and 45.0 seconds respectively. All runs graphed together can be found here - http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_9645&snapshot=zeus_700-5017_rebalance_8fd3&snapshot=zeus_700-5071_rebalance_21ff&snapshot=zeus_700-5127_rebalance_a7b6&snapshot=zeus_700-4846_rebalance_47ff&snapshot=zeus_700-4847_rebalance_f13f

          These latest two runs took a bit longer than the newer 7.0.0 runs which may be test variance or may be due to other changes (such as the one that went into build 5001 to correctly block ops during takeover). I have a re-run scheduled of 6.6.2 still, the first failed for some unknown reason. 

          Will take a look through some of the sets of logs and see if there is anything interesting in there.

          ben.huddleston Ben Huddleston added a comment - - edited Runs of 700-4846 and 700-4847 took 42.3 and 45.0 seconds respectively. All runs graphed together can be found here - http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_9645&snapshot=zeus_700-5017_rebalance_8fd3&snapshot=zeus_700-5071_rebalance_21ff&snapshot=zeus_700-5127_rebalance_a7b6&snapshot=zeus_700-4846_rebalance_47ff&snapshot=zeus_700-4847_rebalance_f13f These latest two runs took a bit longer than the newer 7.0.0 runs which may be test variance or may be due to other changes (such as the one that went into build 5001 to correctly block ops during takeover). I have a re-run scheduled of 6.6.2 still, the first failed for some unknown reason.  Will take a look through some of the sets of logs and see if there is anything interesting in there.
          ben.huddleston Ben Huddleston added a comment - - edited

          On some further inspection here, I think that something has changed in perfrunner/on the cluster since I started running these tests again.

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_9645&snapshot=zeus_700-5017_rebalance_8fd3&snapshot=zeus_700-5017_rebalance_e482&snapshot=zeus_700-4846_rebalance_47ff&snapshot=zeus_700-4847_rebalance_f13f

          This set of graphs contains the old run on 6.6.2, two runs on 7.0.0-5017 (one done one the 7th of May and one rerun on the 11th trigger by me). It also includes two runs triggered on the 11th of intermediate builds. I wanted to graph both of the 5017 builds together but cbmonitor returns a 500 status Server Error that for some reason doesn't occur when I graph all of these together.

          There are two particularly interesting graphs here:

          1) Rebalance progress - shows a massively different rebalance profile on the 5017 runs and very different profiles with the intermediate builds

          2) CPU usage - shows a huge increase in the runs triggered on the 11th.

          Given that two of these runs are for the same build I suspect something has changed in perfrunner/on the cluster. It's hard to make meaningful progress with this issue if we cannot consistently reproduce results.

          Bo-Chun Wang could you please investigate why the runs I triggered on the 11th for 5017 and the intermediate builds have hugely increased CPU usage (i.e. is there some change to perfrunner/the cluster)?

          7th May 5017 - http://perf.jenkins.couchbase.com/job/zeus/6654/
          11th May 5017 - http://perf.jenkins.couchbase.com/job/zeus/6694/
          11th May 4846 - http://perf.jenkins.couchbase.com/job/zeus/6696/
          11th May 4847 - http://perf.jenkins.couchbase.com/job/zeus/6697/

          Additionally, is this in any way related to a failed re-run of the 6.6.2 build? (http://perf.jenkins.couchbase.com/job/zeus/6695/)

          ben.huddleston Ben Huddleston added a comment - - edited On some further inspection here, I think that something has changed in perfrunner/on the cluster since I started running these tests again. http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=zeus_662-9588_rebalance_9645&snapshot=zeus_700-5017_rebalance_8fd3&snapshot=zeus_700-5017_rebalance_e482&snapshot=zeus_700-4846_rebalance_47ff&snapshot=zeus_700-4847_rebalance_f13f This set of graphs contains the old run on 6.6.2, two runs on 7.0.0-5017 (one done one the 7th of May and one rerun on the 11th trigger by me). It also includes two runs triggered on the 11th of intermediate builds. I wanted to graph both of the 5017 builds together but cbmonitor returns a 500 status Server Error that for some reason doesn't occur when I graph all of these together. There are two particularly interesting graphs here: 1) Rebalance progress - shows a massively different rebalance profile on the 5017 runs and very different profiles with the intermediate builds 2) CPU usage - shows a huge increase in the runs triggered on the 11th. Given that two of these runs are for the same build I suspect something has changed in perfrunner/on the cluster. It's hard to make meaningful progress with this issue if we cannot consistently reproduce results. Bo-Chun Wang could you please investigate why the runs I triggered on the 11th for 5017 and the intermediate builds have hugely increased CPU usage (i.e. is there some change to perfrunner/the cluster)? 7th May 5017 - http://perf.jenkins.couchbase.com/job/zeus/6654/ 11th May 5017 - http://perf.jenkins.couchbase.com/job/zeus/6694/ 11th May 4846 - http://perf.jenkins.couchbase.com/job/zeus/6696/ 11th May 4847 - http://perf.jenkins.couchbase.com/job/zeus/6697/ Additionally, is this in any way related to a failed re-run of the 6.6.2 build? ( http://perf.jenkins.couchbase.com/job/zeus/6695/ )
          bo-chun.wang Bo-Chun Wang added a comment - - edited

          Ben Huddleston

          After my change which added compaction before rebalance, there is no change related to rebalance to perfrunner and the cluster. 

           

          I think the run with 6.6.2 build failed because of a connection error. It doesn't look like the high CPU usage is related to this run.

          http://perf.jenkins.couchbase.com/job/zeus/6695/

          2021-05-11T09:16:28 [WARNING] Bad response: http://zeus-srv-01.perf.couchbase.com:8091/pools/default/tasks

          2021-05-11T09:16:30 [WARNING] Connection error: http://zeus-srv-01.perf.couchbase.com:8091/pools/default/buckets

           

          bo-chun.wang Bo-Chun Wang added a comment - - edited Ben Huddleston After my change which added compaction before rebalance, there is no change related to rebalance to perfrunner and the cluster.    I think the run with 6.6.2 build failed because of a connection error. It doesn't look like the high CPU usage is related to this run. http://perf.jenkins.couchbase.com/job/zeus/6695/ 2021-05-11T09:16:28 [WARNING] Bad response: http://zeus-srv-01.perf.couchbase.com:8091/pools/default/tasks 2021-05-11T09:16:30 [WARNING] Connection error: http://zeus-srv-01.perf.couchbase.com:8091/pools/default/buckets  
          bo-chun.wang Bo-Chun Wang added a comment -

          For rebalance-in test, the difference between 6.6.2 and 7.0 is about 10-15%. Given we are seeing 10% run-to-run variation, it's difficult for us to conclude it's a regression. Therefore, I close this issue. We will keep an eye on the test and will open an issue if needed.

          bo-chun.wang Bo-Chun Wang added a comment - For rebalance-in test, the difference between 6.6.2 and 7.0 is about 10-15%. Given we are seeing 10% run-to-run variation, it's difficult for us to conclude it's a regression. Therefore, I close this issue. We will keep an eye on the test and will open an issue if needed.

          People

            bo-chun.wang Bo-Chun Wang
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty