Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39320

the rebalance time for the same build is dramatically different

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      In the rebalance-swap + durability majority tests, we observed that the run time varies a lot. For instance, the following two tests were running on the same build (7.0.0-1834), but they have very different run time. The difference is coming from how much time it took to finish rebalance. The rebalance time should not be dramatically different for same builds. We found the same issue happened on different builds.

       

      Build: 7.0.0-1834

      http://perf.jenkins.couchbase.com/job/titan-durability/67/

      run time: 2 hr 27 min

      2020-04-19T18:30:47 [INFO] Starting rebalance

      2020-04-19T20:41:19 [INFO] Rebalance completed

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.100.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.101.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.102.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.103.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.104.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.105.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.106.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.107.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.108.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-67/172.23.96.109.zip

       

      http://perf.jenkins.couchbase.com/job/titan-durability/65/

      run time: 1 hr 33 min

      2020-04-19T04:00:05 [INFO] Starting rebalance

      2020-04-19T05:16:18 [INFO] Rebalance completed

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.100.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.101.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.102.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.103.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.104.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.105.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.106.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.107.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.108.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-titan-durability-65/172.23.96.109.zip

       

      Attachments

        1. 1_5Hr_run_activeMovesDone.png
          1_5Hr_run_activeMovesDone.png
          251 kB
        2. 1_5hr_run.png
          1_5hr_run.png
          56 kB
        3. 2Hr_run_activeMovesDone.png
          2Hr_run_activeMovesDone.png
          255 kB
        4. 2hr_run.png
          2hr_run.png
          76 kB
        5. ep_dcp_replica_items_remaining.png
          ep_dcp_replica_items_remaining.png
          475 kB
        6. memcahced_rss.png
          memcahced_rss.png
          286 kB
        7. reb_progr_4078.png
          reb_progr_4078.png
          104 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          dfinlay Dave Finlay added a comment -

          Abhi - can you look into this one?

          dfinlay Dave Finlay added a comment - Abhi - can you look into this one?
          bo-chun.wang Bo-Chun Wang added a comment - - edited

          The test is on showfast.

          SET Latency during Rebalance-swap, 4 -> 4, 20M x 512B, Unlimited Ops (0/100 R/W), Durability Majority

          http://showfast.sc.couchbase.com/#/timeline/Linux/reb/kv/Non-DGM

          Before 7.0.0-1834, the run time was about 60-70 minutes. 7.0.0-1834 is the first build where we saw the run time is more than 2 hours.

          The comparison of the two runs:

          http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_700-1834_rebalance_0c45&snapshot=titan_700-1834_rebalance_f896

          bo-chun.wang Bo-Chun Wang added a comment - - edited The test is on showfast. SET Latency during Rebalance-swap, 4 -> 4, 20M x 512B, Unlimited Ops (0/100 R/W), Durability Majority http://showfast.sc.couchbase.com/#/timeline/Linux/reb/kv/Non-DGM Before 7.0.0-1834, the run time was about 60-70 minutes. 7.0.0-1834 is the first build where we saw the run time is more than 2 hours. The comparison of the two runs: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_700-1834_rebalance_0c45&snapshot=titan_700-1834_rebalance_f896

          Rebalance scheduling wise the 2 runs are the same, active moves are performed first and replica moves after that from highest vbucket number to the lowest. Note, there is a difference in vbucket map, but I suspect that is not the issue here, because in case of swap rebalance all moves are to the swapped in node, and the number & order of moves from a particular source node is the same in both runs for all vbuckets.

          Until the active moves complete the rebalance time is roughly similar. However, during the middle part of replica moves there is a flattening in the ~2.5hr run compared to the ~1.5hr run.

          The blown image from around the time of start of the period marked by the red box is as below,

          It looks like the backfill phase (light color) and the tail phase(post backfill, where we set the dual topology and perhaps do takeover if active), see increased times.

          Post the red box in the 2.5hr run, the graphs look pretty much the same as with the 1.5hr run. During the phase marked by the red box in 2hr run, there is increased ep_dcp_replica_items_remaining as can be seen in, http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_700-1834_rebalance_0c45&snapshot=titan_700-1834_rebalance_f896.

          Based on the cursory look, it looks to me that we are hitting some kind of lag in the processing of incoming items. Can someone from the KV team have a look?

          Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Rebalance scheduling wise the 2 runs are the same, active moves are performed first and replica moves after that from highest vbucket number to the lowest. Note, there is a difference in vbucket map, but I suspect that is not the issue here, because in case of swap rebalance all moves are to the swapped in node, and the number & order of moves from a particular source node is the same in both runs for all vbuckets. Until the active moves complete the rebalance time is roughly similar. However, during the middle part of replica moves there is a flattening in the ~2.5hr run compared to the ~1.5hr run. The blown image from around the time of start of the period marked by the red box is as below, It looks like the backfill phase (light color) and the tail phase(post backfill, where we set the dual topology and perhaps do takeover if active), see increased times. Post the red box in the 2.5hr run, the graphs look pretty much the same as with the 1.5hr run. During the phase marked by the red box in 2hr run, there is increased ep_dcp_replica_items_remaining as can be seen in, http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_700-1834_rebalance_0c45&snapshot=titan_700-1834_rebalance_f896 . Based on the cursory look, it looks to me that we are hitting some kind of lag in the processing of incoming items. Can someone from the KV team have a look?
          paolo.cocchi Paolo Cocchi added a comment - - edited

          For some reason this one has been out of the radar for a while.
          The current state (latest run on build 7.0.0-4078, http://perf.jenkins.couchbase.com/job/titan-durability/217/) shows that rebalance-time has dropped dramatically compared to the older builds:

          Full comparison at http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_700-1834_rebalance_0c45&snapshot=titan_700-1834_rebalance_f896&snapshot=titan_700-4078_rebalance_a07a.
          Maybe we want to just keep an on whether the new numbers are consistent across future runs.

          Abhijeeth Nuthan Bo-Chun Wang I see that for that test showfast shows only "latency_set", maybe we want to add a "rebalance_progress" view too ? Thanks

          paolo.cocchi Paolo Cocchi added a comment - - edited For some reason this one has been out of the radar for a while. The current state (latest run on build 7.0.0-4078, http://perf.jenkins.couchbase.com/job/titan-durability/217/ ) shows that rebalance-time has dropped dramatically compared to the older builds: Full comparison at http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_700-1834_rebalance_0c45&snapshot=titan_700-1834_rebalance_f896&snapshot=titan_700-4078_rebalance_a07a . Maybe we want to just keep an on whether the new numbers are consistent across future runs. Abhijeeth Nuthan Bo-Chun Wang I see that for that test showfast shows only "latency_set", maybe we want to add a "rebalance_progress" view too ? Thanks

          I checked the last 7 runs on 7.0 and confirmed the run time is very similar in these runs. I agreed that we just keep an eye on new runs. If the run time is stabilized, we can close this issue then.

          bo-chun.wang Bo-Chun Wang added a comment - I checked the last 7 runs on 7.0 and confirmed the run time is very similar in these runs. I agreed that we just keep an eye on new runs. If the run time is stabilized, we can close this issue then.
          dfinlay Dave Finlay added a comment -

          Bo-Chun Wang: have you seen further issues or do you think this issue can be resolved at this time?

          dfinlay Dave Finlay added a comment - Bo-Chun Wang : have you seen further issues or do you think this issue can be resolved at this time?
          bo-chun.wang Bo-Chun Wang added a comment -

          Dave Finlay

          I check the last 10 weekly runs, and I don't see the issue again. We can mark the issue resolved and close it.

          bo-chun.wang Bo-Chun Wang added a comment - Dave Finlay I check the last 10 weekly runs, and I don't see the issue again. We can mark the issue resolved and close it.
          wayne Wayne Siu added a comment -

          Closing the ticket as the issue is not seen in recent runs.

          wayne Wayne Siu added a comment - Closing the ticket as the issue is not seen in recent runs.

          People

            bo-chun.wang Bo-Chun Wang
            bo-chun.wang Bo-Chun Wang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty