Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51472

20% throughput drop in DGM cbbackupmgr restore tests vs earlier Neo builds

    XMLWordPrintable

Details

    • Bug
    • Resolution: Not a Bug
    • Critical
    • None
    • 7.1.0
    • couchbase-bucket
    • Untriaged
    • 1
    • Yes

    Description

      We are seeing a 20% drop in throughput in the following DGM cbbackupmgr restore tests:

      http://showfast.sc.couchbase.com/#/timeline/Linux/tools/restore/Rift#tools_restore_400M_rift_restore-rift_thr_EE_leto

      http://showfast.sc.couchbase.com/#/timeline/Linux/tools/restore/SQLite#tools_restore_400M_sqlite_restore-sqlite_thr_EE_leto

       

      The issue seems to originate with the KV engine. Binary searching tells us that the offending build is 7.1.0-2463:

      Run (2462)
      Jenkins: http://perf.jenkins.couchbase.com/job/leto-dev/23/

      Throughput: 500

      cbcollect logs:
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-01.perf.couchbase.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-02.perf.couchbase.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-03.perf.couchbase.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-04.perf.couchbase.com.zip

      cbbackupmgr logs:
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/cbbackupmgr-collectinfo-backup-2022-03-16T135726.zip

       

      Run (2463)
      Jenkins: http://perf.jenkins.couchbase.com/job/leto/23526/

      Throughput: 397

      cbcollect logs:
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-01.perf.couchbase.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-02.perf.couchbase.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-03.perf.couchbase.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-04.perf.couchbase.com.zip

      cbbackupmgr logs:
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/cbbackupmgr-collectinfo-backup-2022-03-15T193345.zip

       

      The cbmonitor comparison shows TempOOMs in the 2463 build which coincide with a loss of ops/sec and thus restore throughput:

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=leto_710-2462_restore_a613&snapshot=leto_710-2463_restore_abf4

       

      The only commit in build 2463 is the following KV engine commit:

      Commit: e63420d2e7e5fb16945ab1b5616f60b82deb2afd in build: couchbase-server-7.1.0-2463
      Revert "MB-49469: Introduce max_checkpoints_hard_limit_multiplier"

       
      This is somewhat puzzling because this commit is what seemed to solve MB-51329. The fact that this commit reverted "Introduce max_checkpoints_hard_limit_multiplier" implies that "Introduce max_checkpoints_hard_limit_multiplier" was necessary to achieve the pre-regression restore throughput of ~500. Consequently, we might expect that throughput would be ~400 (regressed value) prior to the build which introduced "Introduce max_checkpoints_hard_limit_multiplier" - this was build 2396. If we look at test runs before 2396 though, we see a throughput which is consistently ~450:

      I will try to narrow down where the increase from ~450 to ~500 came from, as it may help us get to the bottom of this.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Daniel.nagy Daniel Nagy
              Daniel.nagy Daniel Nagy
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty