Details
-
Bug
-
Resolution: Not a Bug
-
Critical
-
None
-
7.1.0
-
Untriaged
-
1
-
Yes
Description
We are seeing a 20% drop in throughput in the following DGM cbbackupmgr restore tests:
The issue seems to originate with the KV engine. Binary searching tells us that the offending build is 7.1.0-2463:
Run (2462)
Jenkins: http://perf.jenkins.couchbase.com/job/leto-dev/23/
Throughput: 500
cbcollect logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-01.perf.couchbase.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-02.perf.couchbase.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-03.perf.couchbase.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/leto-srv-04.perf.couchbase.com.zip
cbbackupmgr logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-dev-23/cbbackupmgr-collectinfo-backup-2022-03-16T135726.zip
Run (2463)
Jenkins: http://perf.jenkins.couchbase.com/job/leto/23526/
Throughput: 397
cbcollect logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-01.perf.couchbase.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-02.perf.couchbase.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-03.perf.couchbase.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/leto-srv-04.perf.couchbase.com.zip
cbbackupmgr logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-leto-23526/cbbackupmgr-collectinfo-backup-2022-03-15T193345.zip
The cbmonitor comparison shows TempOOMs in the 2463 build which coincide with a loss of ops/sec and thus restore throughput:
The only commit in build 2463 is the following KV engine commit:
Commit: e63420d2e7e5fb16945ab1b5616f60b82deb2afd in build: couchbase-server-7.1.0-2463
Revert "MB-49469: Introduce max_checkpoints_hard_limit_multiplier"
This is somewhat puzzling because this commit is what seemed to solve MB-51329. The fact that this commit reverted "Introduce max_checkpoints_hard_limit_multiplier" implies that "Introduce max_checkpoints_hard_limit_multiplier" was necessary to achieve the pre-regression restore throughput of ~500. Consequently, we might expect that throughput would be ~400 (regressed value) prior to the build which introduced "Introduce max_checkpoints_hard_limit_multiplier" - this was build 2396. If we look at test runs before 2396 though, we see a throughput which is consistently ~450:
I will try to narrow down where the increase from ~450 to ~500 came from, as it may help us get to the bottom of this.
Attachments
Issue Links
- relates to
-
MB-51329 ~20-50% throughput drop and OOM in YCSB uniform distribution tests
- Closed