Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 5.5.0
Affects Version/s: 5.5.0
Component/s: couchbase-bucket
Labels:
Environment:
Cluster: hebe_kv
OS: CentOS 7
CPU: E5-2680 v3 (48 vCPU)
Memory: 64GB
Disk: Samsung Pro 850

Triage:
Untriaged
Is this a Regression?:
Yes

Description

Test env and scenario:
3 nodes, 1 replica
20M items in the bucket, 1M ops/sec (50/50 R/W) ongoing

Despite similar replication rate the replication queue on 5.5.0-1970 grows much faster
causing overall performance degradation due to low-mem scenarios like DGM.

Changes in 5.5.0-1970:

[+] 4fa4905 -------~~MB-26021~~------- [6/6]: Limit #checkpoint items flushed in a single batch
https://github.com/couchbase/kv_engine/commit/4fa490526120424e82227b431ec0bb84b487ed37

[+] 90c76d4 -------~~MB-26021~~------- [5/6]: Set max_checkpoints=100 & chk_max_items=10000
https://github.com/couchbase/kv_engine/commit/90c76d4f0d99ef68ff5adb2fb667a4e20383a728

Servers logs:
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-tmp-32/172.23.100.204.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-tmp-32/172.23.100.205.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-hebe-tmp-32/172.23.100.206.zip

5.5.0-1969 versus 5.5.0-1970, replication queue:

All stats:
http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_550-1979_access_e9ad&snapshot=hebe_550-1911_access_f673

Also, similar comparison but using pillowfight tests results:

http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_550-1911_access_8d15&snapshot=ares_550-1979_access_6601&label=5.5.0-1911&label=5.5.0-1979

Logs form 2-node pillowfight test:

https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-7547/172.23.133.13.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-7547/172.23.133.14.zip