Details
-
Bug
-
Resolution: Fixed
-
Major
-
None
-
Security Level: Public
-
8
Description
The affected tests are E2E and SGReplicate Multicluster(Blackholepuller/Newdocpusher) tests: http://showfast.sc.couchbase.com/#/timeline/Linux/syncgateway/sgreplicate/Multi-cluster
Tested on SG 3.2.0-242 and 4.0.0-3. SG version doesn't affect throughput, and only server version causes regression.
In summary: 7.6.0 uses more CPU and memory for SG and more goroutines, but waits more. It also has higher CPU usage for Beam.smp and Indexer. "base.(*Collection).WriteUpdateWithXattr" takes a lot more CPU time.
7.2.2 has higher server memcached usage, higher SG heapinuse, while the average queue size is close to 0.
Looking at these tests:
- SG 3.2.0-242, CB 7.6.0-1887: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2732/consoleFull. Artifacts: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2732/artifact/. Throughput: 68,767 docs pushed/sec
- SG 3.2.0-242, CB 7.2.2-6401: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2719/consoleFull. Artiffacts: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2719/artifact/. Throughput 83,010 docs pushed/sec
- Cbmonitor comparison: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2
- Differences between the runs, on the SG side:
- CPU utilization is up by almost 30% in the test with 7.6.0: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#034ba0a7d72a19ba5b487943020edc6c
- SG memory usage is slightly higher for 7.6.0: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#ea232aa23dc9bc563c7feb6f02df365d
- The number of goroutines and goroutines high watermark is up by about 10% for 7.6.0: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#2acbcb0023d0e2f7e48f1b2ba7077262
- Heapalloc and heapinuse are slightly higher for 7.2.2, but heapidle is higher for 7.6.0: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#8d9cc6ea168d85997db4d06b82dc85d9
- Pausetotalns is about 4 times higher for 7.6.0: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#1c48bfbf6aead72445f25cae23d87e3a
- On the server side:
- Beam.smp RSS is higher by about 10% for 7.2.2: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#f50f2c78d4dcb8696d90ffe49958d25b, but Beam.smp CPU is higher for 7.6.0 http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#fdf2ee304164bba7f03f0e591ce3961d
- Something similar seems to be happening for Indexer, where CPU is higher for 7.6.0: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#dd4c380f69814327c80a1dafb5a9e4c4
- For memcached, CPU usage is higher in 7.2.2: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#3ce41eff4117be874f32415e764956f6
- Another interesting datapoint is data_avgqusz, which is close to 0 at all points for 7.2.2, but is consistently higher for 7.6.0: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=hebe_320-242_run_bp_test_333f&label=7.6.0&snapshot=hebe_320-242_run_bp_test_bc6b&label=7.2.2#47265a3b63dbdc6f140385c6911739fd
- I also tried looking at 2 sg_cpu profiles:
- 7.2.2: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2719/artifact/172.23.100.205_syncgateway_sg_cpu_231209024220_1c6b69.pprof
- 7.6.0: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2732/artifact/172.23.100.205_syncgateway_sg_cpu_231213113212_835b76.pprof
- The main seems to be that CPU time for "go-blip.(*Message).asyncRead.func1" is higher for 7.6.0 (35.82%) compared to 7.2.2 (21.73%). This increase seems to mainly come from "base.(*Collection).WriteUpdateWithXattr", which takes 26.80% of CPU time, compared to 15.87% for 7.2.2
- Comparing 2 sg_mutex profiles:
- 7.2.2: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2719/artifact/172.23.100.205_syncgateway_sg_mutex_231209024119_0c97d8.pprof
- 7.6.0: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2732/artifact/172.23.100.205_syncgateway_sg_mutex_231213113110_00b70e.pprof
- "cbgt.(*Manager).JanitorLoop" takes a lot longer for 7.6.0 (25.59%) compared to 7.2.2 (1.47%). Similarly, "gocbcore.(*memdClient).run.func2" is higher for 7.6.0 (11.93%) compared to 7.2.2 (1.14%)
- I also looked at sg_block, sg_heap and goroutines profiles, but there weren't any big differences between the profiles.
I ran a couple more tests to get server cbcollect_info. Linking them below
- 7.2.2: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2735/console
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2735/172.23.100.190.zip
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2735/172.23.100.191.zip
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2735/172.23.100.192.zip
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2735/172.23.100.193.zip
- 7.6.0: https://perf.jenkins.couchbase.com/job/sg_hebe_sgreplicate_multicluster/2736/console
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2736/172.23.100.190.zip
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2736/172.23.100.191.zip
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2736/172.23.100.192.zip
- https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-sg_hebe_sgreplicate_multicluster-2736/172.23.100.193.zip
Attachments
Issue Links
- is triggering
-
MB-60525 Frontend throughput regression caused by connection_manager_interval=0.1
- Closed