Logs - https://supportal.couchbase.com/snapshot/25a28b542bd29f28159f19a7d5635fd3%3A%3A0
We've got a 2 node cluster again, in this run node 25 ended up with mem usage stuck above the high water mark.

Whilst the test did at some point start using swap, it ended up in a livelock state with 0 swap usage. VMstat from node 25.
[root@cen-sa34 ~]# vmstat -w 1
|
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
|
r b swpd free buff cache si so bi bo in cs us sy id wa st
|
3 0 147096 2531052 0 1673056 0 0 2398 3106 7 10 13 5 79 3 0
|
2 0 147096 2536880 0 1673080 0 0 0 0 5298 28510 4 1 95 0 0
|
3 0 147096 2533416 0 1673080 0 0 0 48 7506 29642 4 1 95 0 0
|
4 0 147096 2527632 0 1673168 0 0 0 24 6141 31915 4 1 95 0 0
|
4 0 147096 2529232 0 1673168 0 0 0 76 5819 31291 4 1 95 0 0
|
4 0 147096 2532092 0 1673196 0 0 0 244 6225 27959 4 1 95 0 0
|
2 0 147096 2538680 0 1673236 0 0 0 92 5856 31254 4 1 95 0 0
|
4 0 147096 2537208 0 1672492 0 0 0 857 7133 31722 4 1 95 0 0
|
5 0 147096 2532876 0 1673260 0 0 0 24 6718 31717 4 1 95 0 0
|
4 0 147096 2529096 0 1673260 0 0 0 87 7977 36804 4 2 94 0 0
|
Observed in the UI that node 25 has 350k DCP backoffs per second. Node 25 also has 0 disk write queue and 0 replication queue. Node 26 has 0 disk disk write queue and ~21k replication queue. The replication queue stat is a touch misleading, it only accounts the in memory queue. Can see below that the actual oustanding number of items is much larger (500M+).

Node 26 isn't getting any replica items as node 25 is above the HWM.
2020-03-04T01:24:49.659863-08:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
|
Spotted that the test was run with the config parameter "cursor_dropping_checkpoint_mem_upper_mark" set to 100. This prevents us from dropping cursors based on checkpoint memory usage. We will still cursor drop when over 95% memory usage.
What is happening here is that we are dropping cursors when we go above 95% mem_used and forcing streams to backfill from disk. At first this is fine as the disk backfills are relatively small. The disk backfill quickly get larger though and we end up in a state where we have to send a large disk snapshot/checkpoint to the replica. This causes an issue on node 25 as we end up with a large amount of replica checkpoint memory overhead. The overhead memory is the keyIndex map. This is used for de-duplication of items on the active node, and for sanity checks on the replica node. We added these sanity checks for SyncWrites, but it might be worth revisiting (at least for disk checkpoints) as this has led this cluster to end up in an unrecoverable state. I'll investigate this further. See also MB-35889.
ep_checkpoint_memory: 27491511930
|
ep_checkpoint_memory_overhead: 27491358336
|
ep_checkpoint_memory_unreferenced: 0
|
ep_cursor_dropping_checkpoint_mem_lower_mark: 30
|
ep_cursor_dropping_checkpoint_mem_upper_mark: 100
|
vb_active_checkpoint_memory: 106368
|
vb_active_checkpoint_memory_overhead: 74752
|
vb_active_checkpoint_memory_unreferenced: 0
|
vb_pending_checkpoint_memory: 0
|
vb_pending_checkpoint_memory_overhead: 0
|
vb_pending_checkpoint_memory_unreferenced: 0
|
vb_replica_checkpoint_memory: 27491405562
|
vb_replica_checkpoint_memory_overhead: 27491283584
|
We could also argue that this is a sizing/test issue though. Whilst memcached/kv_engine is capable of ingesting items faster than we can replicate them, it is not possible to do so for the sustained periods that we try to in this test. This relates to MB-36370.

Suggested that this test might work better with cursor dropping disabled entirely. The cluster might end up in a stuck state if our memory usage gets above the backoff for replication, but if not then this should put backpressure on the clients to reduce the ops/s.
Logs - https://supportal.couchbase.com/snapshot/25a28b542bd29f28159f19a7d5635fd3%3A%3A0
We've got a 2 node cluster again, in this run node 25 ended up with mem usage stuck above the high water mark.
Whilst the test did at some point start using swap, it ended up in a livelock state with 0 swap usage. VMstat from node 25.
[root@cen-sa34 ~]# vmstat -w 1
procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
r b swpd free buff cache si so bi bo in cs us sy id wa st
3 0 147096 2531052 0 1673056 0 0 2398 3106 7 10 13 5 79 3 0
2 0 147096 2536880 0 1673080 0 0 0 0 5298 28510 4 1 95 0 0
3 0 147096 2533416 0 1673080 0 0 0 48 7506 29642 4 1 95 0 0
4 0 147096 2527632 0 1673168 0 0 0 24 6141 31915 4 1 95 0 0
4 0 147096 2529232 0 1673168 0 0 0 76 5819 31291 4 1 95 0 0
4 0 147096 2532092 0 1673196 0 0 0 244 6225 27959 4 1 95 0 0
2 0 147096 2538680 0 1673236 0 0 0 92 5856 31254 4 1 95 0 0
4 0 147096 2537208 0 1672492 0 0 0 857 7133 31722 4 1 95 0 0
5 0 147096 2532876 0 1673260 0 0 0 24 6718 31717 4 1 95 0 0
4 0 147096 2529096 0 1673260 0 0 0 87 7977 36804 4 2 94 0 0
Observed in the UI that node 25 has 350k DCP backoffs per second. Node 25 also has 0 disk write queue and 0 replication queue. Node 26 has 0 disk disk write queue and ~21k replication queue. The replication queue stat is a touch misleading, it only accounts the in memory queue. Can see below that the actual oustanding number of items is much larger (500M+).

Node 26 isn't getting any replica items as node 25 is above the HWM.
2020-03-04T01:24:49.659863-08:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
Spotted that the test was run with the config parameter "cursor_dropping_checkpoint_mem_upper_mark" set to 100. This prevents us from dropping cursors based on checkpoint memory usage. We will still cursor drop when over 95% memory usage.
What is happening here is that we are dropping cursors when we go above 95% mem_used and forcing streams to backfill from disk. At first this is fine as the disk backfills are relatively small. The disk backfill quickly get larger though and we end up in a state where we have to send a large disk snapshot/checkpoint to the replica. This causes an issue on node 25 as we end up with a large amount of replica checkpoint memory overhead. The overhead memory is the keyIndex map. This is used for de-duplication of items on the active node, and for sanity checks on the replica node. We added these sanity checks for SyncWrites, but it might be worth revisiting (at least for disk checkpoints) as this has led this cluster to end up in an unrecoverable state. I'll investigate this further. See also
MB-35889.ep_checkpoint_memory: 27491511930
ep_checkpoint_memory_overhead: 27491358336
ep_checkpoint_memory_unreferenced: 0
ep_cursor_dropping_checkpoint_mem_lower_mark: 30
ep_cursor_dropping_checkpoint_mem_upper_mark: 100
vb_active_checkpoint_memory: 106368
vb_active_checkpoint_memory_overhead: 74752
vb_active_checkpoint_memory_unreferenced: 0
vb_pending_checkpoint_memory: 0
vb_pending_checkpoint_memory_overhead: 0
vb_pending_checkpoint_memory_unreferenced: 0
vb_replica_checkpoint_memory: 27491405562
vb_replica_checkpoint_memory_overhead: 27491283584
We could also argue that this is a sizing/test issue though. Whilst memcached/kv_engine is capable of ingesting items faster than we can replicate them, it is not possible to do so for the sustained periods that we try to in this test. This relates to MB-36370.

Suggested that this test might work better with cursor dropping disabled entirely. The cluster might end up in a stuck state if our memory usage gets above the backoff for replication, but if not then this should put backpressure on the clients to reduce the ops/s.