Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38012

1% DGM Test: Write ops/s dropped from 230k/s to 400/s due to swapping

    XMLWordPrintable

Details

    • Untriaged
    • Unknown
    • Magma: Jan 20 - Feb 2

    Description

      A 2 node cluster running 1% DGM test abruptly dropped ops rate from 240k writes/sec to 400/s when the active resident is at 27%. The test was in the initial loading phase of 12B docs (loaded 460M). Memory used by magma is around 3G per node.

       Logs:
      https://uploads.couchbase.com/magma/collectinfo-2020-02-21T092448-ns_1%40172.23.97.25.zip
      https://uploads.couchbase.com/magma/collectinfo-2020-02-21T092448-ns_1%40172.23.97.26.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ben.huddleston Ben Huddleston added a comment - - edited

            Logs - https://supportal.couchbase.com/snapshot/25a28b542bd29f28159f19a7d5635fd3%3A%3A0

            We've got a 2 node cluster again, in this run node 25 ended up with mem usage stuck above the high water mark.

            Whilst the test did at some point start using swap, it ended up in a livelock state with 0 swap usage. VMstat from node 25.

            [root@cen-sa34 ~]# vmstat -w 1
            procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
             r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
             3  0       147096      2531052            0      1673056    0    0  2398  3106    7   10  13   5  79   3   0
             2  0       147096      2536880            0      1673080    0    0     0     0 5298 28510   4   1  95   0   0
             3  0       147096      2533416            0      1673080    0    0     0    48 7506 29642   4   1  95   0   0
             4  0       147096      2527632            0      1673168    0    0     0    24 6141 31915   4   1  95   0   0
             4  0       147096      2529232            0      1673168    0    0     0    76 5819 31291   4   1  95   0   0
             4  0       147096      2532092            0      1673196    0    0     0   244 6225 27959   4   1  95   0   0
             2  0       147096      2538680            0      1673236    0    0     0    92 5856 31254   4   1  95   0   0
             4  0       147096      2537208            0      1672492    0    0     0   857 7133 31722   4   1  95   0   0
             5  0       147096      2532876            0      1673260    0    0     0    24 6718 31717   4   1  95   0   0
             4  0       147096      2529096            0      1673260    0    0     0    87 7977 36804   4   2  94   0   0

            Observed in the UI that node 25 has 350k DCP backoffs per second. Node 25 also has 0 disk write queue and 0 replication queue. Node 26 has 0 disk disk write queue and ~21k replication queue. The replication queue stat is a touch misleading, it only accounts the in memory queue. Can see below that the actual oustanding number of items is much larger (500M+).

            Node 26 isn't getting any replica items as node 25 is above the HWM.

            2020-03-04T01:24:49.659863-08:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            

            Spotted that the test was run with the config parameter "cursor_dropping_checkpoint_mem_upper_mark" set to 100. This prevents us from dropping cursors based on checkpoint memory usage. We will still cursor drop when over 95% memory usage.
            What is happening here is that we are dropping cursors when we go above 95% mem_used and forcing streams to backfill from disk. At first this is fine as the disk backfills are relatively small. The disk backfill quickly get larger though and we end up in a state where we have to send a large disk snapshot/checkpoint to the replica. This causes an issue on node 25 as we end up with a large amount of replica checkpoint memory overhead. The overhead memory is the keyIndex map. This is used for de-duplication of items on the active node, and for sanity checks on the replica node. We added these sanity checks for SyncWrites, but it might be worth revisiting (at least for disk checkpoints) as this has led this cluster to end up in an unrecoverable state. I'll investigate this further. See also MB-35889.

            ep_checkpoint_memory:                                  27491511930
             ep_checkpoint_memory_overhead:                         27491358336
             ep_checkpoint_memory_unreferenced:                     0
             ep_cursor_dropping_checkpoint_mem_lower_mark:          30
             ep_cursor_dropping_checkpoint_mem_upper_mark:          100
             vb_active_checkpoint_memory:                           106368
             vb_active_checkpoint_memory_overhead:                  74752
             vb_active_checkpoint_memory_unreferenced:              0
             vb_pending_checkpoint_memory:                          0
             vb_pending_checkpoint_memory_overhead:                 0
             vb_pending_checkpoint_memory_unreferenced:             0
             vb_replica_checkpoint_memory:                          27491405562
             vb_replica_checkpoint_memory_overhead:                 27491283584
            

            We could also argue that this is a sizing/test issue though. Whilst memcached/kv_engine is capable of ingesting items faster than we can replicate them, it is not possible to do so for the sustained periods that we try to in this test. This relates to MB-36370.

            Suggested that this test might work better with cursor dropping disabled entirely. The cluster might end up in a stuck state if our memory usage gets above the backoff for replication, but if not then this should put backpressure on the clients to reduce the ops/s.

            ben.huddleston Ben Huddleston added a comment - - edited Logs - https://supportal.couchbase.com/snapshot/25a28b542bd29f28159f19a7d5635fd3%3A%3A0 We've got a 2 node cluster again, in this run node 25 ended up with mem usage stuck above the high water mark. Whilst the test did at some point start using swap, it ended up in a livelock state with 0 swap usage. VMstat from node 25. [root@cen-sa34 ~]# vmstat -w 1 procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 0 147096 2531052 0 1673056 0 0 2398 3106 7 10 13 5 79 3 0 2 0 147096 2536880 0 1673080 0 0 0 0 5298 28510 4 1 95 0 0 3 0 147096 2533416 0 1673080 0 0 0 48 7506 29642 4 1 95 0 0 4 0 147096 2527632 0 1673168 0 0 0 24 6141 31915 4 1 95 0 0 4 0 147096 2529232 0 1673168 0 0 0 76 5819 31291 4 1 95 0 0 4 0 147096 2532092 0 1673196 0 0 0 244 6225 27959 4 1 95 0 0 2 0 147096 2538680 0 1673236 0 0 0 92 5856 31254 4 1 95 0 0 4 0 147096 2537208 0 1672492 0 0 0 857 7133 31722 4 1 95 0 0 5 0 147096 2532876 0 1673260 0 0 0 24 6718 31717 4 1 95 0 0 4 0 147096 2529096 0 1673260 0 0 0 87 7977 36804 4 2 94 0 0 Observed in the UI that node 25 has 350k DCP backoffs per second. Node 25 also has 0 disk write queue and 0 replication queue. Node 26 has 0 disk disk write queue and ~21k replication queue. The replication queue stat is a touch misleading, it only accounts the in memory queue. Can see below that the actual oustanding number of items is much larger (500M+). Node 26 isn't getting any replica items as node 25 is above the HWM. 2020-03-04T01:24:49.659863-08:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high Spotted that the test was run with the config parameter "cursor_dropping_checkpoint_mem_upper_mark" set to 100. This prevents us from dropping cursors based on checkpoint memory usage. We will still cursor drop when over 95% memory usage. What is happening here is that we are dropping cursors when we go above 95% mem_used and forcing streams to backfill from disk. At first this is fine as the disk backfills are relatively small. The disk backfill quickly get larger though and we end up in a state where we have to send a large disk snapshot/checkpoint to the replica. This causes an issue on node 25 as we end up with a large amount of replica checkpoint memory overhead. The overhead memory is the keyIndex map. This is used for de-duplication of items on the active node, and for sanity checks on the replica node. We added these sanity checks for SyncWrites, but it might be worth revisiting (at least for disk checkpoints) as this has led this cluster to end up in an unrecoverable state. I'll investigate this further. See also MB-35889 . ep_checkpoint_memory: 27491511930 ep_checkpoint_memory_overhead: 27491358336 ep_checkpoint_memory_unreferenced: 0 ep_cursor_dropping_checkpoint_mem_lower_mark: 30 ep_cursor_dropping_checkpoint_mem_upper_mark: 100 vb_active_checkpoint_memory: 106368 vb_active_checkpoint_memory_overhead: 74752 vb_active_checkpoint_memory_unreferenced: 0 vb_pending_checkpoint_memory: 0 vb_pending_checkpoint_memory_overhead: 0 vb_pending_checkpoint_memory_unreferenced: 0 vb_replica_checkpoint_memory: 27491405562 vb_replica_checkpoint_memory_overhead: 27491283584 We could also argue that this is a sizing/test issue though. Whilst memcached/kv_engine is capable of ingesting items faster than we can replicate them, it is not possible to do so for the sustained periods that we try to in this test. This relates to MB-36370 . Suggested that this test might work better with cursor dropping disabled entirely. The cluster might end up in a stuck state if our memory usage gets above the backoff for replication, but if not then this should put backpressure on the clients to reduce the ops/s.

            Discovered that our checkpoint memory overhead accounting is not quite correct. In the below graph we are tracking ~27GB memory usage as checkpoint memory overhead. This doesn't tally up with the overall bucket quota though. This tracking does not include the memory used for the keys of the keyIndex map if they have been allocated on the heap (i.e. over 16 bytes). In this test they look to fall in the 48 byte size class.

            We have 27GB tracked checkpoint memory overhead and 300M items in checkpoints (all expelled).

            EMEAUKMAC01:cbcollect_info_ns_1@172.23.97.25_20200304-094130 benhuddleston$ grep -E "num_checkpoint_items" stats.log | awk '{sum+=$2} END {print sum}'
            306795137
            

            If we look at the jemalloc bins we can see that we have ~300M allocators in the 48 and 80 byte bins. We are failing to track a substantial amount of memory in this stat (14+GB).

            bins:           size ind    allocated      nmalloc (#/sec)      ndalloc (#/sec)    nrequests   (#/sec)  nshards      curregs     curslabs  nonfull_slabs regs pgs   util       nfills (#/sec)     nflushes (#/sec)       nslabs     nreslabs (#/sec)      n_lock_ops (#/sec)       n_waiting (#/sec)      n_spin_acq (#/sec)  n_owner_switch (#/sec)   total_wait_ns   (#/sec)     max_wait_ns  max_n_thds
                               8   0      1926648     37355684     670     37114853     666   4785767274     85962        1       240831          471              2  512   1  0.998     34095458     612     25375131     455          506     34000638     610        59921306    1076            2205       0          103886       1        43881071     788       665999965     11962         7999999          29
                              16   1       358736    894023238   16058    894000817   16058  18882685395    339171        1        22421          217            189  256   1  0.403    141763627    2546     42108314     756        13604    158128281    2840       184349647    3311           21857       0          583891      10        76549539    1374      3057999822     54927        11999999          36
                              32   2    387728320   6528711472  117268   6516594962  117051 293136306865   5265322        1     12116510       300351         281186  128   1  0.315    244389628    4389     99747117    1791      1060328   3102906741   55734       346429320    6222        17918594     321         7891174     141       169292408    3040    757713957567  13610079        46999997          45
                              48   3  14800610496  16305665661  292882  15997319609  287344  53515270042    961242        1    308346052      1204648           4656  256   3  0.999    271659621    4879    166924684    2998      2369791   7993264885  143575       442582105    7949        45681278     820        14907929     267       269387825    4838   2720663846225  48868640        33999998          71
                              64   4     42451456   1869308009   33576   1868644705   33564  29098736660    522672        1       663304        10845           1489   64   1  0.955    127773395    2295     57620784    1034        33693   1453084193   26100       185900894    3339         3134399      56         3034303      54       104610733    1879    116309993702   2089163        38999998          64
                              80   5  24596145680   3655260945   65655   3347809124   60133   7842356661    140864        1    307451821      1200985             28  256   5  0.999    188892937    3392     45798976     822      2313219    963237264   17301       238567540    4285         5936948     106         2935421      52        74723924    1342    205235988554   3686454        21999999          28
            

            Raised MB-38173 for this.

            ben.huddleston Ben Huddleston added a comment - Discovered that our checkpoint memory overhead accounting is not quite correct. In the below graph we are tracking ~27GB memory usage as checkpoint memory overhead. This doesn't tally up with the overall bucket quota though. This tracking does not include the memory used for the keys of the keyIndex map if they have been allocated on the heap (i.e. over 16 bytes). In this test they look to fall in the 48 byte size class. We have 27GB tracked checkpoint memory overhead and 300M items in checkpoints (all expelled). EMEAUKMAC01:cbcollect_info_ns_1@172.23.97.25_20200304-094130 benhuddleston$ grep -E "num_checkpoint_items" stats.log | awk '{sum+=$2} END {print sum}' 306795137 If we look at the jemalloc bins we can see that we have ~300M allocators in the 48 and 80 byte bins. We are failing to track a substantial amount of memory in this stat (14+GB). bins: size ind allocated nmalloc (#/sec) ndalloc (#/sec) nrequests (#/sec) nshards curregs curslabs nonfull_slabs regs pgs util nfills (#/sec) nflushes (#/sec) nslabs nreslabs (#/sec) n_lock_ops (#/sec) n_waiting (#/sec) n_spin_acq (#/sec) n_owner_switch (#/sec) total_wait_ns (#/sec) max_wait_ns max_n_thds 8 0 1926648 37355684 670 37114853 666 4785767274 85962 1 240831 471 2 512 1 0.998 34095458 612 25375131 455 506 34000638 610 59921306 1076 2205 0 103886 1 43881071 788 665999965 11962 7999999 29 16 1 358736 894023238 16058 894000817 16058 18882685395 339171 1 22421 217 189 256 1 0.403 141763627 2546 42108314 756 13604 158128281 2840 184349647 3311 21857 0 583891 10 76549539 1374 3057999822 54927 11999999 36 32 2 387728320 6528711472 117268 6516594962 117051 293136306865 5265322 1 12116510 300351 281186 128 1 0.315 244389628 4389 99747117 1791 1060328 3102906741 55734 346429320 6222 17918594 321 7891174 141 169292408 3040 757713957567 13610079 46999997 45 48 3 14800610496 16305665661 292882 15997319609 287344 53515270042 961242 1 308346052 1204648 4656 256 3 0.999 271659621 4879 166924684 2998 2369791 7993264885 143575 442582105 7949 45681278 820 14907929 267 269387825 4838 2720663846225 48868640 33999998 71 64 4 42451456 1869308009 33576 1868644705 33564 29098736660 522672 1 663304 10845 1489 64 1 0.955 127773395 2295 57620784 1034 33693 1453084193 26100 185900894 3339 3134399 56 3034303 54 104610733 1879 116309993702 2089163 38999998 64 80 5 24596145680 3655260945 65655 3347809124 60133 7842356661 140864 1 307451821 1200985 28 256 5 0.999 188892937 3392 45798976 822 2313219 963237264 17301 238567540 4285 5936948 106 2935421 52 74723924 1342 205235988554 3686454 21999999 28 Raised MB-38173 for this.

            Will continue investigation into the checkpoint memory overhead as part of MB-35889. This ticket already attempted to address the same issues but the code had to be reverted as it caused issues with sync writes.

            Recommend re-running this test either with cursor dropping disabled entirely (cursor_dropping_upper_mark=100), a lower mutation_mem_threshold to attempt to try to get the clients to backoff, or just to reduce the throughut of the clients to just below the throughput of in memory dcp replication.

            ben.huddleston Ben Huddleston added a comment - Will continue investigation into the checkpoint memory overhead as part of MB-35889 . This ticket already attempted to address the same issues but the code had to be reverted as it caused issues with sync writes. Recommend re-running this test either with cursor dropping disabled entirely (cursor_dropping_upper_mark=100), a lower mutation_mem_threshold to attempt to try to get the clients to backoff, or just to reduce the throughut of the clients to just below the throughput of in memory dcp replication.
            ben.huddleston Ben Huddleston added a comment - - edited

            Apologies for the slow updates on this ticket. We re-ran the test with a patch to allow erasing of the key indexes of disk checkpoints so that we can backfill from disk for DCP to avoid the cluster locking up until we have better control over checkpoint memory usage. This test still ran into swap somewhere in the access phase. I don't have the logs, one node wouldn't cbcollect as it was too heavily in swap. I investigated all of our memory accounting statistics to try to make sense of what was going wrong. It looked like something wasn't quite adding up. Had negligible memory usage in checkpoints, the memory usage we had in hashtable was high (~30GB) and we had ~22GB reported by magma across all shards. I spotted that the UI was reporting numbers different to cbstats for mem_used and watermarks which threw me off for a bit. Not sure why that was the case. Sarath Lakshman started running the test again with jemalloc profiling so that we can investigate.

            I tried a "repro" of this on a small scale. Created a bucket with 500MB quota and loaded enough docs to force residency ratio down to 0%. Cbstats showed negligible memory usage in checkpoints and hash table. Magma showed ~30MB memory usage which seemed reasonable, but a jemalloc profile showed that magma had an actual memory usage that was much higher (~355MB). Sarath Lakshman is investigating this further as we need magma to stay within given memory limits or we risk going into swap/OOM kill.

            jemalloc.svg

            ben.huddleston Ben Huddleston added a comment - - edited Apologies for the slow updates on this ticket. We re-ran the test with a patch to allow erasing of the key indexes of disk checkpoints so that we can backfill from disk for DCP to avoid the cluster locking up until we have better control over checkpoint memory usage. This test still ran into swap somewhere in the access phase. I don't have the logs, one node wouldn't cbcollect as it was too heavily in swap. I investigated all of our memory accounting statistics to try to make sense of what was going wrong. It looked like something wasn't quite adding up. Had negligible memory usage in checkpoints, the memory usage we had in hashtable was high (~30GB) and we had ~22GB reported by magma across all shards. I spotted that the UI was reporting numbers different to cbstats for mem_used and watermarks which threw me off for a bit. Not sure why that was the case. Sarath Lakshman started running the test again with jemalloc profiling so that we can investigate. I tried a "repro" of this on a small scale. Created a bucket with 500MB quota and loaded enough docs to force residency ratio down to 0%. Cbstats showed negligible memory usage in checkpoints and hash table. Magma showed ~30MB memory usage which seemed reasonable, but a jemalloc profile showed that magma had an actual memory usage that was much higher (~355MB). Sarath Lakshman is investigating this further as we need magma to stay within given memory limits or we risk going into swap/OOM kill. jemalloc.svg
            sarath Sarath Lakshman added a comment - - edited

            We have resolved the issue causing the filesystem memory pressure (Reduce the number of files used by magma). We have to verify it by rerunning the test.

            CBSS-411 is opened to investigate magma memory accounting stats.

            sarath Sarath Lakshman added a comment - - edited We have resolved the issue causing the filesystem memory pressure (Reduce the number of files used by magma). We have to verify it by rerunning the test. CBSS-411 is opened to investigate magma memory accounting stats.

            People

              bo-chun.wang Bo-Chun Wang
              sarath Sarath Lakshman
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty