Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-38012

1% DGM Test: Write ops/s dropped from 230k/s to 400/s due to swapping

    XMLWordPrintable

Details

    • Untriaged
    • Unknown
    • Magma: Jan 20 - Feb 2

    Description

      A 2 node cluster running 1% DGM test abruptly dropped ops rate from 240k writes/sec to 400/s when the active resident is at 27%. The test was in the initial loading phase of 12B docs (loaded 460M). Memory used by magma is around 3G per node.

       Logs:
      https://uploads.couchbase.com/magma/collectinfo-2020-02-21T092448-ns_1%40172.23.97.25.zip
      https://uploads.couchbase.com/magma/collectinfo-2020-02-21T092448-ns_1%40172.23.97.26.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            sarath Sarath Lakshman created issue -
            sarath Sarath Lakshman made changes -
            Field Original Value New Value
            Description A 2 node cluster running 1% DGM test abruptly dropped ops rate from 240k writes/sec to 400/s when the active resident is at 27%. The test was in the initial loading phase of 12B docs (loaded 460M). Memory used by magma is around 3G per node.

             

             
            A 2 node cluster running 1% DGM test abruptly dropped ops rate from 240k writes/sec to 400/s when the active resident is at 27%. The test was in the initial loading phase of 12B docs (loaded 460M). Memory used by magma is around 3G per node.

             Logs:
            [https://uploads.couchbase.com/magma/collectinfo-2020-02-21T092448-ns_1%40172.23.97.25.zip]
            [https://uploads.couchbase.com/magma/collectinfo-2020-02-21T092448-ns_1%40172.23.97.26.zip]
            ben.huddleston Ben Huddleston made changes -
            Attachment gdb.txt [ 84936 ]
            drigby Dave Rigby added a comment -

            Sarath Lakshman Is this a new test or a regression? If a regression, what's the last known good build for this test?

            drigby Dave Rigby added a comment - Sarath Lakshman Is this a new test or a regression? If a regression, what's the last known good build for this test?
            sarath Sarath Lakshman added a comment - - edited

            This test runs with replica=0

            Previously we ran this test with replica=1 and it passed. With a replica, the ops rate usually is capped because of the intra-replication queue building up and data load runs at a much slower phase. The last time we ran without replica was with a pre mad-hatter build and it succeeded. 

            sarath Sarath Lakshman added a comment - - edited This test runs with replica=0 Previously we ran this test with replica=1 and it passed. With a replica, the ops rate usually is capped because of the intra-replication queue building up and data load runs at a much slower phase. The last time we ran without replica was with a pre mad-hatter build and it succeeded. 
            owend Daniel Owen made changes -
            Epic Link MB-30659 [ 88207 ]

            Looks like this problem is very easy to reproduce. The cluster ran in to the issue after I reran it.

            sarath Sarath Lakshman added a comment - Looks like this problem is very easy to reproduce. The cluster ran in to the issue after I reran it.
            owend Daniel Owen made changes -
            Sprint KV Sprint 2020-February [ 979 ]
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston added a comment - - edited

            Little more context, the test is run with two nodes.

            On node 25 we do not go above HWM and do not get any temp oom errors.

            On node 26 we have an odd spike in memory which we never recover from causing temp oom errors. This spike in memory puts us above the (default) 93% threshold under which we will accept new writes.

            Not sure yet if this might be the cause but we have a spike in disk queue items immediately preceeding this.

            Normally we just recover memory by ejecting more stuff, but pager appears to have stopped running.

            Although we still have some ejections which seems very odd. Perhaps the pager is running incredibly slowly?

            ben.huddleston Ben Huddleston added a comment - - edited Little more context, the test is run with two nodes. On node 25 we do not go above HWM and do not get any temp oom errors. On node 26 we have an odd spike in memory which we never recover from causing temp oom errors. This spike in memory puts us above the (default) 93% threshold under which we will accept new writes. Not sure yet if this might be the cause but we have a spike in disk queue items immediately preceeding this. Normally we just recover memory by ejecting more stuff, but pager appears to have stopped running. Although we still have some ejections which seems very odd. Perhaps the pager is running incredibly slowly?
            ben.huddleston Ben Huddleston added a comment - - edited

            From cbstats of node 26. ItemPager task just creates and schedules the ItemPagerVisitor task. In this instance, number of runs means number of runs of any ItemPagerVisitor task, not number of times this specific task has run. Commenting on this because it uses a pausable task framework but does not allow it to pause between vBuckets (if you dig far enough into the code). This visitor has a massive runtime which looks to correspond to the period of slow item paging and slow op/s. We know that it's currently on vBucket 972 but unfortunately don't know if this particular vBucket is slow or if they all are.

            Tasks     Writer Reader AuxIO  NonIO  Total      Uptime: 1:18:22
            Running   0      0      0      1      1
            All       129    128    1      13     271
             
              TID  Pri  St  Bucket       SleepFor    Runtime  TotalRun     #Runs  Type    Name                              Addr                Descr.
             
                1    1  S   bucket-1      0:04.72    0:00.00   0:40.91  31478989  NonIO   ItemPager                         0x00007fbe07758450  Paging out items.
                2    1  S   bucket-1     41:39.63    0:01.86   0:01.86         1  NonIO   ExpiredItemPager                  0x00007fbe077584f0  Paging expired items.
                3  211  S   bucket-1      0:00.42    0:00.00   0:00.32      4645  NonIO   HashtableResizerTask              0x00007fbe077e1900  Adjusting hash table sizes.
                4    6  S   bucket-1      0:04.05    0:00.00   0:15.77    547886  NonIO   ClosedUnrefCheckpointRemoverTask  0x00007fbcc4de4a10  Removing closed unreferenced checkpoints from memory
                5    1  S   bucket-1     -0:00.19    0:00.00   0:03.74    185160  NonIO   DurabilityTimeoutTask             0x00007fbe077e1820  DurabilityTimeoutTask
                6    1  S   bucket-1          inf    0:00.00   0:00.00         0  NonIO   DurabilityCompletionTask          0x00007fbd457a3290  DurabilityCompletionTask
                7   10  S   bucket-1      0:02.89    0:00.00   0:00.00       929  NonIO   WorkLoadMonitor                   0x00007fbe077e1890  Monitoring a workload pattern
                8    7  S   bucket-1      0:02.07    0:00.03  02:54.28       447  NonIO   DefragmenterTask                  0x00007fbcc4de4a90  Memory defragmenter
                9    7  S   bucket-1      0:00.03    0:00.00   0:00.50     18576  NonIO   ItemCompressorTask                0x00007fbcc4de4b10  Item Compressor
               10    7  S   bucket-1          inf    0:00.00   0:00.00         0  NonIO   ItemFreqDecayerTask               0x00007fbe077e1970  Item frequency count decayer task
              913    5  S   bucket-1      0:00.26    0:00.00   0:00.15      4645  NonIO   ConnNotifierCallback              0x00007fbe077e52c0  DCP connection notifier
              914    8  S   bucket-1      0:00.29    0:00.00   0:00.18      4645  NonIO   ConnManager                       0x00007fbcc4de4b90  Connection Manager
            75825    1  R   bucket-1      0:00.00  *48:54.80   0:00.00       760  NonIO   ItemPagerVisitor                  0x00007fbcc516ef10  Item pager on vb:972
            

            This is probably not related but I spotted that we could end up running the item pager twice in a row with the same pager_phase. When we construct a PagingVisitor from ItemPager::run we pass it two values, available and pager_phase. In PagingVisitor we rename available to stateFinalizer and the TL;DR of what available does is ensures that we are only actively running on PagingVisitor at a time (although 2 objects may exist). The purpose of pager_phase is to flip between paging out only replica items and active and pending ones. We do replicas first but otherwise the code looks to handle vBuckets of a different type fine. By setting available before pager_phase in PagingVisitor::complete we expose a small race condition where we could create a new PagingVisitor with a VBucketFilter with the wrong vBuckets. There's not real adverse affect here, just an oddity.

            void PagingVisitor::complete() {
            ...
                bool inverse = false;
                (*stateFinalizer).compare_exchange_strong(inverse, true);
             
                if (owner == ITEM_PAGER && !isBelowLowWaterMark) {
                    if (*pager_phase == REPLICA_ONLY) {
                        *pager_phase = ACTIVE_AND_PENDING_ONLY;
                    } else if (*pager_phase == ACTIVE_AND_PENDING_ONLY && !isEphemeral) {
                        *pager_phase = REPLICA_ONLY;
                    }
                }
            ...
                if (ITEM_PAGER == owner) {
                    // Re-check memory which may wake up the ItemPager and schedule
                    // a new PagingVisitor with the next phase/memory target etc...
                    // This is done after we've signalled 'completion' by clearing
                    // the stateFinalizer, which ensures the ItemPager doesn't just
                    // ignore a request.
                    store.checkAndMaybeFreeMemory();
                }
            }
            

            ben.huddleston Ben Huddleston added a comment - - edited From cbstats of node 26. ItemPager task just creates and schedules the ItemPagerVisitor task. In this instance, number of runs means number of runs of any ItemPagerVisitor task, not number of times this specific task has run. Commenting on this because it uses a pausable task framework but does not allow it to pause between vBuckets (if you dig far enough into the code). This visitor has a massive runtime which looks to correspond to the period of slow item paging and slow op/s. We know that it's currently on vBucket 972 but unfortunately don't know if this particular vBucket is slow or if they all are. Tasks Writer Reader AuxIO NonIO Total Uptime: 1:18:22 Running 0 0 0 1 1 All 129 128 1 13 271   TID Pri St Bucket SleepFor Runtime TotalRun #Runs Type Name Addr Descr.   1 1 S bucket-1 0:04.72 0:00.00 0:40.91 31478989 NonIO ItemPager 0x00007fbe07758450 Paging out items. 2 1 S bucket-1 41:39.63 0:01.86 0:01.86 1 NonIO ExpiredItemPager 0x00007fbe077584f0 Paging expired items. 3 211 S bucket-1 0:00.42 0:00.00 0:00.32 4645 NonIO HashtableResizerTask 0x00007fbe077e1900 Adjusting hash table sizes. 4 6 S bucket-1 0:04.05 0:00.00 0:15.77 547886 NonIO ClosedUnrefCheckpointRemoverTask 0x00007fbcc4de4a10 Removing closed unreferenced checkpoints from memory 5 1 S bucket-1 -0:00.19 0:00.00 0:03.74 185160 NonIO DurabilityTimeoutTask 0x00007fbe077e1820 DurabilityTimeoutTask 6 1 S bucket-1 inf 0:00.00 0:00.00 0 NonIO DurabilityCompletionTask 0x00007fbd457a3290 DurabilityCompletionTask 7 10 S bucket-1 0:02.89 0:00.00 0:00.00 929 NonIO WorkLoadMonitor 0x00007fbe077e1890 Monitoring a workload pattern 8 7 S bucket-1 0:02.07 0:00.03 02:54.28 447 NonIO DefragmenterTask 0x00007fbcc4de4a90 Memory defragmenter 9 7 S bucket-1 0:00.03 0:00.00 0:00.50 18576 NonIO ItemCompressorTask 0x00007fbcc4de4b10 Item Compressor 10 7 S bucket-1 inf 0:00.00 0:00.00 0 NonIO ItemFreqDecayerTask 0x00007fbe077e1970 Item frequency count decayer task 913 5 S bucket-1 0:00.26 0:00.00 0:00.15 4645 NonIO ConnNotifierCallback 0x00007fbe077e52c0 DCP connection notifier 914 8 S bucket-1 0:00.29 0:00.00 0:00.18 4645 NonIO ConnManager 0x00007fbcc4de4b90 Connection Manager 75825 1 R bucket-1 0:00.00 *48:54.80 0:00.00 760 NonIO ItemPagerVisitor 0x00007fbcc516ef10 Item pager on vb:972 This is probably not related but I spotted that we could end up running the item pager twice in a row with the same pager_phase. When we construct a PagingVisitor from ItemPager::run we pass it two values, available and pager_phase . In PagingVisitor we rename available to stateFinalizer and the TL;DR of what available does is ensures that we are only actively running on PagingVisitor at a time (although 2 objects may exist). The purpose of pager_phase is to flip between paging out only replica items and active and pending ones. We do replicas first but otherwise the code looks to handle vBuckets of a different type fine. By setting available before pager_phase in PagingVisitor::complete we expose a small race condition where we could create a new PagingVisitor with a VBucketFilter with the wrong vBuckets. There's not real adverse affect here, just an oddity. void PagingVisitor::complete() { ... bool inverse = false; (*stateFinalizer).compare_exchange_strong(inverse, true);   if (owner == ITEM_PAGER && !isBelowLowWaterMark) { if (*pager_phase == REPLICA_ONLY) { *pager_phase = ACTIVE_AND_PENDING_ONLY; } else if (*pager_phase == ACTIVE_AND_PENDING_ONLY && !isEphemeral) { *pager_phase = REPLICA_ONLY; } } ... if (ITEM_PAGER == owner) { // Re-check memory which may wake up the ItemPager and schedule // a new PagingVisitor with the next phase/memory target etc... // This is done after we've signalled 'completion' by clearing // the stateFinalizer, which ensures the ItemPager doesn't just // ignore a request. store.checkAndMaybeFreeMemory(); } }
            ben.huddleston Ben Huddleston added a comment - - edited

            Another observation about ItemPager behaviour. Don't think it's entirely relevant but writing these things down helps me to understand.

            We will only attempt to eject items for a vBucket if we are currently above the low watermark when the ItemPagerVisitor visits it. This is fine as we generally want our cache to be as heavily populated as possible. One potential issue with this is that we don't move the position at which we start attempting to eject. This means that if we iterate from vBucket 0 to 1023 and find that we satisfy our memory condition before we start processing vBucket 1 then we could end up only ejecting items for vBucket 0. Whilst it might be nice to eject evenly across vBuckets in this case it's probably a bunch more complexity so I can see why this solution may be preferred. However, this causes another issue when we no longer satisfy the memory condition. We then have to iterate over the resident items in vBucket 0 before reaching the items in vBucket 1 and expelling those. This can then result in the following residency ratios (per vBucket). To actually eject stuff we probably need to iterate a bunch of vBuckets for which there is little to eject (i.e. we could be doing more meaningful work if we were smarter with the vBuckets that we chose to visit).

             vb_512:num_items:                        450081
             vb_512:num_non_resident:                 381703
             vb_513:num_items:                        450663
             vb_513:num_non_resident:                 382150
             vb_514:num_items:                        450726
             vb_514:num_non_resident:                 381939
             vb_515:num_items:                        450625
             vb_515:num_non_resident:                 381404
             vb_516:num_items:                        450000
             vb_516:num_non_resident:                 380575
             vb_517:num_items:                        450276
             vb_517:num_non_resident:                 381110
             vb_518:num_items:                        450521
             vb_518:num_non_resident:                 381414
             vb_519:num_items:                        449949
             vb_519:num_non_resident:                 380683
             vb_520:num_items:                        451530
             vb_520:num_non_resident:                 381974
             vb_521:num_items:                        451128
             vb_521:num_non_resident:                 381769
             vb_522:num_items:                        449920
             vb_522:num_non_resident:                 380838
             vb_523:num_items:                        450072
             vb_523:num_non_resident:                 380892
             vb_524:num_items:                        449612
             vb_524:num_non_resident:                 380937
             vb_525:num_items:                        450378
             vb_525:num_non_resident:                 381125
             vb_526:num_items:                        450408
             vb_526:num_non_resident:                 380890
             vb_527:num_items:                        450318
             vb_527:num_non_resident:                 381092
             vb_528:num_items:                        450414
             vb_528:num_non_resident:                 381215
             vb_529:num_items:                        449818
             vb_529:num_non_resident:                 380783
             vb_530:num_items:                        450280
             vb_530:num_non_resident:                 381356
             vb_531:num_items:                        449199
             vb_531:num_non_resident:                 379993
             vb_532:num_items:                        450861
             vb_532:num_non_resident:                 381849
             vb_533:num_items:                        450923
             vb_533:num_non_resident:                 381760
             vb_534:num_items:                        450848
             vb_534:num_non_resident:                 381815
             vb_535:num_items:                        449856
             vb_535:num_non_resident:                 380731
             vb_536:num_items:                        450347
             vb_536:num_non_resident:                 380929
             vb_537:num_items:                        450469
             vb_537:num_non_resident:                 381286
             vb_538:num_items:                        451326
             vb_538:num_non_resident:                 381781
             vb_539:num_items:                        450742
             vb_539:num_non_resident:                 380965
             vb_540:num_items:                        450546
             vb_540:num_non_resident:                 381443
             vb_541:num_items:                        450651
             vb_541:num_non_resident:                 380935
             vb_542:num_items:                        450378
             vb_542:num_non_resident:                 380745
             vb_543:num_items:                        449096
             vb_543:num_non_resident:                 380765
             vb_544:num_items:                        450630
             vb_544:num_non_resident:                 381212
             vb_545:num_items:                        449394
             vb_545:num_non_resident:                 380054
             vb_546:num_items:                        450716
             vb_546:num_non_resident:                 381346
             vb_547:num_items:                        449296
             vb_547:num_non_resident:                 380083
             vb_548:num_items:                        449780
             vb_548:num_non_resident:                 380511
             vb_549:num_items:                        450928
             vb_549:num_non_resident:                 381382
             vb_550:num_items:                        450952
             vb_550:num_non_resident:                 381425
             vb_551:num_items:                        450586
             vb_551:num_non_resident:                 381320
             vb_552:num_items:                        449970
             vb_552:num_non_resident:                 380720
             vb_553:num_items:                        450295
             vb_553:num_non_resident:                 380903
             vb_554:num_items:                        448957
             vb_554:num_non_resident:                 379767
             vb_555:num_items:                        450728
             vb_555:num_non_resident:                 381556
             vb_556:num_items:                        450115
             vb_556:num_non_resident:                 381008
             vb_557:num_items:                        449960
             vb_557:num_non_resident:                 380599
             vb_558:num_items:                        451207
             vb_558:num_non_resident:                 381730
             vb_559:num_items:                        450257
             vb_559:num_non_resident:                 381020
             vb_560:num_items:                        451045
             vb_560:num_non_resident:                 380890
             vb_561:num_items:                        451164
             vb_561:num_non_resident:                 381684
             vb_562:num_items:                        451750
             vb_562:num_non_resident:                 382231
             vb_563:num_items:                        449295
             vb_563:num_non_resident:                 380084
             vb_564:num_items:                        450125
             vb_564:num_non_resident:                 380313
             vb_565:num_items:                        449872
             vb_565:num_non_resident:                 380083
             vb_566:num_items:                        450060
             vb_566:num_non_resident:                 380965
             vb_567:num_items:                        449167
             vb_567:num_non_resident:                 379772
             vb_568:num_items:                        450353
             vb_568:num_non_resident:                 381161
             vb_569:num_items:                        450292
             vb_569:num_non_resident:                 380822
             vb_570:num_items:                        450184
             vb_570:num_non_resident:                 381129
             vb_571:num_items:                        450797
             vb_571:num_non_resident:                 380936
             vb_572:num_items:                        450152
             vb_572:num_non_resident:                 380469
             vb_573:num_items:                        450435
             vb_573:num_non_resident:                 380791
             vb_574:num_items:                        450997
             vb_574:num_non_resident:                 381364
             vb_575:num_items:                        450288
             vb_575:num_non_resident:                 380443
             vb_576:num_items:                        451140
             vb_576:num_non_resident:                 381473
             vb_577:num_items:                        450265
             vb_577:num_non_resident:                 380396
             vb_578:num_items:                        450665
             vb_578:num_non_resident:                 380896
             vb_579:num_items:                        450293
             vb_579:num_non_resident:                 380821
             vb_580:num_items:                        450737
             vb_580:num_non_resident:                 381338
             vb_581:num_items:                        449903
             vb_581:num_non_resident:                 380656
             vb_582:num_items:                        451139
             vb_582:num_non_resident:                 381495
             vb_583:num_items:                        451672
             vb_583:num_non_resident:                 381846
             vb_584:num_items:                        450945
             vb_584:num_non_resident:                 381504
             vb_585:num_items:                        450751
             vb_585:num_non_resident:                 380885
             vb_586:num_items:                        451265
             vb_586:num_non_resident:                 382089
             vb_587:num_items:                        451187
             vb_587:num_non_resident:                 381593
             vb_588:num_items:                        450684
             vb_588:num_non_resident:                 381105
             vb_589:num_items:                        451041
             vb_589:num_non_resident:                 380618
             vb_590:num_items:                        449456
             vb_590:num_non_resident:                 380135
             vb_591:num_items:                        449351
             vb_591:num_non_resident:                 380076
             vb_592:num_items:                        450654
             vb_592:num_non_resident:                 380725
             vb_593:num_items:                        450183
             vb_593:num_non_resident:                 380701
             vb_594:num_items:                        449797
             vb_594:num_non_resident:                 380579
             vb_595:num_items:                        449495
             vb_595:num_non_resident:                 380039
             vb_596:num_items:                        451094
             vb_596:num_non_resident:                 381595
             vb_597:num_items:                        450803
             vb_597:num_non_resident:                 381851
             vb_598:num_items:                        448500
             vb_598:num_non_resident:                 379597
             vb_599:num_items:                        450283
             vb_599:num_non_resident:                 380499
             vb_600:num_items:                        451035
             vb_600:num_non_resident:                 381653
             ...
            omitted for character limit constraint
             ...
             vb_900:num_items:                        450018
             vb_900:num_non_resident:                 284438
             vb_901:num_items:                        450399
             vb_901:num_non_resident:                 290288
             vb_902:num_items:                        450467
             vb_902:num_non_resident:                 288072
             vb_903:num_items:                        450693
             vb_903:num_non_resident:                 288046
             vb_904:num_items:                        449874
             vb_904:num_non_resident:                 284826
             vb_905:num_items:                        450560
             vb_905:num_non_resident:                 269810
             vb_906:num_items:                        450237
             vb_906:num_non_resident:                 270552
             vb_907:num_items:                        450479
             vb_907:num_non_resident:                 290102
             vb_908:num_items:                        450866
             vb_908:num_non_resident:                 290009
             vb_909:num_items:                        450559
             vb_909:num_non_resident:                 290911
             vb_910:num_items:                        450382
             vb_910:num_non_resident:                 291237
             vb_911:num_items:                        448973
             vb_911:num_non_resident:                 290752
             vb_912:num_items:                        449540
             vb_912:num_non_resident:                 282413
             vb_913:num_items:                        450460
             vb_913:num_non_resident:                 291336
             vb_914:num_items:                        450020
             vb_914:num_non_resident:                 281948
             vb_915:num_items:                        450348
             vb_915:num_non_resident:                 290427
             vb_916:num_items:                        449567
             vb_916:num_non_resident:                 281547
             vb_917:num_items:                        450243
             vb_917:num_non_resident:                 290004
             vb_918:num_items:                        450249
             vb_918:num_non_resident:                 295570
             vb_919:num_items:                        451401
             vb_919:num_non_resident:                 292040
             vb_920:num_items:                        450590
             vb_920:num_non_resident:                 284400
             vb_921:num_items:                        450408
             vb_921:num_non_resident:                 366718
             vb_922:num_items:                        451398
             vb_922:num_non_resident:                 371324
             vb_923:num_items:                        449490
             vb_923:num_non_resident:                 355837
             vb_924:num_items:                        449630
             vb_924:num_non_resident:                 296929
             vb_925:num_items:                        450274
             vb_925:num_non_resident:                 291541
             vb_926:num_items:                        450269
             vb_926:num_non_resident:                 290205
             vb_927:num_items:                        450895
             vb_927:num_non_resident:                 361227
             vb_928:num_items:                        450141
             vb_928:num_non_resident:                 295525
             vb_929:num_items:                        450719
             vb_929:num_non_resident:                 287819
             vb_930:num_items:                        449627
             vb_930:num_non_resident:                 296770
             vb_931:num_items:                        449820
             vb_931:num_non_resident:                 288032
             vb_932:num_items:                        450349
             vb_932:num_non_resident:                 297549
             vb_933:num_items:                        449864
             vb_933:num_non_resident:                 290804
             vb_934:num_items:                        449863
             vb_934:num_non_resident:                 297688
             vb_935:num_items:                        449605
             vb_935:num_non_resident:                 293406
             vb_936:num_items:                        449423
             vb_936:num_non_resident:                 285962
             vb_937:num_items:                        450883
             vb_937:num_non_resident:                 296426
             vb_938:num_items:                        451128
             vb_938:num_non_resident:                 290208
             vb_939:num_items:                        451093
             vb_939:num_non_resident:                 298281
             vb_940:num_items:                        450243
             vb_940:num_non_resident:                 288772
             vb_941:num_items:                        450521
             vb_941:num_non_resident:                 300273
             vb_942:num_items:                        451034
             vb_942:num_non_resident:                 292728
             vb_943:num_items:                        450023
             vb_943:num_non_resident:                 302783
             vb_944:num_items:                        449668
             vb_944:num_non_resident:                 279161
             vb_945:num_items:                        450769
             vb_945:num_non_resident:                 284896
             vb_946:num_items:                        450503
             vb_946:num_non_resident:                 293236
             vb_947:num_items:                        449669
             vb_947:num_non_resident:                 300920
             vb_948:num_items:                        450081
             vb_948:num_non_resident:                 279190
             vb_949:num_items:                        452479
             vb_949:num_non_resident:                 379809
             vb_950:num_items:                        451306
             vb_950:num_non_resident:                 289955
             vb_951:num_items:                        450818
             vb_951:num_non_resident:                 296086
             vb_952:num_items:                        450700
             vb_952:num_non_resident:                 305796
             vb_953:num_items:                        450603
             vb_953:num_non_resident:                 230290
             vb_954:num_items:                        450639
             vb_954:num_non_resident:                 284876
             vb_955:num_items:                        451442
             vb_955:num_non_resident:                 292939
             vb_956:num_items:                        448697
             vb_956:num_non_resident:                 300888
             vb_957:num_items:                        448788
             vb_957:num_non_resident:                 291602
             vb_958:num_items:                        451092
             vb_958:num_non_resident:                 285614
             vb_959:num_items:                        450962
             vb_959:num_non_resident:                 294350
             vb_960:num_items:                        450396
             vb_960:num_non_resident:                 301478
             vb_961:num_items:                        450327
             vb_961:num_non_resident:                 307386
             vb_962:num_items:                        450939
             vb_962:num_non_resident:                 286271
             vb_963:num_items:                        449433
             vb_963:num_non_resident:                 290914
             vb_964:num_items:                        450672
             vb_964:num_non_resident:                 301881
             vb_965:num_items:                        450489
             vb_965:num_non_resident:                 307013
             vb_966:num_items:                        450345
             vb_966:num_non_resident:                 283056
             vb_967:num_items:                        449835
             vb_967:num_non_resident:                 289686
             vb_968:num_items:                        450564
             vb_968:num_non_resident:                 312011
             vb_969:num_items:                        451624
             vb_969:num_non_resident:                 290523
             vb_970:num_items:                        450204
             vb_970:num_non_resident:                 297988
             vb_971:num_items:                        450167
             vb_971:num_non_resident:                 309843
             vb_972:num_items:                        450014
             vb_972:num_non_resident:                 101070
             vb_973:num_items:                        450327
             vb_973:num_non_resident:                 0
             vb_974:num_items:                        450493
             vb_974:num_non_resident:                 0
             vb_975:num_items:                        449555
             vb_975:num_non_resident:                 0
             vb_976:num_items:                        451809
             vb_976:num_non_resident:                 0
             vb_977:num_items:                        451384
             vb_977:num_non_resident:                 0
             vb_978:num_items:                        450292
             vb_978:num_non_resident:                 0
             vb_979:num_items:                        450374
             vb_979:num_non_resident:                 0
             vb_980:num_items:                        451438
             vb_980:num_non_resident:                 0
             vb_981:num_items:                        450397
             vb_981:num_non_resident:                 0
             vb_982:num_items:                        450977
             vb_982:num_non_resident:                 0
             vb_983:num_items:                        450836
             vb_983:num_non_resident:                 0
             vb_984:num_items:                        449955
             vb_984:num_non_resident:                 0
             vb_985:num_items:                        449671
             vb_985:num_non_resident:                 0
             vb_986:num_items:                        450592
             vb_986:num_non_resident:                 0
             vb_987:num_items:                        450581
             vb_987:num_non_resident:                 0
             vb_988:num_items:                        449746
             vb_988:num_non_resident:                 0
             vb_989:num_items:                        450621
             vb_989:num_non_resident:                 0
             vb_990:num_items:                        448733
             vb_990:num_non_resident:                 0
             vb_991:num_items:                        450500
             vb_991:num_non_resident:                 0
             vb_992:num_items:                        449949
             vb_992:num_non_resident:                 0
             vb_993:num_items:                        450673
             vb_993:num_non_resident:                 0
             vb_994:num_items:                        449951
             vb_994:num_non_resident:                 0
             vb_995:num_items:                        450864
             vb_995:num_non_resident:                 212122
             vb_996:num_items:                        450406
             vb_996:num_non_resident:                 0
             vb_997:num_items:                        450036
             vb_997:num_non_resident:                 0
            ...
            omitted for character limit constraint
            ...
            

            ben.huddleston Ben Huddleston added a comment - - edited Another observation about ItemPager behaviour. Don't think it's entirely relevant but writing these things down helps me to understand. We will only attempt to eject items for a vBucket if we are currently above the low watermark when the ItemPagerVisitor visits it. This is fine as we generally want our cache to be as heavily populated as possible. One potential issue with this is that we don't move the position at which we start attempting to eject. This means that if we iterate from vBucket 0 to 1023 and find that we satisfy our memory condition before we start processing vBucket 1 then we could end up only ejecting items for vBucket 0. Whilst it might be nice to eject evenly across vBuckets in this case it's probably a bunch more complexity so I can see why this solution may be preferred. However, this causes another issue when we no longer satisfy the memory condition. We then have to iterate over the resident items in vBucket 0 before reaching the items in vBucket 1 and expelling those. This can then result in the following residency ratios (per vBucket). To actually eject stuff we probably need to iterate a bunch of vBuckets for which there is little to eject (i.e. we could be doing more meaningful work if we were smarter with the vBuckets that we chose to visit). vb_512:num_items: 450081 vb_512:num_non_resident: 381703 vb_513:num_items: 450663 vb_513:num_non_resident: 382150 vb_514:num_items: 450726 vb_514:num_non_resident: 381939 vb_515:num_items: 450625 vb_515:num_non_resident: 381404 vb_516:num_items: 450000 vb_516:num_non_resident: 380575 vb_517:num_items: 450276 vb_517:num_non_resident: 381110 vb_518:num_items: 450521 vb_518:num_non_resident: 381414 vb_519:num_items: 449949 vb_519:num_non_resident: 380683 vb_520:num_items: 451530 vb_520:num_non_resident: 381974 vb_521:num_items: 451128 vb_521:num_non_resident: 381769 vb_522:num_items: 449920 vb_522:num_non_resident: 380838 vb_523:num_items: 450072 vb_523:num_non_resident: 380892 vb_524:num_items: 449612 vb_524:num_non_resident: 380937 vb_525:num_items: 450378 vb_525:num_non_resident: 381125 vb_526:num_items: 450408 vb_526:num_non_resident: 380890 vb_527:num_items: 450318 vb_527:num_non_resident: 381092 vb_528:num_items: 450414 vb_528:num_non_resident: 381215 vb_529:num_items: 449818 vb_529:num_non_resident: 380783 vb_530:num_items: 450280 vb_530:num_non_resident: 381356 vb_531:num_items: 449199 vb_531:num_non_resident: 379993 vb_532:num_items: 450861 vb_532:num_non_resident: 381849 vb_533:num_items: 450923 vb_533:num_non_resident: 381760 vb_534:num_items: 450848 vb_534:num_non_resident: 381815 vb_535:num_items: 449856 vb_535:num_non_resident: 380731 vb_536:num_items: 450347 vb_536:num_non_resident: 380929 vb_537:num_items: 450469 vb_537:num_non_resident: 381286 vb_538:num_items: 451326 vb_538:num_non_resident: 381781 vb_539:num_items: 450742 vb_539:num_non_resident: 380965 vb_540:num_items: 450546 vb_540:num_non_resident: 381443 vb_541:num_items: 450651 vb_541:num_non_resident: 380935 vb_542:num_items: 450378 vb_542:num_non_resident: 380745 vb_543:num_items: 449096 vb_543:num_non_resident: 380765 vb_544:num_items: 450630 vb_544:num_non_resident: 381212 vb_545:num_items: 449394 vb_545:num_non_resident: 380054 vb_546:num_items: 450716 vb_546:num_non_resident: 381346 vb_547:num_items: 449296 vb_547:num_non_resident: 380083 vb_548:num_items: 449780 vb_548:num_non_resident: 380511 vb_549:num_items: 450928 vb_549:num_non_resident: 381382 vb_550:num_items: 450952 vb_550:num_non_resident: 381425 vb_551:num_items: 450586 vb_551:num_non_resident: 381320 vb_552:num_items: 449970 vb_552:num_non_resident: 380720 vb_553:num_items: 450295 vb_553:num_non_resident: 380903 vb_554:num_items: 448957 vb_554:num_non_resident: 379767 vb_555:num_items: 450728 vb_555:num_non_resident: 381556 vb_556:num_items: 450115 vb_556:num_non_resident: 381008 vb_557:num_items: 449960 vb_557:num_non_resident: 380599 vb_558:num_items: 451207 vb_558:num_non_resident: 381730 vb_559:num_items: 450257 vb_559:num_non_resident: 381020 vb_560:num_items: 451045 vb_560:num_non_resident: 380890 vb_561:num_items: 451164 vb_561:num_non_resident: 381684 vb_562:num_items: 451750 vb_562:num_non_resident: 382231 vb_563:num_items: 449295 vb_563:num_non_resident: 380084 vb_564:num_items: 450125 vb_564:num_non_resident: 380313 vb_565:num_items: 449872 vb_565:num_non_resident: 380083 vb_566:num_items: 450060 vb_566:num_non_resident: 380965 vb_567:num_items: 449167 vb_567:num_non_resident: 379772 vb_568:num_items: 450353 vb_568:num_non_resident: 381161 vb_569:num_items: 450292 vb_569:num_non_resident: 380822 vb_570:num_items: 450184 vb_570:num_non_resident: 381129 vb_571:num_items: 450797 vb_571:num_non_resident: 380936 vb_572:num_items: 450152 vb_572:num_non_resident: 380469 vb_573:num_items: 450435 vb_573:num_non_resident: 380791 vb_574:num_items: 450997 vb_574:num_non_resident: 381364 vb_575:num_items: 450288 vb_575:num_non_resident: 380443 vb_576:num_items: 451140 vb_576:num_non_resident: 381473 vb_577:num_items: 450265 vb_577:num_non_resident: 380396 vb_578:num_items: 450665 vb_578:num_non_resident: 380896 vb_579:num_items: 450293 vb_579:num_non_resident: 380821 vb_580:num_items: 450737 vb_580:num_non_resident: 381338 vb_581:num_items: 449903 vb_581:num_non_resident: 380656 vb_582:num_items: 451139 vb_582:num_non_resident: 381495 vb_583:num_items: 451672 vb_583:num_non_resident: 381846 vb_584:num_items: 450945 vb_584:num_non_resident: 381504 vb_585:num_items: 450751 vb_585:num_non_resident: 380885 vb_586:num_items: 451265 vb_586:num_non_resident: 382089 vb_587:num_items: 451187 vb_587:num_non_resident: 381593 vb_588:num_items: 450684 vb_588:num_non_resident: 381105 vb_589:num_items: 451041 vb_589:num_non_resident: 380618 vb_590:num_items: 449456 vb_590:num_non_resident: 380135 vb_591:num_items: 449351 vb_591:num_non_resident: 380076 vb_592:num_items: 450654 vb_592:num_non_resident: 380725 vb_593:num_items: 450183 vb_593:num_non_resident: 380701 vb_594:num_items: 449797 vb_594:num_non_resident: 380579 vb_595:num_items: 449495 vb_595:num_non_resident: 380039 vb_596:num_items: 451094 vb_596:num_non_resident: 381595 vb_597:num_items: 450803 vb_597:num_non_resident: 381851 vb_598:num_items: 448500 vb_598:num_non_resident: 379597 vb_599:num_items: 450283 vb_599:num_non_resident: 380499 vb_600:num_items: 451035 vb_600:num_non_resident: 381653 ... omitted for character limit constraint ... vb_900:num_items: 450018 vb_900:num_non_resident: 284438 vb_901:num_items: 450399 vb_901:num_non_resident: 290288 vb_902:num_items: 450467 vb_902:num_non_resident: 288072 vb_903:num_items: 450693 vb_903:num_non_resident: 288046 vb_904:num_items: 449874 vb_904:num_non_resident: 284826 vb_905:num_items: 450560 vb_905:num_non_resident: 269810 vb_906:num_items: 450237 vb_906:num_non_resident: 270552 vb_907:num_items: 450479 vb_907:num_non_resident: 290102 vb_908:num_items: 450866 vb_908:num_non_resident: 290009 vb_909:num_items: 450559 vb_909:num_non_resident: 290911 vb_910:num_items: 450382 vb_910:num_non_resident: 291237 vb_911:num_items: 448973 vb_911:num_non_resident: 290752 vb_912:num_items: 449540 vb_912:num_non_resident: 282413 vb_913:num_items: 450460 vb_913:num_non_resident: 291336 vb_914:num_items: 450020 vb_914:num_non_resident: 281948 vb_915:num_items: 450348 vb_915:num_non_resident: 290427 vb_916:num_items: 449567 vb_916:num_non_resident: 281547 vb_917:num_items: 450243 vb_917:num_non_resident: 290004 vb_918:num_items: 450249 vb_918:num_non_resident: 295570 vb_919:num_items: 451401 vb_919:num_non_resident: 292040 vb_920:num_items: 450590 vb_920:num_non_resident: 284400 vb_921:num_items: 450408 vb_921:num_non_resident: 366718 vb_922:num_items: 451398 vb_922:num_non_resident: 371324 vb_923:num_items: 449490 vb_923:num_non_resident: 355837 vb_924:num_items: 449630 vb_924:num_non_resident: 296929 vb_925:num_items: 450274 vb_925:num_non_resident: 291541 vb_926:num_items: 450269 vb_926:num_non_resident: 290205 vb_927:num_items: 450895 vb_927:num_non_resident: 361227 vb_928:num_items: 450141 vb_928:num_non_resident: 295525 vb_929:num_items: 450719 vb_929:num_non_resident: 287819 vb_930:num_items: 449627 vb_930:num_non_resident: 296770 vb_931:num_items: 449820 vb_931:num_non_resident: 288032 vb_932:num_items: 450349 vb_932:num_non_resident: 297549 vb_933:num_items: 449864 vb_933:num_non_resident: 290804 vb_934:num_items: 449863 vb_934:num_non_resident: 297688 vb_935:num_items: 449605 vb_935:num_non_resident: 293406 vb_936:num_items: 449423 vb_936:num_non_resident: 285962 vb_937:num_items: 450883 vb_937:num_non_resident: 296426 vb_938:num_items: 451128 vb_938:num_non_resident: 290208 vb_939:num_items: 451093 vb_939:num_non_resident: 298281 vb_940:num_items: 450243 vb_940:num_non_resident: 288772 vb_941:num_items: 450521 vb_941:num_non_resident: 300273 vb_942:num_items: 451034 vb_942:num_non_resident: 292728 vb_943:num_items: 450023 vb_943:num_non_resident: 302783 vb_944:num_items: 449668 vb_944:num_non_resident: 279161 vb_945:num_items: 450769 vb_945:num_non_resident: 284896 vb_946:num_items: 450503 vb_946:num_non_resident: 293236 vb_947:num_items: 449669 vb_947:num_non_resident: 300920 vb_948:num_items: 450081 vb_948:num_non_resident: 279190 vb_949:num_items: 452479 vb_949:num_non_resident: 379809 vb_950:num_items: 451306 vb_950:num_non_resident: 289955 vb_951:num_items: 450818 vb_951:num_non_resident: 296086 vb_952:num_items: 450700 vb_952:num_non_resident: 305796 vb_953:num_items: 450603 vb_953:num_non_resident: 230290 vb_954:num_items: 450639 vb_954:num_non_resident: 284876 vb_955:num_items: 451442 vb_955:num_non_resident: 292939 vb_956:num_items: 448697 vb_956:num_non_resident: 300888 vb_957:num_items: 448788 vb_957:num_non_resident: 291602 vb_958:num_items: 451092 vb_958:num_non_resident: 285614 vb_959:num_items: 450962 vb_959:num_non_resident: 294350 vb_960:num_items: 450396 vb_960:num_non_resident: 301478 vb_961:num_items: 450327 vb_961:num_non_resident: 307386 vb_962:num_items: 450939 vb_962:num_non_resident: 286271 vb_963:num_items: 449433 vb_963:num_non_resident: 290914 vb_964:num_items: 450672 vb_964:num_non_resident: 301881 vb_965:num_items: 450489 vb_965:num_non_resident: 307013 vb_966:num_items: 450345 vb_966:num_non_resident: 283056 vb_967:num_items: 449835 vb_967:num_non_resident: 289686 vb_968:num_items: 450564 vb_968:num_non_resident: 312011 vb_969:num_items: 451624 vb_969:num_non_resident: 290523 vb_970:num_items: 450204 vb_970:num_non_resident: 297988 vb_971:num_items: 450167 vb_971:num_non_resident: 309843 vb_972:num_items: 450014 vb_972:num_non_resident: 101070 vb_973:num_items: 450327 vb_973:num_non_resident: 0 vb_974:num_items: 450493 vb_974:num_non_resident: 0 vb_975:num_items: 449555 vb_975:num_non_resident: 0 vb_976:num_items: 451809 vb_976:num_non_resident: 0 vb_977:num_items: 451384 vb_977:num_non_resident: 0 vb_978:num_items: 450292 vb_978:num_non_resident: 0 vb_979:num_items: 450374 vb_979:num_non_resident: 0 vb_980:num_items: 451438 vb_980:num_non_resident: 0 vb_981:num_items: 450397 vb_981:num_non_resident: 0 vb_982:num_items: 450977 vb_982:num_non_resident: 0 vb_983:num_items: 450836 vb_983:num_non_resident: 0 vb_984:num_items: 449955 vb_984:num_non_resident: 0 vb_985:num_items: 449671 vb_985:num_non_resident: 0 vb_986:num_items: 450592 vb_986:num_non_resident: 0 vb_987:num_items: 450581 vb_987:num_non_resident: 0 vb_988:num_items: 449746 vb_988:num_non_resident: 0 vb_989:num_items: 450621 vb_989:num_non_resident: 0 vb_990:num_items: 448733 vb_990:num_non_resident: 0 vb_991:num_items: 450500 vb_991:num_non_resident: 0 vb_992:num_items: 449949 vb_992:num_non_resident: 0 vb_993:num_items: 450673 vb_993:num_non_resident: 0 vb_994:num_items: 449951 vb_994:num_non_resident: 0 vb_995:num_items: 450864 vb_995:num_non_resident: 212122 vb_996:num_items: 450406 vb_996:num_non_resident: 0 vb_997:num_items: 450036 vb_997:num_non_resident: 0 ... omitted for character limit constraint ...

            Nothing in the code is jumping out as a new cause for this. Given that magma builds are using 6.5.0 with a couple cherry-picked commits we know that this is not a regression so hard to pinpoint the issue. Probably best now to re-run this test and see what I can find from some live debugging. In particular, interested in seeing if the ItemPagerVisitor is progressing over vBuckets during the period of low ops/s or if it is just stuck on one vBucket.

            ben.huddleston Ben Huddleston added a comment - Nothing in the code is jumping out as a new cause for this. Given that magma builds are using 6.5.0 with a couple cherry-picked commits we know that this is not a regression so hard to pinpoint the issue. Probably best now to re-run this test and see what I can find from some live debugging. In particular, interested in seeing if the ItemPagerVisitor is progressing over vBuckets during the period of low ops/s or if it is just stuck on one vBucket.
            drigby Dave Rigby made changes -
            Attachment image.png [ 85115 ]
            drigby Dave Rigby made changes -
            Attachment image.png [ 85115 ]

            Due to this code in PagingVisitor::visitBucket() we will end up skipping item paging of active vBuckets if we have no replicas configured and have got stats from the engine with the empty string already (believe ns_server does this every second) until we hit the high water mark. Once over though we will start running on active vBuckets and we can see this from the test. We should probably change this.

             // skip active vbuckets if active resident ratio is lower than replica
                auto current = static_cast<double>(stats.getEstimatedTotalMemoryUsed());
                auto lower = static_cast<double>(stats.mem_low_wat);
                auto high = static_cast<double>(stats.mem_high_wat);
                if (vb->getState() == vbucket_state_active && current < high &&
                    store.getActiveResidentRatio() < store.getReplicaResidentRatio()) {
                    return;
                }
            

            ben.huddleston Ben Huddleston added a comment - Due to this code in PagingVisitor::visitBucket() we will end up skipping item paging of active vBuckets if we have no replicas configured and have got stats from the engine with the empty string already (believe ns_server does this every second) until we hit the high water mark. Once over though we will start running on active vBuckets and we can see this from the test. We should probably change this. // skip active vbuckets if active resident ratio is lower than replica auto current = static_cast<double>(stats.getEstimatedTotalMemoryUsed()); auto lower = static_cast<double>(stats.mem_low_wat); auto high = static_cast<double>(stats.mem_high_wat); if (vb->getState() == vbucket_state_active && current < high && store.getActiveResidentRatio() < store.getReplicaResidentRatio()) { return; }

            Live debugging of this last night was very helpful in spotting what went wrong. Thanks Dave Rigby for the help!

            First spotted that the pager was progressing across vBuckets very slowly. It never got stuck but it was taking ~30 seconds per vBucket.

            Attached gdb and found that the ItemPager had the following backtrace.

            (gdb) bt
            #0  load (__m=std::memory_order_seq_cst, this=<optimized out>) at /usr/local/include/c++/7.3.0/bits/atomic_base.h:396
            #1  operator std::__atomic_base<unsigned int>::__int_type (this=<optimized out>) at /usr/local/include/c++/7.3.0/bits/atomic_base.h:259
            #2  valueSize (this=<optimized out>) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/blob.h:68
            #3  valuelen (this=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/stored-value.h:503
            #4  size (this=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/stored-value.h:518
            #5  _ZN9HashTable10Statistics21StoredValuePropertiesC4EPK11StoredValue (this=this@entry=0x7fe397ffd1d0, sv=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:437
            #6  0x00007fe5a3826333 in StoredValueProperties (sv=<optimized out>, this=0x7fe397ffd1d0) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:429
            #7  prologue (v=<optimized out>, this=0x7fe316784870) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:450
            #8  HashTable::unlocked_ejectItem (this=0x7fe316784818, vptr=@0x7fe397ffd238: 0x7fdcb724fc90, policy=Full) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:1077
            #9  0x00007fe5a385a4be in PagingVisitor::doEviction (this=this@entry=0x7fe5a4e20a80, lh=..., v=<optimized out>, v@entry=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/paging_visitor.cc:367
            #10 0x00007fe5a385a654 in PagingVisitor::visit (this=0x7fe5a4e20a80, lh=..., v=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/paging_visitor.cc:139
            #11 0x00007fe5a38283a0 in HashTable::pauseResumeVisit (this=this@entry=0x7fe316784818, visitor=..., start_pos=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:1036
            #12 0x00007fe5a3828681 in HashTable::visit (this=0x7fe316784818, visitor=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:933
            #13 0x00007fe5a385b6a3 in PagingVisitor::visitBucket (this=0x7fe5a4e20a80, vb=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/paging_visitor.cc:239
            #14 0x00007fe5a383fe74 in VBCBAdaptor::run (this=0x7fe314f2ad70) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/kv_bucket.cc:2312
            #15 0x00007fe5a381bd84 in ExecutorThread::run (this=0x7fe4e310d780) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/executorthread.cc:190
            #16 0x00007fe5ac534f17 in run (this=0x7fe4e3341130) at /home/couchbase/jenkins/workspace/toy-unix/platform/src/cb_pthreads.cc:58
            #17 platform_thread_wrap (arg=0x7fe4e3341130) at /home/couchbase/jenkins/workspace/toy-unix/platform/src/cb_pthreads.cc:71
            #18 0x00007fe5a9f55e65 in start_thread () from /lib64/libpthread.so.0
            #19 0x00007fe5a9c7e88d in clone () from /lib64/libc.so.6
            

            As a one off this isn't a particularly interesting backtrace, but we continued the process and stopped it a few more times and every time it had the same backtrace (but was looking at different StoredValues). The issue here is that it's taking it a very long time to load the Blob object ot get the size.

            Observed in top that vsize of memcached is high (110GB+) when we have a quota of just under 70GB. This could be down to the allocator and allocation size, but the difference is huge and vmstat confirms that we are going into swap.

            [root@cen-sa34 benh]# vmstat -w 1
            procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
             r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
             0  1      1642072      1367916            0     13436428   12   21   339  2180  118  122   8  11  78   2   0
             1  1      1641024      1365028            0     13438324 1044    0  1044    25 7458 11248   2   0  96   2   0
             0  1      1640080      1368380            0     13438328  944    0   944     0 5715 8735   1   0  97   2   0
             1  1      1639016      1363956            0     13438380 1072    0  1072   112 6548 22226   1   1  96   2   0
             0  1      1638024      1353268            0     13421316  980    0   980    48 9723 18208   2   2  95   1   0
             0  1      1637036      1357848            0     13421292  988    0   988     0 4380 8996   1   0  98   1   0
             0  1      1635956      1359276            0     13421316 1060    0  1060     0 3901 7577   1   0  98   1   0
             0  1      1634804      1366064            0     13421304 1132    0  1132    24 4734 11646   1   0  98   1   0
             1  1      1633720      1378112            0     13421312 1076    0  1076    24 5359 13432   1   1  97   1   0
            11  1      1632676      1366676            0     13421424 1036    0  1036    71 8865 17870   2   2  95   1   0
             5  1      1631712      1356148            0     13421440  944    0   944    24 3915 9648   1   1  98   1   0
             3  1      1630668      1372672            0     13421404 1048    0  1048     0 2959 9997   0   0  99   1   0
             2  1      1629684      1373872            0     13421408  968    0   968   113 2787 7271   0   0  99   1   0
             3  1      1628652      1352396            0     13421440 1032    0  1032    55 4012 11587   1   1  98   1   0
             2  1      1627608      1365132            0     13408972 1032    0  1032    74 7306 18997   2   2  95   1   0
             1  1      1626592      1368312            0     13408976 1016    0  1016     0 2698 7696   0   0  99   1   0
             1  1      1625580      1369728            0     13408984 1000    0  1000     0 2477 7138   0   0  99   1   0
             2  1      1624644      1374112            0     13408940  940    0   940     0 3250 11294   1   0  98   1   0
             1  1      1623976      1374200            0     13409004  652    0   652    80 3089 9337   0   0  98   1   0
             6  1      1622924      1379852            0     13398912 1044    0  1044 15442 8059 20857   2   2  95   1   0
             1  1      1621636      1375612            0     13398908 1284    0  1284   268 4515 11280   1   1  96   2   0
             2 15      1619632      1330248            0     13436692 2092    0 11908 31884 115684 390292   5  22  60  13   0
             2  1      1618556      1300520            0     13459088 1024    0  2788 10680 19594 87897   2   7  90   1   0
             2  1      1617552      1286060            0     13459080  976    0   976    41 7322 14254   1   1  97   1   0
             1  1      1616596      1295828            0     13447032  952    0   952    48 8554 17904   2   1  96   1   0
             1  1      1615608      1294072            0     13447060  976    0   976    20 5134 9822   1   1  98   1   0
             0  1      1614536      1296592            0     13447072 1072    0  1072    24 5037 20011   1   0  98   1   0
             0  1      1613492      1299264            0     13447064 1028    0  1028     0 3804 9500   1   0  98   1   0
             1  1      1612452      1300452            0     13447124 1032    0  1032    40 4460 10995   1   0  98   1   0
             0  1      1611328      1295972            0     13447096 1124    0  1124  3912 8701 19543   2   2  96   1   0
             1  2      1610096      1296420            0     13447176 1216    0  1216    24 5366 11185   1   1  97   1   0
             4  2      1608552      1295536            0     13447200 1532    0  1532   251 3804 9426   1   0  97   2   0
            60  4      1605356      1201252            0     13539184 3300    0 22180 55048 204821 493961   8  24  57  11   0
             1  1      1603952      1152516            0     13590292 1416    0  6140 11922 22464 97023   2   6  88   3   0
             2  1      1603012      1142484            0     13590288  936    0   936 23120 9636 18338   2   1  96   1   0
             3  1      1601976      1145092            0     13590304 1032    0  1032     0 6745 11139   1   1  98   1   0
             1  1      1600928      1145932            0     13590308 1040    0  1040    25 5130 9841   1   0  98   1   0
             0  1      1599876      1148984            0     13590312 1052    0  1052    73 4743 18776   1   0  98   1   0
             3  1      1598852      1146296            0     13590340 1012    0  1012    15 4967 12693   1   0  98   1   0
             1  1      1597792      1139832            0     13590400 1048    0  1048   153 7942 19681   2   2  96   1   0
             2  1      1596732      1139608            0     13590396 1052    0  1052    25 5371 10064   1   1  97   1   0
             2  1      1595764      1133724            0     13590428  964    0   964    25 5256 10780   1   1  97   1   0
             0  1      1594660      1132572            0     13590464 1084    0  1084   184 4255 8671   1   0  97   2   0
             2  2      1593532      1132488            0     13590628 1128    0  1128     0 4426 11861   1   1  97   2   0
             2  2      1592024      1127744            0     13578892 1504    0  1504    56 6478 16446   1   1  94   3   0
             7  1      1590948      1131708            0     13578900 1060    0  1060     0 3806 10202   1   0  97   2   0
             4  3      1589968      1135400            0     13578880  964    0  1008  1502 5461 17701   1   1  97   2   0
             5  2      1587644      1054152            0     13663292 2380    0 18944 49436 188902 486739   7  29  54  10   0
             1  1      1586652       994556            0     13703448  992    0  4876 10110 22103 87929   3   5  90   2   0
             0  1      1585628       982532            0     13703528 1020    0  1020    49 10401 19044   2   1  95   2   0
             2  1      1584648       989112            0     13703500  980    0   980     0 6842 11985   1   0  97   2   0
             1  1      1583768       986736            0     13703552  880    0   880    50 8096 12151   1   1  96   2   0
             1  1      1582740       992416            0     13703656 1024    0  1136  9852 6420 18544   1   1  97   2   0
             1  1      1581708       989876            0     13703716 1012    0  1012   399 5863 12318   1   0  97   2   0
             0  1      1580624       999480            0     13691968 1080    0  1080 11622 8253 16440   2   2  95   2   0
             0  1      1579600      1010652            0     13691956 1020    0  1020    24 3829 10467   1   0  97   2   0
             0  1      1578612      1013836            0     13691996 1000    0  1000     0 4620 9498   1   0  97   2   0
            

            Looks like some memory is not being tracked correctly (causing us to use more and run into swap).

            ben.huddleston Ben Huddleston added a comment - Live debugging of this last night was very helpful in spotting what went wrong. Thanks Dave Rigby for the help! First spotted that the pager was progressing across vBuckets very slowly. It never got stuck but it was taking ~30 seconds per vBucket. Attached gdb and found that the ItemPager had the following backtrace. (gdb) bt #0 load (__m=std::memory_order_seq_cst, this=<optimized out>) at /usr/local/include/c++/7.3.0/bits/atomic_base.h:396 #1 operator std::__atomic_base<unsigned int>::__int_type (this=<optimized out>) at /usr/local/include/c++/7.3.0/bits/atomic_base.h:259 #2 valueSize (this=<optimized out>) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/blob.h:68 #3 valuelen (this=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/stored-value.h:503 #4 size (this=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/stored-value.h:518 #5 _ZN9HashTable10Statistics21StoredValuePropertiesC4EPK11StoredValue (this=this@entry=0x7fe397ffd1d0, sv=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:437 #6 0x00007fe5a3826333 in StoredValueProperties (sv=<optimized out>, this=0x7fe397ffd1d0) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:429 #7 prologue (v=<optimized out>, this=0x7fe316784870) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:450 #8 HashTable::unlocked_ejectItem (this=0x7fe316784818, vptr=@0x7fe397ffd238: 0x7fdcb724fc90, policy=Full) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:1077 #9 0x00007fe5a385a4be in PagingVisitor::doEviction (this=this@entry=0x7fe5a4e20a80, lh=..., v=<optimized out>, v@entry=0x7fdcb724fc90) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/paging_visitor.cc:367 #10 0x00007fe5a385a654 in PagingVisitor::visit (this=0x7fe5a4e20a80, lh=..., v=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/paging_visitor.cc:139 #11 0x00007fe5a38283a0 in HashTable::pauseResumeVisit (this=this@entry=0x7fe316784818, visitor=..., start_pos=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:1036 #12 0x00007fe5a3828681 in HashTable::visit (this=0x7fe316784818, visitor=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/hash_table.cc:933 #13 0x00007fe5a385b6a3 in PagingVisitor::visitBucket (this=0x7fe5a4e20a80, vb=...) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/paging_visitor.cc:239 #14 0x00007fe5a383fe74 in VBCBAdaptor::run (this=0x7fe314f2ad70) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/kv_bucket.cc:2312 #15 0x00007fe5a381bd84 in ExecutorThread::run (this=0x7fe4e310d780) at /home/couchbase/jenkins/workspace/toy-unix/kv_engine/engines/ep/src/executorthread.cc:190 #16 0x00007fe5ac534f17 in run (this=0x7fe4e3341130) at /home/couchbase/jenkins/workspace/toy-unix/platform/src/cb_pthreads.cc:58 #17 platform_thread_wrap (arg=0x7fe4e3341130) at /home/couchbase/jenkins/workspace/toy-unix/platform/src/cb_pthreads.cc:71 #18 0x00007fe5a9f55e65 in start_thread () from /lib64/libpthread.so.0 #19 0x00007fe5a9c7e88d in clone () from /lib64/libc.so.6 As a one off this isn't a particularly interesting backtrace, but we continued the process and stopped it a few more times and every time it had the same backtrace (but was looking at different StoredValues). The issue here is that it's taking it a very long time to load the Blob object ot get the size. Observed in top that vsize of memcached is high (110GB+) when we have a quota of just under 70GB. This could be down to the allocator and allocation size, but the difference is huge and vmstat confirms that we are going into swap. [root@cen-sa34 benh]# vmstat -w 1 procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- r b swpd free buff cache si so bi bo in cs us sy id wa st 0 1 1642072 1367916 0 13436428 12 21 339 2180 118 122 8 11 78 2 0 1 1 1641024 1365028 0 13438324 1044 0 1044 25 7458 11248 2 0 96 2 0 0 1 1640080 1368380 0 13438328 944 0 944 0 5715 8735 1 0 97 2 0 1 1 1639016 1363956 0 13438380 1072 0 1072 112 6548 22226 1 1 96 2 0 0 1 1638024 1353268 0 13421316 980 0 980 48 9723 18208 2 2 95 1 0 0 1 1637036 1357848 0 13421292 988 0 988 0 4380 8996 1 0 98 1 0 0 1 1635956 1359276 0 13421316 1060 0 1060 0 3901 7577 1 0 98 1 0 0 1 1634804 1366064 0 13421304 1132 0 1132 24 4734 11646 1 0 98 1 0 1 1 1633720 1378112 0 13421312 1076 0 1076 24 5359 13432 1 1 97 1 0 11 1 1632676 1366676 0 13421424 1036 0 1036 71 8865 17870 2 2 95 1 0 5 1 1631712 1356148 0 13421440 944 0 944 24 3915 9648 1 1 98 1 0 3 1 1630668 1372672 0 13421404 1048 0 1048 0 2959 9997 0 0 99 1 0 2 1 1629684 1373872 0 13421408 968 0 968 113 2787 7271 0 0 99 1 0 3 1 1628652 1352396 0 13421440 1032 0 1032 55 4012 11587 1 1 98 1 0 2 1 1627608 1365132 0 13408972 1032 0 1032 74 7306 18997 2 2 95 1 0 1 1 1626592 1368312 0 13408976 1016 0 1016 0 2698 7696 0 0 99 1 0 1 1 1625580 1369728 0 13408984 1000 0 1000 0 2477 7138 0 0 99 1 0 2 1 1624644 1374112 0 13408940 940 0 940 0 3250 11294 1 0 98 1 0 1 1 1623976 1374200 0 13409004 652 0 652 80 3089 9337 0 0 98 1 0 6 1 1622924 1379852 0 13398912 1044 0 1044 15442 8059 20857 2 2 95 1 0 1 1 1621636 1375612 0 13398908 1284 0 1284 268 4515 11280 1 1 96 2 0 2 15 1619632 1330248 0 13436692 2092 0 11908 31884 115684 390292 5 22 60 13 0 2 1 1618556 1300520 0 13459088 1024 0 2788 10680 19594 87897 2 7 90 1 0 2 1 1617552 1286060 0 13459080 976 0 976 41 7322 14254 1 1 97 1 0 1 1 1616596 1295828 0 13447032 952 0 952 48 8554 17904 2 1 96 1 0 1 1 1615608 1294072 0 13447060 976 0 976 20 5134 9822 1 1 98 1 0 0 1 1614536 1296592 0 13447072 1072 0 1072 24 5037 20011 1 0 98 1 0 0 1 1613492 1299264 0 13447064 1028 0 1028 0 3804 9500 1 0 98 1 0 1 1 1612452 1300452 0 13447124 1032 0 1032 40 4460 10995 1 0 98 1 0 0 1 1611328 1295972 0 13447096 1124 0 1124 3912 8701 19543 2 2 96 1 0 1 2 1610096 1296420 0 13447176 1216 0 1216 24 5366 11185 1 1 97 1 0 4 2 1608552 1295536 0 13447200 1532 0 1532 251 3804 9426 1 0 97 2 0 60 4 1605356 1201252 0 13539184 3300 0 22180 55048 204821 493961 8 24 57 11 0 1 1 1603952 1152516 0 13590292 1416 0 6140 11922 22464 97023 2 6 88 3 0 2 1 1603012 1142484 0 13590288 936 0 936 23120 9636 18338 2 1 96 1 0 3 1 1601976 1145092 0 13590304 1032 0 1032 0 6745 11139 1 1 98 1 0 1 1 1600928 1145932 0 13590308 1040 0 1040 25 5130 9841 1 0 98 1 0 0 1 1599876 1148984 0 13590312 1052 0 1052 73 4743 18776 1 0 98 1 0 3 1 1598852 1146296 0 13590340 1012 0 1012 15 4967 12693 1 0 98 1 0 1 1 1597792 1139832 0 13590400 1048 0 1048 153 7942 19681 2 2 96 1 0 2 1 1596732 1139608 0 13590396 1052 0 1052 25 5371 10064 1 1 97 1 0 2 1 1595764 1133724 0 13590428 964 0 964 25 5256 10780 1 1 97 1 0 0 1 1594660 1132572 0 13590464 1084 0 1084 184 4255 8671 1 0 97 2 0 2 2 1593532 1132488 0 13590628 1128 0 1128 0 4426 11861 1 1 97 2 0 2 2 1592024 1127744 0 13578892 1504 0 1504 56 6478 16446 1 1 94 3 0 7 1 1590948 1131708 0 13578900 1060 0 1060 0 3806 10202 1 0 97 2 0 4 3 1589968 1135400 0 13578880 964 0 1008 1502 5461 17701 1 1 97 2 0 5 2 1587644 1054152 0 13663292 2380 0 18944 49436 188902 486739 7 29 54 10 0 1 1 1586652 994556 0 13703448 992 0 4876 10110 22103 87929 3 5 90 2 0 0 1 1585628 982532 0 13703528 1020 0 1020 49 10401 19044 2 1 95 2 0 2 1 1584648 989112 0 13703500 980 0 980 0 6842 11985 1 0 97 2 0 1 1 1583768 986736 0 13703552 880 0 880 50 8096 12151 1 1 96 2 0 1 1 1582740 992416 0 13703656 1024 0 1136 9852 6420 18544 1 1 97 2 0 1 1 1581708 989876 0 13703716 1012 0 1012 399 5863 12318 1 0 97 2 0 0 1 1580624 999480 0 13691968 1080 0 1080 11622 8253 16440 2 2 95 2 0 0 1 1579600 1010652 0 13691956 1020 0 1020 24 3829 10467 1 0 97 2 0 0 1 1578612 1013836 0 13691996 1000 0 1000 0 4620 9498 1 0 97 2 0 Looks like some memory is not being tracked correctly (causing us to use more and run into swap).
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Sarath Lakshman [ sarath ]
            sarath Sarath Lakshman added a comment - - edited

            Thanks for looking into the problem.

            I noticed the partial swap while I ran the test. But, I didn't expect it to be significant enough to stall the process significantly (There weren't any increments in allocstall counter when the slowdown started).

            We were facing this issue and were tracking issues related to direct reclaim and allocation stall separately. The page cache was unable to keep up with the rate of objects allocated and deallocated from the kernel cached objects (even when there is sufficient free memory and all pages are non-dirty). We had to switch to direct i/o as the filesystem/kernel/xfs caching isn't scaling.

            Reran the test with a toy build including direct i/o enhancements. The test is progressing well without issues. Loaded 1B+ so far without any stall.

            The virtual memory size of the process doesn't seem to be an accounting issue. The process resident and jemalloc resident stats are matching.

            Isn't it usual to have large virtual memory size for jemalloc but much smaller resident ? (At least I noticed it with higher ops/sec tests with zero swap usage).

            sarath Sarath Lakshman added a comment - - edited Thanks for looking into the problem. I noticed the partial swap while I ran the test. But, I didn't expect it to be significant enough to stall the process significantly (There weren't any increments in allocstall counter when the slowdown started). We were facing this issue and were tracking issues related to direct reclaim and allocation stall separately. The page cache was unable to keep up with the rate of objects allocated and deallocated from the kernel cached objects (even when there is sufficient free memory and all pages are non-dirty). We had to switch to direct i/o as the filesystem/kernel/xfs caching isn't scaling. Reran the test with a toy build including direct i/o enhancements. The test is progressing well without issues. Loaded 1B+ so far without any stall. The virtual memory size of the process doesn't seem to be an accounting issue. The process resident and jemalloc resident stats are matching. Isn't it usual to have large virtual memory size for jemalloc but much smaller resident ? (At least I noticed it with higher ops/sec tests with zero swap usage).
            sarath Sarath Lakshman added a comment - - edited

            The run with direct i/o progressed up to loading 10.2B docs and ran into swapping.

            Magma created a significant number of files (5M) as expected. The ns_server godu process iterates through all the files every second (or similar short intervals). This caused all the inodes and dentries to be brought in memory. The godu acts as an infinite loop that brings all filesystem metadata to be in-memory (100%). This causes vmpressure as the entire filesystem meta cannot be kept 100% in memory and caused swapping.

            We have a plan to reduce the number of files. Still, it is going to be in the order of 16MB or 32MB per file. The godu  process will not scale (vmpressure) with the number of files,

            sarath Sarath Lakshman added a comment - - edited The run with direct i/o progressed up to loading 10.2B docs and ran into swapping. Magma created a significant number of files (5M) as expected. The ns_server godu process iterates through all the files every second (or similar short intervals). This caused all the inodes and dentries to be brought in memory. The godu acts as an infinite loop that brings all filesystem metadata to be in-memory (100%). This causes vmpressure as the entire filesystem meta cannot be kept 100% in memory and caused swapping. We have a plan to reduce the number of files. Still, it is going to be in the order of 16MB or 32MB per file. The godu  process will not scale (vmpressure) with the number of files,
            sarath Sarath Lakshman added a comment - - edited

             

            slabtop output:

             
            Active / Total Objects (% used)    : 41794189 / 60179910 (69.4%)
             Active / Total Slabs (% used)      : 1378377 / 1378377 (100.0%)
             Active / Total Caches (% used)     : 68 / 100 (68.0%)
             Active / Total Size (% used)       : 10199466.90K / 13702587.58K (74.4%)
             Minimum / Average / Maximum Object : 0.01K / 0.23K / 16.69K
             
              OBJS ACTIVE  USE OBJ SIZE  SLABS OBJ/SLAB CACHE SIZE NAME
            26787432 9544323  35%    0.19K 637796       42   5102368K dentry
            13761856 13759786  99%    0.06K 215029       64    860116K kmalloc-64
            5630655 5630276  99%    0.08K 110405       51    441620K Acpi-State
            5610222 5609944  99%    1.06K 265662       30   8501184K xfs_inode
            5285107 5284903  99%    0.15K  99719       53    797752K xfs_ili
            374912 289450  77%    0.03K   2929      128     11716K kmalloc-32
            373248 211264  56%    0.02K   1458      256      5832K kmalloc-16
            318045 106955  33%    0.10K   8155       39     32620K buffer_head
            283458 145845  51%    0.38K   6749       42    107984K mnt_cache
            266000  39069  14%    0.50K   4165       64    133280K kmalloc-512
            250560  77845  31%    0.25K   3915       64     62640K kmalloc-256
            205800 205577  99%    0.07K   3675       56     14700K Acpi-Operand
            145887 142911  97%    0.57K   3656       56    116992K radix_tree_node
            145350 145350 100%    0.02K    855      170      3420K scsi_data_buffer
            114176 114176 100%    0.01K    223      512       892K kmalloc-8
            110292  32043  29%    0.09K   2626       42     10504K kmalloc-96
            
            

             

            sarath Sarath Lakshman added a comment - - edited   slabtop output:   Active / Total Objects (% used) : 41794189 / 60179910 (69.4%) Active / Total Slabs (% used) : 1378377 / 1378377 (100.0%) Active / Total Caches (% used) : 68 / 100 (68.0%) Active / Total Size (% used) : 10199466.90K / 13702587.58K (74.4%) Minimum / Average / Maximum Object : 0.01K / 0.23K / 16.69K   OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 26787432 9544323 35% 0.19K 637796 42 5102368K dentry 13761856 13759786 99% 0.06K 215029 64 860116K kmalloc-64 5630655 5630276 99% 0.08K 110405 51 441620K Acpi-State 5610222 5609944 99% 1.06K 265662 30 8501184K xfs_inode 5285107 5284903 99% 0.15K 99719 53 797752K xfs_ili 374912 289450 77% 0.03K 2929 128 11716K kmalloc-32 373248 211264 56% 0.02K 1458 256 5832K kmalloc-16 318045 106955 33% 0.10K 8155 39 32620K buffer_head 283458 145845 51% 0.38K 6749 42 107984K mnt_cache 266000 39069 14% 0.50K 4165 64 133280K kmalloc-512 250560 77845 31% 0.25K 3915 64 62640K kmalloc-256 205800 205577 99% 0.07K 3675 56 14700K Acpi-Operand 145887 142911 97% 0.57K 3656 56 116992K radix_tree_node 145350 145350 100% 0.02K 855 170 3420K scsi_data_buffer 114176 114176 100% 0.01K 223 512 892K kmalloc-8 110292 32043 29% 0.09K 2626 42 10504K kmalloc-96  
            sarath Sarath Lakshman made changes -
            Link This issue relates to MB-38079 [ MB-38079 ]
            sarath Sarath Lakshman made changes -
            Component/s couchbase-bucket [ 10173 ]
            wayne Wayne Siu made changes -
            Labels magma magma-dp affects-cc-testing magma magma-dp
            wayne Wayne Siu made changes -
            Labels affects-cc-testing magma magma-dp magma magma-dp
            sarath Sarath Lakshman made changes -
            Epic Link MB-30659 [ 88207 ] CBSS-305 [ 112304 ]
            sarath Sarath Lakshman made changes -
            Sprint KV Sprint 2020-February [ 979 ] Magma: Jan 20 - Feb 2 [ 968 ]
            sarath Sarath Lakshman made changes -
            Rank Ranked higher
            sarath Sarath Lakshman made changes -
            Summary Write ops/s dropped from 230k/s to 400/s with tmp_oom errors even when there is sufficient memory 1% DGM Test: Write ops/s dropped from 230k/s to 400/s due to swapping
            srinath.duvuru Srinath Duvuru made changes -
            Sprint Magma: Jan 20 - Feb 2 [ 968 ] Magma: Jan 20 - Feb 2, ForestDB: Oct 26 - Nov 14 [ 968, 999 ]
            srinath.duvuru Srinath Duvuru made changes -
            Sprint Magma: Jan 20 - Feb 2, ForestDB: Oct 26 - Nov 14 [ 968, 999 ] Magma: Jan 20 - Feb 2 [ 968 ]
            sarath Sarath Lakshman made changes -
            Sprint Magma: Jan 20 - Feb 2 [ 968 ] Magma: Jan 20 - Feb 2, Magma: Jan 20 - Feb 3 [ 968, 1000 ]
            bo-chun.wang Bo-Chun Wang added a comment -

            I re-ran 1% DGM test with direct-io enabled. The build I used is 1006.5.1-1071, and the test ran with replica=1.

            http://perf.jenkins.couchbase.com/job/rhea-dev/239/

            The test took more than 2 days to load 2.5B docs. I have collected logs here.

            https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2020-03-02T181159-ns_1%40172.23.97.26.zip

            bo-chun.wang Bo-Chun Wang added a comment - I re-ran 1% DGM test with direct-io enabled. The build I used is 1006.5.1-1071, and the test ran with replica=1. http://perf.jenkins.couchbase.com/job/rhea-dev/239/ The test took more than 2 days to load 2.5B docs. I have collected logs here. https://s3.amazonaws.com/bugdb/jira/qe/collectinfo-2020-03-02T181159-ns_1%40172.23.97.26.zip
            sarath Sarath Lakshman added a comment - - edited

            We made some tweaks to make sure that the number of files and godu process reading file metadata to not cause swapping.
            The test ran into the swap again. The stats indicate that memcached rss is growing above the high watermark and running into swap. In this test, we have allocated 40% of the memory quota to magma. But, magma is using around 20GB

            $ grep -o 'TotalMemUsed":[0-9]*' cbstats_kvstore.txt | cut -d: -f2 | awk '{ s+=$1 } END{ print s }'
            21568815839
            

             

            [root@cen-sa35 ~]# cat cbstats_memory.txt
             bytes: 67991669576
             ep_blob_num: 21745903
             ep_blob_overhead: 1961746832
             ep_item_num: 21746297
             ep_kv_size: 16700614293
             ep_max_size: 68719476736
             ep_mem_high_wat: 58411555225
             ep_mem_high_wat_percent: 0.85
             ep_mem_low_wat: 51539607552
             ep_mem_low_wat_percent: 0.75
             ep_oom_errors: 0
             ep_overhead: 17108587431
             ep_storedval_num: 4977
             ep_storedval_overhead: 1961746832
             ep_storedval_size: 557424
             ep_tmp_oom_errors: 425353
             ep_value_size: 16700111616
             mem_used: 67991669576
             mem_used_estimate: 67996003920
             mem_used_merge_threshold: 6135667
             total_allocated_bytes: 68104601056
             total_fragmentation_bytes: 12496273952
             total_heap_bytes: 81810956288
             total_metadata_bytes: 1148322352
             total_resident_bytes: 81737576448
             total_retained_bytes: 54502875136

             

            We are trying to figure out why is memcached using a higher amount of memory for one node.

            The nodes are showing resident as 0% in the UI. But, the replica checkpoint seems to be consuming a lot of the memory as per stats.

             

            mem_used = 67991669576

            magma_usage = 21568815839

            checkpoint_usage = 47710644649 (replica checkpoint memory + overhead)

            magma + checkpoint usage matches with mem_used.

            [root@cen-sa35 ~]# grep memory cbstats_all.txt
            ep_checkpoint_memory: 32755579609
            ep_checkpoint_memory_overhead: 14955246160
            ep_checkpoint_memory_unreferenced: 0
            ep_cursor_memory_freed: 242780
            ep_diskqueue_memory: 0
            ep_meta_data_memory: 10593890
            ep_warmup_min_memory_threshold: 100
            vb_active_checkpoint_memory: 106368
            vb_active_checkpoint_memory_overhead: 74752

            vb_active_checkpoint_memory_unreferenced: 0
            vb_active_ht_memory: 330752
            vb_active_itm_memory: 0
            vb_active_itm_memory_uncompressed: 0
            vb_active_meta_data_memory: 0
            vb_active_queue_memory: 0
            vb_pending_checkpoint_memory: 0
            vb_pending_checkpoint_memory_overhead: 0
            vb_pending_checkpoint_memory_unreferenced: 0
            vb_pending_ht_memory: 0
            vb_pending_itm_memory: 0
            vb_pending_itm_memory_uncompressed: 0
            vb_pending_meta_data_memory: 0
            vb_pending_queue_memory: 0
            vb_replica_checkpoint_memory: 32755473241
            vb_replica_checkpoint_memory_overhead: 14955171408

            vb_replica_checkpoint_memory_unreferenced: 0
            vb_replica_ht_memory: 972928
            vb_replica_itm_memory: 80745940
            vb_replica_itm_memory_uncompressed: 112943380
            vb_replica_meta_data_memory: 10593890
            vb_replica_queue_memory: 0

             

            The write queue is ~0, replica queue is ~21k and I do not see any other backlog building up.

            Ben Huddleston Could you take a look ?

             

            I didn't want to run cbcollect as the box is already in the swap. The system is currently live (172.23.97.26)

            sarath Sarath Lakshman added a comment - - edited We made some tweaks to make sure that the number of files and godu process reading file metadata to not cause swapping. The test ran into the swap again. The stats indicate that memcached rss is growing above the high watermark and running into swap. In this test, we have allocated 40% of the memory quota to magma. But, magma is using around 20GB $ grep -o 'TotalMemUsed":[0-9]*' cbstats_kvstore.txt | cut -d: -f2 | awk '{ s+=$1 } END{ print s }' 21568815839   [root@cen-sa35 ~]# cat cbstats_memory.txt bytes: 67991669576 ep_blob_num: 21745903 ep_blob_overhead: 1961746832 ep_item_num: 21746297 ep_kv_size: 16700614293 ep_max_size: 68719476736 ep_mem_high_wat: 58411555225 ep_mem_high_wat_percent: 0.85 ep_mem_low_wat: 51539607552 ep_mem_low_wat_percent: 0.75 ep_oom_errors: 0 ep_overhead: 17108587431 ep_storedval_num: 4977 ep_storedval_overhead: 1961746832 ep_storedval_size: 557424 ep_tmp_oom_errors: 425353 ep_value_size: 16700111616 mem_used: 67991669576 mem_used_estimate: 67996003920 mem_used_merge_threshold: 6135667 total_allocated_bytes: 68104601056 total_fragmentation_bytes: 12496273952 total_heap_bytes: 81810956288 total_metadata_bytes: 1148322352 total_resident_bytes: 81737576448 total_retained_bytes: 54502875136   We are trying to figure out why is memcached using a higher amount of memory for one node. The nodes are showing resident as 0% in the UI. But, the replica checkpoint seems to be consuming a lot of the memory as per stats.   mem_used = 67991669576 magma_usage = 21568815839 checkpoint_usage = 47710644649 (replica checkpoint memory + overhead) magma + checkpoint usage matches with mem_used. [root@cen-sa35 ~] # grep memory cbstats_all.txt ep_checkpoint_memory: 32755579609 ep_checkpoint_memory_overhead: 14955246160 ep_checkpoint_memory_unreferenced: 0 ep_cursor_memory_freed: 242780 ep_diskqueue_memory: 0 ep_meta_data_memory: 10593890 ep_warmup_min_memory_threshold: 100 vb_active_checkpoint_memory: 106368 vb_active_checkpoint_memory_overhead: 74752 vb_active_checkpoint_memory_unreferenced: 0 vb_active_ht_memory: 330752 vb_active_itm_memory: 0 vb_active_itm_memory_uncompressed: 0 vb_active_meta_data_memory: 0 vb_active_queue_memory: 0 vb_pending_checkpoint_memory: 0 vb_pending_checkpoint_memory_overhead: 0 vb_pending_checkpoint_memory_unreferenced: 0 vb_pending_ht_memory: 0 vb_pending_itm_memory: 0 vb_pending_itm_memory_uncompressed: 0 vb_pending_meta_data_memory: 0 vb_pending_queue_memory: 0 vb_replica_checkpoint_memory: 32755473241 vb_replica_checkpoint_memory_overhead: 14955171408 vb_replica_checkpoint_memory_unreferenced: 0 vb_replica_ht_memory: 972928 vb_replica_itm_memory: 80745940 vb_replica_itm_memory_uncompressed: 112943380 vb_replica_meta_data_memory: 10593890 vb_replica_queue_memory: 0   The write queue is ~0, replica queue is ~21k and I do not see any other backlog building up. Ben Huddleston  Could you take a look ?   I didn't want to run cbcollect as the box is already in the swap. The system is currently live (172.23.97.26)
            sarath Sarath Lakshman made changes -
            Assignee Sarath Lakshman [ sarath ] Ben Huddleston [ ben.huddleston ]
            sarath Sarath Lakshman made changes -
            Component/s couchbase-bucket [ 10173 ]
            sarath Sarath Lakshman made changes -
            Link This issue relates to MB-36370 [ MB-36370 ]
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston made changes -
            ben.huddleston Ben Huddleston added a comment - - edited

            Logs - https://supportal.couchbase.com/snapshot/25a28b542bd29f28159f19a7d5635fd3%3A%3A0

            We've got a 2 node cluster again, in this run node 25 ended up with mem usage stuck above the high water mark.

            Whilst the test did at some point start using swap, it ended up in a livelock state with 0 swap usage. VMstat from node 25.

            [root@cen-sa34 ~]# vmstat -w 1
            procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu--------
             r  b         swpd         free         buff        cache   si   so    bi    bo   in   cs  us  sy  id  wa  st
             3  0       147096      2531052            0      1673056    0    0  2398  3106    7   10  13   5  79   3   0
             2  0       147096      2536880            0      1673080    0    0     0     0 5298 28510   4   1  95   0   0
             3  0       147096      2533416            0      1673080    0    0     0    48 7506 29642   4   1  95   0   0
             4  0       147096      2527632            0      1673168    0    0     0    24 6141 31915   4   1  95   0   0
             4  0       147096      2529232            0      1673168    0    0     0    76 5819 31291   4   1  95   0   0
             4  0       147096      2532092            0      1673196    0    0     0   244 6225 27959   4   1  95   0   0
             2  0       147096      2538680            0      1673236    0    0     0    92 5856 31254   4   1  95   0   0
             4  0       147096      2537208            0      1672492    0    0     0   857 7133 31722   4   1  95   0   0
             5  0       147096      2532876            0      1673260    0    0     0    24 6718 31717   4   1  95   0   0
             4  0       147096      2529096            0      1673260    0    0     0    87 7977 36804   4   2  94   0   0

            Observed in the UI that node 25 has 350k DCP backoffs per second. Node 25 also has 0 disk write queue and 0 replication queue. Node 26 has 0 disk disk write queue and ~21k replication queue. The replication queue stat is a touch misleading, it only accounts the in memory queue. Can see below that the actual oustanding number of items is much larger (500M+).

            Node 26 isn't getting any replica items as node 25 is above the HWM.

            2020-03-04T01:24:49.659863-08:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high
            

            Spotted that the test was run with the config parameter "cursor_dropping_checkpoint_mem_upper_mark" set to 100. This prevents us from dropping cursors based on checkpoint memory usage. We will still cursor drop when over 95% memory usage.
            What is happening here is that we are dropping cursors when we go above 95% mem_used and forcing streams to backfill from disk. At first this is fine as the disk backfills are relatively small. The disk backfill quickly get larger though and we end up in a state where we have to send a large disk snapshot/checkpoint to the replica. This causes an issue on node 25 as we end up with a large amount of replica checkpoint memory overhead. The overhead memory is the keyIndex map. This is used for de-duplication of items on the active node, and for sanity checks on the replica node. We added these sanity checks for SyncWrites, but it might be worth revisiting (at least for disk checkpoints) as this has led this cluster to end up in an unrecoverable state. I'll investigate this further. See also MB-35889.

            ep_checkpoint_memory:                                  27491511930
             ep_checkpoint_memory_overhead:                         27491358336
             ep_checkpoint_memory_unreferenced:                     0
             ep_cursor_dropping_checkpoint_mem_lower_mark:          30
             ep_cursor_dropping_checkpoint_mem_upper_mark:          100
             vb_active_checkpoint_memory:                           106368
             vb_active_checkpoint_memory_overhead:                  74752
             vb_active_checkpoint_memory_unreferenced:              0
             vb_pending_checkpoint_memory:                          0
             vb_pending_checkpoint_memory_overhead:                 0
             vb_pending_checkpoint_memory_unreferenced:             0
             vb_replica_checkpoint_memory:                          27491405562
             vb_replica_checkpoint_memory_overhead:                 27491283584
            

            We could also argue that this is a sizing/test issue though. Whilst memcached/kv_engine is capable of ingesting items faster than we can replicate them, it is not possible to do so for the sustained periods that we try to in this test. This relates to MB-36370.

            Suggested that this test might work better with cursor dropping disabled entirely. The cluster might end up in a stuck state if our memory usage gets above the backoff for replication, but if not then this should put backpressure on the clients to reduce the ops/s.

            ben.huddleston Ben Huddleston added a comment - - edited Logs - https://supportal.couchbase.com/snapshot/25a28b542bd29f28159f19a7d5635fd3%3A%3A0 We've got a 2 node cluster again, in this run node 25 ended up with mem usage stuck above the high water mark. Whilst the test did at some point start using swap, it ended up in a livelock state with 0 swap usage. VMstat from node 25. [root@cen-sa34 ~]# vmstat -w 1 procs -----------------------memory---------------------- ---swap-- -----io---- -system-- --------cpu-------- r b swpd free buff cache si so bi bo in cs us sy id wa st 3 0 147096 2531052 0 1673056 0 0 2398 3106 7 10 13 5 79 3 0 2 0 147096 2536880 0 1673080 0 0 0 0 5298 28510 4 1 95 0 0 3 0 147096 2533416 0 1673080 0 0 0 48 7506 29642 4 1 95 0 0 4 0 147096 2527632 0 1673168 0 0 0 24 6141 31915 4 1 95 0 0 4 0 147096 2529232 0 1673168 0 0 0 76 5819 31291 4 1 95 0 0 4 0 147096 2532092 0 1673196 0 0 0 244 6225 27959 4 1 95 0 0 2 0 147096 2538680 0 1673236 0 0 0 92 5856 31254 4 1 95 0 0 4 0 147096 2537208 0 1672492 0 0 0 857 7133 31722 4 1 95 0 0 5 0 147096 2532876 0 1673260 0 0 0 24 6718 31717 4 1 95 0 0 4 0 147096 2529096 0 1673260 0 0 0 87 7977 36804 4 2 94 0 0 Observed in the UI that node 25 has 350k DCP backoffs per second. Node 25 also has 0 disk write queue and 0 replication queue. Node 26 has 0 disk disk write queue and ~21k replication queue. The replication queue stat is a touch misleading, it only accounts the in memory queue. Can see below that the actual oustanding number of items is much larger (500M+). Node 26 isn't getting any replica items as node 25 is above the HWM. 2020-03-04T01:24:49.659863-08:00 INFO (bucket-1) DCP backfilling task temporarily suspended because the current memory usage is too high Spotted that the test was run with the config parameter "cursor_dropping_checkpoint_mem_upper_mark" set to 100. This prevents us from dropping cursors based on checkpoint memory usage. We will still cursor drop when over 95% memory usage. What is happening here is that we are dropping cursors when we go above 95% mem_used and forcing streams to backfill from disk. At first this is fine as the disk backfills are relatively small. The disk backfill quickly get larger though and we end up in a state where we have to send a large disk snapshot/checkpoint to the replica. This causes an issue on node 25 as we end up with a large amount of replica checkpoint memory overhead. The overhead memory is the keyIndex map. This is used for de-duplication of items on the active node, and for sanity checks on the replica node. We added these sanity checks for SyncWrites, but it might be worth revisiting (at least for disk checkpoints) as this has led this cluster to end up in an unrecoverable state. I'll investigate this further. See also MB-35889 . ep_checkpoint_memory: 27491511930 ep_checkpoint_memory_overhead: 27491358336 ep_checkpoint_memory_unreferenced: 0 ep_cursor_dropping_checkpoint_mem_lower_mark: 30 ep_cursor_dropping_checkpoint_mem_upper_mark: 100 vb_active_checkpoint_memory: 106368 vb_active_checkpoint_memory_overhead: 74752 vb_active_checkpoint_memory_unreferenced: 0 vb_pending_checkpoint_memory: 0 vb_pending_checkpoint_memory_overhead: 0 vb_pending_checkpoint_memory_unreferenced: 0 vb_replica_checkpoint_memory: 27491405562 vb_replica_checkpoint_memory_overhead: 27491283584 We could also argue that this is a sizing/test issue though. Whilst memcached/kv_engine is capable of ingesting items faster than we can replicate them, it is not possible to do so for the sustained periods that we try to in this test. This relates to MB-36370 . Suggested that this test might work better with cursor dropping disabled entirely. The cluster might end up in a stuck state if our memory usage gets above the backoff for replication, but if not then this should put backpressure on the clients to reduce the ops/s.
            ben.huddleston Ben Huddleston made changes -

            Discovered that our checkpoint memory overhead accounting is not quite correct. In the below graph we are tracking ~27GB memory usage as checkpoint memory overhead. This doesn't tally up with the overall bucket quota though. This tracking does not include the memory used for the keys of the keyIndex map if they have been allocated on the heap (i.e. over 16 bytes). In this test they look to fall in the 48 byte size class.

            We have 27GB tracked checkpoint memory overhead and 300M items in checkpoints (all expelled).

            EMEAUKMAC01:cbcollect_info_ns_1@172.23.97.25_20200304-094130 benhuddleston$ grep -E "num_checkpoint_items" stats.log | awk '{sum+=$2} END {print sum}'
            306795137
            

            If we look at the jemalloc bins we can see that we have ~300M allocators in the 48 and 80 byte bins. We are failing to track a substantial amount of memory in this stat (14+GB).

            bins:           size ind    allocated      nmalloc (#/sec)      ndalloc (#/sec)    nrequests   (#/sec)  nshards      curregs     curslabs  nonfull_slabs regs pgs   util       nfills (#/sec)     nflushes (#/sec)       nslabs     nreslabs (#/sec)      n_lock_ops (#/sec)       n_waiting (#/sec)      n_spin_acq (#/sec)  n_owner_switch (#/sec)   total_wait_ns   (#/sec)     max_wait_ns  max_n_thds
                               8   0      1926648     37355684     670     37114853     666   4785767274     85962        1       240831          471              2  512   1  0.998     34095458     612     25375131     455          506     34000638     610        59921306    1076            2205       0          103886       1        43881071     788       665999965     11962         7999999          29
                              16   1       358736    894023238   16058    894000817   16058  18882685395    339171        1        22421          217            189  256   1  0.403    141763627    2546     42108314     756        13604    158128281    2840       184349647    3311           21857       0          583891      10        76549539    1374      3057999822     54927        11999999          36
                              32   2    387728320   6528711472  117268   6516594962  117051 293136306865   5265322        1     12116510       300351         281186  128   1  0.315    244389628    4389     99747117    1791      1060328   3102906741   55734       346429320    6222        17918594     321         7891174     141       169292408    3040    757713957567  13610079        46999997          45
                              48   3  14800610496  16305665661  292882  15997319609  287344  53515270042    961242        1    308346052      1204648           4656  256   3  0.999    271659621    4879    166924684    2998      2369791   7993264885  143575       442582105    7949        45681278     820        14907929     267       269387825    4838   2720663846225  48868640        33999998          71
                              64   4     42451456   1869308009   33576   1868644705   33564  29098736660    522672        1       663304        10845           1489   64   1  0.955    127773395    2295     57620784    1034        33693   1453084193   26100       185900894    3339         3134399      56         3034303      54       104610733    1879    116309993702   2089163        38999998          64
                              80   5  24596145680   3655260945   65655   3347809124   60133   7842356661    140864        1    307451821      1200985             28  256   5  0.999    188892937    3392     45798976     822      2313219    963237264   17301       238567540    4285         5936948     106         2935421      52        74723924    1342    205235988554   3686454        21999999          28
            

            Raised MB-38173 for this.

            ben.huddleston Ben Huddleston added a comment - Discovered that our checkpoint memory overhead accounting is not quite correct. In the below graph we are tracking ~27GB memory usage as checkpoint memory overhead. This doesn't tally up with the overall bucket quota though. This tracking does not include the memory used for the keys of the keyIndex map if they have been allocated on the heap (i.e. over 16 bytes). In this test they look to fall in the 48 byte size class. We have 27GB tracked checkpoint memory overhead and 300M items in checkpoints (all expelled). EMEAUKMAC01:cbcollect_info_ns_1@172.23.97.25_20200304-094130 benhuddleston$ grep -E "num_checkpoint_items" stats.log | awk '{sum+=$2} END {print sum}' 306795137 If we look at the jemalloc bins we can see that we have ~300M allocators in the 48 and 80 byte bins. We are failing to track a substantial amount of memory in this stat (14+GB). bins: size ind allocated nmalloc (#/sec) ndalloc (#/sec) nrequests (#/sec) nshards curregs curslabs nonfull_slabs regs pgs util nfills (#/sec) nflushes (#/sec) nslabs nreslabs (#/sec) n_lock_ops (#/sec) n_waiting (#/sec) n_spin_acq (#/sec) n_owner_switch (#/sec) total_wait_ns (#/sec) max_wait_ns max_n_thds 8 0 1926648 37355684 670 37114853 666 4785767274 85962 1 240831 471 2 512 1 0.998 34095458 612 25375131 455 506 34000638 610 59921306 1076 2205 0 103886 1 43881071 788 665999965 11962 7999999 29 16 1 358736 894023238 16058 894000817 16058 18882685395 339171 1 22421 217 189 256 1 0.403 141763627 2546 42108314 756 13604 158128281 2840 184349647 3311 21857 0 583891 10 76549539 1374 3057999822 54927 11999999 36 32 2 387728320 6528711472 117268 6516594962 117051 293136306865 5265322 1 12116510 300351 281186 128 1 0.315 244389628 4389 99747117 1791 1060328 3102906741 55734 346429320 6222 17918594 321 7891174 141 169292408 3040 757713957567 13610079 46999997 45 48 3 14800610496 16305665661 292882 15997319609 287344 53515270042 961242 1 308346052 1204648 4656 256 3 0.999 271659621 4879 166924684 2998 2369791 7993264885 143575 442582105 7949 45681278 820 14907929 267 269387825 4838 2720663846225 48868640 33999998 71 64 4 42451456 1869308009 33576 1868644705 33564 29098736660 522672 1 663304 10845 1489 64 1 0.955 127773395 2295 57620784 1034 33693 1453084193 26100 185900894 3339 3134399 56 3034303 54 104610733 1879 116309993702 2089163 38999998 64 80 5 24596145680 3655260945 65655 3347809124 60133 7842356661 140864 1 307451821 1200985 28 256 5 0.999 188892937 3392 45798976 822 2313219 963237264 17301 238567540 4285 5936948 106 2935421 52 74723924 1342 205235988554 3686454 21999999 28 Raised MB-38173 for this.

            Will continue investigation into the checkpoint memory overhead as part of MB-35889. This ticket already attempted to address the same issues but the code had to be reverted as it caused issues with sync writes.

            Recommend re-running this test either with cursor dropping disabled entirely (cursor_dropping_upper_mark=100), a lower mutation_mem_threshold to attempt to try to get the clients to backoff, or just to reduce the throughut of the clients to just below the throughput of in memory dcp replication.

            ben.huddleston Ben Huddleston added a comment - Will continue investigation into the checkpoint memory overhead as part of MB-35889 . This ticket already attempted to address the same issues but the code had to be reverted as it caused issues with sync writes. Recommend re-running this test either with cursor dropping disabled entirely (cursor_dropping_upper_mark=100), a lower mutation_mem_threshold to attempt to try to get the clients to backoff, or just to reduce the throughut of the clients to just below the throughput of in memory dcp replication.
            ben.huddleston Ben Huddleston made changes -
            Link This issue relates to MB-35889 [ MB-35889 ]
            ben.huddleston Ben Huddleston added a comment - - edited

            Apologies for the slow updates on this ticket. We re-ran the test with a patch to allow erasing of the key indexes of disk checkpoints so that we can backfill from disk for DCP to avoid the cluster locking up until we have better control over checkpoint memory usage. This test still ran into swap somewhere in the access phase. I don't have the logs, one node wouldn't cbcollect as it was too heavily in swap. I investigated all of our memory accounting statistics to try to make sense of what was going wrong. It looked like something wasn't quite adding up. Had negligible memory usage in checkpoints, the memory usage we had in hashtable was high (~30GB) and we had ~22GB reported by magma across all shards. I spotted that the UI was reporting numbers different to cbstats for mem_used and watermarks which threw me off for a bit. Not sure why that was the case. Sarath Lakshman started running the test again with jemalloc profiling so that we can investigate.

            I tried a "repro" of this on a small scale. Created a bucket with 500MB quota and loaded enough docs to force residency ratio down to 0%. Cbstats showed negligible memory usage in checkpoints and hash table. Magma showed ~30MB memory usage which seemed reasonable, but a jemalloc profile showed that magma had an actual memory usage that was much higher (~355MB). Sarath Lakshman is investigating this further as we need magma to stay within given memory limits or we risk going into swap/OOM kill.

            jemalloc.svg

            ben.huddleston Ben Huddleston added a comment - - edited Apologies for the slow updates on this ticket. We re-ran the test with a patch to allow erasing of the key indexes of disk checkpoints so that we can backfill from disk for DCP to avoid the cluster locking up until we have better control over checkpoint memory usage. This test still ran into swap somewhere in the access phase. I don't have the logs, one node wouldn't cbcollect as it was too heavily in swap. I investigated all of our memory accounting statistics to try to make sense of what was going wrong. It looked like something wasn't quite adding up. Had negligible memory usage in checkpoints, the memory usage we had in hashtable was high (~30GB) and we had ~22GB reported by magma across all shards. I spotted that the UI was reporting numbers different to cbstats for mem_used and watermarks which threw me off for a bit. Not sure why that was the case. Sarath Lakshman started running the test again with jemalloc profiling so that we can investigate. I tried a "repro" of this on a small scale. Created a bucket with 500MB quota and loaded enough docs to force residency ratio down to 0%. Cbstats showed negligible memory usage in checkpoints and hash table. Magma showed ~30MB memory usage which seemed reasonable, but a jemalloc profile showed that magma had an actual memory usage that was much higher (~355MB). Sarath Lakshman is investigating this further as we need magma to stay within given memory limits or we risk going into swap/OOM kill. jemalloc.svg
            ben.huddleston Ben Huddleston made changes -
            Attachment jemalloc.svg [ 88270 ]
            ben.huddleston Ben Huddleston made changes -
            Assignee Ben Huddleston [ ben.huddleston ] Sarath Lakshman [ sarath ]
            srinath.duvuru Srinath Duvuru made changes -
            Sprint Magma: Jan 20 - Feb 2, Magma: Apr2-Apr16 [ 968, 1000 ] Magma: Jan 20 - Feb 2 [ 968 ]
            srinath.duvuru Srinath Duvuru made changes -
            Rank Ranked higher
            sarath Sarath Lakshman added a comment - - edited

            We have resolved the issue causing the filesystem memory pressure (Reduce the number of files used by magma). We have to verify it by rerunning the test.

            CBSS-411 is opened to investigate magma memory accounting stats.

            sarath Sarath Lakshman added a comment - - edited We have resolved the issue causing the filesystem memory pressure (Reduce the number of files used by magma). We have to verify it by rerunning the test. CBSS-411 is opened to investigate magma memory accounting stats.
            sarath Sarath Lakshman made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            sarath Sarath Lakshman made changes -
            Assignee Sarath Lakshman [ sarath ] Bo-Chun Wang [ bo-chun.wang ]
            sarath Sarath Lakshman made changes -
            Epic Link CBSS-305 [ 112304 ] CBSS-376 [ 121213 ]
            sarath Sarath Lakshman made changes -
            Status Resolved [ 5 ] Closed [ 6 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

            People

              bo-chun.wang Bo-Chun Wang
              sarath Sarath Lakshman
              Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty