Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57741

GSI node auto failover - Indexer RSS beyond quota

    XMLWordPrintable

Details

    Description

      We are running the following test CBPS-1184 

      Setup:

      a. 3 KV + 3 indexer nodes 

      b. 10 buckets in the cluster, each bucket has 10 collections on 10 indexes - Total of 1000 indexes per node

      c. Create indexes with "num_replica": 2 so that all indexes have almost same load

      d. Document size around 1KB - Index fields can vary from few bytes to 128 bytes

      Seeing auto-failover  trigger for GSI node

      2023-06-20T18:30:59.131Z, failover:0:info:message(ns_1@cen-s723.perf.couchbase.com) - Starting failing over ['ns_1@cen-s710.perf.couchbase.com']
      2023-06-20T18:30:59.135Z, ns_orchestrator:0:info:message(ns_1@cen-s723.perf.couchbase.com) - Starting failover of nodes ['ns_1@cen-s710.perf.couchbase.com']. Operation Id = 700e757f122d98a4f3ced994f590effb 

      GSI logs on s710

      2023-06-20T18:30:00.112+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:100024037376, freePercent:12.8761323118435, currentQuota=74505980754, heapSz=92311445504, limit=false
      2023-06-20T18:30:00.179+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:100627058688, freePercent:12.360477687799184, currentQuota=74505980754, newQuota=74282722285, netGrowth=223258469, percent=0
      2023-06-20T18:30:00.675+00:00 [Info] bucket-1/b1_index3-7/Backstore#17441508732912831320:0 Plasma: SMR reclaim pending is higher than expected: pending = 94 KB (expected = 51 KB), wCtxCnt = 105, objCnt 3, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 1 KB to 0 KB.
      2023-06-20T18:30:00.869+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.748558% blocked
      2023-06-20T18:30:00.872+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.744441% blocked
      2023-06-20T18:30:01.562+00:00 [Info] LSS /data/@2i/shards/shard6/data/recovery(shard6) : recoveryCleaner: starting... frag 34, data: 644336351, used: 980791738 log:(611513356 - 1593114624) head:611513356
      2023-06-20T18:30:01.562+00:00 [Info] LSS /data/@2i/shards/shard6/data/recovery(shard6) : recoveryCleaner: completed... frag 34, data: 647324269, used: 984995670, disk: 985877492, relocated: 0, retries: 0, skipped: 698059335, cleaned:0, log:(611513356 - 1597390848), disk:(611513356-1597390848), activeFragRatio:30, activeMaxFragRatio:80, run:299 duration:190937 ms elapsed: 299530 ms
      2023-06-20T18:30:04.480+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.789255% blocked
      2023-06-20T18:30:05.056+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.781808% blocked
      2023-06-20T18:30:05.445+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.789278% blocked
      2023-06-20T18:30:07.413+00:00 [Info] LSS /data/@2i/shards/shard10/data(shard10) : logCleaner: starting... frag 31, data: 25344498263, used: 36732220724 log:(29071636706 - 65803882496) head:29071636706
      2023-06-20T18:30:07.413+00:00 [Info] LSS /data/@2i/shards/shard10/data(shard10) : logCleaner: completed... frag 30, data: 26421784064, used: 38292201042, disk: 38293336159, relocated: 134924142, retries: 19170, skipped: 93181547, cleaned:295233271, log:(29366869977 - 67659083776), disk:(29365747617-67659083776), activeFragRatio:30, activeMaxFragRatio:80, run:135 duration:7413 ms elapsed: 2227344 ms
      2023-06-20T18:30:09.416+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.832454% blocked
      2023-06-20T18:30:09.930+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.824558% blocked
      2023-06-20T18:30:10.462+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.833474% blocked
      2023-06-20T18:30:10.959+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:100627058688, freePercent:12.360477687799184, currentQuota=74282722285, heapSz=92713385984, limit=false
      2023-06-20T18:30:11.101+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:101117591552, freePercent:11.930352129222687, currentQuota=74282722285, newQuota=74075598221, netGrowth=207124064, percent=0
      2023-06-20T18:30:14.638+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.878805% blocked
      2023-06-20T18:30:14.933+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.867217% blocked
      2023-06-20T18:30:15.446+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.876949% blocked
      2023-06-20T18:30:18.482+00:00 [Info] LSS /data/@2i/shards/shard12/data/recovery(shard12) : recoveryCleaner: starting... frag 34, data: 632676682, used: 968144286 log:(640213818 - 1608757248) head:640213818
      2023-06-20T18:30:18.523+00:00 [Info] LSS /data/@2i/shards/shard12/data/recovery(shard12) : recoveryCleaner: completed... frag 34, data: 631998646, used: 968353182, disk: 968752326, relocated: 0, retries: 0, skipped: 730643576, cleaned:0, log:(640213818 - 1608966144), disk:(640213818-1608966144), activeFragRatio:30, activeMaxFragRatio:80, run:299 duration:207297 ms elapsed: 299394 ms
      2023-06-20T18:30:19.341+00:00 [Info] janitor: running cleanup.
      2023-06-20T18:30:19.590+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.922641% blocked
      2023-06-20T18:30:19.811+00:00 [Info] memstats {"Alloc":7436909112, "TotalAlloc":10083467138608, "Sys":10992412152, "Lookups":0, "Mallocs":104730307961,"Frees":104679650487, "HeapAlloc":7436909112, "HeapSys":8582037504, "HeapIdle":1050099712, "HeapInuse":7531937792,"HeapReleased":554999808, "HeapObjects":50657474,"MSpanInuse":126544320, "MSpanSys": 131898240, "StackInuse": 2019262464,"GCSys":235511752, "LastGC":1687285770644133042,"PauseTotalNs":11683505695, "PauseNs":[66338393, 3869186, 26673511, 921387, 50970032, 21736397, 24213317, 775925, 51058767, 624253, 46718731, 801674, 15434029, 667072], "NumGC":7955}
      2023-06-20T18:30:19.917+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.911290% blocked
      2023-06-20T18:30:20.804+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.924539% blocked
      2023-06-20T18:30:22.451+00:00 [Info] LSS /data/@2i/shards/shard19/data/recovery(shard19) : recoveryCleaner: starting... frag 40, data: 303698748, used: 509784132 log:(793822494 - 1304342528) head:792764342
      2023-06-20T18:30:22.451+00:00 [Info] LSS /data/@2i/shards/shard19/data/recovery(shard19) : recoveryCleaner: completed... frag 40, data: 300976700, used: 509657220, disk: 511568610, relocated: 0, retries: 0, skipped: 2039941843, cleaned:1051682, log:(794874176 - 1305391104), disk:(793822494-1305391104), activeFragRatio:30, activeMaxFragRatio:80, run:299 duration:214222 ms elapsed: 299284 ms
      2023-06-20T18:30:22.845+00:00 [Info] RebalanceServiceManager::GetTaskList []
      2023-06-20T18:30:22.854+00:00 [Info] RebalanceServiceManager::GetTaskList returns &{Rev:[0 0 0 0 0 0 0 3] Tasks:[]}
      2023-06-20T18:30:22.859+00:00 [Info] RebalanceServiceManager::GetTaskList [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:22.882+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
      2023-06-20T18:30:22.882+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 3] Nodes:[934a8f6699f5188b2b79b67348b13327 3315b8b046040f66bc5f8fcdeb51ab5d f908fffed22299b346b7dd11d8bd6e4a] IsBalanced:true Messages:[]}
      2023-06-20T18:30:22.889+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:23.493+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:101117591552, freePercent:11.930352129222687, currentQuota=74075598221, heapSz=93124968448, limit=false
      2023-06-20T18:30:23.434+00:00 [Info] RebalanceServiceManager::rebalanceJanitor Running Periodic Cleanup
      2023-06-20T18:30:23.796+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:101751967744, freePercent:11.44839549002602, currentQuota=74075598221, newQuota=73767786671, netGrowth=307811550, percent=0
      2023-06-20T18:30:24.138+00:00 [Info] bucket-4/b4_index10-4/Backstore#13095880066195656553:0 Plasma: SMR reclaim pending is higher than expected: pending = 77 KB (expected = 51 KB), wCtxCnt = 99, objCnt 49, changed reclaimList flush threshold from 2 to 0, changed reclaimSize flush threshold from 0 KB to 0 KB.
      2023-06-20T18:30:23.745+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 2.129938684s. priorTime: 2023-06-20 18:30:21.615451297 +0000 UTC m=+13142.744758263, callTime: 2023-06-20 18:30:23.745389972 +0000 UTC m=+13144.874696947, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:24.200+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow call 455.150526ms. callTime: 2023-06-20 18:30:23.745389972 +0000 UTC m=+13144.874696947, doneTime: 2023-06-20 18:30:24.200540509 +0000 UTC m=+13145.329847473, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:24.250+00:00 [Warn] util::traceLockLog: Long hold of svcMgrMu.Lock() 713.420329ms in RebalanceServiceManager::rebalanceJanitor
      2023-06-20T18:30:24.435+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.965870% blocked
      2023-06-20T18:30:24.993+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.956240% blocked
      2023-06-20T18:30:26.282+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.973137% blocked
      2023-06-20T18:30:28.044+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 2.418479804s. priorTime: 2023-06-20 18:30:25.626186936 +0000 UTC m=+13146.755493878, callTime: 2023-06-20 18:30:28.044666737 +0000 UTC m=+13149.173973682, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:29.514+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.010743% blocked
      2023-06-20T18:30:30.030+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.998196% blocked
      2023-06-20T18:30:30.448+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.006952% blocked
      2023-06-20T18:30:34.620+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.054416% blocked
      2023-06-20T18:30:34.967+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.041347% blocked
      2023-06-20T18:30:35.520+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.051634% blocked
      2023-06-20T18:30:36.126+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:101751967744, freePercent:11.44839549002602, currentQuota=73767786671, heapSz=93601521664, limit=false
      2023-06-20T18:30:36.219+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:102228393984, freePercent:11.101295079979167, currentQuota=73767786671, newQuota=73577460498, netGrowth=190326173, percent=0
      2023-06-20T18:30:36.253+00:00 [Info] bucket-2/b2_index4-6/Backstore#13402402523673797046:0 Plasma: SMR reclaim pending is higher than expected: pending = 120 KB (expected = 51 KB), wCtxCnt = 90, objCnt 6, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 1 KB to 0 KB.
      2023-06-20T18:30:39.501+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.097707% blocked
      2023-06-20T18:30:40.022+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.085504% blocked
      2023-06-20T18:30:40.031+00:00 [Info] LSS /data/@2i/shards/shard15/data/recovery(shard15) : recoveryCleaner: starting... frag 41, data: 292171216, used: 501637688 log:(831935830 - 1333690368) head:829836960
      2023-06-20T18:30:40.031+00:00 [Info] LSS /data/@2i/shards/shard15/data/recovery(shard15) : recoveryCleaner: completed... frag 42, data: 289522076, used: 501379022, disk: 504901984, relocated: 0, retries: 0, skipped: 2116580399, cleaned:1053136, log:(832988966 - 1334738944), disk:(829836960-1334738944), activeFragRatio:30, activeMaxFragRatio:80, run:297 duration:232370 ms elapsed: 299465 ms
      2023-06-20T18:30:40.453+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.095254% blocked
      2023-06-20T18:30:44.529+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.141605% blocked
      2023-06-20T18:30:44.971+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.128936% blocked
      2023-06-20T18:30:45.558+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.139974% blocked
      2023-06-20T18:30:48.832+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:102228393984, freePercent:11.101295079979167, currentQuota=73577460498, heapSz=94071918592, limit=false
      2023-06-20T18:30:49.058+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:102533611520, freePercent:10.773975613743879, currentQuota=73577460498, newQuota=73424456988, netGrowth=153003510, percent=0
      2023-06-20T18:30:50.423+00:00 [Info] bucket-9/b9_index7-5/Backstore#2853482345901207500:0 Plasma: SMR reclaim pending is higher than expected: pending = 61 KB (expected = 51 KB), wCtxCnt = 105, objCnt 48, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 0 KB to 0 KB.
      2023-06-20T18:30:52.523+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 4.895058402s. priorTime: 2023-06-20 18:30:47.628581763 +0000 UTC m=+13168.757888733, callTime: 2023-06-20 18:30:52.523640185 +0000 UTC m=+13173.652947135, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:52.631+00:00 [Info] bucket-3/b3_index2-5/Mainstore#81001212506724495:0 Plasma: SMR reclaim pending is higher than expected: pending = 65 KB (expected = 51 KB), wCtxCnt = 84, objCnt 1, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 1 KB to 0 KB.
      2023-06-20T18:30:52.859+00:00 [Info] RebalanceServiceManager::GetTaskList []
      2023-06-20T18:30:52.859+00:00 [Info] RebalanceServiceManager::GetTaskList returns &{Rev:[0 0 0 0 0 0 0 3] Tasks:[]}
      2023-06-20T18:30:52.894+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
      2023-06-20T18:30:52.894+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 3] Nodes:[934a8f6699f5188b2b79b67348b13327 3315b8b046040f66bc5f8fcdeb51ab5d f908fffed22299b346b7dd11d8bd6e4a] IsBalanced:true Messages:[]}
      2023-06-20T18:30:52.901+00:00 [Info] RebalanceServiceManager::GetTaskList [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:52.936+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:54.045+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.208152% blocked
      2023-06-20T18:30:54.078+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.226087% blocked
      2023-06-20T18:30:54.080+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.213081% blocked
      2023-06-20T18:30:54.494+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.229792% blocked
      2023-06-20T18:30:54.515+00:00 [Info] RebalanceServiceManager::rebalanceJanitor Running Periodic Cleanup
      2023-06-20T18:30:55.825+00:00 [Info] bucket-8/b8_index7-5/Mainstore#15074528002066371286:0 Plasma: SMR reclaim pending is higher than expected: pending = 71 KB (expected = 51 KB), wCtxCnt = 105, objCnt 28, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 0 KB to 0 KB.
      2023-06-20T18:30:59.163+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 6.639719677s. priorTime: 2023-06-20 18:30:52.523640185 +0000 UTC m=+13173.652947135, callTime: 2023-06-20 18:30:59.163359868 +0000 UTC m=+13180.292666812, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:59.509+00:00 [Warn] util::traceLockLog: Long hold of svcMgrMu.Lock() 4.994162677s in RebalanceServiceManager::rebalanceJanitor
       

      Run: http://perf.jenkins.couchbase.com/job/hemera/8207/ 

      Attachments

        1. Screenshot 2023-07-19 at 3.00.53 PM-1.png
          232 kB
          Jinesh Parakh
        2. Screenshot 2023-07-19 at 3.00.53 PM.png
          232 kB
          Jinesh Parakh
        3. Screenshot 2023-07-06 at 13.57.56-1.png
          84 kB
          Varun Velamuri
        4. Screenshot 2023-07-06 at 13.57.56.png
          84 kB
          Varun Velamuri
        5. Screenshot 2023-06-28 at 11.42.02.png
          59 kB
          Varun Velamuri
        6. Screenshot 2023-06-28 at 11.40.28.png
          82 kB
          Varun Velamuri
        7. newplot (17).png
          51 kB
          Jinesh Parakh
        8. newplot (16).png
          57 kB
          Jinesh Parakh
        9. newplot (15).png
          53 kB
          Jinesh Parakh
        10. newplot (14).png
          89 kB
          Jinesh Parakh
        11. newplot (13).png
          73 kB
          Jinesh Parakh
        12. newplot (12).png
          77 kB
          Jinesh Parakh
        13. image-2023-07-06-12-40-12-098.png
          431 kB
          Varun Velamuri

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              varun.velamuri Varun Velamuri
              varun.velamuri Varun Velamuri
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty