Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-57623

GSI node auto failover

    XMLWordPrintable

Details

    Description

      We are running the following test CBPS-1184 

      Setup:

      a. 3 KV + 3 indexer nodes 

      b. 10 buckets in the cluster, each bucket has 10 collections on 10 indexes - Total of 1000 indexes per node

      c. Create indexes with "num_replica": 2 so that all indexes have almost same load

      d. Document size around 1KB - Index fields can vary from few bytes to 128 bytes

      Seeing auto-failover  trigger for GSI node

      2023-06-20T18:30:59.131Z, failover:0:info:message(ns_1@cen-s723.perf.couchbase.com) - Starting failing over ['ns_1@cen-s710.perf.couchbase.com']
      2023-06-20T18:30:59.135Z, ns_orchestrator:0:info:message(ns_1@cen-s723.perf.couchbase.com) - Starting failover of nodes ['ns_1@cen-s710.perf.couchbase.com']. Operation Id = 700e757f122d98a4f3ced994f590effb 

      GSI logs on s710

      2023-06-20T18:30:00.112+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:100024037376, freePercent:12.8761323118435, currentQuota=74505980754, heapSz=92311445504, limit=false
      2023-06-20T18:30:00.179+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:100627058688, freePercent:12.360477687799184, currentQuota=74505980754, newQuota=74282722285, netGrowth=223258469, percent=0
      2023-06-20T18:30:00.675+00:00 [Info] bucket-1/b1_index3-7/Backstore#17441508732912831320:0 Plasma: SMR reclaim pending is higher than expected: pending = 94 KB (expected = 51 KB), wCtxCnt = 105, objCnt 3, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 1 KB to 0 KB.
      2023-06-20T18:30:00.869+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.748558% blocked
      2023-06-20T18:30:00.872+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.744441% blocked
      2023-06-20T18:30:01.562+00:00 [Info] LSS /data/@2i/shards/shard6/data/recovery(shard6) : recoveryCleaner: starting... frag 34, data: 644336351, used: 980791738 log:(611513356 - 1593114624) head:611513356
      2023-06-20T18:30:01.562+00:00 [Info] LSS /data/@2i/shards/shard6/data/recovery(shard6) : recoveryCleaner: completed... frag 34, data: 647324269, used: 984995670, disk: 985877492, relocated: 0, retries: 0, skipped: 698059335, cleaned:0, log:(611513356 - 1597390848), disk:(611513356-1597390848), activeFragRatio:30, activeMaxFragRatio:80, run:299 duration:190937 ms elapsed: 299530 ms
      2023-06-20T18:30:04.480+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.789255% blocked
      2023-06-20T18:30:05.056+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.781808% blocked
      2023-06-20T18:30:05.445+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.789278% blocked
      2023-06-20T18:30:07.413+00:00 [Info] LSS /data/@2i/shards/shard10/data(shard10) : logCleaner: starting... frag 31, data: 25344498263, used: 36732220724 log:(29071636706 - 65803882496) head:29071636706
      2023-06-20T18:30:07.413+00:00 [Info] LSS /data/@2i/shards/shard10/data(shard10) : logCleaner: completed... frag 30, data: 26421784064, used: 38292201042, disk: 38293336159, relocated: 134924142, retries: 19170, skipped: 93181547, cleaned:295233271, log:(29366869977 - 67659083776), disk:(29365747617-67659083776), activeFragRatio:30, activeMaxFragRatio:80, run:135 duration:7413 ms elapsed: 2227344 ms
      2023-06-20T18:30:09.416+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.832454% blocked
      2023-06-20T18:30:09.930+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.824558% blocked
      2023-06-20T18:30:10.462+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.833474% blocked
      2023-06-20T18:30:10.959+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:100627058688, freePercent:12.360477687799184, currentQuota=74282722285, heapSz=92713385984, limit=false
      2023-06-20T18:30:11.101+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:101117591552, freePercent:11.930352129222687, currentQuota=74282722285, newQuota=74075598221, netGrowth=207124064, percent=0
      2023-06-20T18:30:14.638+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.878805% blocked
      2023-06-20T18:30:14.933+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.867217% blocked
      2023-06-20T18:30:15.446+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.876949% blocked
      2023-06-20T18:30:18.482+00:00 [Info] LSS /data/@2i/shards/shard12/data/recovery(shard12) : recoveryCleaner: starting... frag 34, data: 632676682, used: 968144286 log:(640213818 - 1608757248) head:640213818
      2023-06-20T18:30:18.523+00:00 [Info] LSS /data/@2i/shards/shard12/data/recovery(shard12) : recoveryCleaner: completed... frag 34, data: 631998646, used: 968353182, disk: 968752326, relocated: 0, retries: 0, skipped: 730643576, cleaned:0, log:(640213818 - 1608966144), disk:(640213818-1608966144), activeFragRatio:30, activeMaxFragRatio:80, run:299 duration:207297 ms elapsed: 299394 ms
      2023-06-20T18:30:19.341+00:00 [Info] janitor: running cleanup.
      2023-06-20T18:30:19.590+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.922641% blocked
      2023-06-20T18:30:19.811+00:00 [Info] memstats {"Alloc":7436909112, "TotalAlloc":10083467138608, "Sys":10992412152, "Lookups":0, "Mallocs":104730307961,"Frees":104679650487, "HeapAlloc":7436909112, "HeapSys":8582037504, "HeapIdle":1050099712, "HeapInuse":7531937792,"HeapReleased":554999808, "HeapObjects":50657474,"MSpanInuse":126544320, "MSpanSys": 131898240, "StackInuse": 2019262464,"GCSys":235511752, "LastGC":1687285770644133042,"PauseTotalNs":11683505695, "PauseNs":[66338393, 3869186, 26673511, 921387, 50970032, 21736397, 24213317, 775925, 51058767, 624253, 46718731, 801674, 15434029, 667072], "NumGC":7955}
      2023-06-20T18:30:19.917+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.911290% blocked
      2023-06-20T18:30:20.804+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.924539% blocked
      2023-06-20T18:30:22.451+00:00 [Info] LSS /data/@2i/shards/shard19/data/recovery(shard19) : recoveryCleaner: starting... frag 40, data: 303698748, used: 509784132 log:(793822494 - 1304342528) head:792764342
      2023-06-20T18:30:22.451+00:00 [Info] LSS /data/@2i/shards/shard19/data/recovery(shard19) : recoveryCleaner: completed... frag 40, data: 300976700, used: 509657220, disk: 511568610, relocated: 0, retries: 0, skipped: 2039941843, cleaned:1051682, log:(794874176 - 1305391104), disk:(793822494-1305391104), activeFragRatio:30, activeMaxFragRatio:80, run:299 duration:214222 ms elapsed: 299284 ms
      2023-06-20T18:30:22.845+00:00 [Info] RebalanceServiceManager::GetTaskList []
      2023-06-20T18:30:22.854+00:00 [Info] RebalanceServiceManager::GetTaskList returns &{Rev:[0 0 0 0 0 0 0 3] Tasks:[]}
      2023-06-20T18:30:22.859+00:00 [Info] RebalanceServiceManager::GetTaskList [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:22.882+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
      2023-06-20T18:30:22.882+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 3] Nodes:[934a8f6699f5188b2b79b67348b13327 3315b8b046040f66bc5f8fcdeb51ab5d f908fffed22299b346b7dd11d8bd6e4a] IsBalanced:true Messages:[]}
      2023-06-20T18:30:22.889+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:23.493+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:101117591552, freePercent:11.930352129222687, currentQuota=74075598221, heapSz=93124968448, limit=false
      2023-06-20T18:30:23.434+00:00 [Info] RebalanceServiceManager::rebalanceJanitor Running Periodic Cleanup
      2023-06-20T18:30:23.796+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:101751967744, freePercent:11.44839549002602, currentQuota=74075598221, newQuota=73767786671, netGrowth=307811550, percent=0
      2023-06-20T18:30:24.138+00:00 [Info] bucket-4/b4_index10-4/Backstore#13095880066195656553:0 Plasma: SMR reclaim pending is higher than expected: pending = 77 KB (expected = 51 KB), wCtxCnt = 99, objCnt 49, changed reclaimList flush threshold from 2 to 0, changed reclaimSize flush threshold from 0 KB to 0 KB.
      2023-06-20T18:30:23.745+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 2.129938684s. priorTime: 2023-06-20 18:30:21.615451297 +0000 UTC m=+13142.744758263, callTime: 2023-06-20 18:30:23.745389972 +0000 UTC m=+13144.874696947, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:24.200+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow call 455.150526ms. callTime: 2023-06-20 18:30:23.745389972 +0000 UTC m=+13144.874696947, doneTime: 2023-06-20 18:30:24.200540509 +0000 UTC m=+13145.329847473, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:24.250+00:00 [Warn] util::traceLockLog: Long hold of svcMgrMu.Lock() 713.420329ms in RebalanceServiceManager::rebalanceJanitor
      2023-06-20T18:30:24.435+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.965870% blocked
      2023-06-20T18:30:24.993+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.956240% blocked
      2023-06-20T18:30:26.282+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.973137% blocked
      2023-06-20T18:30:28.044+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 2.418479804s. priorTime: 2023-06-20 18:30:25.626186936 +0000 UTC m=+13146.755493878, callTime: 2023-06-20 18:30:28.044666737 +0000 UTC m=+13149.173973682, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:29.514+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.010743% blocked
      2023-06-20T18:30:30.030+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 12.998196% blocked
      2023-06-20T18:30:30.448+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.006952% blocked
      2023-06-20T18:30:34.620+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.054416% blocked
      2023-06-20T18:30:34.967+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.041347% blocked
      2023-06-20T18:30:35.520+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.051634% blocked
      2023-06-20T18:30:36.126+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:101751967744, freePercent:11.44839549002602, currentQuota=73767786671, heapSz=93601521664, limit=false
      2023-06-20T18:30:36.219+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:102228393984, freePercent:11.101295079979167, currentQuota=73767786671, newQuota=73577460498, netGrowth=190326173, percent=0
      2023-06-20T18:30:36.253+00:00 [Info] bucket-2/b2_index4-6/Backstore#13402402523673797046:0 Plasma: SMR reclaim pending is higher than expected: pending = 120 KB (expected = 51 KB), wCtxCnt = 90, objCnt 6, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 1 KB to 0 KB.
      2023-06-20T18:30:39.501+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.097707% blocked
      2023-06-20T18:30:40.022+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.085504% blocked
      2023-06-20T18:30:40.031+00:00 [Info] LSS /data/@2i/shards/shard15/data/recovery(shard15) : recoveryCleaner: starting... frag 41, data: 292171216, used: 501637688 log:(831935830 - 1333690368) head:829836960
      2023-06-20T18:30:40.031+00:00 [Info] LSS /data/@2i/shards/shard15/data/recovery(shard15) : recoveryCleaner: completed... frag 42, data: 289522076, used: 501379022, disk: 504901984, relocated: 0, retries: 0, skipped: 2116580399, cleaned:1053136, log:(832988966 - 1334738944), disk:(829836960-1334738944), activeFragRatio:30, activeMaxFragRatio:80, run:297 duration:232370 ms elapsed: 299465 ms
      2023-06-20T18:30:40.453+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.095254% blocked
      2023-06-20T18:30:44.529+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.141605% blocked
      2023-06-20T18:30:44.971+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.128936% blocked
      2023-06-20T18:30:45.558+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.139974% blocked
      2023-06-20T18:30:48.832+00:00 [Info] Plasma: Adaptive memory quota tuning RSS:102228393984, freePercent:11.101295079979167, currentQuota=73577460498, heapSz=94071918592, limit=false
      2023-06-20T18:30:49.058+00:00 [Info] Plasma: Adaptive memory quota tuning (decrementing): RSS:102533611520, freePercent:10.773975613743879, currentQuota=73577460498, newQuota=73424456988, netGrowth=153003510, percent=0
      2023-06-20T18:30:50.423+00:00 [Info] bucket-9/b9_index7-5/Backstore#2853482345901207500:0 Plasma: SMR reclaim pending is higher than expected: pending = 61 KB (expected = 51 KB), wCtxCnt = 105, objCnt 48, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 0 KB to 0 KB.
      2023-06-20T18:30:52.523+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 4.895058402s. priorTime: 2023-06-20 18:30:47.628581763 +0000 UTC m=+13168.757888733, callTime: 2023-06-20 18:30:52.523640185 +0000 UTC m=+13173.652947135, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:52.631+00:00 [Info] bucket-3/b3_index2-5/Mainstore#81001212506724495:0 Plasma: SMR reclaim pending is higher than expected: pending = 65 KB (expected = 51 KB), wCtxCnt = 84, objCnt 1, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 1 KB to 0 KB.
      2023-06-20T18:30:52.859+00:00 [Info] RebalanceServiceManager::GetTaskList []
      2023-06-20T18:30:52.859+00:00 [Info] RebalanceServiceManager::GetTaskList returns &{Rev:[0 0 0 0 0 0 0 3] Tasks:[]}
      2023-06-20T18:30:52.894+00:00 [Info] RebalanceServiceManager::GetCurrentTopology []
      2023-06-20T18:30:52.894+00:00 [Info] RebalanceServiceManager::GetCurrentTopology returns &{Rev:[0 0 0 0 0 0 0 3] Nodes:[934a8f6699f5188b2b79b67348b13327 3315b8b046040f66bc5f8fcdeb51ab5d f908fffed22299b346b7dd11d8bd6e4a] IsBalanced:true Messages:[]}
      2023-06-20T18:30:52.901+00:00 [Info] RebalanceServiceManager::GetTaskList [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:52.936+00:00 [Info] RebalanceServiceManager::GetCurrentTopology [0 0 0 0 0 0 0 3]
      2023-06-20T18:30:54.045+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.208152% blocked
      2023-06-20T18:30:54.078+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.226087% blocked
      2023-06-20T18:30:54.080+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.213081% blocked
      2023-06-20T18:30:54.494+00:00 [Info] DATP[->dataport ":9105"] DATP -> Indexer 13.229792% blocked
      2023-06-20T18:30:54.515+00:00 [Info] RebalanceServiceManager::rebalanceJanitor Running Periodic Cleanup
      2023-06-20T18:30:55.825+00:00 [Info] bucket-8/b8_index7-5/Mainstore#15074528002066371286:0 Plasma: SMR reclaim pending is higher than expected: pending = 71 KB (expected = 51 KB), wCtxCnt = 105, objCnt 28, changed reclaimList flush threshold from 0 to 0, changed reclaimSize flush threshold from 0 KB to 0 KB.
      2023-06-20T18:30:59.163+00:00 [Warn] AutofailoverServiceManager::HealthCheck: Slow heartbeat 6.639719677s. priorTime: 2023-06-20 18:30:52.523640185 +0000 UTC m=+13173.652947135, callTime: 2023-06-20 18:30:59.163359868 +0000 UTC m=+13180.292666812, healthInfo: {DiskFailures:0}
      2023-06-20T18:30:59.509+00:00 [Warn] util::traceLockLog: Long hold of svcMgrMu.Lock() 4.994162677s in RebalanceServiceManager::rebalanceJanitor
       

      Run: http://perf.jenkins.couchbase.com/job/hemera/8207/ 

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              vikas.chaudhary Vikas Chaudhary
              vikas.chaudhary Vikas Chaudhary
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty