Details
-
Bug
-
Resolution: Fixed
-
Major
-
Cheshire-Cat
-
Untriaged
-
-
1
-
Yes
Description
Build : 7.0.0-3090
Test : -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml
Scale : 3
Iteration : 1st
The query logs on 172.23.97.148 are flooded with msgs like the following since 2020-09-11T16:08:57.
2020-09-11T16:08:56.659-07:00 [Info] GsiClient::UpdateUsecjson: using collatejson as data format between indexer and GsiClient
|
2020-09-11T16:08:56.858-07:00 [Info] GSIC[default/default-_default-_default-1599860544594313464] 95b0325f-e651-48a0-9159-06c579bb28c4 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results114623839578745
|
2020-09-11T16:08:56.885-07:00 [Info] GSIC[default/default-_default-_default-1599860544594313464] 1813bb35-25b3-4bbf-a492-3682a952aa0d new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results114623070685572
|
2020-09-11T16:08:56.965-07:00 [INFO] Using plain authentication for user <ud>@cbq-engine</ud>
|
2020-09-11T16:08:56.971-07:00 [INFO] Using plain authentication for user <ud>@cbq-engine</ud>
|
2020-09-11T16:08:57.126-07:00 [Info] GSIC[default/DISTRICT-_default-_default-1599860577111098260] logstats "DISTRICT" {"gsi_scan_count":13116,"gsi_scan_duration":752612993488,"gsi_throttle_duration":0,"gsi_prime_duration":2415703966999,"gsi_blocked_duration":0,"gsi_total_temp_files":0}
|
2020-09-11T16:08:57.171-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.171-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.171-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.174-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.175-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.175-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.178-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.178-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.180-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.188-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.188-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.189-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
2020-09-11T16:08:57.197-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210'
|
172.23.96.48 is an Eventing node.
A lot delete and select queries have failed, probably because of the same underlying issue. Also, there are a 2000+ queries stuck at the moment (checked on active_requests). One of the stuck queries is - create primary index on `default` using GSI. Due to this, the test is not able to proceed.
There was also an OOM kill reported on the other query node - 172.23.97.149
[root@localhost ~]# dmesg -T | grep cbq-engine
|
[Sat Sep 12 00:47:41 2020] [114181] 996 114181 6545941 5326305 12205 840390 0 cbq-engine
|
[Sat Sep 12 00:47:41 2020] Out of memory: Kill process 114181 (cbq-engine) score 946 or sacrifice child
|
[Sat Sep 12 00:47:45 2020] cbq-engine invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
|
[Sat Sep 12 00:47:45 2020] cbq-engine cpuset=/ mems_allowed=0
|
[Sat Sep 12 00:47:45 2020] CPU: 6 PID: 121981 Comm: cbq-engine Kdump: loaded Not tainted 3.10.0-1062.9.1.el7.x86_64 #1
|
[Sat Sep 12 00:47:45 2020] [114181] 996 114181 6545941 5327535 12210 841855 0 cbq-engine
|
[Sat Sep 12 00:47:45 2020] Out of memory: Kill process 114181 (cbq-engine) score 947 or sacrifice child
|
[Sat Sep 12 00:47:45 2020] [114181] 996 114181 6545941 5326271 12210 843119 0 cbq-engine
|
[Sat Sep 12 00:47:45 2020] Out of memory: Kill process 114181 (cbq-engine) score 947 or sacrifice child
|
[Sat Sep 12 00:47:45 2020] Killed process 114181 (cbq-engine), UID 996, total-vm:26183764kB, anon-rss:21305084kB, file-rss:0kB, shmem-rss:0kB
|
Also, the cbq-engine process on 172.23.97.148 is hogging too much CPU and RAM. Output from top :
top - 11:39:37 up 5 days, 20:00, 1 user, load average: 21.25, 21.72, 22.00
|
Tasks: 1 total, 0 running, 1 sleeping, 0 stopped, 0 zombie
|
%Cpu(s): 13.2 us, 86.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.1 st
|
KiB Mem : 22431768 total, 15434376 free, 4669404 used, 2327988 buff/cache
|
KiB Swap: 3670012 total, 3670012 free, 0 used. 17405344 avail Mem
|
|
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
|
114623 couchba+ 20 0 5278928 3.4g 28664 S 738.5 15.8 8038:56 cbq-engine
|
Attaching the following :
1. Initial cbcollect that has log entries from 2020-09-11T16:08:57
2. Query goroutine dump, memory, cpu and heap profile, and dump of active and completed requests from both the query nodes taken at 9/12 11:26 AM
Note : This is the same test and hardware as the one used to certify Mad-hatter release. No changes are done to this test.
This test was last run successfully with 7.0.0-2792 (and saw MB-40805 there). After that the test couldn't run much because of another blocker issue in Analytics - MB-40832. Hence treating this as a regression from 7.0.0-2792 and 7.0.0-3090
Attachments
For Gerrit Dashboard: MB-41420 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
138318,6 | MB-41420 diag | master | go-couchbase | Status: MERGED | +2 | +1 |
138320,1 | MB-41420 diag | master | query | Status: ABANDONED | 0 | 0 |
138321,5 | MB-41420 diag | master | query | Status: ABANDONED | 0 | 0 |
138489,2 | Revert "MB-41420 diag" | master | go-couchbase | Status: MERGED | +2 | +1 |
138542,2 | MB-41420 diag | master | query | Status: MERGED | +2 | +1 |
138546,2 | MB-41420 diag | master | query | Status: MERGED | +2 | +1 |
138547,2 | MB-41420 diag | master | query | Status: MERGED | +2 | +1 |
138552,2 | MB-41420 diag | master | query | Status: MERGED | +2 | +1 |
138561,1 | undo MB-41420 diag | master | query | Status: ABANDONED | 0 | 0 |
138562,1 | undo MB-41420 diag | master | query | Status: ABANDONED | 0 | 0 |
138773,2 | Fix 172.23.96.48 as the eventing node on centos2 cluster for debugging MB-41420. This can be reverted once the issue is debugged and fix verified if needed. | master | sequoia | Status: MERGED | +2 | +1 |