Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-41420

[System Test] : cbq-engine issues - 2000+ stuck queries

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-3090
      Test : -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml
      Scale : 3
      Iteration : 1st

      The query logs on 172.23.97.148 are flooded with msgs like the following since 2020-09-11T16:08:57.

      2020-09-11T16:08:56.659-07:00 [Info] GsiClient::UpdateUsecjson: using collatejson as data format between indexer and GsiClient
      2020-09-11T16:08:56.858-07:00 [Info] GSIC[default/default-_default-_default-1599860544594313464] 95b0325f-e651-48a0-9159-06c579bb28c4 new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results114623839578745
      2020-09-11T16:08:56.885-07:00 [Info] GSIC[default/default-_default-_default-1599860544594313464] 1813bb35-25b3-4bbf-a492-3682a952aa0d new temp file ... /opt/couchbase/var/lib/couchbase/tmp/scan-results114623070685572
      2020-09-11T16:08:56.965-07:00 [INFO] Using plain authentication for user <ud>@cbq-engine</ud> 
      2020-09-11T16:08:56.971-07:00 [INFO] Using plain authentication for user <ud>@cbq-engine</ud> 
      2020-09-11T16:08:57.126-07:00 [Info] GSIC[default/DISTRICT-_default-_default-1599860577111098260] logstats "DISTRICT" {"gsi_scan_count":13116,"gsi_scan_duration":752612993488,"gsi_throttle_duration":0,"gsi_prime_duration":2415703966999,"gsi_blocked_duration":0,"gsi_total_temp_files":0}
      2020-09-11T16:08:57.171-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.171-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.171-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.174-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.175-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.175-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.178-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.178-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.180-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.188-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.188-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.189-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      2020-09-11T16:08:57.197-07:00 [INFO] Pool Get returned default: Unable to find given hostport in cbauth database: `172.23.96.48:11210' 
      

      172.23.96.48 is an Eventing node.

      A lot delete and select queries have failed, probably because of the same underlying issue. Also, there are a 2000+ queries stuck at the moment (checked on active_requests). One of the stuck queries is - create primary index on `default` using GSI. Due to this, the test is not able to proceed.

      There was also an OOM kill reported on the other query node - 172.23.97.149

      [root@localhost ~]# dmesg -T | grep cbq-engine
      [Sat Sep 12 00:47:41 2020] [114181]   996 114181  6545941  5326305   12205   840390             0 cbq-engine
      [Sat Sep 12 00:47:41 2020] Out of memory: Kill process 114181 (cbq-engine) score 946 or sacrifice child
      [Sat Sep 12 00:47:45 2020] cbq-engine invoked oom-killer: gfp_mask=0x200da, order=0, oom_score_adj=0
      [Sat Sep 12 00:47:45 2020] cbq-engine cpuset=/ mems_allowed=0
      [Sat Sep 12 00:47:45 2020] CPU: 6 PID: 121981 Comm: cbq-engine Kdump: loaded Not tainted 3.10.0-1062.9.1.el7.x86_64 #1
      [Sat Sep 12 00:47:45 2020] [114181]   996 114181  6545941  5327535   12210   841855             0 cbq-engine
      [Sat Sep 12 00:47:45 2020] Out of memory: Kill process 114181 (cbq-engine) score 947 or sacrifice child
      [Sat Sep 12 00:47:45 2020] [114181]   996 114181  6545941  5326271   12210   843119             0 cbq-engine
      [Sat Sep 12 00:47:45 2020] Out of memory: Kill process 114181 (cbq-engine) score 947 or sacrifice child
      [Sat Sep 12 00:47:45 2020] Killed process 114181 (cbq-engine), UID 996, total-vm:26183764kB, anon-rss:21305084kB, file-rss:0kB, shmem-rss:0kB
      

      Also, the cbq-engine process on 172.23.97.148 is hogging too much CPU and RAM. Output from top :

      top - 11:39:37 up 5 days, 20:00,  1 user,  load average: 21.25, 21.72, 22.00
      Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
      %Cpu(s): 13.2 us, 86.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.3 si,  0.1 st
      KiB Mem : 22431768 total, 15434376 free,  4669404 used,  2327988 buff/cache
      KiB Swap:  3670012 total,  3670012 free,        0 used. 17405344 avail Mem
       
         PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
      114623 couchba+  20   0 5278928   3.4g  28664 S 738.5 15.8   8038:56 cbq-engine
      

      Attaching the following :
      1. Initial cbcollect that has log entries from 2020-09-11T16:08:57
      2. Query goroutine dump, memory, cpu and heap profile, and dump of active and completed requests from both the query nodes taken at 9/12 11:26 AM

      Note : This is the same test and hardware as the one used to certify Mad-hatter release. No changes are done to this test.

      This test was last run successfully with 7.0.0-2792 (and saw MB-40805 there). After that the test couldn't run much because of another blocker issue in Analytics - MB-40832. Hence treating this as a regression from 7.0.0-2792 and 7.0.0-3090

      Attachments

        1. 172.23.120.73_vbucketmap.json
          26 kB
        2. 172.23.97.149_vbucketmap.json
          26 kB
        3. goroutine_dump_148_1205.txt
          27.21 MB
        4. goroutine_dump_148.txt
          11.83 MB
        5. goroutine_dump_149_1205.txt
          38.99 MB
        6. goroutine_dump_149.txt
          2.00 MB
        7. heap_148_1205.prof
          357 kB
        8. heap_148.prof
          329 kB
        9. heap_149_1205.prof
          353 kB
        10. heap_149.prof
          230 kB
        11. query_148_1205.mprof
          51.39 MB
        12. query_148.mprof
          43.33 MB
        13. query_149_1205.mprof
          47.52 MB
        14. query_149.mprof
          26.06 MB
        15. query_active_requests_148_1205.json
          649 kB
        16. query_active_requests_148.json
          967 kB
        17. query_active_requests_149_1205.json
          706 kB
        18. query_active_requests_149.json
          0.0 kB
        19. query_completed_requests_148_1205.json
          2.90 MB
        20. query_completed_requests_148.json
          2.47 MB
        21. query_completed_requests_149_1205.json
          2.88 MB
        22. query_completed_requests_149.json
          2.44 MB
        23. query_cpu_profile_148_1205.pprof
          45 kB
        24. query_cpu_profile_148.pprof
          88 kB
        25. query_cpu_profile_149_1205.pprof
          52 kB
        26. query_cpu_profile_149.pprof
          86 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          marco.greco Marco Greco added a comment -

          Mihir Kamdar from the goroutine dump the lastest stall looks like MB-43070, which I am progressing separately

          marco.greco Marco Greco added a comment - Mihir Kamdar from the goroutine dump the lastest stall looks like MB-43070 , which I am progressing separately
          marco.greco Marco Greco added a comment -

          No other occurrence of vbucketmaps pointing to non kv nodes, resolving.

          marco.greco Marco Greco added a comment - No other occurrence of vbucketmaps pointing to non kv nodes, resolving.

          Verified on 7.0.0-4226. Not seeing this issue any more.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Verified on 7.0.0-4226. Not seeing this issue any more.

          Build sync_gateway-3.0.0-223 contains go-couchbase commit 9d3a2a7 with commit message:
          Revert "MB-41420 diag"

          build-team Couchbase Build Team added a comment - Build sync_gateway-3.0.0-223 contains go-couchbase commit 9d3a2a7 with commit message: Revert " MB-41420 diag"

          Build sync_gateway-3.0.0-223 contains go-couchbase commit 0d6ea75 with commit message:
          MB-41420 diag

          build-team Couchbase Build Team added a comment - Build sync_gateway-3.0.0-223 contains go-couchbase commit 0d6ea75 with commit message: MB-41420 diag

          People

            marco.greco Marco Greco
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty