Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43613

[N1QL TXNS][SYSTEM TEST] memcached sacrificed due to OOM while running n1ql txns

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: Cheshire-Cat
    • Fix Version/s: 7.0.0
    • Component/s: test-execution
    • Environment:
      7.0.0-4170

      Description

      We are seeing memcached crashes on 4 separate nodes during our n1ql transactions system test for the node given here the crash is at 08:13:26. We see the same behavior on the other data nodes that crashed, will attach logs for all 4 as well as when they crashed. Also attaching logs of a query node 172.23.104.21

      172.23.104.16 : 08:13:26
      172.23.104.17: 08:11:44
      172.23.104.18: 08:16:05
      172.23.104.19: 08:15:17

      In the test itself transactions began running at 06:28:46 and finished at 08:48:15. Later on in the system test we see similar behavior, where the memcached nodes crash during the transaction runs.

      /opt/couchbase/var/lib/couchbase/logs/info.log:[user:info,2021-01-12T08:13:26.597-08:00,ns_1@172.23.104.16:<0.11086.0>:ns_log:crash_consumption_loop:69]Service 'memcached' exited with status 137. Restarting. Messages:
      

      If we check this timestamp we see that the node is OOM:

      [root@cnt7-qe1 ~]# dmesg -T | grep -i "out of memory"
      [Tue Jan 12 08:13:20 2021] Out of memory: Kill process 64087 (memcached) score 790 or sacrifice child
      [Tue Jan 12 08:13:20 2021] Out of memory: Kill process 64089 (memcached) score 790 or sacrifice child
      
      

      If we check the memcached logs around this time we see this slow operation warning:

      2021-01-12T08:13:00.667540-08:00 WARNING 340: Slow operation. {"cid":"172.23.97.83:46408/0","duration":"5129 ms","trace":"request=1038140068077781:5129845","command":"GET_ALL_VB_SEQNOS","peer":"{"ip":"172.23.97.83","port":46408}","bucket":"bucket3","packet":{"bodylen":4,"cas":0,"datatype":"raw","extlen":4,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"GET_ALL_VB_SEQNOS","vbucket":0}}
      2021-01-12T08:13:00.667517-08:00 WARNING 231: Slow operation. {"cid":"172.23.96.30:35746/0","duration":"5533 ms","trace":"request=1038139640104401:5533234","command":"GET_ALL_VB_SEQNOS","peer":"{"ip":"172.23.96.30","port":35746}","bucket":"bucket3","packet":{"bodylen":4,"cas":0,"datatype":"raw","extlen":4,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"GET_ALL_VB_SEQNOS","vbucket":0}}
      2021-01-12T08:13:00.695827-08:00 WARNING 221: Slow operation. {"cid":"c049e821a4a7ad18/0c97291ef3d93c8c/d12","duration":"1011 ms","trace":"request=1038148297012481:1011078","command":"GET_CLUSTER_CONFIG","peer":"{"ip":"172.23.104.23","port":34214}","bucket":"bucket4","packet":{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":302841856,"opcode":"GET_CLUSTER_CONFIG","vbucket":0}}
      2021-01-12T08:13:00.715680-08:00 WARNING 81: Slow operation. {"cid":"127.0.0.1:38518/0","duration":"12 s","trace":"request=1038137513143626:12175295","command":"STAT","peer":"{"ip":"127.0.0.1","port":38518}","bucket":"bucket2","packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>dcpagg.:</ud>","keylen":8,"magic":"ClientRequest","opaque":0,"opcode":"STAT","vbucket":0}}
      2021-01-12T08:13:00.803597-08:00 INFO (bucket2) DCP (Producer) eq_dcpq:secidx:proj-bucket2-MAINT_STREAM_TOPIC_5769e97ec1395454814a6ff5e5bfb7e2-18405484233030147622/3 - Notifying paused connection now that DcpProducer::BufferLog is no longer full; ackedBytes:2959667395, bytesSent:18653581, maxBytes:20971520
      2021-01-12T08:13:13.610630-08:00 WARNING 235: Slow operation. {"cid":"172.23.97.83:41558/0","duration":"3714 ms","trace":"request=1038155381409997:3714817","command":"GET_ALL_VB_SEQNOS","peer":"{"ip":"172.23.97.83","port":41558}","bucket":"bucket2","packet":{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":0,"opcode":"GET_ALL_VB_SEQNOS","vbucket":0}}
      2021-01-12T08:13:13.610157-08:00 WARNING 93: Slow operation. {"cid":"127.0.0.1:46420/0","duration":"9263 ms","trace":"request=1038149817263470:9263155","command":"STAT","peer":"{"ip":"127.0.0.1","port":46420}","bucket":"bucket3","packet":{"bodylen":7,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>topkeys</ud>","keylen":7,"magic":"ClientRequest","opaque":0,"opcode":"STAT","vbucket":0}}
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          drigby Dave Rigby added a comment - - edited

          We are seeing memcached crashes on 4 separate nodes during our n1ql transactions system test...

          This is incorrect. You're seeing memcached getting forcefully terminated (via SIGKILL) by the Linux kernel out-of-memory killer. memcached isn't crashing (which implies it's performed some illegal operation).

          The Linux OOM-killer is invoked once total RAM usage reaches the available RAM; however the process which is "sacrificed" is just an artibrary choice (typically a process with the largest RAM usage). Given memcached generally has a large RAM quota; it isn't necessary doing anything wrong, it's simply the victim.

          In this case, the Data Service has a quota of (via Nutshell):

          [BAD]   Total Server Quota  : Total Quota (21536 MB) 90.00% of Total RAM (23929 MB)
                                        Max recommended is 80.00% (19143.2 MB)
          

          Note warning from Nutshell - the Data Service has been allocated 90% of total RAM, leaving only 2,383MB for all other processes and the kernel.

          From the Linux oom-killer message, we can see the following per-process memory usage at the point it ran out of memory (note all numbers are in 4KB pages):

          [Tue Jan 12 08:13:21 2021] 7_scheduler invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
          ...
          [Tue Jan 12 08:13:21 2021] [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
          [Tue Jan 12 08:13:21 2021] [  502]     0   502    48336    29580     100        0             0 systemd-journal
          [Tue Jan 12 08:13:21 2021] [  524]     0   524    50848     1079      34        0             0 lvmetad
          [Tue Jan 12 08:13:21 2021] [  537]     0   537    11750      537      22        0         -1000 systemd-udevd
          [Tue Jan 12 08:13:21 2021] [  687]     0   687    13877      111      28        0         -1000 auditd
          [Tue Jan 12 08:13:21 2021] [  709]     0   709   144010    15188     152        0             0 rsyslogd
          [Tue Jan 12 08:13:21 2021] [  710]  1001   710   179373     2125      19        0             0 node_exporter
          [Tue Jan 12 08:13:21 2021] [  712]     0   712     5420       96      14        0             0 irqbalance
          [Tue Jan 12 08:13:21 2021] [  714]   998   714   133543     1625      58        0             0 polkitd
          [Tue Jan 12 08:13:21 2021] [  715]    81   715     8206      133      17        0          -900 dbus-daemon
          [Tue Jan 12 08:13:21 2021] [  717]    38   717    11292      179      27        0             0 ntpd
          [Tue Jan 12 08:13:21 2021] [  728]     0   728   138114     7218      95        0             0 NetworkManager
          [Tue Jan 12 08:13:21 2021] [  729]     0   729     6062       78      16        0             0 systemd-logind
          [Tue Jan 12 08:13:21 2021] [  733]     0   733     6476       52      18        0             0 atd
          [Tue Jan 12 08:13:21 2021] [  735]     0   735    31571      160      19        0             0 crond
          [Tue Jan 12 08:13:21 2021] [  753]     0   753    27522       33      11        0             0 agetty
          [Tue Jan 12 08:13:21 2021] [  782]     0   782    27522       34      11        0             0 agetty
          [Tue Jan 12 08:13:21 2021] [  939]     0   939   140608     3174      92        0             0 tuned
          [Tue Jan 12 08:13:21 2021] [  940]     0   940    26511      246      55        0         -1000 sshd
          [Tue Jan 12 08:13:21 2021] [ 1001]     0  1001   197616      351       6        0             0 xe-daemon
          [Tue Jan 12 08:13:21 2021] [ 1015]     0  1015     1091       27       7        0             0 logger
          [Tue Jan 12 08:13:21 2021] [ 1257]     0  1257    22396      260      42        0             0 master
          [Tue Jan 12 08:13:21 2021] [ 1259]    89  1259    22466      266      45        0             0 qmgr
          [Tue Jan 12 08:13:21 2021] [62187]     0 62187    28353     3122      57        0             0 dhclient
          [Tue Jan 12 08:13:21 2021] [63740]  1000 63740  1043268     7778     116        0             0 beam.smp
          [Tue Jan 12 08:13:21 2021] [63753]  1000 63753     2907       40      11        0             0 epmd
          [Tue Jan 12 08:13:21 2021] [63858]  1000 63858     1088       23       8        0             0 erl_child_setup
          [Tue Jan 12 08:13:21 2021] [63889]  1000 63889     2399      270      10        0             0 gosecrets
          [Tue Jan 12 08:13:21 2021] [63895]  1000 63895  1217453   133108     547        0             0 beam.smp
          [Tue Jan 12 08:13:21 2021] [63917]  1000 63917     1088       23       8        0             0 erl_child_setup
          [Tue Jan 12 08:13:21 2021] [63948]  1000 63948    28295       54      12        0             0 sh
          [Tue Jan 12 08:13:21 2021] [63950]  1000 63950     1086       24       8        0             0 memsup
          [Tue Jan 12 08:13:21 2021] [63951]  1000 63951     1086       18       8        0             0 cpu_sup
          [Tue Jan 12 08:13:21 2021] [64057]  1000 64057     2897       29      11        0             0 inet_gethost
          [Tue Jan 12 08:13:21 2021] [64058]  1000 64058     4484       42      14        0             0 inet_gethost
          [Tue Jan 12 08:13:21 2021] [64062]  1000 64062    55317      230      19        0             0 saslauthd-port
          [Tue Jan 12 08:13:21 2021] [64072]  1000 64072     2550      421       9        0             0 goport
          [Tue Jan 12 08:13:21 2021] [64078]  1000 64078   106494     1645      41        0             0 goxdcr
          [Tue Jan 12 08:13:21 2021] [64087]  1000 64087  6777131  4832510   10632        0             0 memcached
          [Tue Jan 12 08:13:21 2021] [64114]  1000 64114   241024     1197      79        0             0 js-evaluator
          [Tue Jan 12 08:13:21 2021] [64129]  1000 64129   241024     1200      78        0             0 js-evaluator
          [Tue Jan 12 08:13:21 2021] [65153]  1000 65153   851714    22454     180        0             0 beam.smp
          [Tue Jan 12 08:13:21 2021] [65160]  1000 65160     1088       24       8        0             0 erl_child_setup
          [Tue Jan 12 08:13:21 2021] [65210]  1000 65210    28295       51      12        0             0 sh
          [Tue Jan 12 08:13:21 2021] [65211]  1000 65211     1086       24       7        0             0 memsup
          [Tue Jan 12 08:13:21 2021] [65213]  1000 65213     1086       18       7        0             0 cpu_sup
          [Tue Jan 12 08:13:21 2021] [65222]  1000 65222     2478     1406       9        0             0 godu
          [Tue Jan 12 08:13:21 2021] [65228]  1000 65228    28295       54      12        0             0 sh
          [Tue Jan 12 08:13:21 2021] [65229]  1000 65229      845       89       7        0             0 godu
          [Tue Jan 12 08:13:21 2021] [65236]  1000 65236     1597      177       7        0             0 goport
          [Tue Jan 12 08:13:21 2021] [65242]  1000 65242   295888    21627     121        0             0 prometheus
          [Tue Jan 12 08:13:21 2021] [65255]  1000 65255     2155       43       9        0             0 sigar_port
          [Tue Jan 12 08:13:21 2021] [65256]  1000 65256     2155       52      10        0             0 sigar_port
          [Tue Jan 12 08:13:21 2021] [65259]  1000 65259     2550      925       9        0             0 goport
          [Tue Jan 12 08:13:21 2021] [65265]  1000 65265  2211450   688604    4077        0             0 projector
          [Tue Jan 12 08:13:21 2021] [65330]  1000 65330     2897       25      11        0             0 inet_gethost
          [Tue Jan 12 08:13:21 2021] [65331]  1000 65331     2897       31      11        0             0 inet_gethost
          [Tue Jan 12 08:13:21 2021] [83558]  1000 83558     4484       43      14        0             0 inet_gethost
          [Tue Jan 12 08:13:21 2021] [109821]    89 109821    22422      253      44        0             0 pickup
          

          memcached is only consuming 4832510 pages of RSS, or 18,876 MB - well below it's allowed quota of 21,536 MB. The next biggest process is projector which is consuming 688604 pages of 2,689 MB of RAM - i.e. more than the entire memory left over for all non-memcached processes.

          (After that there's nothing that high; next highest is ns_server at around 519MB).

          This is either a environmental sizing issue (as per the Nutshell warning, leaving only 2,383MB for all other processes is bad), or possibly an issue with the projector is this is a test which previously passed on this environment.

          Assigning back to reporter to check sizing / if this is a regression since a previous test.

          Show
          drigby Dave Rigby added a comment - - edited We are seeing memcached crashes on 4 separate nodes during our n1ql transactions system test... This is incorrect. You're seeing memcached getting forcefully terminated (via SIGKILL) by the Linux kernel out-of-memory killer. memcached isn't crashing (which implies it's performed some illegal operation). The Linux OOM-killer is invoked once total RAM usage reaches the available RAM; however the process which is "sacrificed" is just an artibrary choice (typically a process with the largest RAM usage). Given memcached generally has a large RAM quota; it isn't necessary doing anything wrong, it's simply the victim. In this case, the Data Service has a quota of (via Nutshell): [BAD] Total Server Quota : Total Quota (21536 MB) 90.00% of Total RAM (23929 MB) Max recommended is 80.00% (19143.2 MB) Note warning from Nutshell - the Data Service has been allocated 90% of total RAM, leaving only 2,383MB for all other processes and the kernel. From the Linux oom-killer message, we can see the following per-process memory usage at the point it ran out of memory (note all numbers are in 4KB pages): [Tue Jan 12 08:13:21 2021] 7_scheduler invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0 ... [Tue Jan 12 08:13:21 2021] [ pid ] uid tgid total_vm rss nr_ptes swapents oom_score_adj name [Tue Jan 12 08:13:21 2021] [ 502] 0 502 48336 29580 100 0 0 systemd-journal [Tue Jan 12 08:13:21 2021] [ 524] 0 524 50848 1079 34 0 0 lvmetad [Tue Jan 12 08:13:21 2021] [ 537] 0 537 11750 537 22 0 -1000 systemd-udevd [Tue Jan 12 08:13:21 2021] [ 687] 0 687 13877 111 28 0 -1000 auditd [Tue Jan 12 08:13:21 2021] [ 709] 0 709 144010 15188 152 0 0 rsyslogd [Tue Jan 12 08:13:21 2021] [ 710] 1001 710 179373 2125 19 0 0 node_exporter [Tue Jan 12 08:13:21 2021] [ 712] 0 712 5420 96 14 0 0 irqbalance [Tue Jan 12 08:13:21 2021] [ 714] 998 714 133543 1625 58 0 0 polkitd [Tue Jan 12 08:13:21 2021] [ 715] 81 715 8206 133 17 0 -900 dbus-daemon [Tue Jan 12 08:13:21 2021] [ 717] 38 717 11292 179 27 0 0 ntpd [Tue Jan 12 08:13:21 2021] [ 728] 0 728 138114 7218 95 0 0 NetworkManager [Tue Jan 12 08:13:21 2021] [ 729] 0 729 6062 78 16 0 0 systemd-logind [Tue Jan 12 08:13:21 2021] [ 733] 0 733 6476 52 18 0 0 atd [Tue Jan 12 08:13:21 2021] [ 735] 0 735 31571 160 19 0 0 crond [Tue Jan 12 08:13:21 2021] [ 753] 0 753 27522 33 11 0 0 agetty [Tue Jan 12 08:13:21 2021] [ 782] 0 782 27522 34 11 0 0 agetty [Tue Jan 12 08:13:21 2021] [ 939] 0 939 140608 3174 92 0 0 tuned [Tue Jan 12 08:13:21 2021] [ 940] 0 940 26511 246 55 0 -1000 sshd [Tue Jan 12 08:13:21 2021] [ 1001] 0 1001 197616 351 6 0 0 xe-daemon [Tue Jan 12 08:13:21 2021] [ 1015] 0 1015 1091 27 7 0 0 logger [Tue Jan 12 08:13:21 2021] [ 1257] 0 1257 22396 260 42 0 0 master [Tue Jan 12 08:13:21 2021] [ 1259] 89 1259 22466 266 45 0 0 qmgr [Tue Jan 12 08:13:21 2021] [62187] 0 62187 28353 3122 57 0 0 dhclient [Tue Jan 12 08:13:21 2021] [63740] 1000 63740 1043268 7778 116 0 0 beam.smp [Tue Jan 12 08:13:21 2021] [63753] 1000 63753 2907 40 11 0 0 epmd [Tue Jan 12 08:13:21 2021] [63858] 1000 63858 1088 23 8 0 0 erl_child_setup [Tue Jan 12 08:13:21 2021] [63889] 1000 63889 2399 270 10 0 0 gosecrets [Tue Jan 12 08:13:21 2021] [63895] 1000 63895 1217453 133108 547 0 0 beam.smp [Tue Jan 12 08:13:21 2021] [63917] 1000 63917 1088 23 8 0 0 erl_child_setup [Tue Jan 12 08:13:21 2021] [63948] 1000 63948 28295 54 12 0 0 sh [Tue Jan 12 08:13:21 2021] [63950] 1000 63950 1086 24 8 0 0 memsup [Tue Jan 12 08:13:21 2021] [63951] 1000 63951 1086 18 8 0 0 cpu_sup [Tue Jan 12 08:13:21 2021] [64057] 1000 64057 2897 29 11 0 0 inet_gethost [Tue Jan 12 08:13:21 2021] [64058] 1000 64058 4484 42 14 0 0 inet_gethost [Tue Jan 12 08:13:21 2021] [64062] 1000 64062 55317 230 19 0 0 saslauthd-port [Tue Jan 12 08:13:21 2021] [64072] 1000 64072 2550 421 9 0 0 goport [Tue Jan 12 08:13:21 2021] [64078] 1000 64078 106494 1645 41 0 0 goxdcr [Tue Jan 12 08:13:21 2021] [64087] 1000 64087 6777131 4832510 10632 0 0 memcached [Tue Jan 12 08:13:21 2021] [64114] 1000 64114 241024 1197 79 0 0 js-evaluator [Tue Jan 12 08:13:21 2021] [64129] 1000 64129 241024 1200 78 0 0 js-evaluator [Tue Jan 12 08:13:21 2021] [65153] 1000 65153 851714 22454 180 0 0 beam.smp [Tue Jan 12 08:13:21 2021] [65160] 1000 65160 1088 24 8 0 0 erl_child_setup [Tue Jan 12 08:13:21 2021] [65210] 1000 65210 28295 51 12 0 0 sh [Tue Jan 12 08:13:21 2021] [65211] 1000 65211 1086 24 7 0 0 memsup [Tue Jan 12 08:13:21 2021] [65213] 1000 65213 1086 18 7 0 0 cpu_sup [Tue Jan 12 08:13:21 2021] [65222] 1000 65222 2478 1406 9 0 0 godu [Tue Jan 12 08:13:21 2021] [65228] 1000 65228 28295 54 12 0 0 sh [Tue Jan 12 08:13:21 2021] [65229] 1000 65229 845 89 7 0 0 godu [Tue Jan 12 08:13:21 2021] [65236] 1000 65236 1597 177 7 0 0 goport [Tue Jan 12 08:13:21 2021] [65242] 1000 65242 295888 21627 121 0 0 prometheus [Tue Jan 12 08:13:21 2021] [65255] 1000 65255 2155 43 9 0 0 sigar_port [Tue Jan 12 08:13:21 2021] [65256] 1000 65256 2155 52 10 0 0 sigar_port [Tue Jan 12 08:13:21 2021] [65259] 1000 65259 2550 925 9 0 0 goport [Tue Jan 12 08:13:21 2021] [65265] 1000 65265 2211450 688604 4077 0 0 projector [Tue Jan 12 08:13:21 2021] [65330] 1000 65330 2897 25 11 0 0 inet_gethost [Tue Jan 12 08:13:21 2021] [65331] 1000 65331 2897 31 11 0 0 inet_gethost [Tue Jan 12 08:13:21 2021] [83558] 1000 83558 4484 43 14 0 0 inet_gethost [Tue Jan 12 08:13:21 2021] [109821] 89 109821 22422 253 44 0 0 pickup memcached is only consuming 4832510 pages of RSS, or 18,876 MB - well below it's allowed quota of 21,536 MB. The next biggest process is projector which is consuming 688604 pages of 2,689 MB of RAM - i.e. more than the entire memory left over for all non-memcached processes. (After that there's nothing that high; next highest is ns_server at around 519MB). This is either a environmental sizing issue (as per the Nutshell warning, leaving only 2,383MB for all other processes is bad), or possibly an issue with the projector is this is a test which previously passed on this environment. Assigning back to reporter to check sizing / if this is a regression since a previous test.
          Hide
          raju Raju Suravarjjala added a comment -

          Ajay Bhullar Are you able to reproduce this issue? What are the next steps?

          Show
          raju Raju Suravarjjala added a comment - Ajay Bhullar Are you able to reproduce this issue? What are the next steps?
          Hide
          mihir.kamdar Mihir Kamdar added a comment -

          This issue is not seen after the test was adjusted for load.

          Show
          mihir.kamdar Mihir Kamdar added a comment - This issue is not seen after the test was adjusted for load.

            People

            Assignee:
            ajay.bhullar Ajay Bhullar
            Reporter:
            ajay.bhullar Ajay Bhullar
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty