Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-26833

Rebalance slows down and eventually fails as memcached goes OOM (9 nodes, 10TB)

    XMLWordPrintable

Details

    Description

      Test scenario:

      • 9 nodes
      • 1 bucket, 1 replica, full eviction
      • 10B items (~1KB), 1% resident ratio
      • 15K ops/sec (90% read, 10% update), 10% cache miss ratio (before rebalance)
      • Swap rebalance of one node (172.23.96.108 -> 172.23.96.109)

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=titan_500-3519_rebalance_734b

      [user:error,2017-11-13T23:07:39.113-08:00,ns_1@172.23.96.100:<0.1876.0>:ns_orchestrator:do_log_rebalance_completion:1249]Rebalance exited with reason {unexpected_exit,
                                    {'EXIT',<0.10489.188>,
                                     {wait_seqno_persisted_failed,"bucket-1",109,
                                      9761353,
                                      [{'ns_1@172.23.96.109',
                                        {'EXIT',
                                         {{badmatch,{error,closed}},
                                          {gen_server,call,
                                           [{'janitor_agent-bucket-1',
                                             'ns_1@172.23.96.109'},
                                            {if_rebalance,<0.20946.163>,
                                             {wait_seqno_persisted,109,9761353}},
                                            infinity]}}}}]}}}
      

      Nov 13 23:07:34 172-23-96-109 kernel: sshd invoked oom-killer: gfp_mask=0x3000d0, order=2, oom_score_adj=-1000
      Nov 13 23:07:34 172-23-96-109 kernel: sshd cpuset=/ mems_allowed=0
      Nov 13 23:07:34 172-23-96-109 kernel: CPU: 11 PID: 1559 Comm: sshd Not tainted 3.10.0-514.2.2.el7.x86_64 #1
      Nov 13 23:07:34 172-23-96-109 kernel: Hardware name: Dell Inc. PowerEdge R430/03XKDV, BIOS 1.2.6 06/08/2015
      Nov 13 23:07:34 172-23-96-109 kernel: ffff880ff2a5af10 00000000156a25bf ffff880ffd283ae8 ffffffff816861cc
      Nov 13 23:07:34 172-23-96-109 kernel: ffff880ffd283b78 ffffffff81681177 ffffffff812ae87b ffff880ff984e830
      Nov 13 23:07:34 172-23-96-109 kernel: ffff880ffd1c2220 65a455fb00000020 fffeefff00000000 0000000000000001
      Nov 13 23:07:34 172-23-96-109 kernel: Call Trace:
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff816861cc>] dump_stack+0x19/0x1b
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff81681177>] dump_header+0x8e/0x225
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff812ae87b>] ? cred_has_capability+0x6b/0x120
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff8118476e>] oom_kill_process+0x24e/0x3c0
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff810937ee>] ? has_capability_noaudit+0x1e/0x30
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff81184fa6>] out_of_memory+0x4b6/0x4f0
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff81681c80>] __alloc_pages_slowpath+0x5d7/0x725
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff8118b0b5>] __alloc_pages_nodemask+0x405/0x420
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff810831cd>] copy_process+0x1dd/0x1960
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff81084b01>] do_fork+0x91/0x2c0
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff81084db6>] SyS_clone+0x16/0x20
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff81696b19>] stub_clone+0x69/0x90
      Nov 13 23:07:34 172-23-96-109 kernel: [<ffffffff816967c9>] ? system_call_fastpath+0x16/0x1b
      Nov 13 23:07:34 172-23-96-109 kernel: Mem-Info:
      Nov 13 23:07:34 172-23-96-109 kernel: active_anon:14912087 inactive_anon:758328 isolated_anon:0#012 active_file:150 inactive_file:0 isolated_file:2#012 unevictable:0 dirty:0 writeback:63 unstable:0#012 slab_reclaimable:192854 slab_unreclaimable:37216#012 mapped:2694 shmem:43091 pagetables:35213 bounce:0#012 free:89791 free_pcp:121 free_cma:0
      Nov 13 23:07:34 172-23-96-109 kernel: Node 0 DMA free:15864kB min:16kB low:20kB high:24kB active_anon:0kB inactive_anon:0kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15980kB managed:15896kB mlocked:0kB dirty:0kB writeback:0kB mapped:0kB shmem:0kB slab_reclaimable:0kB slab_unreclaimable:32kB kernel_stack:0kB pagetables:0kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      Nov 13 23:07:34 172-23-96-109 kernel: lowmem_reserve[]: 0 1692 63128 63128
      Nov 13 23:07:34 172-23-96-109 kernel: Node 0 DMA32 free:247144kB min:1808kB low:2260kB high:2712kB active_anon:511864kB inactive_anon:521132kB active_file:0kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:1982164kB managed:1734352kB mlocked:0kB dirty:0kB writeback:0kB mapped:8kB shmem:1256kB slab_reclaimable:410860kB slab_unreclaimable:20556kB kernel_stack:2096kB pagetables:2856kB unstable:0kB bounce:0kB free_pcp:0kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? yes
      Nov 13 23:07:34 172-23-96-109 kernel: lowmem_reserve[]: 0 0 61436 61436
      Nov 13 23:07:34 172-23-96-109 kernel: Node 0 Normal free:96156kB min:65756kB low:82192kB high:98632kB active_anon:59136484kB inactive_anon:2512180kB active_file:600kB inactive_file:0kB unevictable:0kB isolated(anon):0kB isolated(file):8kB present:63963136kB managed:62910988kB mlocked:0kB dirty:0kB writeback:252kB mapped:10768kB shmem:171108kB slab_reclaimable:360556kB slab_unreclaimable:128276kB kernel_stack:16352kB pagetables:137996kB unstable:0kB bounce:0kB free_pcp:484kB local_pcp:0kB free_cma:0kB writeback_tmp:0kB pages_scanned:902 all_unreclaimable? yes
      Nov 13 23:07:35 172-23-96-109 kernel: lowmem_reserve[]: 0 0 0 0
      Nov 13 23:07:35 172-23-96-109 kernel: Node 0 DMA: 0*4kB 1*8kB (U) 1*16kB (U) 1*32kB (U) 1*64kB (U) 1*128kB (U) 1*256kB (U) 0*512kB 1*1024kB (U) 1*2048kB (M) 3*4096kB (M) = 15864kB
      Nov 13 23:07:35 172-23-96-109 kernel: Node 0 DMA32: 14456*4kB (UEM) 6505*8kB (UEM) 3136*16kB (UEM) 678*32kB (UEM) 154*64kB (UEM) 86*128kB (UEM) 56*256kB (UM) 37*512kB (UEM) 11*1024kB (M) 0*2048kB 0*4096kB = 247144kB
      Nov 13 23:07:35 172-23-96-109 kernel: Node 0 Normal: 23736*4kB (UEM) 203*8kB (UEM) 79*16kB (UEM) 19*32kB (UEM) 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 98440kB
      Nov 13 23:07:35 172-23-96-109 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=1048576kB
      Nov 13 23:07:35 172-23-96-109 kernel: Node 0 hugepages_total=0 hugepages_free=0 hugepages_surp=0 hugepages_size=2048kB
      Nov 13 23:07:35 172-23-96-109 kernel: 359060 total pagecache pages
      Nov 13 23:07:35 172-23-96-109 kernel: 314584 pages in swap cache
      Nov 13 23:07:35 172-23-96-109 kernel: Swap cache stats: add 7942786, delete 7628202, find 1601153/2403101
      Nov 13 23:07:35 172-23-96-109 kernel: Free swap  = 0kB
      Nov 13 23:07:35 172-23-96-109 kernel: Total swap = 4194300kB
      Nov 13 23:07:35 172-23-96-109 kernel: 16490320 pages RAM
      Nov 13 23:07:35 172-23-96-109 kernel: 0 pages HighMem/MovableOnly
      Nov 13 23:07:35 172-23-96-109 kernel: 325011 pages reserved
      Nov 13 23:07:35 172-23-96-109 kernel: [ pid ]   uid  tgid total_vm      rss nr_ptes swapents oom_score_adj name
      Nov 13 23:07:35 172-23-96-109 kernel: [  904]     0   904    11535     2991      27       48             0 systemd-journal
      Nov 13 23:07:35 172-23-96-109 kernel: [  918]     0   918    48781        0      28      906             0 lvmetad
      Nov 13 23:07:35 172-23-96-109 kernel: [  937]     0   937    11844       11      23      737         -1000 systemd-udevd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1235]     0  1235    13854       30      27       83         -1000 auditd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1261]     0  1261     4853       68      14       58             0 irqbalance
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1265]     0  1265     5700       20      13       85             0 ipmievd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1270]    81  1270     8239      122      16       39          -900 dbus-daemon
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1276]     0  1276   131313      335      69      314             0 NetworkManager
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1278]    70  1278     7093       98      18       44             0 avahi-daemon
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1279]    70  1279     6995        8      17       52             0 avahi-daemon
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1285]     0  1285    31555       41      17      118             0 crond
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1288]    38  1288    11161       37      26      136             0 ntpd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1294]     0  1294    27509        1       9       33             0 agetty
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1319]   999  1319   131879      213      52     1407             0 polkitd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1559]     0  1559    26370       24      53      223         -1000 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 1561]     0  1561    95899     2174      80      111             0 rsyslogd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 4805]  1000  4805   451269     4878     176     5914             0 beam.smp
      Nov 13 23:07:35 172-23-96-109 kernel: [ 4818]  1000  4818     2892       12      11       33             0 epmd
      Nov 13 23:07:35 172-23-96-109 kernel: [ 4964]  1000  4964     2038      908       8      108             0 gosecrets
      Nov 13 23:07:35 172-23-96-109 kernel: [ 4969]  1000  4969   580792    80511     385     9070             0 beam.smp
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5043]  1000  5043    28282       41      12       23             0 sh
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5044]  1000  5044     1073       15       8        9             0 memsup
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5045]  1000  5045     1073        0       8       22             0 cpu_sup
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5047]  1000  5047     2884        5      11       25             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5048]  1000  5048     2884        2      11       31             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5170]  1000  5170    28301      119      17      213             0 saslauthd-port
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5192]  1000  5192 18622809 15205939   32725   994580             0 memcached
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5474]  1000  5474     2884        0      11       30             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5476]  1000  5476     2884        0      11       33             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [ 5482]  1000  5482   352155     4081     168     7174             0 beam.smp
      Nov 13 23:07:35 172-23-96-109 kernel: [18040]  1000 18040     2884        3      11       30             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [18041]  1000 18041     2884       10      11       23             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [18042]  1000 18042     2884       10      11       23             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [18080]  1000 18080   397631     6445     221     8635             0 beam.smp
      Nov 13 23:07:35 172-23-96-109 kernel: [18259]  1000 18259    28283       29      12       23             0 sh
      Nov 13 23:07:35 172-23-96-109 kernel: [18260]  1000 18260     1074       12       8       12             0 memsup
      Nov 13 23:07:35 172-23-96-109 kernel: [18261]  1000 18261     1074        0       8       24             0 cpu_sup
      Nov 13 23:07:35 172-23-96-109 kernel: [18275]  1000 18275     6088      961      16        9             0 godu
      Nov 13 23:07:35 172-23-96-109 kernel: [18276]  1000 18276    28282       31      11       22             0 sh
      Nov 13 23:07:35 172-23-96-109 kernel: [18277]  1000 18277      846       47       7      109             0 godu
      Nov 13 23:07:35 172-23-96-109 kernel: [18307]  1000 18307     2279      196      10        6             0 sigar_port
      Nov 13 23:07:35 172-23-96-109 kernel: [18319]  1000 18319     2251      562       8       75             0 goport
      Nov 13 23:07:35 172-23-96-109 kernel: [18323]  1000 18323    57293     1044      28       54             0 goxdcr
      Nov 13 23:07:35 172-23-96-109 kernel: [18368]  1000 18368     1923      323       8       82             0 goport
      Nov 13 23:07:35 172-23-96-109 kernel: [18372]  1000 18372   643851      996     112      224             0 projector
      Nov 13 23:07:35 172-23-96-109 kernel: [18376]  1000 18376     2885        0      10       30             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [18378]  1000 18378     2885        1      10       33             0 inet_gethost
      Nov 13 23:07:35 172-23-96-109 kernel: [28931]     0 28931    36655      318      70        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28933]     0 28933    36655      318      69        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28935]    74 28935    26727      272      49        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28936]    74 28936    26727      272      50        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28948]     0 28948    36655      318      72        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28950]     0 28950    36655      318      73        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28968]     0 28968    36959      331      74        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28973]    74 28973    26727      272      52        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28974]    74 28974    26727      272      52        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: [28983]     0 28983    26706      248      54        0             0 sshd
      Nov 13 23:07:35 172-23-96-109 kernel: Out of memory: Kill process 5192 (memcached) score 943 or sacrifice child
      Nov 13 23:07:35 172-23-96-109 kernel: Killed process 5192 (memcached) total-vm:74491236kB, anon-rss:60823756kB, file-rss:0kB, shmem-rss:0kB
      

      Attachments

        1. mem_used.png
          mem_used.png
          359 kB
        2. memcached_rss.png
          memcached_rss.png
          37 kB
        3. Screen Shot 2018-01-19 at 16.36.14.png
          Screen Shot 2018-01-19 at 16.36.14.png
          77 kB
        4. Screen Shot 2018-01-19 at 16.47.19.png
          Screen Shot 2018-01-19 at 16.47.19.png
          192 kB
        5. swap_used.png
          swap_used.png
          36 kB
        For Gerrit Dashboard: MB-26833
        # Subject Branch Project Status CR V

        Activity

          People

            pavelpaulau Pavel Paulau (Inactive)
            pavelpaulau Pavel Paulau (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty