Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-19780

ns_couchdb dying (eheap_alloc) and couchdb OOM's on restart

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      2 1/2 days in test, I ran out of disk space from cores being generated on couchdb being killed by OOM. I'm not sure why OOM is kicking in because the test has cleanup phase to prevent hard-out-of-mem happening on the bucket.

      I expected couchdb to comeback if killed this way but it continues to be killed after restart...

      [ns_server:info,2016-05-29T04:36:20.836-07:00,ns_1@172.23.105.61:ns_couchdb_port<0.22870.351>:ns_port_server:log:210]ns_couchdb<0.22870.351>: Apache CouchDB  (LogLevel=info) is starting.
      [ns_server:info,2016-05-29T04:36:21.175-07:00,ns_1@172.23.105.61:ns_couchdb_port<0.22870.351>:ns_port_server:log:210]ns_couchdb<0.22870.351>: Apache CouchDB has started. Time to relax.
      [ns_server:info,2016-05-29T04:38:18.170-07:00,ns_1@172.23.105.61:ns_couchdb_port<0.28610.351>:ns_port_server:log:210]ns_couchdb<0.28610.351>: Apache CouchDB  (LogLevel=info) is starting.
      [ns_server:info,2016-05-29T04:38:18.516-07:00,ns_1@172.23.105.61:ns_couchdb_port<0.28610.351>:ns_port_server:log:210]ns_couchdb<0.28610.351>: Apache CouchDB has started. Time to relax.
      [ns_server:info,2016-05-29T04:40:01.897-07:00,ns_1@172.23.105.61:ns_couchdb_port<0.1110.352>:ns_port_server:log:210]ns_couchdb<0.1110.352>: Apache CouchDB  (LogLevel=info) is starting.
      ns_couchdb<0.1110.352>: Apache CouchDB has started. Time to relax.
      

      Tracing the series of events, ns_server got error badrpc, nodedown from couchdb

      [ns_server:error,2016-05-29T04:29:58.801-07:00,ns_1@172.23.105.61:<0.6395.342>:menelaus_web:loop:189]Server error during processing: ["web request failed",
                                       {path,
                                        "/pools/default/buckets/WAREHOUSE/ddocs"},
                                       {method,'GET'},
                                       {type,exit},
                                       {what,{error,{badrpc,nodedown}}},
                                       {trace,
                                        [{ns_couchdb_api,rpc_couchdb_node,4,
                                          [{file,"src/ns_couchdb_api.erl"},
                                           {line,162}]},
                                         {capi_utils,full_live_ddocs,3,
                                          [{file,"src/capi_utils.erl"},{line,172}]},
      ...
      

      I couldn't find a reason why couchdb was down in logs other than this error printed just prior

      [couchdb:error,2016-05-29T04:27:34.465-07:00,couchdb_ns_1@127.0.0.1:<0.335.0>:couch_log:error:44]Cleanup process <0.22776.195> for set view `ORDER_LINE`, replica (prod) group `_design/all`, died with reason: stopped
      

      Here couchdb is restarted after eheap_alloc error

      [ns_server:info,2016-05-29T04:35:47.762-07:00,ns_1@172.23.105.61:ns_couchdb_port<0.8218.32>:ns_port_server:log:210]ns_couchdb<0.8218.32>:
      ns_couchdb<0.8218.32>: Crash dump was written to: erl_crash.dump.1464283581.8932.ns_couchdb
      ns_couchdb<0.8218.32>: eheap_alloc: Cannot allocate 8162366936 bytes of memory (of type "old_heap").
       
      [ns_server:error,2016-05-29T04:36:18.098-07:00,ns_1@172.23.105.61:wait_link_to_couchdb_node<0.22631.351>:ns_server_nodes_sup:do_wait_link_to_couchdb_node:163]ns_couchdb
      _port(<0.8218.32>) died with reason {abnormal,134}
      [ns_server:info,2016-05-29T04:36:20.836-07:00,ns_1@172.23.105.61:ns_couchdb_port<0.22870.351>:ns_port_server:log:210]ns_couchdb<0.22870.351>: Apache CouchDB  (LogLevel=info) is starting.
      

      This happens a few times and then Couchdb is then killed by oom (core attached)

      ...
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330644] [ 1585]  1000  1585  7632490  5334196   10554        0             0 beam.smp
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330646] [ 1622]  1000  1622     1462      147       8        0             0 goport
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330647] [ 1627]  1000  1627   109769    21401      69        0             0 goxdcr
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330649] [ 1636]  1000  1636     1113      175       7        0             0 sh
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330650] [ 1638]  1000  1638     1084      351       8        0             0 memsup
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330652] [ 1639]  1000  1639     1084      183       8        0             0 cpu_sup
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330653] [ 1645]  1000  1645     2516     1210      10        0             0 godu
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330654] [ 1646]  1000  1646     1112      166       7        0             0 sh
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330656] [ 1647]  1000  1647     1330      105       8        0             0 godu
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330657] [ 1659]  1000  1659    43359     2573      37        0             0 moxi
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330659] [ 1660]  1000  1660     2155      381      10        0             0 sigar_port
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330660] [ 1661]  1000  1661     1867      220       9        0             0 inet_gethost
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330662] [ 1662]  1000  1662     2391      393      10        0             0 inet_gethost
      May 29 04:43:30 kvm-s63705 kernel: [5679446.330663] Out of memory: Kill process 1585 (beam.smp) score 692 or sacrifice child
      

      At this time test was rebalancing in 2 nodes

      ok 261 - [2016-05-29T04:19:06-07:00, 1419c28:57f2a1] server-add -c 172.23.106.14 --server-add 172.23.105.83 -u Administrator -p password --server-add-username Administrator --server-add-password password
      ok 262 - [2016-05-29T04:19:12-07:00, 1419c28:22ba5c] server-add -c 172.23.106.14 --server-add 172.23.105.63 -u Administrator -p password --server-add-username Administrator --server-add-password password
      *not ok* 263 - [2016-05-29T04:29:31-07:00, 1419c28:c4557f] rebalance -c 172.23.106.14 -u Administrator -p password
      

      With cores enabled I ran out of disk space and from beam creating 15GB cores.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              vmx Volker Mische
              tommie Tommie McAfee (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty