Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7824

Entire cluster goes down because system couldn't provision memory for erlang "Out of memory: Kill process 6783 (beam.smp)"

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None
    • Environment:
      2.0.1-160-rel
      2 4-core-30G-SSDs + 2 8-core-30G-SSDs
      Centos
      4 buckets

      Description

      • Continuous load running on all 4 buckets
      • Data Compaction running on one of the bucket 'MsgsCalls'
      • Ran a curl post to purge deleted key information on all 4 buckets.

      "curl -i -u Administrator:password -X POST http://<IP>:<PORT>/pools/default/buckets/<BUCKET_NAME>/controller/unsafePurgeBucket"

      • Noticed 2 data compactions were running on 'MsgsCalls', and 1 each for the rest of the buckets.
      • The entire cluster went down soon, with an OOM kill.
      • High erlang usages on all nodes:

      10.6.2.68
      20641 couchbas 20 0 21.9g 20g 39m S 99.9 65.6 1556:01 beam.smp

      10.6.2.66:
      21770 couchbas 20 0 15.0g 13g 39m S 100.2 46.2 1618:29 beam.smp

      10.6.2.69:
      6783 couchbas 20 0 25.2g 23g 39m S 146.0 77.2 1361:58 beam.smp

      10.6.2.89:
      28802 couchbas 20 0 23.3g 21g 39m S 99.8 73.5 1579:18 beam.smp

      From /var/log/messages on 10.6.2.69:

      Feb 25 15:58:38 pine-11804 kernel: Out of memory: Kill process 6783 (beam.smp) score 716 or sacrifice child
      Feb 25 15:58:38 pine-11804 kernel: Killed process 6819, UID 498, (memcached) total-vm:9476628kB, anon-rss:7074864kB, file-rss:988kB

      cbcollect_info from all nodes:
      https://s3.amazonaws.com/bugdb/MB~/10_6_2_66.zip
      https://s3.amazonaws.com/bugdb/MB~/10_6_2_68.zip
      https://s3.amazonaws.com/bugdb/MB~/10_6_2_69.zip
      https://s3.amazonaws.com/bugdb/MB~/10_6_2_89.zip

      Is the 2-data-compactors-running-over-the-same-bucket the reason why erlang usage shot up?
      As per the UI, a total of 5 data compactions were running before the cluster went down.

      Swap per node: 2064376k total
      vm.swappiness = 0

      • Rebooted the nodes:
        All nodes came back up
        However, beam.smp still has high memory consumption (up till 16G), even with the cluster idle right now..
        Is it the garbage collector not kicking in that's causing the memory not to be freed?
      • Couldn't find any message in the logs that compaction was killed.
      • Live cluster (after the reboot) available at: http://10.6.2.68:8091
      1. Screen Shot 2013-02-25 at 4.53.53 PM.png
        131 kB
      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-7824=message:MB-7824

        Activity

        Hide
        thuan Thuan Nguyen added a comment - - edited

        FYI, this bug may be the same with bug MB-7748 and MB-7799

        Show
        thuan Thuan Nguyen added a comment - - edited FYI, this bug may be the same with bug MB-7748 and MB-7799
        Hide
        abhinav Abhinav Dangeti added a comment -

        Alk, I've attached the erl_crash_dump file you wanted.

        Show
        abhinav Abhinav Dangeti added a comment - Alk, I've attached the erl_crash_dump file you wanted.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        It appears all collect infos have empty diag.log.

        May I ask you to grab diag manually and wait long enough ?

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - It appears all collect infos have empty diag.log. May I ask you to grab diag manually and wait long enough ?
        Show
        abhinav Abhinav Dangeti added a comment - Sure, here are the diags from 10.6.2.68, 10.6.2.89: https://s3.amazonaws.com/bugdb/MB~/10.6.2.68-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/MB~/10.6.2.89-8091-diag.txt.gz
        Hide
        abhinav Abhinav Dangeti added a comment - - edited

        Diags from 10.6.2.68, after triggering erlang's garbage collection:
        https://s3.amazonaws.com/bugdb/MB~/10.6.2.68-8091-diag_post_gc.txt.gz

        Move to MB-7828

        Show
        abhinav Abhinav Dangeti added a comment - - edited Diags from 10.6.2.68, after triggering erlang's garbage collection: https://s3.amazonaws.com/bugdb/MB~/10.6.2.68-8091-diag_post_gc.txt.gz Move to MB-7828
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Traced this down as a duplicate of MB-7828.

        We were logging huge binary that couch_file got and that caused our logging processes to eat tons of ram.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Traced this down as a duplicate of MB-7828 . We were logging huge binary that couch_file got and that caused our logging processes to eat tons of ram.

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            abhinav Abhinav Dangeti
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes