Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6618

beam.smp eats up all memory on node beyond the membase quota and causes oom killer

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.0
    • 1.8.1
    • ns_server, storage-engine
    • Security Level: Public
    • None

    Description

      I have a cluster of membase boxes running 1.8.1.

      Node specs:
      4x vcpu
      32GB memory
      1.5GB swap

      Membase configuration:

      per-node quota: 10GB (yes, only 10GB)
      less than a dozen couchbase buckets with replica=1 or replica=0
      one couchbase bucket has 4million records, the rest varies from a couple thousand to hundreds of thousands

      Description of issue:

      For one of our buckets that is being used as a cache for our application, we update the keys with new refreshed data every 3 hours. The number of keys being inserted/updated is about 40k in total, replica=1, and total on-disk usage for this bucket is about 16GB (this includes replica). This definitely causes some significant load, as we run the inserts via two parallel threads. System stats show significant IOwait on the box while this operation is happening (takes about 20m), but the 'disk write queue' counter in membase shows in the hundreds and sometimes peaks at single digit thousands, so we were not concerned. Also, since our memory quote per-node is under 50% of available system memory, we considered that adequate for any additional memory that might be needed for queuing.

      About every 10 days, we have had an outage for membase due to oom killer being involved. We were unsure what was the cause, but I did add some sar-type stats to the membase machines and I found that the cause was actually membase, and not something outside of it.

      Apparently beam.smp starts to eat up all available memory, which eventually exhausts all available memory. beam.smp was stabilized at 2% of total memory (~600k) and within a 30m timeframe, eats up all available memory. My stats last reported that beam.smp was eating up 74% of all total memory (~23GB!) before the machine had to be rebooted.

      Here is the output of atopsar demonstrating the memory utilization that led to the node eventually killing itself due to exhaustion of memory. The 3 columns are the top 3 consumers of memory (by percentage) on the machine every 10 minutes. The "logging restarted" line indicates my reboot of the machine at the time it went unresponsive.

      -------------------------- analysis date: 2012/09/12 --------------------------

      06:42:31 pid command mem% | pid command mem% | pid command mem%top3
      06:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:12:31 970 beam.smp 15% | 1137 memcache 10% | 1136 moxi 0%
      07:22:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:32:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:42:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:12:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:22:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:32:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:42:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:12:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:22:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:32:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:42:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      10:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      10:12:31 970 beam.smp 45% | 1137 memcache 10% | 1136 moxi 0%
      10:22:31 970 beam.smp 62% | 1137 memcache 10% | 1136 moxi 0%
      10:32:31 970 beam.smp 74% | 1137 memcache 10% | 1136 moxi 0%
      10:48:33 ......................... logging restarted .........................
      10:58:32 1249 memcache 8% | 1107 moxi 1% | 967 beam.smp 1%
      11:08:32 1249 memcache 8% | 1107 moxi 1% | 967 beam.smp 1%
      11:18:32 1249 memcache 8% | 967 beam.smp 1% | 1107 moxi 1%
      11:28:32 1249 memcache 8% | 967 beam.smp 1% | 1107 moxi 1%

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            alkondratenko Aleksey Kondratenko (Inactive)
            keromeister F KUO
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty