Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6618

beam.smp eats up all memory on node beyond the membase quota and causes oom killer

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 1.8.1
    • Fix Version/s: 2.0
    • Component/s: ns_server, storage-engine
    • Security Level: Public
    • Labels:
      None

      Description

      I have a cluster of membase boxes running 1.8.1.

      Node specs:
      4x vcpu
      32GB memory
      1.5GB swap

      Membase configuration:

      per-node quota: 10GB (yes, only 10GB)
      less than a dozen couchbase buckets with replica=1 or replica=0
      one couchbase bucket has 4million records, the rest varies from a couple thousand to hundreds of thousands

      Description of issue:

      For one of our buckets that is being used as a cache for our application, we update the keys with new refreshed data every 3 hours. The number of keys being inserted/updated is about 40k in total, replica=1, and total on-disk usage for this bucket is about 16GB (this includes replica). This definitely causes some significant load, as we run the inserts via two parallel threads. System stats show significant IOwait on the box while this operation is happening (takes about 20m), but the 'disk write queue' counter in membase shows in the hundreds and sometimes peaks at single digit thousands, so we were not concerned. Also, since our memory quote per-node is under 50% of available system memory, we considered that adequate for any additional memory that might be needed for queuing.

      About every 10 days, we have had an outage for membase due to oom killer being involved. We were unsure what was the cause, but I did add some sar-type stats to the membase machines and I found that the cause was actually membase, and not something outside of it.

      Apparently beam.smp starts to eat up all available memory, which eventually exhausts all available memory. beam.smp was stabilized at 2% of total memory (~600k) and within a 30m timeframe, eats up all available memory. My stats last reported that beam.smp was eating up 74% of all total memory (~23GB!) before the machine had to be rebooted.

      Here is the output of atopsar demonstrating the memory utilization that led to the node eventually killing itself due to exhaustion of memory. The 3 columns are the top 3 consumers of memory (by percentage) on the machine every 10 minutes. The "logging restarted" line indicates my reboot of the machine at the time it went unresponsive.

      -------------------------- analysis date: 2012/09/12 --------------------------

      06:42:31 pid command mem% | pid command mem% | pid command mem%top3
      06:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:12:31 970 beam.smp 15% | 1137 memcache 10% | 1136 moxi 0%
      07:22:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:32:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:42:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      07:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:12:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:22:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:32:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:42:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      08:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:12:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:22:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:32:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:42:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      09:52:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      10:02:31 1137 memcache 10% | 970 beam.smp 2% | 1136 moxi 0%
      10:12:31 970 beam.smp 45% | 1137 memcache 10% | 1136 moxi 0%
      10:22:31 970 beam.smp 62% | 1137 memcache 10% | 1136 moxi 0%
      10:32:31 970 beam.smp 74% | 1137 memcache 10% | 1136 moxi 0%
      10:48:33 ......................... logging restarted .........................
      10:58:32 1249 memcache 8% | 1107 moxi 1% | 967 beam.smp 1%
      11:08:32 1249 memcache 8% | 1107 moxi 1% | 967 beam.smp 1%
      11:18:32 1249 memcache 8% | 967 beam.smp 1% | 1107 moxi 1%
      11:28:32 1249 memcache 8% | 967 beam.smp 1% | 1107 moxi 1%

      1. atopsar.crashlog
        66 kB
        F KUO
      2. atopsar.crashlog-091412
        138 kB
        F KUO
      3. ns-diag-20120912110828.txt.gz
        19.04 MB
        F KUO
      4. ns-diag-20120912215027-node-215.txt.gz
        17.99 MB
        F KUO
      5. ns-diag-20120914174013-node10.txt.gz
        19.04 MB
        F KUO
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        keromeister F KUO added a comment -

        Hi again,

        This just happened again. This time, beam.smp ate up memory much faster and I definitely didn't have enough time to catch it.

        beam.smp went from 2% of memory usage (600k) to 37% of memory usage (11.8+GB) in just 10 minutes between stats gathering.

        I will attach the diagnostic log from the affected node.

        -------------------------- analysis date: 2012/09/14 --------------------------

        06:52:21 pid command mem% | pid command mem% | pid command mem%top3
        07:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        07:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        07:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        07:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        07:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        07:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        08:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        08:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        08:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        08:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        08:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        08:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        09:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        09:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        09:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        09:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        09:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        09:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        10:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        10:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        10:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        10:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        10:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        10:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        11:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        11:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        11:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        11:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        11:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        11:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        12:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        12:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        12:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        12:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        12:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        12:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        13:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        13:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        13:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        13:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        13:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        13:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        14:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        14:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        14:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        14:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        14:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        14:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        15:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        15:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        15:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        15:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        15:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        15:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        16:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        16:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        16:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        16:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        16:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        16:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        17:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        17:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2%
        17:22:21 932 beam.smp 37% | 29819 memcache 10% | 29818 moxi 9%
        17:39:17 ......................... logging restarted .........................
        17:49:17 1425 memcache 6% | 1189 moxi 1% | 990 beam.smp 1%

        Show
        keromeister F KUO added a comment - Hi again, This just happened again. This time, beam.smp ate up memory much faster and I definitely didn't have enough time to catch it. beam.smp went from 2% of memory usage (600k) to 37% of memory usage (11.8+GB) in just 10 minutes between stats gathering. I will attach the diagnostic log from the affected node. -------------------------- analysis date: 2012/09/14 -------------------------- 06:52:21 pid command mem% | pid command mem% | pid command mem% top3 07:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 07:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 07:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 07:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 07:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 07:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 08:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 08:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 08:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 08:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 08:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 08:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 09:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 09:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 09:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 09:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 09:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 09:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 10:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 10:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 10:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 10:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 10:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 10:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 11:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 11:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 11:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 11:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 11:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 11:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 12:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 12:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 12:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 12:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 12:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 12:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 13:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 13:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 13:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 13:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 13:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 13:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 14:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 14:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 14:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 14:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 14:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 14:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 15:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 15:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 15:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 15:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 15:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 15:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 16:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 16:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 16:22:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 16:32:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 16:42:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 16:52:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 17:02:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 17:12:21 29819 memcache 11% | 29818 moxi 9% | 932 beam.smp 2% 17:22:21 932 beam.smp 37% | 29819 memcache 10% | 29818 moxi 9% 17:39:17 ......................... logging restarted ......................... 17:49:17 1425 memcache 6% | 1189 moxi 1% | 990 beam.smp 1%
        Hide
        keromeister F KUO added a comment -

        diagnostic log from the crashed node attached. Unfortunately it was taken after it was forcibly rebooted; not sure how much insight will be available....

        Show
        keromeister F KUO added a comment - diagnostic log from the crashed node attached. Unfortunately it was taken after it was forcibly rebooted; not sure how much insight will be available....
        Hide
        keromeister F KUO added a comment -

        attaching output of 'atopsar -AM' on affected node to show system health leading up to incident.

        Show
        keromeister F KUO added a comment - attaching output of 'atopsar -AM' on affected node to show system health leading up to incident.
        Hide
        keromeister F KUO added a comment -

        Also, there was quite an IOwait spike and a large spike in disk write queue on couchbase. I cannot explain it.

        Show
        keromeister F KUO added a comment - Also, there was quite an IOwait spike and a large spike in disk write queue on couchbase. I cannot explain it.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Some progress was made for 2.0.0 and we plan further work under MB-8211

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Some progress was made for 2.0.0 and we plan further work under MB-8211

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            keromeister F KUO
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes