Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58143

couchbase-server unit file should specify OOM policy of CONTINUE (was: systemd kills `couchbase-server` on any process getting OOM killed)

    XMLWordPrintable

Details

    • 0
    • Build Team 2023 Sprint 10

    Description

      In some cases I've seen that when indexer was OOM killed by the kernel, systemd also killed the `couchbase-server` service. So the entire node becomes unavailable.

      I've been seeing this behaviour in the new setup by the perf team where they have upgraded to Ubuntu 20.4 (or 22.4) and have enabled cgroup v2 in the system. From my understanding, the installation steps by default don't specify any cgroups yet the couchbase services runs by default in it's own cgroup. From my limited research I have found that we need to configure a systemctl setting `OOMPolicy=continue` for the service so that systemd does not kill the cgroup slice. Though there are some versions of systemd that have a bug to not honour this, but the latest versions have this fixed. Latest kernels come with systemd-oomd which also follow this policy and can kill/allow to continue the cgroup slice when a unit (aka processes like indexer memcached) gets OOM killed.

      Opening this ticket to see if we can set `OOMPolicy=continue` (in the config file of service) and achieve the desired behaviour where the entire server slice is not killed. This was introduced with systemd v243. It is only valid for slices which are under cgroup v2. Systemd commit - https://github.com/systemd/systemd/commit/afcfaa695cd00b713e7d57e1829da90b692ac6f8

      More info on OOMPolicy/it's enums:

      typedef enum OOMPolicy {
              OOM_CONTINUE,          /* The kernel kills the process it wants to kill, and that's it */
              OOM_STOP,              /* The kernel kills the process it wants to kill, and we stop the unit */
              OOM_KILL,              /* The kernel kills the process it wants to kill, and all others in the unit, and we stop the unit */
              _OOM_POLICY_MAX,
              _OOM_POLICY_INVALID = -1
      } OOMPolicy;
      

      The default value of OOMPolicy/DefaultOOMPolicy is OOM_STOP enum

      Logs from previous MBs -

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-18-234-108-17.compute-1.amazonaws.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-34-207-245-80.compute-1.amazonaws.com.zip
      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-54-211-67-109.compute-1.amazonaws.com.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            dhruvil.ketanshah Dhruvil Shah
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Time Tracking

                Estimated:
                Original Estimate - Not Specified
                Not Specified
                Remaining:
                Remaining Estimate - 0h
                0h
                Logged:
                Time Spent - 2h
                2h

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty