Details
-
Improvement
-
Resolution: Fixed
-
Major
-
7.6.0, 7.2.1, 7.1.5
-
0
-
Build Team 2023 Sprint 10
Description
In some cases I've seen that when indexer was OOM killed by the kernel, systemd also killed the `couchbase-server` service. So the entire node becomes unavailable.
I've been seeing this behaviour in the new setup by the perf team where they have upgraded to Ubuntu 20.4 (or 22.4) and have enabled cgroup v2 in the system. From my understanding, the installation steps by default don't specify any cgroups yet the couchbase services runs by default in it's own cgroup. From my limited research I have found that we need to configure a systemctl setting `OOMPolicy=continue` for the service so that systemd does not kill the cgroup slice. Though there are some versions of systemd that have a bug to not honour this, but the latest versions have this fixed. Latest kernels come with systemd-oomd which also follow this policy and can kill/allow to continue the cgroup slice when a unit (aka processes like indexer memcached) gets OOM killed.
Opening this ticket to see if we can set `OOMPolicy=continue` (in the config file of service) and achieve the desired behaviour where the entire server slice is not killed. This was introduced with systemd v243. It is only valid for slices which are under cgroup v2. Systemd commit - https://github.com/systemd/systemd/commit/afcfaa695cd00b713e7d57e1829da90b692ac6f8
More info on OOMPolicy/it's enums:
typedef enum OOMPolicy { |
OOM_CONTINUE, /* The kernel kills the process it wants to kill, and that's it */ |
OOM_STOP, /* The kernel kills the process it wants to kill, and we stop the unit */ |
OOM_KILL, /* The kernel kills the process it wants to kill, and all others in the unit, and we stop the unit */ |
_OOM_POLICY_MAX,
|
_OOM_POLICY_INVALID = -1 |
} OOMPolicy;
|
The default value of OOMPolicy/DefaultOOMPolicy is OOM_STOP enum
Logs from previous MBs -
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-18-234-108-17.compute-1.amazonaws.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-34-207-245-80.compute-1.amazonaws.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-54-211-67-109.compute-1.amazonaws.com.zip