Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Fixed
Priority: Major
Fix Version/s: 7.6.0
Affects Version/s: 7.6.0, 7.2.1, 7.1.5
Component/s: build, installer, ns_server
Labels:
- approved-for-trinity
- pm-internal

Story Points:
0
Sprint:
Build Team 2023 Sprint 10

Description

In some cases I've seen that when indexer was OOM killed by the kernel, systemd also killed the `couchbase-server` service. So the entire node becomes unavailable.

I've been seeing this behaviour in the new setup by the perf team where they have upgraded to Ubuntu 20.4 (or 22.4) and have enabled cgroup v2 in the system. From my understanding, the installation steps by default don't specify any cgroups yet the couchbase services runs by default in it's own cgroup. From my limited research I have found that we need to configure a systemctl setting `OOMPolicy=continue` for the service so that systemd does not kill the cgroup slice. Though there are some versions of systemd that have a bug to not honour this, but the latest versions have this fixed. Latest kernels come with systemd-oomd which also follow this policy and can kill/allow to continue the cgroup slice when a unit (aka processes like indexer memcached) gets OOM killed.

Opening this ticket to see if we can set `OOMPolicy=continue` (in the config file of service) and achieve the desired behaviour where the entire server slice is not killed. This was introduced with systemd v243. It is only valid for slices which are under cgroup v2. Systemd commit - https://github.com/systemd/systemd/commit/afcfaa695cd00b713e7d57e1829da90b692ac6f8

More info on OOMPolicy/it's enums:

typedef enum OOMPolicy {

        OOM_CONTINUE,          /* The kernel kills the process it wants to kill, and that's it */

        OOM_STOP,              /* The kernel kills the process it wants to kill, and we stop the unit */

        OOM_KILL,              /* The kernel kills the process it wants to kill, and all others in the unit, and we stop the unit */

        _OOM_POLICY_MAX,

        _OOM_POLICY_INVALID = -1

} OOMPolicy;

The default value of OOMPolicy/DefaultOOMPolicy is OOM_STOP enum

Logs from previous MBs -

https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-18-234-108-17.compute-1.amazonaws.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-34-207-245-80.compute-1.amazonaws.com.zip
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-Cloud-Tester-Dev-1331/ec2-54-211-67-109.compute-1.amazonaws.com.zip

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

oom_restart_172_16_12_104.zip
18.40 MB
09/Aug/23 9:42 AM

Activity

People

Assignee:: Ashwin Govindarajulu

Reporter:: Dhruvil Shah

Votes:: 0 Vote for this issue

Watchers:: 10 Start watching this issue

Dates

Created:: 04/Aug/23 6:22 AM

Updated:: 24/Nov/23 12:06 AM

Resolved:: 09/Aug/23 9:45 AM

Time Tracking

Estimated:

Not Specified

Remaining:

Logged:

couchbase-server unit file should specify OOM policy of CONTINUE (was: systemd kills `couchbase-server` on any process getting OOM killed)

Details

Description

Attachments

Attachments

Activity

People

Dates

Time Tracking

PagerDuty