Issue observed only starting from build 2440 onwards
The following issue can be observed clearly on eventing's CI tests starting from build 2440 and is reproducible everytime. For comparison, no KV failures are observed for the same tests on build 2434.
Steps to reproduce:
- Setup a single node cluster with services: kv, eventing
- Create 3 buckets: default (will be used as source bucket for the eventing function), eventing (used to store metadata), hello-world (used as destination bucket binding)
- Keep the memory quota of destination bucket to 500 MB.
- Create the following function. For each mutation on src bucket, this function upserts 6 15 MB docs via bucket ops and via N1QL to the destination bucket (hello-world).
- Deploy the function and create 10-20 documents on src bucket which in-turn should upsert 50 15 MB docs to the bucket "hello-world" (destination bucket).
Observation based on eventing CI tests:
- Everything works without issues on build 2434.
- libcouchbase (the client used by eventing for upserts to destination bucket) reports a lot of tmpfails / tmp oom errors starting from build 2440. We've tested this behaviour until build 2470 where similar issues are observed.
Attached is cbstats from one of the KV nodes for the destination bucket: hello-world_cbstats_n1.log where we observe KV reporting a high number of tmp_oom errors.
Changelog : shows no change introduced in eventing, (only a few unrelated changes in ns_server) and 2 patches in kv:
|For Gerrit Dashboard: MB-51408|
|172215,5||MB-51408: Don't miss closing the open checkpoint at memory recovery||neo||kv_engine||Status: MERGED||+2||+1|
|172499,1||Merge branch 'neo' into 'master'||master||kv_engine||Status: MERGED||+2||+1|
|174252,5||MB-50984: Ensure CheckpointMemRecoveryTask attempts checkpoint creation||master||kv_engine||Status: MERGED||+2||+1|