Details
-
Bug
-
Resolution: Unresolved
-
Major
-
7.1.3
-
Untriaged
-
0
-
Yes
Description
While measuring the latency of SyncWrites on modest node sizes (EC2 r5.2xlarge - 8 CPU cores), it was observed that there were periodic jumps in the worst-case (p100) SyncWrite latency every 10mins:
Looking at tasks which run every 10mins, we can see a very direct correlation with when the ExpiryPager is scheduled to run (for the 7 buckets on this cluster):
i.e. when the ExpiryPager starts to run for a bucket, the maximum SyncWrite latency suffers.
This appears to be due to contention on the NonIO thread pool - on an 8-core system we create 2 nonIO threads, and the ExpiryPager runs 2 tasks per Bucket.
Indeed, the latency increase is (almost) entirely eliminated if the number of NonIO threads is increased from 2 to 3 - so there's still a "spare" NonIO thread when the ExpiryPager tasks are running - note threads were changed at the dotted blue line: