Add per-bucket |max_cas - wall_clock| > threshold ep-engine counter
Description
Components
Affects versions
Fix versions
Labels
Environment
Release Notes Description
blocks
Activity
CB robot October 17, 2016 at 10:52 AM
Build 4.7.0-1233 contains ep-engine commit 3ba9f54be46e6d439608dce69b873dc5f56bf049 with commit message:
: A single total for drift ahead exceptions
https://github.com/couchbase/ep-engine/commit/3ba9f54be46e6d439608dce69b873dc5f56bf049
Jim Walker October 12, 2016 at 3:09 PM
ep_clock_cas_drift_threshold_exceeded is incremented for any CAS received from the future that is above the threshold.
This increments regardless of the bucket configuration.
CB robot October 12, 2016 at 8:00 AM
Build 4.6.0-3368 contains ep-engine commit 3ba9f54be46e6d439608dce69b873dc5f56bf049 with commit message:
: A single total for drift ahead exceptions
https://github.com/couchbase/ep-engine/commit/3ba9f54be46e6d439608dce69b873dc5f56bf049
Dave Finlay October 4, 2016 at 1:07 AMEdited
Got it, Jim, thanks. Let's keep this ticket open to track this for the time being. In the event that http://review.couchbase.org/#/c/68272 (or one of the other changes in that stack) resolves it, then we can close.
Jim Walker October 3, 2016 at 9:47 AMEdited
So far all stat work has been under the pre-existing and the first set of stats are designed to give some useful granularity (hence why we have ahead/behind/active/replica etc...) for post-mortem analysis.
Notes:
The stats are per bucket (we also have vbucket-details for the full per VB picture).
I've split things into active/replica which is to aid post-mortem analysis, it's easier to know if something is wrong because of DCP vs XDCR (set_w_meta).
I've split ahead/behind into separate counters too, again for the full picture on post-mortem analysis.
In terms of this task, a new counter will be added that is the sum of active/replica. That will give a single counter which should correctly trigger alerts whenever max_cas drifts ahead of the threshold as per this task.
Details
Assignee
Jim WalkerJim WalkerReporter
Dave FinlayDave FinlayPriority
MajorInstabug
Open Instabug
Details
Details
Assignee
Reporter
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support
Linked Tickets
Zendesk Support

On any vbucket, at the point we update the max_cas, if the value of it differs greatly from the current wall clock time this is a clear indication that the wall clock on some other node is skewed from the clock on this node, by approximately that amount.
To allow ns_server to alert admins on this (that they need to examine the clocks across the replication topology) I propose that we track the following stat in ep-engine:
ep_clock_cas_drift_threshold_exceeded
: counter that is incremented every time the max cas is update and its value is greater than the wall clock by more than a threshold value. Note that since the max cas is always at least as large as the wall clock when it's updated, the difference won't actually be negative.I propose the default value for the threshold to be: 5 seconds. I'll file a separate ticket to track making this threshold configurable and dynamically changeable.