Add per-bucket |max_cas - wall_clock| > threshold ep-engine counter

Description

On any vbucket, at the point we update the max_cas, if the value of it differs greatly from the current wall clock time this is a clear indication that the wall clock on some other node is skewed from the clock on this node, by approximately that amount.

To allow ns_server to alert admins on this (that they need to examine the clocks across the replication topology) I propose that we track the following stat in ep-engine:

ep_clock_cas_drift_threshold_exceeded: counter that is incremented every time the max cas is update and its value is greater than the wall clock by more than a threshold value. Note that since the max cas is always at least as large as the wall clock when it's updated, the difference won't actually be negative.

I propose the default value for the threshold to be: 5 seconds. I'll file a separate ticket to track making this threshold configurable and dynamically changeable.

Components

Affects versions

Fix versions

Labels

Environment

None

Release Notes Description

None

Activity

Show:

CB robot October 17, 2016 at 10:52 AM

Build 4.7.0-1233 contains ep-engine commit 3ba9f54be46e6d439608dce69b873dc5f56bf049 with commit message:
: A single total for drift ahead exceptions
https://github.com/couchbase/ep-engine/commit/3ba9f54be46e6d439608dce69b873dc5f56bf049

Jim Walker October 12, 2016 at 3:09 PM

ep_clock_cas_drift_threshold_exceeded is incremented for any CAS received from the future that is above the threshold.

This increments regardless of the bucket configuration.

CB robot October 12, 2016 at 8:00 AM

Build 4.6.0-3368 contains ep-engine commit 3ba9f54be46e6d439608dce69b873dc5f56bf049 with commit message:
: A single total for drift ahead exceptions
https://github.com/couchbase/ep-engine/commit/3ba9f54be46e6d439608dce69b873dc5f56bf049

Dave Finlay October 4, 2016 at 1:07 AM
Edited

Got it, Jim, thanks. Let's keep this ticket open to track this for the time being. In the event that http://review.couchbase.org/#/c/68272 (or one of the other changes in that stack) resolves it, then we can close.

Jim Walker October 3, 2016 at 9:47 AM
Edited

So far all stat work has been under the pre-existing and the first set of stats are designed to give some useful granularity (hence why we have ahead/behind/active/replica etc...) for post-mortem analysis.

Notes:

  1. The stats are per bucket (we also have vbucket-details for the full per VB picture).

  2. I've split things into active/replica which is to aid post-mortem analysis, it's easier to know if something is wrong because of DCP vs XDCR (set_w_meta).

  3. I've split ahead/behind into separate counters too, again for the full picture on post-mortem analysis.

In terms of this task, a new counter will be added that is the sum of active/replica. That will give a single counter which should correctly trigger alerts whenever max_cas drifts ahead of the threshold as per this task.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created September 29, 2016 at 6:31 PM
Updated October 17, 2016 at 10:52 AM
Resolved October 12, 2016 at 3:09 PM
Instabug