Details
-
Bug
-
Resolution: Fixed
-
Major
-
4.6.0
-
None
-
Build 3430
-
Untriaged
-
Unknown
Description
This is a clone of MB-21533, the bug filed by Perry that he was seeing the "drift threshold exceeded" alert on non-LWW buckets. The bug where this shows for non-LWW buckets is fixed; however, there's a separate problem that the alert shows up at all. Details on the cluster are:
- No XDCR
- Clocks are synced using NTP
- When any amount of load is run the alert immediately and consistently shows up
Details on the NTP setup from the logs are as below. As you can see all of the nodes sync off the same source (ec2-54-187-254-) and have offsets (difference between the client server and the source) in milliseconds. (cf: https://pthree.org/2013/11/05/real-life-ntp/)
$ fgrep -h -A 6 "ntpq -p" */couchbase.log
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 65 256 377 76.034 3.106 8.778
|
+ec2-54-86-63-15 132.163.4.103 2 u 37 256 377 76.473 7.853 3.852
|
*ec2-54-187-254- 128.138.140.44 2 u 143 256 377 0.899 6.642 0.214
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u - 256 377 80.047 2.521 3.567
|
+ec2-54-86-63-15 132.163.4.103 2 u 51 128 377 69.978 -7.430 3.463
|
*ec2-54-187-254- 128.138.140.44 2 u 6 128 377 0.900 -5.065 0.062
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 19 128 377 81.551 -30.455 3.745
|
+ec2-54-86-63-15 132.163.4.103 2 u 35 128 377 66.821 -27.995 4.095
|
*ec2-54-187-254- 128.138.140.44 2 u 23 128 377 0.374 -31.611 0.935
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 26 64 377 90.378 10.203 2.235
|
+ec2-54-86-63-15 132.163.4.103 2 u 34 64 377 89.424 7.238 0.670
|
*ec2-54-187-254- 128.138.140.44 2 u 28 64 377 0.388 2.918 0.588
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 172 256 377 89.698 6.565 0.144
|
+ec2-54-86-63-15 132.163.4.103 2 u 16 256 377 90.743 7.670 2.190
|
*ec2-54-187-254- 128.138.140.44 2 u 7 256 377 0.400 1.488 0.224
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 6 64 377 75.074 2.331 11.261
|
+ec2-54-86-63-15 132.163.4.103 2 u 64 64 377 77.545 -0.088 9.470
|
*ec2-54-187-254- 128.138.140.44 2 u 63 64 377 0.844 13.787 8.117
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 7 256 377 84.790 13.940 1.160
|
+ec2-54-86-63-15 132.163.4.103 2 u 11 256 377 81.518 5.039 2.448
|
*ec2-54-187-254- 128.138.140.44 2 u 1 256 377 0.399 3.018 0.656
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 36 128 377 76.419 3.093 3.013
|
+ec2-54-86-63-15 132.163.4.103 2 u 76 128 377 74.800 -1.465 6.089
|
*ec2-54-187-254- 128.138.140.44 2 u 13 128 377 0.887 1.636 0.265
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 130 256 377 83.314 7.269 6.740
|
+ec2-54-86-63-15 132.163.4.103 2 u 140 256 377 67.922 -2.552 4.371
|
*ec2-54-187-254- 128.138.140.44 2 u 14 256 377 0.407 -3.317 0.212
|
--
|
ntpq -p
|
==============================================================================
|
remote refid st t when poll reach delay offset jitter
|
==============================================================================
|
+ec2-54-86-63-15 132.163.4.103 2 u 148 256 377 87.519 -13.582 1.074
|
+ec2-54-86-63-15 132.163.4.103 2 u 82 256 377 89.582 -13.907 0.081
|
*ec2-54-187-254- 128.138.140.44 2 u 69 256 377 0.914 -16.796 0.074
|
If you see the attached screenshots you'll see that sets per second and the rate of replica ahead exceptions are perfectly correlated. And of course it must be the replica ahead exceptions that are causing the alerts since there's no XDCR in the picture. This was under substantial load – however, Perry reported that he can get exceptions under 100 sets / s.
There's something fishy. The clocks are decently synchronized. How can replica ahead exceptions occur with a threshold of 5 seconds?
Logs:
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-160-109-132.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-160-175-223.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-160-39-229.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-161-151-185.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-161-182-8.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-161-226-15.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-34-124.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-6-24.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-69-200.us-west-2.compute.amazonaws.com.zip
https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-71-57.us-west-2.compute.amazonaws.com.zip