Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-21574

Prevent XDCR from non-LWW buckets into LWW buckets

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 4.6.0
    • 4.6.0
    • couchbase-bucket
    • None
    • Build 3430
    • Untriaged
    • Unknown

    Description

      This is a clone of MB-21533, the bug filed by Perry that he was seeing the "drift threshold exceeded" alert on non-LWW buckets. The bug where this shows for non-LWW buckets is fixed; however, there's a separate problem that the alert shows up at all. Details on the cluster are:

      1. No XDCR
      2. Clocks are synced using NTP
      3. When any amount of load is run the alert immediately and consistently shows up

      Details on the NTP setup from the logs are as below. As you can see all of the nodes sync off the same source (ec2-54-187-254-) and have offsets (difference between the client server and the source) in milliseconds. (cf: https://pthree.org/2013/11/05/real-life-ntp/)

      $ fgrep -h -A 6 "ntpq -p" */couchbase.log
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u   65  256  377   76.034    3.106   8.778
      +ec2-54-86-63-15 132.163.4.103    2 u   37  256  377   76.473    7.853   3.852
      *ec2-54-187-254- 128.138.140.44   2 u  143  256  377    0.899    6.642   0.214
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u    -  256  377   80.047    2.521   3.567
      +ec2-54-86-63-15 132.163.4.103    2 u   51  128  377   69.978   -7.430   3.463
      *ec2-54-187-254- 128.138.140.44   2 u    6  128  377    0.900   -5.065   0.062
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u   19  128  377   81.551  -30.455   3.745
      +ec2-54-86-63-15 132.163.4.103    2 u   35  128  377   66.821  -27.995   4.095
      *ec2-54-187-254- 128.138.140.44   2 u   23  128  377    0.374  -31.611   0.935
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u   26   64  377   90.378   10.203   2.235
      +ec2-54-86-63-15 132.163.4.103    2 u   34   64  377   89.424    7.238   0.670
      *ec2-54-187-254- 128.138.140.44   2 u   28   64  377    0.388    2.918   0.588
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u  172  256  377   89.698    6.565   0.144
      +ec2-54-86-63-15 132.163.4.103    2 u   16  256  377   90.743    7.670   2.190
      *ec2-54-187-254- 128.138.140.44   2 u    7  256  377    0.400    1.488   0.224
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u    6   64  377   75.074    2.331  11.261
      +ec2-54-86-63-15 132.163.4.103    2 u   64   64  377   77.545   -0.088   9.470
      *ec2-54-187-254- 128.138.140.44   2 u   63   64  377    0.844   13.787   8.117
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u    7  256  377   84.790   13.940   1.160
      +ec2-54-86-63-15 132.163.4.103    2 u   11  256  377   81.518    5.039   2.448
      *ec2-54-187-254- 128.138.140.44   2 u    1  256  377    0.399    3.018   0.656
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u   36  128  377   76.419    3.093   3.013
      +ec2-54-86-63-15 132.163.4.103    2 u   76  128  377   74.800   -1.465   6.089
      *ec2-54-187-254- 128.138.140.44   2 u   13  128  377    0.887    1.636   0.265
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u  130  256  377   83.314    7.269   6.740
      +ec2-54-86-63-15 132.163.4.103    2 u  140  256  377   67.922   -2.552   4.371
      *ec2-54-187-254- 128.138.140.44   2 u   14  256  377    0.407   -3.317   0.212
      --
      ntpq -p
      ==============================================================================
           remote           refid      st t when poll reach   delay   offset  jitter
      ==============================================================================
      +ec2-54-86-63-15 132.163.4.103    2 u  148  256  377   87.519  -13.582   1.074
      +ec2-54-86-63-15 132.163.4.103    2 u   82  256  377   89.582  -13.907   0.081
      *ec2-54-187-254- 128.138.140.44   2 u   69  256  377    0.914  -16.796   0.074
      

      If you see the attached screenshots you'll see that sets per second and the rate of replica ahead exceptions are perfectly correlated. And of course it must be the replica ahead exceptions that are causing the alerts since there's no XDCR in the picture. This was under substantial load – however, Perry reported that he can get exceptions under 100 sets / s.

      There's something fishy. The clocks are decently synchronized. How can replica ahead exceptions occur with a threshold of 5 seconds?

      Logs:
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-160-109-132.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-160-175-223.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-160-39-229.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-161-151-185.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-161-182-8.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-161-226-15.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-34-124.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-6-24.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-69-200.us-west-2.compute.amazonaws.com.zip
      https://s3.amazonaws.com/bugdb/df-perry-connect-demo/collectinfo-2016-11-02T182545-ns_1%40ec2-35-162-71-57.us-west-2.compute.amazonaws.com.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            bharath.gp Bharath G P
            perry Perry Krug
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty