Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-531

Couchbase operator event order is sometimes inconsistent

    XMLWordPrintable

Details

    Description

      Couchbase cluster event order is getting changed.

      In the below scenario, pods 0001, 0005 is getting killed and the same order is preserved in the description initially. Later once the pod 0001 is recovered, "MemberDown" event order is getting swapped.

      Cb-Cluster description with member doing down event order 0001 followed by 0005,

      Ashwins-MacBook-Pro:couchbase-operator]$ kubectl describe cbc
      Name:         test-couchbase-l477x
      Namespace:    ashwin
      Labels:       <none>
      Annotations:  <none>
      API Version:  couchbase.com/v1
      Kind:         CouchbaseCluster
      Metadata:
        Cluster Name:
        Creation Timestamp:  2018-08-11T15:41:30Z
        Generate Name:       test-couchbase-
        Generation:          1
        Resource Version:    3645172
        Self Link:           /apis/couchbase.com/v1/namespaces/ashwin/couchbaseclusters/test-couchbase-l477x
        UID:                 045e0cd7-9d7d-11e8-8cfd-080027ee3776
      Spec:
        Auth Secret:  basic-test-secret
        Base Image:   couchbase/server
        Buckets:
          Conflict Resolution:  seqno
          Enable Flush:         true
          Eviction Policy:      fullEviction
          Io Priority:          high
          Memory Quota:         100
          Name:                 PVBucket
          Replicas:             2
          Type:                 couchbase
        Cluster:
          Analytics Service Memory Quota:                 1024
          Auto Failover Max Count:                        3
          Auto Failover On Data Disk Issues:              false
          Auto Failover On Data Disk Issues Time Period:  120
          Auto Failover Server Group:                     false
          Auto Failover Timeout:                          30
          Cluster Name:                                   test-couchbase-l477x
          Data Service Memory Quota:                      256
          Eventing Service Memory Quota:                  256
          Index Service Memory Quota:                     256
          Index Storage Setting:                          memory_optimized
          Search Service Memory Quota:                    256
        Expose Admin Console:                             false
        Security Context:
          Fs Group:  1000
        Servers:
          Name:  test_config_1
          Pod:
            Resources:
            Volume Mounts:
              Data:     couchbase
              Default:  couchbase
          Services:
            data
            query
            index
          Size:                         6
        Software Update Notifications:  false
        Version:                        enterprise-5.5.0
        Volume Claim Templates:
          Metadata:
            Creation Timestamp:  <nil>
            Name:                couchbase
          Spec:
            Resources:
              Requests:
                Storage:         2Gi
            Storage Class Name:  standard
          Status:
      Status:
        Buckets:
          PV Bucket:
            Conflict Resolution:  seqno
            Enable Flush:         true
            Eviction Policy:      fullEviction
            Io Priority:          high
            Memory Quota:         100
            Name:                 PVBucket
            Replicas:             2
            Type:                 couchbase
        Cluster Id:               3b7d4d629beaf9454dd3f55ffd2f218b
        Conditions:
          Available:
            Last Transition Time:  2018-08-11T15:44:22Z
            Last Update Time:      2018-08-11T15:44:22Z
            Message:               The following nodes are down and not serving requests: http://test-couchbase-l477x-0001.test-couchbase-l477x.ashwin.svc:8091, http://test-couchbase-l477x-0005.test-couchbase-l477x.ashwin.svc:8091
            Reason:                Cluster partially available
            Status:                False
          Balanced:
            Last Transition Time:  2018-08-11T15:43:43Z
            Last Update Time:      2018-08-11T15:43:43Z
            Message:               Data is equally distributed across all nodes in the cluster
            Reason:                Cluster is balanced
            Status:                True
        Control Paused:            false
        Current Version:           enterprise-5.5.0
        Members:
          Index:  6
          Ready:
            Name:  test-couchbase-l477x-0000
            Name:  test-couchbase-l477x-0002
            Name:  test-couchbase-l477x-0003
            Name:  test-couchbase-l477x-0004
          Unready:
            Name:  test-couchbase-l477x-0001
            Name:  test-couchbase-l477x-0005
        Phase:     Running
        Reason:
        Size:      6
      Events:
        Type     Reason              Age               From                                 Message
        ----     ------              ----              ----                                 -------
        Normal   NewMemberAdded      3m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0000 added to cluster
        Normal   NewMemberAdded      2m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0001 added to cluster
        Normal   NewMemberAdded      2m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0002 added to cluster
        Normal   NewMemberAdded      2m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0003 added to cluster
        Normal   NewMemberAdded      1m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0004 added to cluster
        Normal   NewMemberAdded      1m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0005 added to cluster
        Normal   RebalanceStarted    1m                couchbase-operator-7f558f9849-8dj5h  A rebalance has been started to balance data across the cluster
        Normal   RebalanceCompleted  1m                couchbase-operator-7f558f9849-8dj5h  A rebalance has completed
        Normal   BucketCreated       1m                couchbase-operator-7f558f9849-8dj5h  A new bucket `PVBucket` was created
        Warning  MemberDown          1s (x6 over 46s)  couchbase-operator-7f558f9849-8dj5h  Existing member test-couchbase-l477x-0001 down
        Warning  MemberDown          1s (x6 over 46s)  couchbase-operator-7f558f9849-8dj5h  Existing member test-couchbase-l477x-0005 down
      Ashwins-MacBook-Pro:couchbase-operator]${noformat}
       
       
      Changed *Member-Down* event order,
      

      Ashwins-MacBook-Pro:couchbase-operator]$ kubectl describe cbc
      Name:         test-couchbase-l477x
      Namespace:    ashwin
      Labels:       <none>
      Annotations:  <none>
      API Version:  couchbase.com/v1
      Kind:         CouchbaseCluster
      Metadata:
        Cluster Name:
        Creation Timestamp:  2018-08-11T15:41:30Z
        Generate Name:       test-couchbase-
        Generation:          1
        Resource Version:    3645263
        Self Link:           /apis/couchbase.com/v1/namespaces/ashwin/couchbaseclusters/test-couchbase-l477x
        UID:                 045e0cd7-9d7d-11e8-8cfd-080027ee3776
      Spec:
        Auth Secret:  basic-test-secret
        Base Image:   couchbase/server
        Buckets:
          Conflict Resolution:  seqno
          Enable Flush:         true
          Eviction Policy:      fullEviction
          Io Priority:          high
          Memory Quota:         100
          Name:                 PVBucket
          Replicas:             2
          Type:                 couchbase
        Cluster:
          Analytics Service Memory Quota:                 1024
          Auto Failover Max Count:                        3
          Auto Failover On Data Disk Issues:              false
          Auto Failover On Data Disk Issues Time Period:  120
          Auto Failover Server Group:                     false
          Auto Failover Timeout:                          30
          Cluster Name:                                   test-couchbase-l477x
          Data Service Memory Quota:                      256
          Eventing Service Memory Quota:                  256
          Index Service Memory Quota:                     256
          Index Storage Setting:                          memory_optimized
          Search Service Memory Quota:                    256
        Expose Admin Console:                             false
        Security Context:
          Fs Group:  1000
        Servers:
          Name:  test_config_1
          Pod:
            Resources:
            Volume Mounts:
              Data:     couchbase
              Default:  couchbase
          Services:
            data
            query
            index
          Size:                         6
        Software Update Notifications:  false
        Version:                        enterprise-5.5.0
        Volume Claim Templates:
          Metadata:
            Creation Timestamp:  <nil>
            Name:                couchbase
          Spec:
            Resources:
              Requests:
                Storage:         2Gi
            Storage Class Name:  standard
          Status:
      Status:
        Buckets:
          PV Bucket:
            Conflict Resolution:  seqno
            Enable Flush:         true
            Eviction Policy:      fullEviction
            Io Priority:          high
            Memory Quota:         100
            Name:                 PVBucket
            Replicas:             2
            Type:                 couchbase
        Cluster Id:               3b7d4d629beaf9454dd3f55ffd2f218b
        Conditions:
          Available:
            Last Transition Time:  2018-08-11T15:45:24Z
            Last Update Time:      2018-08-11T15:45:24Z
            Message:               The following nodes are down and not serving requests: http://test-couchbase-l477x-0005.test-couchbase-l477x.ashwin.svc:8091, http://test-couchbase-l477x-0001.test-couchbase-l477x.ashwin.svc:8091
            Reason:                Cluster partially available
            Status:                False
          Balanced:
            Last Transition Time:  2018-08-11T15:43:43Z
            Last Update Time:      2018-08-11T15:43:43Z
            Message:               Data is equally distributed across all nodes in the cluster
            Reason:                Cluster is balanced
            Status:                True
        Control Paused:            false
        Current Version:           enterprise-5.5.0
        Members:
          Index:  6
          Ready:
            Name:  test-couchbase-l477x-0000
            Name:  test-couchbase-l477x-0002
            Name:  test-couchbase-l477x-0003
            Name:  test-couchbase-l477x-0004
          Unready:
            Name:  test-couchbase-l477x-0001
            Name:  test-couchbase-l477x-0005
        Phase:     Running
        Reason:
        Size:      6
      Events:
        Type     Reason              Age               From                                 Message
        ----     ------              ----              ----                                 -------
        Normal   NewMemberAdded      3m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0000 added to cluster
        Normal   NewMemberAdded      3m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0001 added to cluster
        Normal   NewMemberAdded      3m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0002 added to cluster
        Normal   NewMemberAdded      2m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0003 added to cluster
        Normal   NewMemberAdded      2m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0004 added to cluster
        Normal   NewMemberAdded      2m                couchbase-operator-7f558f9849-8dj5h  New member test-couchbase-l477x-0005 added to cluster
        Normal   RebalanceStarted    2m                couchbase-operator-7f558f9849-8dj5h  A rebalance has been started to balance data across the cluster
        Normal   RebalanceCompleted  1m                couchbase-operator-7f558f9849-8dj5h  A rebalance has completed
        Normal   BucketCreated       1m                couchbase-operator-7f558f9849-8dj5h  A new bucket `PVBucket` was created
        Warning  MemberDown          22s (x7 over 1m)  couchbase-operator-7f558f9849-8dj5h  Existing member test-couchbase-l477x-0005 down
        Warning  MemberDown          13s (x8 over 1m)  couchbase-operator-7f558f9849-8dj5h  Existing member test-couchbase-l477x-0001 down
        Normal   MemberRecovered     3s                couchbase-operator-7f558f9849-8dj5h  Existing member test-couchbase-l477x-0001 recovered
      Ashwins-MacBook-Pro:couchbase-operator]${noformat}

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            simon.murray Simon Murray added a comment -

            Looking at the logs:

            - count: 8
              eventTime: null
              firstTimestamp: 2018-08-11T15:44:22Z
              involvedObject:
                apiVersion: couchbase.com/v1
                kind: CouchbaseCluster
                name: test-couchbase-l477x
                namespace: ashwin
                resourceVersion: "3645172"
                uid: 045e0cd7-9d7d-11e8-8cfd-080027ee3776
              lastTimestamp: 2018-08-11T15:45:25Z
              message: Existing member test-couchbase-l477x-0001 down
              metadata:
                creationTimestamp: 2018-08-11T15:44:22Z
                name: test-couchbase-l477x.1549deee29e7e7b1
                namespace: ashwin
                resourceVersion: "3645264"
                selfLink: /api/v1/namespaces/ashwin/events/test-couchbase-l477x.1549deee29e7e7b1
                uid: 6af71080-9d7d-11e8-8cfd-080027ee3776
              reason: MemberDown
              reportingComponent: ""
              reportingInstance: ""
              source:
                component: couchbase-operator-7f558f9849-8dj5h
              type: Warning
            - count: 8
              eventTime: null
              firstTimestamp: 2018-08-11T15:44:22Z
              involvedObject:
                apiVersion: couchbase.com/v1
                kind: CouchbaseCluster
                name: test-couchbase-l477x
                namespace: ashwin
                resourceVersion: "3645172"
                uid: 045e0cd7-9d7d-11e8-8cfd-080027ee3776
              lastTimestamp: 2018-08-11T15:45:43Z
              message: Existing member test-couchbase-l477x-0005 down
              metadata:
                creationTimestamp: 2018-08-11T15:44:22Z
                name: test-couchbase-l477x.1549deee2a6d0ba1
                namespace: ashwin
                resourceVersion: "3645301"
                selfLink: /api/v1/namespaces/ashwin/events/test-couchbase-l477x.1549deee2a6d0ba1
                uid: 6b2775f5-9d7d-11e8-8cfd-080027ee3776
              reason: MemberDown
              reportingComponent: ""
              reportingInstance: ""
              source:
                component: couchbase-operator-7f558f9849-8dj5h
              type: Warning

            Both events are created at the same time (provided time resolution is at the second level).  The last time stamps are different but in the "correct" order, so these do not determine the order things appear in.  These events are created by a third party library which we cannot alter to artificially order, by adding delays.  Basically you are at the mercy of the arbitrary order that etcd returns them in.

            As such you should use https://issues.couchbase.com/browse/K8S-519 in order to specify that they may occur in any sequence.

             

            simon.murray Simon Murray added a comment - Looking at the logs: - count: 8   eventTime: null   firstTimestamp: 2018-08-11T15:44:22Z   involvedObject:     apiVersion: couchbase.com/v1     kind: CouchbaseCluster     name: test-couchbase-l477x     namespace: ashwin     resourceVersion: "3645172"     uid: 045e0cd7-9d7d-11e8-8cfd-080027ee3776   lastTimestamp: 2018-08-11T15:45:25Z   message: Existing member test-couchbase-l477x-0001 down   metadata:     creationTimestamp: 2018-08-11T15:44:22Z     name: test-couchbase-l477x.1549deee29e7e7b1     namespace: ashwin     resourceVersion: "3645264"     selfLink: /api/v1/namespaces/ashwin/events/test-couchbase-l477x.1549deee29e7e7b1     uid: 6af71080-9d7d-11e8-8cfd-080027ee3776   reason: MemberDown   reportingComponent: ""   reportingInstance: ""   source:     component: couchbase-operator-7f558f9849-8dj5h   type: Warning - count: 8   eventTime: null   firstTimestamp: 2018-08-11T15:44:22Z   involvedObject:     apiVersion: couchbase.com/v1     kind: CouchbaseCluster     name: test-couchbase-l477x     namespace: ashwin     resourceVersion: "3645172"     uid: 045e0cd7-9d7d-11e8-8cfd-080027ee3776   lastTimestamp: 2018-08-11T15:45:43Z   message: Existing member test-couchbase-l477x-0005 down   metadata:     creationTimestamp: 2018-08-11T15:44:22Z     name: test-couchbase-l477x.1549deee2a6d0ba1     namespace: ashwin     resourceVersion: "3645301"     selfLink: /api/v1/namespaces/ashwin/events/test-couchbase-l477x.1549deee2a6d0ba1     uid: 6b2775f5-9d7d-11e8-8cfd-080027ee3776   reason: MemberDown   reportingComponent: ""   reportingInstance: ""   source:     component: couchbase-operator-7f558f9849-8dj5h   type: Warning Both events are created at the same time (provided time resolution is at the second level).  The last time stamps are different but in the "correct" order, so these do not determine the order things appear in.  These events are created by a third party library which we cannot alter to artificially order, by adding delays.  Basically you are at the mercy of the arbitrary order that etcd returns them in. As such you should use https://issues.couchbase.com/browse/K8S-519 in order to specify that they may occur in any sequence.  

            Description for release notes:

            Known Issue: When the same event occurs on different Couchbase nodes the event order is not reported consistently. This may affect applications that rely on the event order to be consistent when checking a CouchbaseClusters status.

            Workaround: None.

            mikew Mike Wiederhold [X] (Inactive) added a comment - Description for release notes: Known Issue: When the same event occurs on different Couchbase nodes the event order is not reported consistently. This may affect applications that rely on the event order to be consistent when checking a CouchbaseClusters status. Workaround: None.

            Using event schema validation changes (K8S-519) in the similar test cases.

            This will change the way we are validating our test cases, but the behavior doesn't change in actual K8S' event level.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Using event schema validation changes ( K8S-519 ) in the similar test cases. This will change the way we are validating our test cases, but the behavior doesn't change in actual K8S' event level.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty