Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3479

[Couchbase Operator 2.6.4] Potential data loss during EKS Upgrade.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.6.4
    • None
    • operator
    • None
    • 3

    Description

      Couchbase Cluster Description

      • Set up the cluster as per the required specifications
      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM) 
      • 6 Data Service, 4 Index Service and Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • ~210GB data per bucket → ~2TB data loaded onto cluster.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)

      CAO operator [2.6.4]
      Steps of Exercise Performed :-

      1. Upgraded 7.2.4 to 7.2.5 successful.
      2. Upgraded EKS 1.25 to 1.26

      Successful Upgrade from CB. 7.2.4 to 7.2.5. but during EKS Upgrade 1.25 to 1.26 on 7.2.5, we noticed node 04 went down during the rebalance of node 0003 addition.

      Topology of CB Cluster Pods

      CB Server Pod Node address
      cb-example-0000 ip-10-0-1-100.us-east-2.compute.internal
      cb-example-0001 ip-10-0-1-121.us-east-2.compute.internal
      cb-example-0002 ip-10-0-2-164.us-east-2.compute.internal
      cb-example-0003 ip-10-0-3-47.us-east-2.compute.internal
      cb-example-0004 ip-10-0-2-57.us-east-2.compute.internal
      cb-example-0005 ip-10-0-3-64.us-east-2.compute.internal
      cb-example-0006 ip-10-0-1-188.us-east-2.compute.internal
      cb-example-0007 ip-10-0-3-151.us-east-2.compute.internal
      cb-example-0008 ip-10-0-1-44.us-east-2.compute.internal
      cb-example-0009 ip-10-0-2-57.us-east-2.compute.internal

      Note - All node are on different worker nodes.

      Note - The UpdateConfig of Amazon EKS managed worker is set to default that is 1. Hence, only one worked node upgrade  to new K8s version.

      Logs Analysis:-

      • It appears like at 2024-05-13T13:22  the EKS upgrade activity started. Started bringing down node 0009.

      Auto failover completed for node 09:

      {"timestamp":"2024-05-13T13:22:30.984Z","event_id":15,"component":"ns_server","description":"Auto failover completed","severity":"info","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"0f1ee0d1-dc10-4131-b36c-1dab790c8844","extra_attributes":{"operation_id":"1cbfce813fcf0d808141db836e199551","nodes_info":
      {"active_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"failover_nodes":["ns_1@cb-example-0009.cb-example.default.svc"],"master_node":"ns_1@cb-example-0000.cb-example.default.svc"},"time_taken":2376,"completion_message":"Failover completed successfully."}}
      

       
      The node was not rebalanced out of the cluster. The node was upgraded and added back to the cluster and rebalance initiated.

      {"timestamp":"2024-05-13T13:24:15.566Z","event_id":2,"component":"ns_server","description":"Rebalance initiated","severity":"info","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"dc2f90d8-8f4d-4fc4-a1c2-c98f540f7d55","extra_attributes":{"operation_id":"0a7034f9884ad197d6017aae4b0abdae","nodes_info":
      {"active_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"keep_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]}}}
      

      Rebalance completed 

      {"timestamp":"2024-05-13T13:24:52.795Z","event_id":3,"component":"ns_server","description":"Rebalance completed","severity":"info","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"559b0a99-2c9b-4137-a81f-f0dfcdc383b9","extra_attributes":{"operation_id":"0a7034f9884ad197d6017aae4b0abdae","nodes_info":
      {"active_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"keep_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"eject_nodes":[],"delta_nodes":[],"failed_nodes":[]},"time_taken":37229,"completion_message":"Rebalance completed successfully."}}
      

      Later, node 0007 was upgraded in the same fashion.
      Now its 0003's turn. Node was failed over at 2024-05-13T13:28.

      {"timestamp":"2024-05-13T13:28:44.297Z","event_id":15,"component":"ns_server","description":"Auto failover completed","severity":"info","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"cc4ad459-8fe0-4aa8-be4f-4b33ebb361b1","extra_attributes":{"operation_id":"aa9d780ab8070a717de6f8178b4efd15","nodes_info":
      {"active_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"failover_nodes":["ns_1@cb-example-0003.cb-example.default.svc"],"master_node":"ns_1@cb-example-0000.cb-example.default.svc"},"time_taken":1015,"completion_message":"Failover completed successfully."}}
      

      rebalance initiated at 2024-05-13T13:35 to delta recover the node 0003.

      {"timestamp":"2024-05-13T13:35:40.425Z","event_id":2,"component":"ns_server","description":"Rebalance initiated","severity":"info","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"30a70a64-8a13-417d-a832-1e553b457c85","extra_attributes":{"operation_id":"b7d215118f0ab480859a1b351f16c00d","nodes_info":
      {"active_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"keep_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"eject_nodes":[],"delta_nodes":["ns_1@cb-example-0003.cb-example.default.svc"],"failed_nodes":[]}}}
      

      While this rebalance is still in progress, node 004 went down for the upgrade. This node could not be auto failed over since 0003 was already failed over and not part of cluster yet. 

      {"timestamp":"2024-05-13T13:37:30.281Z","event_id":20,"component":"ns_server","description":"Node Down","severity":"warning","node":"cb-example-0009.cb-example.default.svc","otp_node":"ns_1@cb-example-0009.cb-example.default.svc","uuid":"10315522-d098-421b-b76b-57a03947f7ef","extra_attributes":{"down_node":"ns_1@cb-example-0004.cb-example.default.svc","reason":"[{nodedown_reason,connection_closed}]"}} {"timestamp":"2024-05-13T13:37:35.863Z","event_id":17,"component":"ns_server","description":"Auto failover warning","severity":"warn","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"432bafe8-7fab-4514-8834-d454a48a6db9","extra_attributes":{"reason":"Maximum number of auto-failover nodes (1) has been reached.","nodes":["ns_1@cb-example-0004.cb-example.default.svc"]}}
      

      Then the rebalance failed because node 0004 is down.
       

      {"timestamp":"2024-05-13T13:41:13.197Z","event_id":4,"component":"ns_server","description":"Rebalance failed","severity":"error","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"c868553f-e41b-4e8e-8b69-4c2f88e0b854","extra_attributes":{"operation_id":"b7d215118f0ab480859a1b351f16c00d","nodes_info":{"active_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"keep_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"eject_nodes":[],"delta_nodes":["ns_1@cb-example-0003.cb-example.default.svc"],"failed_nodes":[]},"time_taken":333084,"completion_message":"Rebalance exited with reason {{badmatch,\n {error,\n
      {failed_nodes,\n ['ns_1@cb-example-0003.cb-example.default.svc']}}},\n [{ns_janitor,cleanup_apply_config_body,4,\n [
      {file,\"src/ns_janitor.erl\"}
      ,\{line,295}]},\n {ns_janitor,'{-}cleanup_apply_config/4-fun-0{-}',\n 4,\n [\{file,"src/ns_janitor.erl"}
      ,\{line,215}]},\n {async,'{-}async_init/4-fun-1{-}',3,\n [
      {file,\"src/async.erl\"}
      ,\{line,191}]}]}."}}
      

      Now node 0004 started coming up.

      {"timestamp":"2024-05-13T13:43:42.437Z","event_id":1,"component":"ns_server","description":"Service started","severity":"info","node":"cb-example-0004.cb-example.default.svc","otp_node":"ns_1@cb-example-0004.cb-example.default.svc","uuid":"4472c76e-cebe-4bd7-87e2-04c0d01e1f90","extra_attributes":\{"name":"ns_server"}}
      

      After this, both nodes 0003 and 0004 were added back to the cluster. 
       

      {"timestamp":"2024-05-13T13:51:54.166Z","event_id":3,"component":"ns_server","description":"Rebalance completed","severity":"info","node":"cb-example-0000.cb-example.default.svc","otp_node":"ns_1@cb-example-0000.cb-example.default.svc","uuid":"b97c3919-3daa-4446-8d34-42b1b63974e6","extra_attributes":{"operation_id":"b8bd1d7c5b169bb59910247e63eca6b9","nodes_info":{"active_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-example.default.svc"],"keep_nodes":["ns_1@cb-example-0000.cb-example.default.svc","ns_1@cb-example-0001.cb-example.default.svc","ns_1@cb-example-0002.cb-example.default.svc","ns_1@cb-example-0003.cb-example.default.svc","ns_1@cb-example-0004.cb-example.default.svc","ns_1@cb-example-0005.cb-example.default.svc","ns_1@cb-example-0006.cb-example.default.svc","ns_1@cb-example-0007.cb-example.default.svc","ns_1@cb-example-0008.cb-example.default.svc","ns_1@cb-example-0009.cb-
      

      If at a point a new node (B) is auto failed over with previous node (A) still rebalancing may lose data of node A if replicated onto Node B. ( #Replicas = 1 )

      Need to investigate:
      1. Why was node 004 down during the rebalance of node 0003 addition?
      2. Was it part of the part of upgrade or any other issues?

       

      CB Collect - https://supportal.couchbase.com/snapshot/8387b5feeb1f87c29dfab6ecb4add56a::1

      Operator Logs-
      cbopinfo-20240513T192014+0530.tar.gz

      SS of UpdateConfig of Node Group

       

      Attachments

        1. cbopinfo-20240513T192014+0530.tar.gz
          517 kB
        2. image-2024-05-14-23-48-31-420.png
          image-2024-05-14-23-48-31-420.png
          1.14 MB
        3. screenshot-1.png
          screenshot-1.png
          45 kB
        4. screenshot-2.png
          screenshot-2.png
          85 kB
        5. Screenshot 2024-05-14 at 11.45.11 PM.png
          Screenshot 2024-05-14 at 11.45.11 PM.png
          1.14 MB
        6. Screenshot 2024-05-15 at 4.13.15 PM.png
          Screenshot 2024-05-15 at 4.13.15 PM.png
          962 kB
        7. screenshot-3.png
          screenshot-3.png
          79 kB
        8. screenshot-4.png
          screenshot-4.png
          87 kB
        9. screenshot-5.png
          screenshot-5.png
          116 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              justin.ashworth Justin Ashworth
              manik.mahajan Manik Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty