Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3515

[operator 2.6.4-119] Query+Index pod is not added back upon restart of EC2 machine during delta recovery upgrade

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 2.6.4
    • None
    • operator
    • None
    • 11 - Race to Crashpoint Tower
    • 2

    Description

      Kubernetes Version 1.25
      Couchbase Server 7.2.5  → 7.6.1 
      Operator 2.6.4-119

      Cluster Setup

      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM) 
      • 6 Data Service, 4 Index Service & Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • ~2TiB data loaded onto cluster before the beginning of upgrade.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)

      Experiment :-

      1. EC2 node hosting Data pod restart during the upgrade.
      2. EC2 node hosting Query+index pod restart during upgrade

      Observation:-

      1. The graceful failover exited prematurely for the data pod because the node went down. The node was later returned with the new Couchbase version and added back through delta recovery.
      2. After the node hosting pod cb-example-0007 (Query + Index) restarted, the pod was never re-added to the cluster.
      3. Cluster is running on reduced capacity as one Index+query pod is effectively in an unknown state 

      Follow ups:-

      1. Why query node was not added back to the cluster?
      2. From my analysis, cb-0007 was upgraded to version 7.6.1, but the operator is stuck on creating the pod for cb-0007, and the cluster status still shows as 'upgrading'. How can we resolve these kinds of issues?

       

      CB logs - http://supportal.couchbase.com/snapshot/04bbaa961d67e0c9b6cd01145ce0be72::0

       

      Operator logs -

      cbopinfo-20240529T200123+0530.tar.gz

      Analysis :-

      CB server's perspective:-

      failover initiated on 0007 for upgrade

      Starting failing over ['ns_1@cb-example-0007.cb-example.default.svc']failover 000ns_1@cb-example-0006.cb-example.default.svc    10:45:08 AM 29 May, 2024

       

      failover completed successfully

      Failover completed successfully.
      Rebalance Operation Id = 77809f491a566aa32c488a62fabae607ns_orchestrator 000ns_1@cb-example-0006.cb-example.default.svc 10:45:09 AM 29 May, 2024

       

      0007 was up with 7.6.1 and the tries to get add back.

      Couchbase Server has started on web port 8091 on node 'ns_1@cb-example-0007.cb-example.default.svc'. Version: "7.6.1-3200-enterprise".menelaus_web_sup 001ns_1@cb-example-0007.cb-example.default.svc 10:45:20 AM 29 May, 2024

       

      Node hosting pod 0007 restarts, operator tries to spawn new pod which never came to running status.

       

      Operator's perspective:-

      1. Operator is trying to do 0007 delta Recovery
      2. Unable to perform Graceful failover
      3. Stuck on creating pod 

       

      "level":"info","ts":"2024-05-29T10:45:08Z","logger":"cluster","msg":"Upgrading pods with DeltaRecovery","cluster":"default/cb-example","names":["cb-example-0007"],"target-version":"7.6.1"}
       
      {"level":"info","ts":"2024-05-29T10:45:08Z","logger":"cluster","msg":"Unable to perform graceful failover on node. Reverting to hard failover.","cluster":"default/cb-example","name":"cb-example-0007"}
       
      {"level":"info","ts":"2024-05-29T10:45:09Z","logger":"kubernetes","msg":"Creating pod","cluster":"default/cb-example","name":"cb-example-0007","image":"couchbase/server:7.6.1"}

       

      The cluster is still showing in 7.2.5 in the operator but got upgraded  to 7.6.1.

      - lastTransitionTime: "2024-05-29T09:33:03Z"
        lastUpdateTime: "2024-05-29T09:33:03Z"
        message: Cluster upgrading (progress 9/10)
        reason: Upgrading
        status: "True"
        type: Upgrading
      currentVersion: 7.2.5
      members:
        ready:
        - cb-example-0000
        - cb-example-0001
        - cb-example-0002
        - cb-example-0003
        - cb-example-0004
        - cb-example-0005
        - cb-example-0006
        - cb-example-0007
        - cb-example-0008
        - cb-example-0009
      size: 10

       

      SS all pods

      SS Cluster

       

      0007 details - 0007.json

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              justin.ashworth Justin Ashworth
              usamah.jassat Usamah Jassat
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty