Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-3488

[operator:2.6.4-109] node got auto failover due to memory pressure during steady state data loading

    XMLWordPrintable

Details

    • Bug
    • Resolution: Won't Fix
    • Major
    • 2.6.4
    • None
    • operator
    • None
    • High
    • 0

    Description

      Couchbase Cluster Description

      • Set up the cluster as per the required specifications
      • Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM) 
      • 6 Data Service, 4 Index Service and Query Service Nodes.
      • 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
      • ~215GB data per bucket → ~1.3TB data loaded onto cluster.
      • 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)

      We planned to test the upgrade from EKS 1.26 to 1.27, but a failover occurred when the cb-example-0002 pod was evicted. We have collected logs both before and after the failover event.

      CB logs before failover -  http://supportal.couchbase.com/snapshot/d98f94df31477f6c622956049790e725::0

      CB Logs  after failover -
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0000.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0001.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0002.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0003.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0004.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0005.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0006.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0007.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0008.cb-example.default.svc.zip
      https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0009.cb-example.default.svc.zip

      K8 topology before failover

      kubectl get pods -o wide                                                                                     
      NAME                                           READY   STATUS      RESTARTS        AGE    IP            NODE                                        NOMINATED NODE   READINESS GATES
      cb-example-0000                                1/1     Running     0               106m   10.0.4.144    ip-10-0-4-32.us-east-2.compute.internal     <none>           1/1
      cb-example-0001                                1/1     Running     0               104m   10.0.11.47    ip-10-0-11-233.us-east-2.compute.internal   <none>           1/1
      cb-example-0002                                1/1     Running     0               104m   10.0.7.211    ip-10-0-5-41.us-east-2.compute.internal     <none>           1/1
      cb-example-0003                                1/1     Running     0               105m   10.0.12.142   ip-10-0-15-18.us-east-2.compute.internal    <none>           1/1
      cb-example-0004                                1/1     Running     0               105m   10.0.13.132   ip-10-0-15-234.us-east-2.compute.internal   <none>           1/1
      cb-example-0005                                1/1     Running     0               105m   10.0.10.10    ip-10-0-10-74.us-east-2.compute.internal    <none>           1/1
      cb-example-0006                                1/1     Running     0               105m   10.0.9.129    ip-10-0-10-12.us-east-2.compute.internal    <none>           1/1
      cb-example-0007                                1/1     Running     0               105m   10.0.15.10    ip-10-0-12-126.us-east-2.compute.internal   <none>           1/1
      cb-example-0008                                1/1     Running     0               105m   10.0.7.208    ip-10-0-5-41.us-east-2.compute.internal     <none>           1/1
      cb-example-0009                                1/1     Running     0               105m   10.0.7.138    ip-10-0-7-8.us-east-2.compute.internal      <none>           1/1
      cbq-bucket0-fp2dj                              0/1     Completed   0               89m    10.0.10.64    ip-10-0-11-233.us-east-2.compute.internal   <none>           <none>
      cbq-bucket1-h5r8f                              0/1     Completed   0               89m    10.0.12.16    ip-10-0-12-126.us-east-2.compute.internal   <none>           <none>
      cbq-bucket2-rb62l                              0/1     Completed   0               89m    10.0.14.248   ip-10-0-15-18.us-east-2.compute.internal    <none>           <none>
      cbq-bucket3-gdfb5                              0/1     Completed   0               89m    10.0.4.79     ip-10-0-7-103.us-east-2.compute.internal    <none>           <none>
      cbq-bucket4-dhjs5                              0/1     Completed   0               89m    10.0.11.137   ip-10-0-10-74.us-east-2.compute.internal    <none>           <none>
      cbq-bucket5-m2cr6                              0/1     Completed   0               89m    10.0.13.195   ip-10-0-15-234.us-east-2.compute.internal   <none>           <none>
      cbq-bucket6-s2g9x                              0/1     Completed   0               89m    10.0.7.51     ip-10-0-4-32.us-east-2.compute.internal     <none>           <none>
      cbq-bucket7-z4rhk                              0/1     Completed   0               89m    10.0.7.29     ip-10-0-7-8.us-east-2.compute.internal      <none>           <none>
      cbq-bucket8-5t48s                              0/1     Completed   0               89m    10.0.10.86    ip-10-0-10-12.us-east-2.compute.internal    <none>           <none>
      cbq-bucket9-nvpxz                              0/1     Completed   0               89m    10.0.9.148    ip-10-0-10-109.us-east-2.compute.internal   <none>           <none>
      couchbase-operator-5f949645cc-g9nzp            1/1     Running     21 (117m ago)   4h1m   10.0.5.226    ip-10-0-7-103.us-east-2.compute.internal    <none>           <none>
      couchbase-operator-admission-84dcbd656-dtc5p   1/1     Running     0               4h1m   10.0.11.204   ip-10-0-10-109.us-east-2.compute.internal   <none>           <none>
      queryapp-bucket0-pkbtc                         1/1     Running     0               85m    10.0.10.187   ip-10-0-10-109.us-east-2.compute.internal   <none>           <none>
      queryapp-bucket1-sf5gn                         1/1     Running     0               85m    10.0.11.158   ip-10-0-10-109.us-east-2.compute.internal   <none>           <none>
      sirius-deployment-b76bd6b85-jdbsg              1/1     Running     0               3h     10.0.13.69    ip-10-0-14-221.us-east-2.compute.internal   <none>           <none>

      K8 toplology after update 

      cb-example-0000                                1/1     Running   0               158m    10.0.4.144    ip-10-0-4-32.us-east-2.compute.internal     <none>           1/1
      cb-example-0001                                1/1     Running   0               156m    10.0.11.47    ip-10-0-11-233.us-east-2.compute.internal   <none>           1/1
      cb-example-0002                                1/1     Running   0               18m     10.0.4.79     ip-10-0-7-103.us-east-2.compute.internal    <none>           0/1
      cb-example-0003                                1/1     Running   0               156m    10.0.12.142   ip-10-0-15-18.us-east-2.compute.internal    <none>           1/1
      cb-example-0004                                1/1     Running   0               156m    10.0.13.132   ip-10-0-15-234.us-east-2.compute.internal   <none>           1/1
      cb-example-0005                                1/1     Running   0               156m    10.0.10.10    ip-10-0-10-74.us-east-2.compute.internal    <none>           1/1
      cb-example-0006                                1/1     Running   0               156m    10.0.9.129    ip-10-0-10-12.us-east-2.compute.internal    <none>           1/1
      cb-example-0007                                1/1     Running   0               156m    10.0.15.10    ip-10-0-12-126.us-east-2.compute.internal   <none>           1/1
      cb-example-0008                                1/1     Running   0               156m    10.0.7.208    ip-10-0-5-41.us-east-2.compute.internal     <none>           1/1
      cb-example-0009                                1/1     Running   0               156m    10.0.7.138    ip-10-0-7-8.us-east-2.compute.internal      <none>           1/1
      couchbase-operator-5f949645cc-g9nzp            1/1     Running   21 (168m ago)   4h53m   10.0.5.226    ip-10-0-7-103.us-east-2.compute.internal    <none>           <none>
      couchbase-operator-admission-84dcbd656-dtc5p   1/1     Running   0               4h53m   10.0.11.204   ip-10-0-10-109.us-east-2.compute.internal   <none>           <none>
      queryapp-bucket0-pkbtc                         1/1     Running   0               137m    10.0.10.187   ip-10-0-10-109.us-east-2.compute.internal   <none>           <none>
      queryapp-bucket1-sf5gn                         1/1     Running   0               137m    10.0.11.158   ip-10-0-10-109.us-east-2.compute.internal   <none>           <none>
      sirius-deployment-b76bd6b85-7g5cr              1/1     Running   0               21m     10.0.6.76     ip-10-0-7-8.us-east-2.compute.internal      <none>           <none>

      K8 events  after failover:

      bin %kubectl get event --field-selector involvedObject.name=cb-example-0002
      LAST SEEN   TYPE      REASON                   OBJECT                MESSAGE
      14m         Warning   Evicted                  pod/cb-example-0002   The node was low on resource: memory. Container couchbase-server was using 39596260Ki, which exceeds its request of 0.
      14m         Normal    Killing                  pod/cb-example-0002   Stopping container couchbase-server
      14m         Warning   ExceededGracePeriod      pod/cb-example-0002   Container runtime did not kill the pod within specified grace period.
      14m         Normal    Scheduled                pod/cb-example-0002   Successfully assigned default/cb-example-0002 to ip-10-0-7-103.us-east-2.compute.internal
      14m         Warning   FailedAttachVolume       pod/cb-example-0002   Multi-Attach error for volume "pvc-45bb00a1-9919-4599-8a9f-4c7cd3635ed8" Volume is already exclusively attached to one node and can't be attached to another
      14m         Warning   FailedAttachVolume       pod/cb-example-0002   Multi-Attach error for volume "pvc-141fa984-4eac-446d-8a7c-bc68e5f73033" Volume is already exclusively attached to one node and can't be attached to another
      14m         Normal    SuccessfulAttachVolume   pod/cb-example-0002   AttachVolume.Attach succeeded for volume "pvc-141fa984-4eac-446d-8a7c-bc68e5f73033"
      13m         Normal    SuccessfulAttachVolume   pod/cb-example-0002   AttachVolume.Attach succeeded for volume "pvc-45bb00a1-9919-4599-8a9f-4c7cd3635ed8"
      13m         Normal    Pulling                  pod/cb-example-0002   Pulling image "couchbase/server:7.2.5"
      13m         Normal    Pulled                   pod/cb-example-0002   Successfully pulled image "couchbase/server:7.2.5" in 14.785882244s (14.785900344s including waiting)
      13m         Normal    Created                  pod/cb-example-0002   Created container couchbase-server-init
      13m         Normal    Started                  pod/cb-example-0002   Started container couchbase-server-init
      13m         Normal    Pulled                   pod/cb-example-0002   Container image "couchbase/server:7.2.5" already present on machine
      13m         Normal    Created                  pod/cb-example-0002   Created container couchbase-server
      13m         Normal    Started                  pod/cb-example-0002   Started container couchbase-server
      

      Key Logs -

      9:02:52 AM 17 May, 2024

      Rebalance exited with reason \{prepare_delta_recovery_failed,"bucket0", {error, {failed_nodes, [{'ns_1@cb-example-0002.cb-example.default.svc', {error,aborted}}]}}}. Rebalance Operation Id = 215d34246dcb1253b3c64186bde4082bns_orchestrator 000ns_1@cb-example-0000.cb-example.default.svc 
      

      9:02:47 AM 17 May, 2024

      Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc', 'ns_1@cb-example-0001.cb-example.default.svc', 'ns_1@cb-example-0002.cb-example.default.svc', 'ns_1@cb-example-0003.cb-example.default.svc', 'ns_1@cb-example-0004.cb-example.default.svc', 'ns_1@cb-example-0005.cb-example.default.svc', 'ns_1@cb-example-0006.cb-example.default.svc', 'ns_1@cb-example-0007.cb-example.default.svc', 'ns_1@cb-example-0008.cb-example.default.svc', 'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@cb-example-0002.cb-example.default.svc'], Delta recovery buckets = all; Operation Id = 215d34246dcb1253b3c641 show...ns_orchestrator 000ns_1@cb-example-0000.cb-example.default.svc 
      

      9:02:40 AM 17 May, 2024

      Node ('ns_1@cb-example-0002.cb-example.default.svc') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.auto_failover 000ns_1@cb-example-0000.cb-example.default.svc 
      

      9:02:40 AM 17 May, 2024

      Failover completed successfully. Rebalance Operation Id = ef66aa620825d9e872ac3aaedbd13537ns_orchestrator 000ns_1@cb-example-0000.cb-example.default.svc 
      

      9:02:39 AM 17 May, 2024

      Deactivating failed over nodes ['ns_1@cb-example-0002.cb-example.default.svc']failover 000ns_1@cb-example-0000.cb-example.default.svc 
      

      CAO logs pre upgrade :- K8S-3486_pre_control_plane.tar.gz
      CAO logs after failover :- cbopinfo-20240517T150223+0530.tar.gz

      Attachments

        1. cbopinfo-20240517T150223+0530.tar.gz
          1.56 MB
          Manik Mahajan
        2. K8S-3486_pre_control_plane.tar.gz
          1.51 MB
          Manik Mahajan
        3. screenshot-1.png
          79 kB
          Dave Finlay
        4. screenshot-2.png
          88 kB
          Dave Finlay

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              yusuf.ramzan Yusuf Ramzan
              manik.mahajan Manik Mahajan
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty