Description
Couchbase Cluster Description
- Set up the cluster as per the required specifications
- Each node is an m5.4xlarge instance. (16 vCPUs and 64GB RAM)
- 6 Data Service, 4 Index Service and Query Service Nodes.
- 10 Buckets (with 1 replica), Full Eviction and Auto-failover set to 5s.
- ~215GB data per bucket → ~1.3TB data loaded onto cluster.
- 50 Primary Indexes with 1 Replica each. (Total 100 Indexes)
We planned to test the upgrade from EKS 1.26 to 1.27, but a failover occurred when the cb-example-0002 pod was evicted. We have collected logs both before and after the failover event.
CB logs before failover - http://supportal.couchbase.com/snapshot/d98f94df31477f6c622956049790e725::0
CB Logs after failover -
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0000.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0001.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0002.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0003.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0004.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0005.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0006.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0007.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0008.cb-example.default.svc.zip
https://cb-engineering.s3.amazonaws.com/K8s_pod_eviction_during_data_load/collectinfo-2024-05-17T092108-ns_1%40cb-example-0009.cb-example.default.svc.zip
K8 topology before failover
kubectl get pods -o wide
|
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
|
cb-example-0000 1/1 Running 0 106m 10.0.4.144 ip-10-0-4-32.us-east-2.compute.internal <none> 1/1
|
cb-example-0001 1/1 Running 0 104m 10.0.11.47 ip-10-0-11-233.us-east-2.compute.internal <none> 1/1
|
cb-example-0002 1/1 Running 0 104m 10.0.7.211 ip-10-0-5-41.us-east-2.compute.internal <none> 1/1
|
cb-example-0003 1/1 Running 0 105m 10.0.12.142 ip-10-0-15-18.us-east-2.compute.internal <none> 1/1
|
cb-example-0004 1/1 Running 0 105m 10.0.13.132 ip-10-0-15-234.us-east-2.compute.internal <none> 1/1
|
cb-example-0005 1/1 Running 0 105m 10.0.10.10 ip-10-0-10-74.us-east-2.compute.internal <none> 1/1
|
cb-example-0006 1/1 Running 0 105m 10.0.9.129 ip-10-0-10-12.us-east-2.compute.internal <none> 1/1
|
cb-example-0007 1/1 Running 0 105m 10.0.15.10 ip-10-0-12-126.us-east-2.compute.internal <none> 1/1
|
cb-example-0008 1/1 Running 0 105m 10.0.7.208 ip-10-0-5-41.us-east-2.compute.internal <none> 1/1
|
cb-example-0009 1/1 Running 0 105m 10.0.7.138 ip-10-0-7-8.us-east-2.compute.internal <none> 1/1
|
cbq-bucket0-fp2dj 0/1 Completed 0 89m 10.0.10.64 ip-10-0-11-233.us-east-2.compute.internal <none> <none>
|
cbq-bucket1-h5r8f 0/1 Completed 0 89m 10.0.12.16 ip-10-0-12-126.us-east-2.compute.internal <none> <none>
|
cbq-bucket2-rb62l 0/1 Completed 0 89m 10.0.14.248 ip-10-0-15-18.us-east-2.compute.internal <none> <none>
|
cbq-bucket3-gdfb5 0/1 Completed 0 89m 10.0.4.79 ip-10-0-7-103.us-east-2.compute.internal <none> <none>
|
cbq-bucket4-dhjs5 0/1 Completed 0 89m 10.0.11.137 ip-10-0-10-74.us-east-2.compute.internal <none> <none>
|
cbq-bucket5-m2cr6 0/1 Completed 0 89m 10.0.13.195 ip-10-0-15-234.us-east-2.compute.internal <none> <none>
|
cbq-bucket6-s2g9x 0/1 Completed 0 89m 10.0.7.51 ip-10-0-4-32.us-east-2.compute.internal <none> <none>
|
cbq-bucket7-z4rhk 0/1 Completed 0 89m 10.0.7.29 ip-10-0-7-8.us-east-2.compute.internal <none> <none>
|
cbq-bucket8-5t48s 0/1 Completed 0 89m 10.0.10.86 ip-10-0-10-12.us-east-2.compute.internal <none> <none>
|
cbq-bucket9-nvpxz 0/1 Completed 0 89m 10.0.9.148 ip-10-0-10-109.us-east-2.compute.internal <none> <none>
|
couchbase-operator-5f949645cc-g9nzp 1/1 Running 21 (117m ago) 4h1m 10.0.5.226 ip-10-0-7-103.us-east-2.compute.internal <none> <none>
|
couchbase-operator-admission-84dcbd656-dtc5p 1/1 Running 0 4h1m 10.0.11.204 ip-10-0-10-109.us-east-2.compute.internal <none> <none>
|
queryapp-bucket0-pkbtc 1/1 Running 0 85m 10.0.10.187 ip-10-0-10-109.us-east-2.compute.internal <none> <none>
|
queryapp-bucket1-sf5gn 1/1 Running 0 85m 10.0.11.158 ip-10-0-10-109.us-east-2.compute.internal <none> <none>
|
sirius-deployment-b76bd6b85-jdbsg 1/1 Running 0 3h 10.0.13.69 ip-10-0-14-221.us-east-2.compute.internal <none> <none>
|
K8 toplology after update
cb-example-0000 1/1 Running 0 158m 10.0.4.144 ip-10-0-4-32.us-east-2.compute.internal <none> 1/1
|
cb-example-0001 1/1 Running 0 156m 10.0.11.47 ip-10-0-11-233.us-east-2.compute.internal <none> 1/1
|
cb-example-0002 1/1 Running 0 18m 10.0.4.79 ip-10-0-7-103.us-east-2.compute.internal <none> 0/1
|
cb-example-0003 1/1 Running 0 156m 10.0.12.142 ip-10-0-15-18.us-east-2.compute.internal <none> 1/1
|
cb-example-0004 1/1 Running 0 156m 10.0.13.132 ip-10-0-15-234.us-east-2.compute.internal <none> 1/1
|
cb-example-0005 1/1 Running 0 156m 10.0.10.10 ip-10-0-10-74.us-east-2.compute.internal <none> 1/1
|
cb-example-0006 1/1 Running 0 156m 10.0.9.129 ip-10-0-10-12.us-east-2.compute.internal <none> 1/1
|
cb-example-0007 1/1 Running 0 156m 10.0.15.10 ip-10-0-12-126.us-east-2.compute.internal <none> 1/1
|
cb-example-0008 1/1 Running 0 156m 10.0.7.208 ip-10-0-5-41.us-east-2.compute.internal <none> 1/1
|
cb-example-0009 1/1 Running 0 156m 10.0.7.138 ip-10-0-7-8.us-east-2.compute.internal <none> 1/1
|
couchbase-operator-5f949645cc-g9nzp 1/1 Running 21 (168m ago) 4h53m 10.0.5.226 ip-10-0-7-103.us-east-2.compute.internal <none> <none>
|
couchbase-operator-admission-84dcbd656-dtc5p 1/1 Running 0 4h53m 10.0.11.204 ip-10-0-10-109.us-east-2.compute.internal <none> <none>
|
queryapp-bucket0-pkbtc 1/1 Running 0 137m 10.0.10.187 ip-10-0-10-109.us-east-2.compute.internal <none> <none>
|
queryapp-bucket1-sf5gn 1/1 Running 0 137m 10.0.11.158 ip-10-0-10-109.us-east-2.compute.internal <none> <none>
|
sirius-deployment-b76bd6b85-7g5cr 1/1 Running 0 21m 10.0.6.76 ip-10-0-7-8.us-east-2.compute.internal <none> <none>
|
K8 events after failover:
bin %kubectl get event --field-selector involvedObject.name=cb-example-0002
|
LAST SEEN TYPE REASON OBJECT MESSAGE
|
14m Warning Evicted pod/cb-example-0002 The node was low on resource: memory. Container couchbase-server was using 39596260Ki, which exceeds its request of 0.
|
14m Normal Killing pod/cb-example-0002 Stopping container couchbase-server
|
14m Warning ExceededGracePeriod pod/cb-example-0002 Container runtime did not kill the pod within specified grace period.
|
14m Normal Scheduled pod/cb-example-0002 Successfully assigned default/cb-example-0002 to ip-10-0-7-103.us-east-2.compute.internal
|
14m Warning FailedAttachVolume pod/cb-example-0002 Multi-Attach error for volume "pvc-45bb00a1-9919-4599-8a9f-4c7cd3635ed8" Volume is already exclusively attached to one node and can't be attached to another
|
14m Warning FailedAttachVolume pod/cb-example-0002 Multi-Attach error for volume "pvc-141fa984-4eac-446d-8a7c-bc68e5f73033" Volume is already exclusively attached to one node and can't be attached to another
|
14m Normal SuccessfulAttachVolume pod/cb-example-0002 AttachVolume.Attach succeeded for volume "pvc-141fa984-4eac-446d-8a7c-bc68e5f73033"
|
13m Normal SuccessfulAttachVolume pod/cb-example-0002 AttachVolume.Attach succeeded for volume "pvc-45bb00a1-9919-4599-8a9f-4c7cd3635ed8"
|
13m Normal Pulling pod/cb-example-0002 Pulling image "couchbase/server:7.2.5"
|
13m Normal Pulled pod/cb-example-0002 Successfully pulled image "couchbase/server:7.2.5" in 14.785882244s (14.785900344s including waiting)
|
13m Normal Created pod/cb-example-0002 Created container couchbase-server-init
|
13m Normal Started pod/cb-example-0002 Started container couchbase-server-init
|
13m Normal Pulled pod/cb-example-0002 Container image "couchbase/server:7.2.5" already present on machine
|
13m Normal Created pod/cb-example-0002 Created container couchbase-server
|
13m Normal Started pod/cb-example-0002 Started container couchbase-server
|
Key Logs -
9:02:52 AM 17 May, 2024
Rebalance exited with reason \{prepare_delta_recovery_failed,"bucket0", {error, {failed_nodes, [{'ns_1@cb-example-0002.cb-example.default.svc', {error,aborted}}]}}}. Rebalance Operation Id = 215d34246dcb1253b3c64186bde4082bns_orchestrator 000ns_1@cb-example-0000.cb-example.default.svc
|
9:02:47 AM 17 May, 2024
Starting rebalance, KeepNodes = ['ns_1@cb-example-0000.cb-example.default.svc', 'ns_1@cb-example-0001.cb-example.default.svc', 'ns_1@cb-example-0002.cb-example.default.svc', 'ns_1@cb-example-0003.cb-example.default.svc', 'ns_1@cb-example-0004.cb-example.default.svc', 'ns_1@cb-example-0005.cb-example.default.svc', 'ns_1@cb-example-0006.cb-example.default.svc', 'ns_1@cb-example-0007.cb-example.default.svc', 'ns_1@cb-example-0008.cb-example.default.svc', 'ns_1@cb-example-0009.cb-example.default.svc'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@cb-example-0002.cb-example.default.svc'], Delta recovery buckets = all; Operation Id = 215d34246dcb1253b3c641 show...ns_orchestrator 000ns_1@cb-example-0000.cb-example.default.svc
|
9:02:40 AM 17 May, 2024
Node ('ns_1@cb-example-0002.cb-example.default.svc') was automatically failed over. Reason: The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service.auto_failover 000ns_1@cb-example-0000.cb-example.default.svc
|
9:02:40 AM 17 May, 2024
Failover completed successfully. Rebalance Operation Id = ef66aa620825d9e872ac3aaedbd13537ns_orchestrator 000ns_1@cb-example-0000.cb-example.default.svc
|
9:02:39 AM 17 May, 2024
Deactivating failed over nodes ['ns_1@cb-example-0002.cb-example.default.svc']failover 000ns_1@cb-example-0000.cb-example.default.svc
|
CAO logs pre upgrade :- K8S-3486_pre_control_plane.tar.gz
CAO logs after failover :- cbopinfo-20240517T150223+0530.tar.gz