Details
Description
Steps to Repro
- Run following volume test on a Capella GCP cluster from TAF repository -
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/capella.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance_disk,graceful=True,skip_cleanup=True,num_buckets=1,num_doc_per_collections=125000000,skip_default=True,xdcr_remote_clusters=0,backup_nodes=0,bucket_names=GleamBook,bucket_type=membase,eviction_policy=fullEviction,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=30,gsi_nodes=3,cbas_nodes=0,fts_nodes=0,kv_nodes=3,n1ql_nodes=3,eventing_nodes=3,mutation_perc=100,key_type=RandomKey,capella_run=true,services=data-query-index-eventing,max_rebl_nodes=27,kv_compute=n2-standard-8,gsi_compute=n2-standard-8,n1ql_compute=n2-standard-8,eventing_compute=n2-standard-8,cbas_compute=n2-standard-8,kv_disk=800,n1ql_disk=50,gsi_disk=800,cbas_disk=800,eventing_disk=700,provider=GCP,region=us-central1,type=PD-SSD,skip_teardown_cleanup=true,wait_timeout=14400,index_timeout=28800,runtype=dedicated,track_failures=False -m rest'
- Following is the cluster configuration -
+----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+
| Nodes | Zone | Services | CPU | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version / Config |
+----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+
| svc-e-node-010.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 0.7724 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-d-node-002.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv | 0.8870 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-d-node-001.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv | 0.6311 | 31.35 GiB | 30.69 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-i-node-009.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index | 0.7033 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-q-node-005.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql | 0.4806 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-q-node-004.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql | 0.8567 | 31.35 GiB | 30.64 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-i-node-007.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index | 1.2142 | 31.35 GiB | 30.60 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-i-node-008.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index | 0.9745 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-q-node-006.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql | 0.7349 | 31.35 GiB | 30.64 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-d-node-003.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv | 0.5935 | 31.35 GiB | 30.57 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-e-node-012.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 1.0358 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
| svc-e-node-011.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 0.4844 | 31.35 GiB | 30.68 GiB | 0.0 Byte / 0.0 Byte | 0 / 0 | 7.1.4-3632-enterprise / default |
+----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+
- Create and deploy 5 Eventing functions -
- bucket-op
- curl
- n1ql
- sbm
- timers
- Data loading is happening continuously on source collection.
Observation
All 3 Eventing nodes present in the cluster are going down continuously due to higher memory usage.
Example -
2023-05-05T08:07:04.520Z, auto_failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Node ('ns_1@svc-e-node-030.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com') was automatically failed over. Reason: The cluster manager did not respond for the duration of the auto-failover threshold. |
2023-05-05T08:07:20.861Z, menelaus_web_alerts_srv:0:info:message(ns_1@svc-e-node-029.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - CRITICAL: On node svc-e-node-029.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com system memory use is 97.29% of total available memory, above the critical threshold of 95%. |
2023-05-05T08:07:42.578Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Starting failing over ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com'] |
2023-05-05T08:07:42.578Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Starting failover of nodes ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']. Operation Id = e175c8640c2db3826d64c190e190622c |
2023-05-05T08:07:42.867Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Failed over ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']: ok |
2023-05-05T08:07:44.877Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Deactivating failed over nodes ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com'] |
2023-05-05T08:07:45.034Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Failover completed successfully. |
NOTE
I have collected and uploaded logs for 2 out of the 3 Eventing nodes as CP currently does not support log collection for failed over node.