Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Critical
Fix Version/s: Morpheus
Affects Version/s: 7.1.4
Component/s: eventing
Labels:
- nimbus
Environment:
Enterprise Edition 7.1.4 build 3632

Triage:
Untriaged
Story Points:
0
Is this a Regression?:
Unknown

Description

Steps to Repro

Run following volume test on a Capella GCP cluster from TAF repository -

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/capella.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance_disk,graceful=True,skip_cleanup=True,num_buckets=1,num_doc_per_collections=125000000,skip_default=True,xdcr_remote_clusters=0,backup_nodes=0,bucket_names=GleamBook,bucket_type=membase,eviction_policy=fullEviction,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=30,gsi_nodes=3,cbas_nodes=0,fts_nodes=0,kv_nodes=3,n1ql_nodes=3,eventing_nodes=3,mutation_perc=100,key_type=RandomKey,capella_run=true,services=data-query-index-eventing,max_rebl_nodes=27,kv_compute=n2-standard-8,gsi_compute=n2-standard-8,n1ql_compute=n2-standard-8,eventing_compute=n2-standard-8,cbas_compute=n2-standard-8,kv_disk=800,n1ql_disk=50,gsi_disk=800,cbas_disk=800,eventing_disk=700,provider=GCP,region=us-central1,type=PD-SSD,skip_teardown_cleanup=true,wait_timeout=14400,index_timeout=28800,runtype=dedicated,track_failures=False -m rest'

Following is the cluster configuration -

+----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+

| Nodes                                                                | Zone | Services | CPU    | Mem_total | Mem_free  | Swap_mem_used       | Active / Replica  | Version / Config                |

+----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+

| svc-e-node-010.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 0.7724 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-d-node-002.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv       | 0.8870 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-d-node-001.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv       | 0.6311 | 31.35 GiB | 30.69 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-i-node-009.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index    | 0.7033 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-q-node-005.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql     | 0.4806 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-q-node-004.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql     | 0.8567 | 31.35 GiB | 30.64 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-i-node-007.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index    | 1.2142 | 31.35 GiB | 30.60 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-i-node-008.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index    | 0.9745 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-q-node-006.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql     | 0.7349 | 31.35 GiB | 30.64 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-d-node-003.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv       | 0.5935 | 31.35 GiB | 30.57 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-e-node-012.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 1.0358 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

| svc-e-node-011.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 0.4844 | 31.35 GiB | 30.68 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |

+----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+

Create and deploy 5 Eventing functions -
- bucket-op
- curl
- n1ql
- sbm
- timers
Data loading is happening continuously on source collection.

Observation

All 3 Eventing nodes present in the cluster are going down continuously due to higher memory usage.

Example -

2023-05-05T08:07:04.520Z, auto_failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Node ('ns_1@svc-e-node-030.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com') was automatically failed over. Reason: The cluster manager did not respond for the duration of the auto-failover threshold.

2023-05-05T08:07:20.861Z, menelaus_web_alerts_srv:0:info:message(ns_1@svc-e-node-029.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - CRITICAL: On node svc-e-node-029.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com system memory use is 97.29% of total available memory, above the critical threshold of 95%.

2023-05-05T08:07:42.578Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Starting failing over ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']

2023-05-05T08:07:42.578Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Starting failover of nodes ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']. Operation Id = e175c8640c2db3826d64c190e190622c

2023-05-05T08:07:42.867Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Failed over ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']: ok

2023-05-05T08:07:44.877Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Deactivating failed over nodes ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']

2023-05-05T08:07:45.034Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Failover completed successfully.

NOTE

I have collected and uploaded logs for 2 out of the 3 Eventing nodes as CP currently does not support log collection for failed over node.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Eventing nodes are getting failed over continuously on a GCP cluster due to high memory usage

Details

Description

Steps to Repro

Observation

NOTE

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty