Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-56778

Eventing nodes are getting failed over continuously on a GCP cluster due to high memory usage

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Critical
    • Morpheus
    • 7.1.4
    • eventing
    • Enterprise Edition 7.1.4 build 3632
    • Untriaged
    • 0
    • Unknown

    Description

      Steps to Repro

      1. Run following volume test on a Capella GCP cluster from TAF repository -

        guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/capella.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.hostedHospital.Murphy.test_rebalance_disk,graceful=True,skip_cleanup=True,num_buckets=1,num_doc_per_collections=125000000,skip_default=True,xdcr_remote_clusters=0,backup_nodes=0,bucket_names=GleamBook,bucket_type=membase,eviction_policy=fullEviction,iterations=1,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,maxttl=10,pc=30,gsi_nodes=3,cbas_nodes=0,fts_nodes=0,kv_nodes=3,n1ql_nodes=3,eventing_nodes=3,mutation_perc=100,key_type=RandomKey,capella_run=true,services=data-query-index-eventing,max_rebl_nodes=27,kv_compute=n2-standard-8,gsi_compute=n2-standard-8,n1ql_compute=n2-standard-8,eventing_compute=n2-standard-8,cbas_compute=n2-standard-8,kv_disk=800,n1ql_disk=50,gsi_disk=800,cbas_disk=800,eventing_disk=700,provider=GCP,region=us-central1,type=PD-SSD,skip_teardown_cleanup=true,wait_timeout=14400,index_timeout=28800,runtype=dedicated,track_failures=False -m rest'
        

      2. Following is the cluster configuration -

        +----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+
        | Nodes                                                                | Zone | Services | CPU    | Mem_total | Mem_free  | Swap_mem_used       | Active / Replica  | Version / Config                |
        +----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+
        | svc-e-node-010.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 0.7724 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-d-node-002.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv       | 0.8870 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-d-node-001.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv       | 0.6311 | 31.35 GiB | 30.69 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-i-node-009.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index    | 0.7033 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-q-node-005.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql     | 0.4806 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-q-node-004.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql     | 0.8567 | 31.35 GiB | 30.64 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-i-node-007.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index    | 1.2142 | 31.35 GiB | 30.60 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-i-node-008.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | index    | 0.9745 | 31.35 GiB | 30.63 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-q-node-006.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | n1ql     | 0.7349 | 31.35 GiB | 30.64 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-d-node-003.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | kv       | 0.5935 | 31.35 GiB | 30.57 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-e-node-012.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 1.0358 | 31.35 GiB | 30.67 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        | svc-e-node-011.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com | None | eventing | 0.4844 | 31.35 GiB | 30.68 GiB | 0.0 Byte / 0.0 Byte | 0 / 0             | 7.1.4-3632-enterprise / default |
        +----------------------------------------------------------------------+------+----------+--------+-----------+-----------+---------------------+-------------------+---------------------------------+
        

      3. Create and deploy 5 Eventing functions -
        • bucket-op
        • curl
        • n1ql
        • sbm
        • timers
      4. Data loading is happening continuously on source collection.

      Observation

      All 3 Eventing nodes present in the cluster are going down continuously due to higher memory usage.

      Example -

      2023-05-05T08:07:04.520Z, auto_failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Node ('ns_1@svc-e-node-030.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com') was automatically failed over. Reason: The cluster manager did not respond for the duration of the auto-failover threshold. 
      2023-05-05T08:07:20.861Z, menelaus_web_alerts_srv:0:info:message(ns_1@svc-e-node-029.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - CRITICAL: On node svc-e-node-029.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com system memory use is 97.29% of total available memory, above the critical threshold of 95%.
      2023-05-05T08:07:42.578Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Starting failing over ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']
      2023-05-05T08:07:42.578Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Starting failover of nodes ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']. Operation Id = e175c8640c2db3826d64c190e190622c
      2023-05-05T08:07:42.867Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Failed over ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']: ok
      2023-05-05T08:07:44.877Z, failover:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Deactivating failed over nodes ['ns_1@svc-e-node-028.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com']
      2023-05-05T08:07:45.034Z, ns_orchestrator:0:info:message(ns_1@svc-d-node-022.lvn2q50aakurqem1.sandbox.nonprod-project-avengers.com) - Failover completed successfully.
      

      NOTE

      I have collected and uploaded logs for 2 out of the 3 Eventing nodes as CP currently does not support log collection for failed over node.

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            ankit.prabhu Ankit Prabhu
            sujay.gad Sujay Gad
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty