Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48512

Rebalance failure due to OOM kill on one of the node

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • 7.1.0
    • 7.1.0
    • fts
    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.1.0-1300
      Test: -test tests/fts/cheshire-cat/test_fts_clusterops_cheshire_cat_coll_crud_freetier.yml -scope tests/fts/cheshire-cat/scope_fts_cheshire_cat_free_tier.yml

      • Cluster with 3 nodes having kv,n1ql, search, index on all the nodes
      • Create 1 bucket, 100 scopes and 50 collections in each scopes
      • Created 500 indexes: one index (1 partition) on each collection
      • Create 2500 GSI indexes ( 5 on each collection)
      • Load documents on some of the collections
      • Run fts query workload
      • Kill fts on 172.23.106.242 and wait for 25 mins
      • Rebalance in 172.23.107.89 with fts service
      • Rebalance failed due OOM kill of fts node on 172.23.106.242

      From : 172.23.106.253

      [ns_server:error,2021-09-17T16:12:55.608-07:00,ns_1@172.23.106.253:service_rebalancer-fts<0.26279.44>:service_rebalancer:run_rebalance_worker:130]Agent terminated during the rebalance: {'DOWN',
                                              #Ref<0.2884519677.227278849.177132>,
                                              process,<31346.30956.27>,
                                              {lost_connection,
                                               {'ns_1@172.23.106.242',shutdown}}}
      [ns_server:info,2021-09-17T16:12:55.613-07:00,ns_1@172.23.106.253:rebalance_agent<0.8298.0>:rebalance_agent:handle_down:290]Rebalancer process <0.26322.44> died (reason {service_rebalance_failed,fts,
                                                    {agent_died,<31346.30956.27>,
                                                     {lost_connection,
                                                      {'ns_1@172.23.106.242',
                                                       shutdown}}}}).
      [ns_server:error,2021-09-17T16:12:55.613-07:00,ns_1@172.23.106.253:service_agent-fts<0.9178.0>:service_agent:handle_info:281]Rebalancer <0.26279.44> died unexpectedly: {agent_died,<31346.30956.27>,
                                                  {lost_connection,
                                                   {'ns_1@172.23.106.242',
                                                    shutdown}}}
      [user:error,2021-09-17T16:12:55.617-07:00,ns_1@172.23.106.253:<0.9322.0>:ns_orchestrator:log_rebalance_completion:1412]Rebalance exited with reason {service_rebalance_failed,fts,
                                    {agent_died,<31346.30956.27>,
                                     {lost_connection,
                                      {'ns_1@172.23.106.242',shutdown}}}}.
      Rebalance Operation Id = 54061800f5f01c3d63359735e10b16d0
      

      From 172.23.106.242:

      2021-09-17T16:12:46.954-07:00 [INFO] app_herder: indexing over indexQuota: 4592640000, memUsed: 8862700440, preIndexingMemory: 426816, indexes: 145, waiting: 243
      2021-09-17T16:12:52.080-07:00 [INFO] app_herder: query ended, indexes: 145, waiting: 243
      2021-09-17T16:12:52.181-07:00 [INFO] app_herder: indexing over indexQuota: 4592640000, memUsed: 9077805976, preIndexingMemory: 426816, indexes: 145, waiting: 243
      2021-09-17T16:12:52.184-07:00 [INFO] app_herder: query ended, indexes: 145, waiting: 243
      2021-09-17T16:12:56.813-07:00 [INFO] main: /opt/couchbase/bin/cbft started (v0.6.0/5.5.0)
      2021-09-17T16:12:56.830-07:00 [INFO] main: file descriptor limit current: 200000 max: 200000
      2021-09-17T16:12:56.830-07:00 [INFO]   -authType="cbauth"
      2021-09-17T16:12:56.830-07:00 [INFO]   -bindGrpc="172.23.106.242:9130,0.0.0.0:9130"
      2021-09-17T16:12:56.830-07:00 [INFO]   -bindGrpcSsl="172.23.106.242:19130,0.0.0.0:19130"
      2021-09-17T16:12:56.830-07:00 [INFO]   -bindHttp="172.23.106.242:8094,0.0.0.0:8094"
      2021-09-17T16:12:56.830-07:00 [INFO]   -bindHttps=":18094"
      2021-09-17T16:12:56.830-07:00 [INFO]   -cfgConnect="metakv"
      2021-09-17T16:12:56.830-07:00 [INFO]   -container=""
      

      Is there a reason for 172.23.106.242 being overloaded while other nodes are fine.

      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1631924630/collectinfo-2021-09-18T002351-ns_1%40172.23.106.242.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1631924630/collectinfo-2021-09-18T002351-ns_1%40172.23.106.243.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1631924630/collectinfo-2021-09-18T002351-ns_1%40172.23.106.253.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1631924630/collectinfo-2021-09-18T002351-ns_1%40172.23.107.89.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            girish.benakappa Girish Benakappa
            girish.benakappa Girish Benakappa
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty