Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-33436

FTS intermittent rebalance failure

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Build : 6.5.0-2647

      Following failure is an intermittent failure seen in one of the tests. After the test is completed successfully, the framework cleans up the cluster by deleting the buckets and removing out all the nodes and perform a rebalance. Rebalance operation after removal of the FTS node here fails. Have seen this very intermittently.

      [ns_server:error,2019-03-18T22:38:09.589-07:00,ns_1@172.23.104.105:service_rebalancer-fts<0.8814.4>:service_rebalancer:run_rebalance:82]Agent terminated during the rebalance: {'DOWN',
                                              #Ref<0.541932241.3854303233.130179>,
                                              process,<22576.24369.1>,
                                              {linked_process_died,<22576.24478.1>,
                                               {timeout,
                                                {gen_server,call,
                                                 [<22576.24448.1>,
                                                  {call,
                                                   "ServiceAPI.GetCurrentTopology",
                                                   #Fun<json_rpc_connection.0.102434519>},
                                                  60000]}}}}
      [ns_server:error,2019-03-18T22:38:09.591-07:00,ns_1@172.23.104.105:service_rebalancer-fts<0.8814.4>:service_agent:process_bad_results:810]Service call unset_rebalancer (service fts) failed on some nodes:
      [{'ns_1@172.23.104.107',nack}]
      [ns_server:warn,2019-03-18T22:38:09.591-07:00,ns_1@172.23.104.105:service_rebalancer-fts<0.8814.4>:service_rebalancer:run_rebalance:91]Failed to unset rebalancer on some nodes:
      {error,{bad_nodes,fts,unset_rebalancer,[{'ns_1@172.23.104.107',nack}]}}
      [user:error,2019-03-18T22:38:09.592-07:00,ns_1@172.23.104.105:<0.2687.0>:ns_orchestrator:do_log_rebalance_completion:1206]Rebalance exited with reason {service_rebalance_failed,fts,
                                    {linked_process_died,<22576.24478.1>,
                                     {timeout,
                                      {gen_server,call,
                                       [<22576.24448.1>,
                                        {call,"ServiceAPI.GetCurrentTopology",
                                         #Fun<json_rpc_connection.0.102434519>},
                                        60000]}}}}. Operation Id = 87198007415e937d5007037fa814171e
      

      Logs attached.

      172.23.104.107 is the node being removed in this step. The buckets were deleted before starting the rebalance :
      [2019-03-18 22:37:14,480] - [bucket_helper:143] INFO - deleting existing buckets [u'default', u'sasl_bucket_1', u'sasl_bucket_2', u'sasl_bucket_3', u'standard_bucket_1', u'standard_bucket_2', u'standard_bucket_3'] on 172.23.104.105

      For QE Reference :

      *Test* : ./testrunner -i /tmp/testexec.304.ini -p get-cbcollect-info=True,disable_HTP=True,index_type=upside_down,get-logs=False,stop-on-failure=False,fts_quota=750 -t fts.stable_topology_fts.StableTopFTS.create_simple_default_index,items=1000,cluster=D,F,standard_buckets=3,sasl_buckets=3,index_per_bucket=3,update=True,expires=30,memory_only=True,GROUP=P0
      *Job* : centos-fts_mem-only-indexes
      

      Attachments

        1. 172.23.104.222-20190501-1012-diag.zip
          9.57 MB
        2. 172.23.104.223-20190501-1014-diag.zip
          1.13 MB
        3. 172.23.105.201-20190708-1201-diag.zip
          14.81 MB
        4. 172.23.105.202-20190708-1204-diag.zip
          7.12 MB
        5. test_10_lat.zip
          22.51 MB
        6. test_10.zip
          20.50 MB

        Issue Links

          For Gerrit Dashboard: MB-33436
          # Subject Branch Project Status CR V

          Activity

            People

              girish.benakappa Girish Benakappa
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty