Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39871

FTS: Rebalance failure : service_rebalance_failed while ejecting nodes - upside_down

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • fts
    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0-2278

      Test suite: centos-fts_stabletopologyP0
      http://qa.sc.couchbase.com/job/test_suite_executor/222681/console

      Note that this is happening only with upside_down indexes and not with scorch indexes.

      Test:
      fts.stable_topology_fts.StableTopFTS:
      create_simple_default_index,items=1000,cluster=D,F,F,standard_buckets=3,sasl_buckets=3,index_per_bucket=3,GROUP=P0,cluster=D+F,disable_HTP=True,get-cbcollect-info=False,index_type=upside_down,fts_quota=750,GROUP=P0

      Steps in the test:

      • Create a cluster with n1:fts+kv+index+n1ql and n2:fts
      • Create default, sasl_bucket_1, sasl_bucket_2, sasl_bucket_3, standard_bucket_1, standard_bucket_2, standard_bucket_3
      • Create fts indexes : default_index_1, default_index_2, default_index_3, sasl_bucket_1_index_1, sasl_bucket_1_index_2, sasl_bucket_1_index_3, sasl_bucket_2_index_1, sasl_bucket_2_index_2, sasl_bucket_2_index_3, sasl_bucket_3_index_1, sasl_bucket_3_index_2, sasl_bucket_3_index_3,standard_bucket_1_index_1,standard_bucket_1_index_2,standard_bucket_1_index_3,standard_bucket_2_index_1,standard_bucket_2_index_2,standard_bucket_2_index_3,standard_bucket_3_index_1,standard_bucket_3_index_2,standard_bucket_3_index_3
      • * Load all the buckets with 1000 docs and wait for all the indexes to complete
      • delete all the indexes one after the other and wait for index delete to complete
      • delete all the buckets created and wait for delete bucket to complete
      • rebalancing all nodes in order to remove nodes. We see below error:

        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance params : {'knownNodes': 'ns_1@172.23.120.93,ns_1@172.23.120.95', 'ejectedNodes': 'ns_1@172.23.120.93', 'user': 'Administrator', 'password': 'password'}
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance operation started
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 0.00 %
        2020-06-10 18:46:13 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:23 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:33 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:43 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:53 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:47:03 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
        2020-06-10 18:47:13 | INFO | MainProcess | test_thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.120.95:
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1591840023463, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {agent_died,<0.2923.0>,\n                               {linked_process_died,<0.3707.0>,\n                                {timeout,\n                                 {gen_server,call,\n                                  [<0.2973.0>,\n                                   {call,"ServiceAPI.GetTaskList",\n                                    #Fun<json_rpc_connection.0.102434519>},\n                                   60000]}}}}}.\nRebalance Operation Id = 3dea98d7db7f7f69cc8c82ba52151df2', 'serverTime': '2020-06-10T18:47:03.463Z'}
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1591839963234, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.120.95'], EjectNodes = ['ns_1@172.23.120.93'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 3dea98d7db7f7f69cc8c82ba52151df2", 'serverTime': '2020-06-10T18:46:03.234Z'}
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1591839963227, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.120.95\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.120.93\']', 'serverTime': '2020-06-10T18:46:03.227Z'}
        

      Log snippet:

      Starting rebalance from test, ejected nodes ['ns_1@172.23.121.66']
      2020-06-10T19:02:18.533-07:00, ns_orchestrator:0:info:message(ns_1@172.23.121.65) - Starting rebalance, KeepNodes = ['ns_1@172.23.121.65'], EjectNodes = ['ns_1@172.23.121.66'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 29c8525859119169f864efdb41b2b4d0
      2020-06-10T19:03:18.636-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.65) - Rebalance exited with reason {service_rebalance_failed,fts,
                                    {agent_died,<0.2128.0>,
                                     {linked_process_died,<0.2675.0>,
                                      {timeout,
                                       {gen_server,call,
                                        [<0.2162.0>,
                                         {call,"ServiceAPI.GetTaskList",
                                          #Fun<json_rpc_connection.0.102434519>},
                                         60000]}}}}}.
      Rebalance Operation Id = 29c8525859119169f864efdb41b2b4d0
      -------------------------------
       
       
      per_node_processes('ns_1@172.23.121.65') =
           {<0.25311.13>,
            [{backtrace,
                 [<<"Program counter: 0x00007f86bcb70178 (diag_handler:'-collect_diag_per_node/1-fun-1-'/2 + 136)">>,
                  <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
                  <<"0x00007f86b2676048 Return addr 0x00007f8741d5fca0 (proc_lib:init_p/3 + 288)">>,
                  <<"y(0)     <0.25310.13>">>,<<>>,
                  <<"0x00007f86b2676058 Return addr 0x0000000000942608 (<terminate process normally>)">>,
                  <<"y(0)     Catch 0x00007f8741d5fcc0 (proc_lib:init_p/3 + 320)">>,
                  <<"y(1)     []">>,<<>>]},
             {messages,[]},
             {dictionary,
                 [{'$initial_call',
                      {diag_handler,'-collect_diag_per_node/1-fun-1-',0}},
                  {'$ancestors',[<0.25310.13>]}]},
             {registered_name,[]},
             {status,waiting},
             {initial_call,{proc_lib,init_p,3}},
             {error_handler,error_handler},
             {garbage_collection,
                 [{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                  {min_bin_vheap_size,46422},
                  {min_heap_size,233},
                  {fullsweep_after,512},
                  {minor_gcs,0}]},
             {garbage_collection_info,
                 [{old_heap_block_size,0},
                  {heap_block_size,233},
                  {mbuf_size,0},
                  {recent_size,0},
                  {stack_size,5},
                  {old_heap_size,0},
                  {heap_size,32},
                  {bin_vheap_size,0},
                  {bin_vheap_block_size,46422},
                  {bin_old_vheap_size,0},
                  {bin_old_vheap_block_size,46422}]},
             {links,[<0.25310.13>]},
             {monitors,[{process,<0.290.0>},{process,<0.25310.13>}]},
             {monitored_by,[]},
             {memory,2888},
             {message_queue_len,0},
             {reductions,9},
             {trap_exit,false},
             {current_location,
                 {diag_handler,'-collect_diag_per_node/1-fun-1-',2,
                     [{file,"src/diag_handler.erl"},{line,238}]}}]}
           {<0.25310.13>,
            [{backtrace,
                 [<<"Program counter: 0x00007f873a644ee0 (unknown function)"
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            No work has yet been logged on this issue.

            People

              abhinav Abhi Dangeti
              girish.benakappa Girish Benakappa
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty