Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39871

FTS: Rebalance failure : service_rebalance_failed while ejecting nodes - upside_down

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • fts
    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0-2278

      Test suite: centos-fts_stabletopologyP0
      http://qa.sc.couchbase.com/job/test_suite_executor/222681/console

      Note that this is happening only with upside_down indexes and not with scorch indexes.

      Test:
      fts.stable_topology_fts.StableTopFTS:
      create_simple_default_index,items=1000,cluster=D,F,F,standard_buckets=3,sasl_buckets=3,index_per_bucket=3,GROUP=P0,cluster=D+F,disable_HTP=True,get-cbcollect-info=False,index_type=upside_down,fts_quota=750,GROUP=P0

      Steps in the test:

      • Create a cluster with n1:fts+kv+index+n1ql and n2:fts
      • Create default, sasl_bucket_1, sasl_bucket_2, sasl_bucket_3, standard_bucket_1, standard_bucket_2, standard_bucket_3
      • Create fts indexes : default_index_1, default_index_2, default_index_3, sasl_bucket_1_index_1, sasl_bucket_1_index_2, sasl_bucket_1_index_3, sasl_bucket_2_index_1, sasl_bucket_2_index_2, sasl_bucket_2_index_3, sasl_bucket_3_index_1, sasl_bucket_3_index_2, sasl_bucket_3_index_3,standard_bucket_1_index_1,standard_bucket_1_index_2,standard_bucket_1_index_3,standard_bucket_2_index_1,standard_bucket_2_index_2,standard_bucket_2_index_3,standard_bucket_3_index_1,standard_bucket_3_index_2,standard_bucket_3_index_3
      • * Load all the buckets with 1000 docs and wait for all the indexes to complete
      • delete all the indexes one after the other and wait for index delete to complete
      • delete all the buckets created and wait for delete bucket to complete
      • rebalancing all nodes in order to remove nodes. We see below error:

        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance params : {'knownNodes': 'ns_1@172.23.120.93,ns_1@172.23.120.95', 'ejectedNodes': 'ns_1@172.23.120.93', 'user': 'Administrator', 'password': 'password'}
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance operation started
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 0.00 %
        2020-06-10 18:46:13 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:23 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:33 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:43 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:53 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:47:03 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
        2020-06-10 18:47:13 | INFO | MainProcess | test_thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.120.95:
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1591840023463, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {agent_died,<0.2923.0>,\n                               {linked_process_died,<0.3707.0>,\n                                {timeout,\n                                 {gen_server,call,\n                                  [<0.2973.0>,\n                                   {call,"ServiceAPI.GetTaskList",\n                                    #Fun<json_rpc_connection.0.102434519>},\n                                   60000]}}}}}.\nRebalance Operation Id = 3dea98d7db7f7f69cc8c82ba52151df2', 'serverTime': '2020-06-10T18:47:03.463Z'}
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1591839963234, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.120.95'], EjectNodes = ['ns_1@172.23.120.93'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 3dea98d7db7f7f69cc8c82ba52151df2", 'serverTime': '2020-06-10T18:46:03.234Z'}
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1591839963227, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.120.95\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.120.93\']', 'serverTime': '2020-06-10T18:46:03.227Z'}
        

      Log snippet:

      Starting rebalance from test, ejected nodes ['ns_1@172.23.121.66']
      2020-06-10T19:02:18.533-07:00, ns_orchestrator:0:info:message(ns_1@172.23.121.65) - Starting rebalance, KeepNodes = ['ns_1@172.23.121.65'], EjectNodes = ['ns_1@172.23.121.66'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 29c8525859119169f864efdb41b2b4d0
      2020-06-10T19:03:18.636-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.65) - Rebalance exited with reason {service_rebalance_failed,fts,
                                    {agent_died,<0.2128.0>,
                                     {linked_process_died,<0.2675.0>,
                                      {timeout,
                                       {gen_server,call,
                                        [<0.2162.0>,
                                         {call,"ServiceAPI.GetTaskList",
                                          #Fun<json_rpc_connection.0.102434519>},
                                         60000]}}}}}.
      Rebalance Operation Id = 29c8525859119169f864efdb41b2b4d0
      -------------------------------
       
       
      per_node_processes('ns_1@172.23.121.65') =
           {<0.25311.13>,
            [{backtrace,
                 [<<"Program counter: 0x00007f86bcb70178 (diag_handler:'-collect_diag_per_node/1-fun-1-'/2 + 136)">>,
                  <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
                  <<"0x00007f86b2676048 Return addr 0x00007f8741d5fca0 (proc_lib:init_p/3 + 288)">>,
                  <<"y(0)     <0.25310.13>">>,<<>>,
                  <<"0x00007f86b2676058 Return addr 0x0000000000942608 (<terminate process normally>)">>,
                  <<"y(0)     Catch 0x00007f8741d5fcc0 (proc_lib:init_p/3 + 320)">>,
                  <<"y(1)     []">>,<<>>]},
             {messages,[]},
             {dictionary,
                 [{'$initial_call',
                      {diag_handler,'-collect_diag_per_node/1-fun-1-',0}},
                  {'$ancestors',[<0.25310.13>]}]},
             {registered_name,[]},
             {status,waiting},
             {initial_call,{proc_lib,init_p,3}},
             {error_handler,error_handler},
             {garbage_collection,
                 [{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                  {min_bin_vheap_size,46422},
                  {min_heap_size,233},
                  {fullsweep_after,512},
                  {minor_gcs,0}]},
             {garbage_collection_info,
                 [{old_heap_block_size,0},
                  {heap_block_size,233},
                  {mbuf_size,0},
                  {recent_size,0},
                  {stack_size,5},
                  {old_heap_size,0},
                  {heap_size,32},
                  {bin_vheap_size,0},
                  {bin_vheap_block_size,46422},
                  {bin_old_vheap_size,0},
                  {bin_old_vheap_block_size,46422}]},
             {links,[<0.25310.13>]},
             {monitors,[{process,<0.290.0>},{process,<0.25310.13>}]},
             {monitored_by,[]},
             {memory,2888},
             {message_queue_len,0},
             {reductions,9},
             {trap_exit,false},
             {current_location,
                 {diag_handler,'-collect_diag_per_node/1-fun-1-',2,
                     [{file,"src/diag_handler.erl"},{line,238}]}}]}
           {<0.25310.13>,
            [{backtrace,
                 [<<"Program counter: 0x00007f873a644ee0 (unknown function)"
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            But in parallel, index definition deletions are happening in node 66 too as well which are triggered by bucket deletion.

            2020-06-10T19:02:42.694-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_1, indexUUID: 2475fd24b05e94a6
             
            2020-06-10T19:02:42.694-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_3, indexUUID: 517bc0a49b081398
             
            2020-06-10T19:02:42.720-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_3_index_2, indexUUID: 6394b7b567d8d8f4 
            
            

             

            The rebalance under scrutiny started here at

            2020-06-10T19:02:18.635-07:00 [INFO] ctl/manager: StartTopologyChange, change: {f4672824563c60c3ba8f02fe628856c0 [] topology-change-rebalance [{{3f6df0a48ee2ddc325a509990c0c290a 0 <nil>} recovery-full}] [{f02eabbbd7a602ab9ce7aeeafae37042 0 <nil>}]} 
            
            

             

            And it looks a long time to actually start the rebalance here, but by that time rebalance reported as failure. 

            2020-06-10T19:04:15.526-07:00 [INFO] ctl: getMovingPartitionsCount: 0
            2020-06-10T19:04:15.526-07:00 [INFO] ctl/manager: StartTopologyChange, started
            2020-06-10T19:04:15.526-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 7, changed, rv: &{Rev:[57] Tasks:[{Rev:[56] ID:rebalance:f4672824563c60c3ba8f02fe628856c0 Type:task-rebalance Status:task-running IsCancelable:true Progress:0 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:f4672824563c60c3ba8f02fe628856c0 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:3f6df0a48ee2ddc325a509990c0c290a Priority:0 Opaque:<nil>}  

            In between, we can see a lot of metakv updates coming to the node 65.. 

             

            One quick thinking is that - perhaps cbgt is getting blocked on the feed/agent wait part which Abhinav Dangeti was mentioning earlier. So with an early exit fix there things ought to settle /work fine here is another thinking. (as we can see that eventually in the 66 node's logs there aren't any pindexes/feeds/indexes to clean up.)

            Girish Benakappa, can we test this after the gocb issue which is mentioned above?

             

            Will check this later.. Switching priority here to file leak CBSE.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited But in parallel, index definition deletions are happening in node 66 too as well which are triggered by bucket deletion. 2020 - 06 -10T19: 02 : 42.694 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_1, indexUUID: 2475fd24b05e94a6   2020 - 06 -10T19: 02 : 42.694 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_3, indexUUID: 517bc0a49b081398   2020 - 06 -10T19: 02 : 42.720 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_3_index_2, indexUUID: 6394b7b567d8d8f4   The rebalance under scrutiny started here at 2020 - 06 -10T19: 02 : 18.635 - 07 : 00 [INFO] ctl/manager: StartTopologyChange, change: {f4672824563c60c3ba8f02fe628856c0 [] topology-change-rebalance [{{3f6df0a48ee2ddc325a509990c0c290a 0 <nil>} recovery-full}] [{f02eabbbd7a602ab9ce7aeeafae37042 0 <nil>}]}   And it looks a long time to actually start the rebalance here, but by that time rebalance reported as failure.  2020 - 06 -10T19: 04 : 15.526 - 07 : 00 [INFO] ctl: getMovingPartitionsCount: 0 2020 - 06 -10T19: 04 : 15.526 - 07 : 00 [INFO] ctl/manager: StartTopologyChange, started 2020 - 06 -10T19: 04 : 15.526 - 07 : 00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 7 , changed, rv: &{Rev:[ 57 ] Tasks:[{Rev:[ 56 ] ID:rebalance:f4672824563c60c3ba8f02fe628856c0 Type:task-rebalance Status:task-running IsCancelable: true Progress: 0 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:f4672824563c60c3ba8f02fe628856c0 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:3f6df0a48ee2ddc325a509990c0c290a Priority: 0 Opaque:<nil>} In between, we can see a lot of metakv updates coming to the node 65..    One quick thinking is that - perhaps cbgt is getting blocked on the feed/agent wait part which Abhinav Dangeti  was mentioning earlier. So with an early exit fix there things ought to settle /work fine here is another thinking. (as we can see that eventually in the 66 node's logs there aren't any pindexes/feeds/indexes to clean up.) Girish Benakappa , can we test this after the gocb issue which is mentioned above?   Will check this later.. Switching priority here to file leak CBSE.

            Build couchbase-server-7.0.0-2451 contains cbgt commit d6542cf with commit message:
            MB-39871: Add index and source names to logs on failing to fetch stats

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2451 contains cbgt commit d6542cf with commit message: MB-39871 : Add index and source names to logs on failing to fetch stats

            Hey Girish Benakappa, we've addressed the gocbcore issue GOCBC-868 and bumped up the cbgt's gocbcore version with http://review.couchbase.org/c/cbgt/+/132161.

            Marking this issue as resolved for now - feel free to reopen this issue if you still see it in builds that carry the changes mentioned above.

            abhinav Abhinav Dangeti added a comment - Hey Girish Benakappa , we've addressed the gocbcore issue GOCBC-868 and bumped up the cbgt's gocbcore version with  http://review.couchbase.org/c/cbgt/+/132161 . Marking this issue as resolved for now - feel free to reopen this issue if you still see it in builds that carry the changes mentioned above.

            I've reverted the above version bump for gocbcore within cbgt because of other issues seen. I'll close out this ticket once we're ready to bump up gocbcore (to a SHA that contains the fix for this) again.

            abhinav Abhinav Dangeti added a comment - I've reverted the above version bump for gocbcore within cbgt because of other issues seen. I'll close out this ticket once we're ready to bump up gocbcore (to a SHA that contains the fix for this) again.

            Closing as we are not seeing this issue with latest builds (7.0.0-3938)

            girish.benakappa Girish Benakappa added a comment - Closing as we are not seeing this issue with latest builds (7.0.0-3938)

            People

              abhinav Abhinav Dangeti
              girish.benakappa Girish Benakappa
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty