Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39871

FTS: Rebalance failure : service_rebalance_failed while ejecting nodes - upside_down

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • fts
    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0-2278

      Test suite: centos-fts_stabletopologyP0
      http://qa.sc.couchbase.com/job/test_suite_executor/222681/console

      Note that this is happening only with upside_down indexes and not with scorch indexes.

      Test:
      fts.stable_topology_fts.StableTopFTS:
      create_simple_default_index,items=1000,cluster=D,F,F,standard_buckets=3,sasl_buckets=3,index_per_bucket=3,GROUP=P0,cluster=D+F,disable_HTP=True,get-cbcollect-info=False,index_type=upside_down,fts_quota=750,GROUP=P0

      Steps in the test:

      • Create a cluster with n1:fts+kv+index+n1ql and n2:fts
      • Create default, sasl_bucket_1, sasl_bucket_2, sasl_bucket_3, standard_bucket_1, standard_bucket_2, standard_bucket_3
      • Create fts indexes : default_index_1, default_index_2, default_index_3, sasl_bucket_1_index_1, sasl_bucket_1_index_2, sasl_bucket_1_index_3, sasl_bucket_2_index_1, sasl_bucket_2_index_2, sasl_bucket_2_index_3, sasl_bucket_3_index_1, sasl_bucket_3_index_2, sasl_bucket_3_index_3,standard_bucket_1_index_1,standard_bucket_1_index_2,standard_bucket_1_index_3,standard_bucket_2_index_1,standard_bucket_2_index_2,standard_bucket_2_index_3,standard_bucket_3_index_1,standard_bucket_3_index_2,standard_bucket_3_index_3
      • * Load all the buckets with 1000 docs and wait for all the indexes to complete
      • delete all the indexes one after the other and wait for index delete to complete
      • delete all the buckets created and wait for delete bucket to complete
      • rebalancing all nodes in order to remove nodes. We see below error:

        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [cluster_helper.cleanup_cluster] rebalancing all nodes in order to remove nodes
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance params : {'knownNodes': 'ns_1@172.23.120.93,ns_1@172.23.120.95', 'ejectedNodes': 'ns_1@172.23.120.93', 'user': 'Administrator', 'password': 'password'}
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance operation started
        2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 0.00 %
        2020-06-10 18:46:13 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:23 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:33 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:43 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:46:53 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:47:03 | INFO | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] rebalance percentage : 50.00 %
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client._rebalance_status_and_progress] {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
        2020-06-10 18:47:13 | INFO | MainProcess | test_thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.120.95:
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1591840023463, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {agent_died,<0.2923.0>,\n                               {linked_process_died,<0.3707.0>,\n                                {timeout,\n                                 {gen_server,call,\n                                  [<0.2973.0>,\n                                   {call,"ServiceAPI.GetTaskList",\n                                    #Fun<json_rpc_connection.0.102434519>},\n                                   60000]}}}}}.\nRebalance Operation Id = 3dea98d7db7f7f69cc8c82ba52151df2', 'serverTime': '2020-06-10T18:47:03.463Z'}
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1591839963234, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.120.95'], EjectNodes = ['ns_1@172.23.120.93'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 3dea98d7db7f7f69cc8c82ba52151df2", 'serverTime': '2020-06-10T18:46:03.234Z'}
        2020-06-10 18:47:13 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.120.95', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1591839963227, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.120.95\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.120.93\']', 'serverTime': '2020-06-10T18:46:03.227Z'}
        

      Log snippet:

      Starting rebalance from test, ejected nodes ['ns_1@172.23.121.66']
      2020-06-10T19:02:18.533-07:00, ns_orchestrator:0:info:message(ns_1@172.23.121.65) - Starting rebalance, KeepNodes = ['ns_1@172.23.121.65'], EjectNodes = ['ns_1@172.23.121.66'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 29c8525859119169f864efdb41b2b4d0
      2020-06-10T19:03:18.636-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.121.65) - Rebalance exited with reason {service_rebalance_failed,fts,
                                    {agent_died,<0.2128.0>,
                                     {linked_process_died,<0.2675.0>,
                                      {timeout,
                                       {gen_server,call,
                                        [<0.2162.0>,
                                         {call,"ServiceAPI.GetTaskList",
                                          #Fun<json_rpc_connection.0.102434519>},
                                         60000]}}}}}.
      Rebalance Operation Id = 29c8525859119169f864efdb41b2b4d0
      -------------------------------
       
       
      per_node_processes('ns_1@172.23.121.65') =
           {<0.25311.13>,
            [{backtrace,
                 [<<"Program counter: 0x00007f86bcb70178 (diag_handler:'-collect_diag_per_node/1-fun-1-'/2 + 136)">>,
                  <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
                  <<"0x00007f86b2676048 Return addr 0x00007f8741d5fca0 (proc_lib:init_p/3 + 288)">>,
                  <<"y(0)     <0.25310.13>">>,<<>>,
                  <<"0x00007f86b2676058 Return addr 0x0000000000942608 (<terminate process normally>)">>,
                  <<"y(0)     Catch 0x00007f8741d5fcc0 (proc_lib:init_p/3 + 320)">>,
                  <<"y(1)     []">>,<<>>]},
             {messages,[]},
             {dictionary,
                 [{'$initial_call',
                      {diag_handler,'-collect_diag_per_node/1-fun-1-',0}},
                  {'$ancestors',[<0.25310.13>]}]},
             {registered_name,[]},
             {status,waiting},
             {initial_call,{proc_lib,init_p,3}},
             {error_handler,error_handler},
             {garbage_collection,
                 [{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                  {min_bin_vheap_size,46422},
                  {min_heap_size,233},
                  {fullsweep_after,512},
                  {minor_gcs,0}]},
             {garbage_collection_info,
                 [{old_heap_block_size,0},
                  {heap_block_size,233},
                  {mbuf_size,0},
                  {recent_size,0},
                  {stack_size,5},
                  {old_heap_size,0},
                  {heap_size,32},
                  {bin_vheap_size,0},
                  {bin_vheap_block_size,46422},
                  {bin_old_vheap_size,0},
                  {bin_old_vheap_block_size,46422}]},
             {links,[<0.25310.13>]},
             {monitors,[{process,<0.290.0>},{process,<0.25310.13>}]},
             {monitored_by,[]},
             {memory,2888},
             {message_queue_len,0},
             {reductions,9},
             {trap_exit,false},
             {current_location,
                 {diag_handler,'-collect_diag_per_node/1-fun-1-',2,
                     [{file,"src/diag_handler.erl"},{line,238}]}}]}
           {<0.25310.13>,
            [{backtrace,
                 [<<"Program counter: 0x00007f873a644ee0 (unknown function)"
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            girish.benakappa Girish Benakappa added a comment - Logs: https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance-issue/collectinfo-2020-06-11T020503-ns_1%40172.23.121.65.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/rebalance-issue/collectinfo-2020-06-11T020503-ns_1%40172.23.121.66.zip

            hey Girish Benakappa,  help me if am I am overlooking something here.

            In the first snippet of logs, the rebalance node IPs are different 172.23.120.95/172.23.120.93 from the attached logs. Logs attached are for nodes 65/66.

             2020-06-10 18:46:03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance params : {'knownNodes': 'ns_1@172.23.120.93,ns_1@172.23.120.95', 'ejectedNodes': 'ns_1@172.23.120.93', 'user': 'Administrator', 'password': 'password'}

            In the second snippet, it looks like the timestamps are matching.

            Noted this feed error a few times repeatedly. tagging Abhi Dangeti ..

            2020-06-10T19:03:45.524-07:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"WaitUntilReady","Opaque":"","TimeObserved":10000144898,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""} 

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - hey  Girish Benakappa ,  help me if am I am overlooking something here. In the first snippet of logs, the rebalance node IPs are different 172.23.120.95/172.23.120.93 from the attached logs. Logs attached are for nodes 65/66. 2020 - 06 - 10 18 : 46 : 03 | INFO | MainProcess | test_thread | [rest_client.rebalance] rebalance params : { 'knownNodes' : 'ns_1@172.23.120.93,ns_1@172.23.120.95' , 'ejectedNodes' : 'ns_1@172.23.120.93' , 'user' : 'Administrator' , 'password' : 'password' } In the second snippet, it looks like the timestamps are matching. Noted this feed error a few times repeatedly. tagging Abhi Dangeti  .. 2020 - 06 -10T19: 03 : 45.524 - 07 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: unambiguous timeout | { "InnerError" :{ "InnerError" :{ "InnerError" :{}, "Message" : "unambiguous timeout" }}, "OperationID" : "WaitUntilReady" , "Opaque" : "" , "TimeObserved" : 10000144898 , "RetryReasons" : null , "RetryAttempts" : 0 , "LastDispatchedTo" : "" , "LastDispatchedFrom" : "" , "LastConnectionID" : "" }

            Hey Sreekanth Sivasankaran, that timeout you see can happen only if there was a stale index definition around for an index which has been previously deleted. And if there are several such timeouts (10s each), we could easily go past the GetTaskList timeout of 60s.

            That said - I'm having trouble reproducing this issue.

            Let me see if I can find anything else from the logs.

            abhinav Abhi Dangeti added a comment - Hey Sreekanth Sivasankaran , that timeout you see can happen only if there was a stale index definition around for an index which has been previously deleted. And if there are several such timeouts (10s each), we could easily go past the GetTaskList timeout of 60s. That said - I'm having trouble reproducing this issue. Let me see if I can find anything else from the logs.

            Looking at the data directory of both the nodes, seems like all the index content has been deleted ..

            *.65:

            /opt/couchbase/var/lib/couchbase/data/@fts:
            total 4
             625896 drwxrwx---. 2 couchbase couchbase 22 Jun 10 19:04 .
             625908 drwxrwx---. 4 couchbase couchbase 31 Jun 10 19:02 ..
            1842696 -rw-------. 1 couchbase couchbase 32 Jun 10 18:39 cbft.uuid 

            *.66:

            /opt/couchbase/var/lib/couchbase/data/@fts:
            total 4
            35223629 drwxrwx---. 2 couchbase couchbase 22 Jun 10 19:05 .
              827106 drwxrwx---. 4 couchbase couchbase 31 Jun 10 19:02 ..
            35223632 -rw-------. 1 couchbase couchbase 32 Jun 10 18:39 cbft.uuid 

            But per the time stamps, not until 19:04 and 19:05 .. and the logs indicate the timeout occurred on node *.65 at 19:03:35..

            2020-06-10T19:03:35.524-07:00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"WaitUntilReady","Opaque":"","TimeObserved":10000167533,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
            2020-06-10T19:03:35.524-07:00 [INFO] (GOCBCORE) SDK Version: gocbcore/v9.0.1 

            So it's possible that there could've been a race between a CfgSetIndexDefs(..) and the CfgGetIndexDefs(..) within ctl/ctl's getMovingPartitionsCount(..).

            abhinav Abhi Dangeti added a comment - Looking at the data directory of both the nodes, seems like all the index content has been deleted .. *.65: /opt/couchbase/var/lib/couchbase/data/ @fts : total 4 625896 drwxrwx---. 2 couchbase couchbase 22 Jun 10 19 : 04 . 625908 drwxrwx---. 4 couchbase couchbase 31 Jun 10 19 : 02 .. 1842696 -rw-------. 1 couchbase couchbase 32 Jun 10 18 : 39 cbft.uuid *.66: /opt/couchbase/var/lib/couchbase/data/ @fts : total 4 35223629 drwxrwx---. 2 couchbase couchbase 22 Jun 10 19 : 05 .   827106 drwxrwx---. 4 couchbase couchbase 31 Jun 10 19 : 02 .. 35223632 -rw-------. 1 couchbase couchbase 32 Jun 10 18 : 39 cbft.uuid But per the time stamps, not until 19:04 and 19:05 .. and the logs indicate the timeout occurred on node *.65 at 19:03:35.. 2020 - 06 -10T19: 03 : 35.524 - 07 : 00 [INFO] ctl: getMovingPartitionsCount, CouchbasePartitions failed, err: unambiguous timeout | { "InnerError" :{ "InnerError" :{ "InnerError" :{}, "Message" : "unambiguous timeout" }}, "OperationID" : "WaitUntilReady" , "Opaque" : "" , "TimeObserved" : 10000167533 , "RetryReasons" : null , "RetryAttempts" : 0 , "LastDispatchedTo" : "" , "LastDispatchedFrom" : "" , "LastConnectionID" : "" } 2020 - 06 -10T19: 03 : 35.524 - 07 : 00 [INFO] (GOCBCORE) SDK Version: gocbcore/v9. 0.1 So it's possible that there could've been a race between a CfgSetIndexDefs(..) and the CfgGetIndexDefs(..) within ctl/ctl's getMovingPartitionsCount(..).

            I've added a logging update: http://review.couchbase.org/c/cbgt/+/130337.

            Sreekanth Sivasankaran To me it looks like a race between CfgSetIndexDefs and CfgGetIndexDefs; and since you've more expertise working recently on this - reassigning this ticket back to you. I'll let you know if I can think of anything else. 

            abhinav Abhi Dangeti added a comment - I've added a logging update:  http://review.couchbase.org/c/cbgt/+/130337 . Sreekanth Sivasankaran To me it looks like a race between CfgSetIndexDefs and CfgGetIndexDefs; and since you've more expertise working recently on this - reassigning this ticket back to you. I'll let you know if I can think of anything else. 

            One other thing - you shouldn't be seeing the timeout once GOCBC-868 is resolved, as it wouldn't take 10s to figure out that the bucket isn't available anymore.

            abhinav Abhi Dangeti added a comment - One other thing - you shouldn't be seeing the timeout once  GOCBC-868 is resolved, as it wouldn't take 10s to figure out that the bucket isn't available anymore.

            Girish Benakappa,  if I look into the fts logs, looks like I am not seeing 20+ index deletion request. From both the nodes I could see a 10 index deletion requests.

             

            Node 66
             
            2020-06-10T19:00:35.304-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_1_index_2, indexUUID: 906f4f8c62edd781
             
            2020-06-10T19:00:47.887-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_2_index_1, indexUUID: 5b84832c1a274d50
             
            2020-06-10T19:01:18.248-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_1_index_1, indexUUID: 2b2de4ed4fb6916f
             
            Node 65
             
            2020-06-10T19:00:03.669-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: default_index_1, indexUUID: 5bcd46c471c352af
             
            2020-06-10T19:00:19.486-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: default_index_3, indexUUID: 3fe9ced477428d54
             
            2020-06-10T19:00:58.730-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_2_index_3, indexUUID: 584eae3ec622f5a3
             
            2020-06-10T19:01:08.757-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_3_index_2, indexUUID: 2d5368f0bab9eade
             
            2020-06-10T19:01:27.828-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_1_index_3, indexUUID: 3a75f1ec61f9aedc
             
            2020-06-10T19:01:37.594-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_2_index_2, indexUUID: 3dd4689152963e35
             
            2020-06-10T19:01:49.176-07:00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_3_index_1, indexUUID: 11f91a2e526fdbb3
            
            

             

            So we are only deleting 10 indexes out of 17+ indexes here before bucket deletions unlike mentioned in the steps above. 

            Later during the bucket deletions, we can see the index deletion handlings here.

            2020-06-10T19:02:08.649-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: default_index_2, indexUUID: 5755d7f426e922b1
             
            2020-06-10T19:02:10.232-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_1_index_1, indexUUID: 15afa11e683abdaa
             
            2020-06-10T19:02:10.232-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_1_index_3, indexUUID: 39a8419e7ebd88cd
             
            2020-06-10T19:02:11.921-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_2_index_2, indexUUID: 2d841fb09302aace
             
            2020-06-10T19:02:13.224-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_3_index_1, indexUUID: 1d352562b1a2265b
             
            2020-06-10T19:02:25.092-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_1_index_2, indexUUID: 336e21eaf88dc78c
             

            Eventually can see an empty plan here. So 

            2020-06-10T19:02:42.806-07:00 [INFO] cfg_metakv_lean: setLeanPlan, val: {"uuid":"2fd058c68e93841e","planPIndexes":{},"implVersion":"5.5.0","warnings":{}}
            2020-06-10T19:02:42.807-07:00 [INFO] cfg_metakv: metaKVCallback, path: /fts/cbgt/cfg/curMetaKvPlanKey, key: curMetaKvPlanKey, deletion: false 

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - Girish Benakappa ,  if I look into the fts logs, looks like I am not seeing 20+ index deletion request. From both the nodes I could see a 10 index deletion requests.   Node 66   2020 - 06 -10T19: 00 : 35.304 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_1_index_2, indexUUID: 906f4f8c62edd781   2020 - 06 -10T19: 00 : 47.887 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_2_index_1, indexUUID: 5b84832c1a274d50   2020 - 06 -10T19: 01 : 18.248 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_1_index_1, indexUUID: 2b2de4ed4fb6916f   Node 65   2020 - 06 -10T19: 00 : 03.669 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: default_index_1, indexUUID: 5bcd46c471c352af   2020 - 06 -10T19: 00 : 19.486 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: default_index_3, indexUUID: 3fe9ced477428d54   2020 - 06 -10T19: 00 : 58.730 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_2_index_3, indexUUID: 584eae3ec622f5a3   2020 - 06 -10T19: 01 : 08.757 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: sasl_bucket_3_index_2, indexUUID: 2d5368f0bab9eade   2020 - 06 -10T19: 01 : 27.828 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_1_index_3, indexUUID: 3a75f1ec61f9aedc   2020 - 06 -10T19: 01 : 37.594 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_2_index_2, indexUUID: 3dd4689152963e35   2020 - 06 -10T19: 01 : 49.176 - 07 : 00 [INFO] manager_api: index definition deleted, indexType: fulltext-index, indexName: standard_bucket_3_index_1, indexUUID: 11f91a2e526fdbb3   So we are only deleting 10 indexes out of 17+ indexes here before bucket deletions unlike mentioned in the steps above.  Later during the bucket deletions, we can see the index deletion handlings here. 2020 - 06 -10T19: 02 : 08.649 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: default_index_2, indexUUID: 5755d7f426e922b1   2020 - 06 -10T19: 02 : 10.232 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_1_index_1, indexUUID: 15afa11e683abdaa   2020 - 06 -10T19: 02 : 10.232 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_1_index_3, indexUUID: 39a8419e7ebd88cd   2020 - 06 -10T19: 02 : 11.921 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_2_index_2, indexUUID: 2d841fb09302aace   2020 - 06 -10T19: 02 : 13.224 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: sasl_bucket_3_index_1, indexUUID: 1d352562b1a2265b   2020 - 06 -10T19: 02 : 25.092 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_1_index_2, indexUUID: 336e21eaf88dc78c Eventually can see an empty plan here. So  2020 - 06 -10T19: 02 : 42.806 - 07 : 00 [INFO] cfg_metakv_lean: setLeanPlan, val: { "uuid" : "2fd058c68e93841e" , "planPIndexes" :{}, "implVersion" : "5.5.0" , "warnings" :{}} 2020 - 06 -10T19: 02 : 42.807 - 07 : 00 [INFO] cfg_metakv: metaKVCallback, path: /fts/cbgt/cfg/curMetaKvPlanKey, key: curMetaKvPlanKey, deletion: false
            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

            But in parallel, index definition deletions are happening in node 66 too as well which are triggered by bucket deletion.

            2020-06-10T19:02:42.694-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_1, indexUUID: 2475fd24b05e94a6
             
            2020-06-10T19:02:42.694-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_3, indexUUID: 517bc0a49b081398
             
            2020-06-10T19:02:42.720-07:00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_3_index_2, indexUUID: 6394b7b567d8d8f4 
            
            

             

            The rebalance under scrutiny started here at

            2020-06-10T19:02:18.635-07:00 [INFO] ctl/manager: StartTopologyChange, change: {f4672824563c60c3ba8f02fe628856c0 [] topology-change-rebalance [{{3f6df0a48ee2ddc325a509990c0c290a 0 <nil>} recovery-full}] [{f02eabbbd7a602ab9ce7aeeafae37042 0 <nil>}]} 
            
            

             

            And it looks a long time to actually start the rebalance here, but by that time rebalance reported as failure. 

            2020-06-10T19:04:15.526-07:00 [INFO] ctl: getMovingPartitionsCount: 0
            2020-06-10T19:04:15.526-07:00 [INFO] ctl/manager: StartTopologyChange, started
            2020-06-10T19:04:15.526-07:00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 7, changed, rv: &{Rev:[57] Tasks:[{Rev:[56] ID:rebalance:f4672824563c60c3ba8f02fe628856c0 Type:task-rebalance Status:task-running IsCancelable:true Progress:0 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:f4672824563c60c3ba8f02fe628856c0 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:3f6df0a48ee2ddc325a509990c0c290a Priority:0 Opaque:<nil>}  

            In between, we can see a lot of metakv updates coming to the node 65.. 

             

            One quick thinking is that - perhaps cbgt is getting blocked on the feed/agent wait part which Abhi Dangeti was mentioning earlier. So with an early exit fix there things ought to settle /work fine here is another thinking. (as we can see that eventually in the 66 node's logs there aren't any pindexes/feeds/indexes to clean up.)

            Girish Benakappa, can we test this after the gocb issue which is mentioned above?

             

            Will check this later.. Switching priority here to file leak CBSE.

            Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited But in parallel, index definition deletions are happening in node 66 too as well which are triggered by bucket deletion. 2020 - 06 -10T19: 02 : 42.694 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_1, indexUUID: 2475fd24b05e94a6   2020 - 06 -10T19: 02 : 42.694 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_2_index_3, indexUUID: 517bc0a49b081398   2020 - 06 -10T19: 02 : 42.720 - 07 : 00 [INFO] manager_api: starting index definition deletion, indexType: fulltext-index, indexName: standard_bucket_3_index_2, indexUUID: 6394b7b567d8d8f4   The rebalance under scrutiny started here at 2020 - 06 -10T19: 02 : 18.635 - 07 : 00 [INFO] ctl/manager: StartTopologyChange, change: {f4672824563c60c3ba8f02fe628856c0 [] topology-change-rebalance [{{3f6df0a48ee2ddc325a509990c0c290a 0 <nil>} recovery-full}] [{f02eabbbd7a602ab9ce7aeeafae37042 0 <nil>}]}   And it looks a long time to actually start the rebalance here, but by that time rebalance reported as failure.  2020 - 06 -10T19: 04 : 15.526 - 07 : 00 [INFO] ctl: getMovingPartitionsCount: 0 2020 - 06 -10T19: 04 : 15.526 - 07 : 00 [INFO] ctl/manager: StartTopologyChange, started 2020 - 06 -10T19: 04 : 15.526 - 07 : 00 [INFO] ctl/manager: GetTaskList, haveTasksRev: 7 , changed, rv: &{Rev:[ 57 ] Tasks:[{Rev:[ 56 ] ID:rebalance:f4672824563c60c3ba8f02fe628856c0 Type:task-rebalance Status:task-running IsCancelable: true Progress: 0 DetailedProgress:map[] Description:topology change ErrorMessage: Extra:map[topologyChange:{ID:f4672824563c60c3ba8f02fe628856c0 CurrentTopologyRev:[] Type:topology-change-rebalance KeepNodes:[{NodeInfo:{NodeID:3f6df0a48ee2ddc325a509990c0c290a Priority: 0 Opaque:<nil>} In between, we can see a lot of metakv updates coming to the node 65..    One quick thinking is that - perhaps cbgt is getting blocked on the feed/agent wait part which Abhi Dangeti  was mentioning earlier. So with an early exit fix there things ought to settle /work fine here is another thinking. (as we can see that eventually in the 66 node's logs there aren't any pindexes/feeds/indexes to clean up.) Girish Benakappa , can we test this after the gocb issue which is mentioned above?   Will check this later.. Switching priority here to file leak CBSE.

            Build couchbase-server-7.0.0-2451 contains cbgt commit d6542cf with commit message:
            MB-39871: Add index and source names to logs on failing to fetch stats

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2451 contains cbgt commit d6542cf with commit message: MB-39871 : Add index and source names to logs on failing to fetch stats

            Hey Girish Benakappa, we've addressed the gocbcore issue GOCBC-868 and bumped up the cbgt's gocbcore version with http://review.couchbase.org/c/cbgt/+/132161.

            Marking this issue as resolved for now - feel free to reopen this issue if you still see it in builds that carry the changes mentioned above.

            abhinav Abhi Dangeti added a comment - Hey Girish Benakappa , we've addressed the gocbcore issue GOCBC-868 and bumped up the cbgt's gocbcore version with  http://review.couchbase.org/c/cbgt/+/132161 . Marking this issue as resolved for now - feel free to reopen this issue if you still see it in builds that carry the changes mentioned above.

            I've reverted the above version bump for gocbcore within cbgt because of other issues seen. I'll close out this ticket once we're ready to bump up gocbcore (to a SHA that contains the fix for this) again.

            abhinav Abhi Dangeti added a comment - I've reverted the above version bump for gocbcore within cbgt because of other issues seen. I'll close out this ticket once we're ready to bump up gocbcore (to a SHA that contains the fix for this) again.

            Closing as we are not seeing this issue with latest builds (7.0.0-3938)

            girish.benakappa Girish Benakappa added a comment - Closing as we are not seeing this issue with latest builds (7.0.0-3938)

            People

              abhinav Abhi Dangeti
              girish.benakappa Girish Benakappa
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty