Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45383

FTS: cluster rebalance failed after test execution - Service 'fts' exited with status 137.

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Cannot Reproduce
    • Cheshire-Cat
    • 7.0.0
    • fts
    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Build: 7.0.0-4797

      After execution if the following test

      ./testrunner -i /tmp/testexec.22646.ini -p get-cbcollect-info=True,disable_HTP=True,get-logs=True,stop-on-failure=False,index_type=scorch,fts_quota=750,run_via_n1ql=True,custom_map_add_non_indexed_fields=False,GROUP=N1QL_MATCH_PHRASE,query_types=N1QL_MATCH_PHRASE -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,num_custom_analyzers=1,compare_es=True,cm_id=105,num_queries=100,GROUP=BUCKETS;P0;SKIP_FOR_N1QL;N1QL_MATCH_PHRASE;COLLECTIONS
      

      Job name:

      centos-fts_custom-map-n1ql-rqg-scorch-match-phrase_6.5_P1
      

      cluster rebalance was failed with the following messages:

      [2021-03-31 04:15:05,554] - [rest_client:1864] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-03-31 04:15:05,568] - [rest_client:3816] INFO - Latest logs from UI on 172.23.98.17:
      [2021-03-31 04:15:05,568] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1617189297176, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                                 {agent_died,<0.3286.0>,\n                                     {lost_connection,shutdown}}}.\nRebalance Operation Id = 505bbfc322316c65568f2883ccff257d', 'serverTime': '2021-03-31T04:14:57.176Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'info', 'code': 0, 'module': 'ns_log', 'tstamp': 1617189297162, 'shortText': 'message', 'text': "Service 'fts' exited with status 137. Restarting. Messages:\n2021-03-31T04:14:55.582-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n2021-03-31T04:14:55.604-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n2021-03-31T04:14:55.615-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n2021-03-31T04:14:55.625-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n2021-03-31T04:14:55.648-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n2021-03-31T04:14:51.641-07:00 [WARN] (GOCBCORE) memdClient read failure on conn `02769c42d8bd79fd/2f07c519c255cdd8` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n2021-03-31T04:14:55.671-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n2021-03-31T04:14:55.727-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615\n", 'serverTime': '2021-03-31T04:14:57.162Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1617189285401, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.98.17'], EjectNodes = ['ns_1@172.23.98.15',\n                                                                     'ns_1@172.23.98.16'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 505bbfc322316c65568f2883ccff257d", 'serverTime': '2021-03-31T04:14:45.401Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1617189285397, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.98.17\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.98.15\', \'ns_1@172.23.98.16\']', 'serverTime': '2021-03-31T04:14:45.397Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'info', 'code': 11, 'module': 'menelaus_web', 'tstamp': 1617189285361, 'shortText': 'message', 'text': 'Deleted bucket "default"\n', 'serverTime': '2021-03-31T04:14:45.361Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1617189284182, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.98.17\' for deletion', 'serverTime': '2021-03-31T04:14:44.182Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.16', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1617189284180, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.98.16\' for deletion', 'serverTime': '2021-03-31T04:14:44.180Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1617189010688, 'shortText': 'message', 'text': 'Bucket "default" loaded on node \'ns_1@172.23.98.17\' in 0 seconds.', 'serverTime': '2021-03-31T04:10:10.688Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.16', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1617189003953, 'shortText': 'message', 'text': 'Bucket "default" loaded on node \'ns_1@172.23.98.16\' in 0 seconds.', 'serverTime': '2021-03-31T04:10:03.953Z'}
      [2021-03-31 04:15:05,569] - [rest_client:3817] ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'info', 'code': 12, 'module': 'menelaus_web', 'tstamp': 1617189003862, 'shortText': 'message', 'text': 'Created bucket "default" of type: couchbase\n[{num_replicas,1},\n {replica_index,true},\n {ram_quota,1800404992},\n {durability_min_level,none},\n {flush_enabled,true},\n {num_threads,3},\n {eviction_policy,value_only},\n {conflict_resolution_type,seqno},\n {storage_mode,couchstore},\n {max_ttl,0},\n {compression_mode,passive}]', 'serverTime': '2021-03-31T04:10:03.862Z'}
      
      

      All cluster logs are attached

      Attachments

        1. 172.23.98.15-20210331-0415-diag.zip
          26.61 MB
        2. 172.23.98.15-diag.txt.gz
          245 kB
        3. 172.23.98.16-20210331-0415-diag.zip
          44.89 MB
        4. 172.23.98.16-diag.txt.gz
          247 kB
        5. 172.23.98.17-20210331-0415-diag.zip
          49.45 MB
        6. 172.23.98.17-diag.txt.gz
          757 kB
        7. Screen Shot 2021-04-01 at 4.40.04 PM.png
          Screen Shot 2021-04-01 at 4.40.04 PM.png
          60 kB
        8. test.log
          293 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Seeing several GOCBCORE warnings on the node that was OOM-killed (172.23.98.17) ..

          2021-03-31T01:06:55.267-07:00 [WARN] (GOCBCORE) memdClient read failure on conn `056cd3fe513b745f/efb4183906e18f66` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T01:06:55.277-07:00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T01:06:55.277-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc0004a1620 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T01:06:55.277-07:00 [INFO] (GOCBCORE) Stopping CCCP poller for HTTP polling takeover
          2021-03-31T01:06:55.282-07:00 [WARN] (GOCBCORE) memdClient read failure on conn `056cd3fe513b745f/0e48bad4bfdccb3c` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T01:06:55.291-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

          Most warnings indicate: "bucket not found" or "bad bucket" as the reason. There's several warnings within memcached logs too, from CouchKVStore not able to find files etc.

          Evgeny Makarenko, could you highlight what the test does - does it delete the bucket at any point?

          Charles Dixon, if you take a look at this memory consumption graph ..

          The logs indicate several GOCBCORE warnings at the time when memory consumption is constantly going up -

          2021-03-31T03:00:11.060-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc00c71a2a0 failed to bootstrap: invalid argument | {"status_code":4,"document_key":"SCRAM-SHA1","bucket":"default","error_name":"EINVAL","error_description":"Invalid packet","opaque":8,"context":"Logic error: The server expects SASL STEP after returning CONTINUE from SASL START","last_dispatched_to":"172.23.98.17:11210","last_dispatched_from":"172.23.98.17:33432","last_connection_id":"315a5845aee2e2d5/e37a8898fb511394"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

          2021-03-31T03:00:11.062-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc01ec8ca80 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLSTEP","Opaque":"0xb","TimeObserved":31549997,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"172.23.98.17:11210","LastDispatchedFrom":"172.23.98.17:33430","LastConnectionID":"cec51f9918cab938/f100914e5ae7dd01"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

          2021-03-31T03:00:11.722-07:00 [WARN] (GOCBCORE) Failed to connect to host. Unable to find given hostport in cbauth database: `172.23.98.16:8091' -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

          2021-03-31T03:00:11.867-07:00 [WARN] (GOCBCORE) Failed to close authentication client (close tcp 172.23.98.17:58080->172.23.98.16:11210: use of closed network connection) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

          2021-03-31T03:00:11.909-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc0232b07e0 failed to bootstrap: EOF | {"document_key":"{\"a\":\"gocbcore/v9.1.3 stats\",\"i\":\"49c981ad28aca181/58392152c2cf15c7\"}","bucket":"default","last_dispatched_to":"172.23.98.16:11210","last_dispatched_from":"172.23.98.17:58214","last_connection_id":"49c981ad28aca181/58392152c2cf15c7"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

          Even if the cluster's in a bad state (for any reason) - is there some situation where gocbcore is leaking memory or creating way too much garbage due to the all these connection attempts causing memory usage to spike the way it is.

          abhinav Abhinav Dangeti added a comment - Seeing several GOCBCORE warnings on the node that was OOM-killed (172.23.98.17) .. 2021 - 03 -31T01: 06 : 55.267 - 07 : 00 [WARN] (GOCBCORE) memdClient read failure on conn `056cd3fe513b745f/efb4183906e18f66` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T01: 06 : 55.277 - 07 : 00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T01: 06 : 55.277 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc0004a1620 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T01: 06 : 55.277 - 07 : 00 [INFO] (GOCBCORE) Stopping CCCP poller for HTTP polling takeover 2021 - 03 -31T01: 06 : 55.282 - 07 : 00 [WARN] (GOCBCORE) memdClient read failure on conn `056cd3fe513b745f/0e48bad4bfdccb3c` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T01: 06 : 55.291 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 Most warnings indicate: "bucket not found" or "bad bucket" as the reason. There's several warnings within memcached logs too, from CouchKVStore not able to find files etc. Evgeny Makarenko , could you highlight what the test does - does it delete the bucket at any point? Charles Dixon , if you take a look at this memory consumption graph .. The logs indicate several GOCBCORE warnings at the time when memory consumption is constantly going up - 2021 - 03 -31T03: 00 : 11.060 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc00c71a2a0 failed to bootstrap: invalid argument | { "status_code" : 4 , "document_key" : "SCRAM-SHA1" , "bucket" : "default" , "error_name" : "EINVAL" , "error_description" : "Invalid packet" , "opaque" : 8 , "context" : "Logic error: The server expects SASL STEP after returning CONTINUE from SASL START" , "last_dispatched_to" : "172.23.98.17:11210" , "last_dispatched_from" : "172.23.98.17:33432" , "last_connection_id" : "315a5845aee2e2d5/e37a8898fb511394" } -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T03: 00 : 11.062 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc01ec8ca80 failed to bootstrap: ambiguous timeout | { "InnerError" :{ "InnerError" :{ "InnerError" :{}, "Message" : "ambiguous timeout" }}, "OperationID" : "CMD_SASLSTEP" , "Opaque" : "0xb" , "TimeObserved" : 31549997 , "RetryReasons" : null , "RetryAttempts" : 0 , "LastDispatchedTo" : "172.23.98.17:11210" , "LastDispatchedFrom" : "172.23.98.17:33430" , "LastConnectionID" : "cec51f9918cab938/f100914e5ae7dd01" } -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T03: 00 : 11.722 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host. Unable to find given hostport in cbauth database: ` 172.23 . 98.16 : 8091 ' -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T03: 00 : 11.867 - 07 : 00 [WARN] (GOCBCORE) Failed to close authentication client (close tcp 172.23 . 98.17 : 58080 -> 172.23 . 98.16 : 11210 : use of closed network connection) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T03: 00 : 11.909 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc0232b07e0 failed to bootstrap: EOF | { "document_key" : "{\"a\":\"gocbcore/v9.1.3 stats\",\"i\":\"49c981ad28aca181/58392152c2cf15c7\"}" , "bucket" : "default" , "last_dispatched_to" : "172.23.98.16:11210" , "last_dispatched_from" : "172.23.98.17:58214" , "last_connection_id" : "49c981ad28aca181/58392152c2cf15c7" } -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 Even if the cluster's in a bad state (for any reason) - is there some situation where gocbcore is leaking memory or creating way too much garbage due to the all these connection attempts causing memory usage to spike the way it is.

          Hi Abhinav Dangeti I'm not aware of any leaks or ways that bootstrapping could be causing a significant leak, even repeated connection attempts is a pretty low number of operations. This might be a silly question but are you also definitely closing the agents each time that you're creating a new one?

          The

          Failed to connect to host, bad bucket.

          error indicates that the http config poller received a 404 trying to get the bucket config. There are loads of select bucket errors with (bucket not found) too which goes some way to suggesting that bucket doesn't exist or is in a weird state a lot. I'm also seeing select bucket sometimes fail with EACCESS.

           

          Failed to connect to host. Unable to find given hostport in cbauth database

          This error is also from the http config poller but only the first sentence, the second sentence must be coming from cbauth when we try to send the HTTP request and try to get credentials.

           

          Around the time that you've singled out I can see there are loads of timeouts even for commands like HELLO.

          charles.dixon Charles Dixon added a comment - Hi Abhinav Dangeti I'm not aware of any leaks or ways that bootstrapping could be causing a significant leak, even repeated connection attempts is a pretty low number of operations. This might be a silly question but are you also definitely closing the agents each time that you're creating a new one? The Failed to connect to host, bad bucket. error indicates that the http config poller received a 404 trying to get the bucket config. There are loads of select bucket errors with (bucket not found) too which goes some way to suggesting that bucket doesn't exist or is in a weird state a lot. I'm also seeing select bucket sometimes fail with EACCESS.   Failed to connect to host. Unable to find given hostport in cbauth database This error is also from the http config poller but only the first sentence, the second sentence must be coming from cbauth when we try to send the HTTP request and try to get credentials.   Around the time that you've singled out I can see there are loads of timeouts even for commands like HELLO.
          abhinav Abhinav Dangeti added a comment - - edited

          FTS keeps a cache of gocbcore.Agent and gocbcore.DCPAgent per bucket dedicated for stats routines.

          When FTS sets up an index partition covering a group of vbuckets - it first sets up a gocbcore.Agent to fetch the bucketUUID and collections manifest and then closes that agent. It then sets up a gocbcore.DCPAgent to create a bunch of streams. This DCPAgent is also saved in a cache for use by future index partitions going to the same bucket.

          So, Charles Dixon to answer this question ..

          This might be a silly question but are you also definitely closing the agents each time that you're creating a new one?

          Which agents are you referring to? I believe the description above will tell you a bit about how FTS does agent management.

          Ya the system is under duress - if the buckets were indeed deleted, FTS will go ahead and initiate shutdown over all indexes subscribed to it as we'd receive an EOF on the channel.

          On timeouts during select_bucket or hello, the SDK will reattempt the connection yes? Note that the retryStrategy FTS uses for agents is just for gocbcore.Agents and not for gocbcore.DCPAgents - do you think it wise to setup a RetryAfter callback for these too? 

          abhinav Abhinav Dangeti added a comment - - edited FTS keeps a cache of gocbcore.Agent and gocbcore.DCPAgent per bucket dedicated for stats routines. When FTS sets up an index partition covering a group of vbuckets - it first sets up a gocbcore.Agent to fetch the bucketUUID and collections manifest and then closes that agent. It then sets up a gocbcore.DCPAgent to create a bunch of streams. This DCPAgent is also saved in a cache for use by future index partitions going to the same bucket. So, Charles Dixon to answer this question .. This might be a silly question but are you also definitely closing the agents each time that you're creating a new one? Which agents are you referring to? I believe the description above will tell you a bit about how FTS does agent management. Ya the system is under duress - if the buckets were indeed deleted, FTS will go ahead and initiate shutdown over all indexes subscribed to it as we'd receive an EOF on the channel. On timeouts during select_bucket or hello, the SDK will reattempt the connection yes? Note that the retryStrategy FTS uses for agents is just for gocbcore.Agents and not for gocbcore.DCPAgents - do you think it wise to setup a RetryAfter callback for these too? 

          Abhinav DangetiI see 138 instances of "Creating new agent" and 93 of "Create new dcp agent" in the logfile, is that expected?

          On timeouts during select_bucket or hello, the SDK will reattempt the connection yes?

          That's right, the SDK will keep retrying connections until they succeed or the SDK is told to stop either by a shutdown or a cluster config indicating that the node is gone/doesn't have kv etc...

          {{Note that the retryStrategy FTS uses for agents is just for gocbcore.Agents and not for gocbcore.DCPAgent - do you think it wise to setup a RetryAfter callback for these too? }}

          The retry strategy is only used for WaitUntilReady I believe? If that's the case then I don't think it'd hurt to use it but I also think that it's not something that would be related to this issue.

          charles.dixon Charles Dixon added a comment - Abhinav Dangeti I see 138 instances of "Creating new agent" and 93 of "Create new dcp agent" in the logfile, is that expected? On timeouts during select_bucket or hello, the SDK will reattempt the connection yes? That's right, the SDK will keep retrying connections until they succeed or the SDK is told to stop either by a shutdown or a cluster config indicating that the node is gone/doesn't have kv etc... {{Note that the retryStrategy FTS uses for agents is just for gocbcore.Agents and not for gocbcore.DCPAgent - do you think it wise to setup a RetryAfter callback for these too? }} The retry strategy is only used for WaitUntilReady I believe? If that's the case then I don't think it'd hurt to use it but I also think that it's not something that would be related to this issue.

          Yes, the large number of instances depends on the number of indexes in the system. Each DCPAgent for a bucket, is shared by 6 index partitions, and Agents are set up and closed during an index partition's feed initiation.

          The retry strategy is for WaitUntilReady - that's correct, with a timeout of 10s.

           

          abhinav Abhinav Dangeti added a comment - Yes, the large number of instances depends on the number of indexes in the system. Each DCPAgent for a bucket, is shared by 6 index partitions, and Agents are set up and closed during an index partition's feed initiation. The retry strategy is for WaitUntilReady - that's correct, with a timeout of 10s.  
          abhinav Abhinav Dangeti added a comment - - edited

          It seems like this rebalance failure due to FTS getting killed's occurred during test clean up after the index and bucket have been deleted from the system ..

          [2021-03-31 04:14:44,116] - [fts_base] [140302480238336] - INFO - Validated: all index files for custom_index deleted from disk
          [2021-03-31 04:14:44,116] - [fts_base] [140302480238336] - INFO - removing nodes from cluster ...
          [2021-03-31 04:14:44,126] - [fts_base] [140302480238336] - INFO - cleanup [ip:172.23.98.17 port:8091 ssh_username:root, ip:172.23.98.16 port:8091 ssh_username:root, ip:172.23.98.15 port:8091 ssh_username:root]
          [2021-03-31 04:14:44,144] - [bucket_helper] [140302480238336] - INFO - deleting existing buckets ['default'] on 172.23.98.17
          [2021-03-31 04:14:45,360] - [bucket_helper] [140302480238336] - INFO - waiting for bucket deletion to complete....
          [2021-03-31 04:14:45,368] - [rest_client] [140302480238336] - INFO - node 172.23.98.17 existing buckets : []
          [2021-03-31 04:14:45,368] - [bucket_helper] [140302480238336] - INFO - deleted bucket : default from 172.23.98.17
          [2021-03-31 04:14:45,375] - [rest_client] [140302480238336] - INFO - -->is_ns_server_running?
          [2021-03-31 04:14:45,394] - [cluster_helper] [140302480238336] - INFO - rebalancing all nodes in order to remove nodes
          [2021-03-31 04:14:45,396] - [rest_client] [140302480238336] - INFO - rebalance params : {'knownNodes': 'ns_1@172.23.98.15,ns_1@172.23.98.16,ns_1@172.23.98.17', 'ejectedNodes': 'ns_1@172.23.98.15,ns_1@172.23.98.16', 'user': 'Administrator', 'password': 'password'}
          [2021-03-31 04:14:45,454] - [rest_client] [140302480238336] - INFO - rebalance operation started
          [2021-03-31 04:14:45,456] - [rest_client] [140302480238336] - INFO - rebalance percentage : 0.00 %
          [2021-03-31 04:14:55,542] - [rest_client] [140302480238336] - INFO - rebalance percentage : 25.00 %
          [2021-03-31 04:15:05,554] - [rest_client] [140302480238336] - ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
          [2021-03-31 04:15:05,568] - [rest_client] [140302480238336] - INFO - Latest logs from UI on 172.23.98.17:
          [2021-03-31 04:15:05,568] - [rest_client] [140302480238336] - ERROR - {'node': 'ns_1@172.23.98.17', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1617189297176, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                                 {agent_died,<0.3286.0>,\n                                     {lost_connection,shutdown}}}.\nRebalance Operation Id = 505bbfc322316c65568f2883ccff257d', 'serverTime': '2021-03-31T04:14:57.176Z'} 

          So i think we should figure what was going on around here ..

          2021-03-31T04:14:53.172-07:00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:53.172-07:00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:53.281-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc00b4c8180 failed to bootstrap: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"CMD_SASLAUTH","Opaque":"0x4","TimeObserved":506585485,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"172.23.98.16:11210","LastDispatchedFrom":"172.23.98.17:48420","LastConnectionID":"36a0c7bb305678da/ce711420cbb69bea"} -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:53.292-07:00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:53.292-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc00626d320 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:53.292-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc00c71a240 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:53.295-07:00 [WARN] (GOCBCORE) Pipeline Client 0xc0239f3320 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:53.346-07:00 [WARN] (GOCBCORE) memdClient read failure on conn `2e661e640041ea95/d4c424535200fb7b` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.230-07:00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.236-07:00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.313-07:00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.342-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.357-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.386-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.395-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.409-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.409-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.464-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.490-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.519-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.563-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.582-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.604-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.615-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.625-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.648-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:51.641-07:00 [WARN] (GOCBCORE) memdClient read failure on conn `02769c42d8bd79fd/2f07c519c255cdd8` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.671-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615
          2021-03-31T04:14:55.727-07:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:615 

          Hey Evgeny Makarenko, will you be able to obtain the heap profile from all the FTS nodes after you've deleted the index and bucket before rebalancing out all the nodes during clean up? This could help us identify what's holding up all the resources.

          abhinav Abhinav Dangeti added a comment - - edited It seems like this rebalance failure due to FTS getting killed's occurred during test clean up after the index and bucket have been deleted from the system .. [ 2021 - 03 - 31 04 : 14 : 44 , 116 ] - [fts_base] [ 140302480238336 ] - INFO - Validated: all index files for custom_index deleted from disk [ 2021 - 03 - 31 04 : 14 : 44 , 116 ] - [fts_base] [ 140302480238336 ] - INFO - removing nodes from cluster ... [ 2021 - 03 - 31 04 : 14 : 44 , 126 ] - [fts_base] [ 140302480238336 ] - INFO - cleanup [ip: 172.23 . 98.17 port: 8091 ssh_username:root, ip: 172.23 . 98.16 port: 8091 ssh_username:root, ip: 172.23 . 98.15 port: 8091 ssh_username:root] [ 2021 - 03 - 31 04 : 14 : 44 , 144 ] - [bucket_helper] [ 140302480238336 ] - INFO - deleting existing buckets [ 'default' ] on 172.23 . 98.17 [ 2021 - 03 - 31 04 : 14 : 45 , 360 ] - [bucket_helper] [ 140302480238336 ] - INFO - waiting for bucket deletion to complete.... [ 2021 - 03 - 31 04 : 14 : 45 , 368 ] - [rest_client] [ 140302480238336 ] - INFO - node 172.23 . 98.17 existing buckets : [] [ 2021 - 03 - 31 04 : 14 : 45 , 368 ] - [bucket_helper] [ 140302480238336 ] - INFO - deleted bucket : default from 172.23 . 98.17 [ 2021 - 03 - 31 04 : 14 : 45 , 375 ] - [rest_client] [ 140302480238336 ] - INFO - -->is_ns_server_running? [ 2021 - 03 - 31 04 : 14 : 45 , 394 ] - [cluster_helper] [ 140302480238336 ] - INFO - rebalancing all nodes in order to remove nodes [ 2021 - 03 - 31 04 : 14 : 45 , 396 ] - [rest_client] [ 140302480238336 ] - INFO - rebalance params : { 'knownNodes' : 'ns_1@172.23.98.15,ns_1@172.23.98.16,ns_1@172.23.98.17' , 'ejectedNodes' : 'ns_1@172.23.98.15,ns_1@172.23.98.16' , 'user' : 'Administrator' , 'password' : 'password' } [ 2021 - 03 - 31 04 : 14 : 45 , 454 ] - [rest_client] [ 140302480238336 ] - INFO - rebalance operation started [ 2021 - 03 - 31 04 : 14 : 45 , 456 ] - [rest_client] [ 140302480238336 ] - INFO - rebalance percentage : 0.00 % [ 2021 - 03 - 31 04 : 14 : 55 , 542 ] - [rest_client] [ 140302480238336 ] - INFO - rebalance percentage : 25.00 % [ 2021 - 03 - 31 04 : 15 : 05 , 554 ] - [rest_client] [ 140302480238336 ] - ERROR - { 'status' : 'none' , 'errorMessage' : 'Rebalance failed. See logs for detailed reason. You can try again.' } - rebalance failed [ 2021 - 03 - 31 04 : 15 : 05 , 568 ] - [rest_client] [ 140302480238336 ] - INFO - Latest logs from UI on 172.23 . 98.17 : [ 2021 - 03 - 31 04 : 15 : 05 , 568 ] - [rest_client] [ 140302480238336 ] - ERROR - { 'node' : 'ns_1@172.23.98.17' , 'type' : 'critical' , 'code' : 0 , 'module' : 'ns_orchestrator' , 'tstamp' : 1617189297176 , 'shortText' : 'message' , 'text' : 'Rebalance exited with reason {service_rebalance_failed,fts,\n {agent_died,<0.3286.0>,\n {lost_connection,shutdown}}}.\nRebalance Operation Id = 505bbfc322316c65568f2883ccff257d' , 'serverTime' : '2021-03-31T04:14:57.176Z' } So i think we should figure what was going on around here .. 2021 - 03 -31T04: 14 : 53.172 - 07 : 00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 53.172 - 07 : 00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 53.281 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc00b4c8180 failed to bootstrap: ambiguous timeout | { "InnerError" :{ "InnerError" :{ "InnerError" :{}, "Message" : "ambiguous timeout" }}, "OperationID" : "CMD_SASLAUTH" , "Opaque" : "0x4" , "TimeObserved" : 506585485 , "RetryReasons" : null , "RetryAttempts" : 0 , "LastDispatchedTo" : "172.23.98.16:11210" , "LastDispatchedFrom" : "172.23.98.17:48420" , "LastConnectionID" : "36a0c7bb305678da/ce711420cbb69bea" } -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 53.292 - 07 : 00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 53.292 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc00626d320 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 53.292 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc00c71a240 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 53.295 - 07 : 00 [WARN] (GOCBCORE) Pipeline Client 0xc0239f3320 failed to bootstrap: bucket not found -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 53.346 - 07 : 00 [WARN] (GOCBCORE) memdClient read failure on conn `2e661e640041ea95/d4c424535200fb7b` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.230 - 07 : 00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.236 - 07 : 00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.313 - 07 : 00 [WARN] (GOCBCORE) Failed to perform select bucket against server (bucket not found) -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.342 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.357 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.386 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.395 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.409 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.409 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.464 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.490 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.519 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.563 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.582 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.604 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.615 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.625 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.648 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 51.641 - 07 : 00 [WARN] (GOCBCORE) memdClient read failure on conn `02769c42d8bd79fd/2f07c519c255cdd8` : EOF -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.671 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 2021 - 03 -31T04: 14 : 55.727 - 07 : 00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go: 615 Hey Evgeny Makarenko , will you be able to obtain the heap profile from all the FTS nodes after you've deleted the index and bucket before rebalancing out all the nodes during clean up? This could help us identify what's holding up all the resources.

          People

            evgeny.makarenko Evgeny Makarenko (Inactive)
            evgeny.makarenko Evgeny Makarenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty