Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44730

Unexpected server error received during rebalance

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-4554

      Unexpected server error is found during rebalance after test  execution.

      Corresponding test log fragment is here:

      [2021-02-27 03:18:36,683] - [cluster_helper] [140676780521216] - INFO - rebalancing all nodes in order to remove nodes
      [2021-02-27 03:18:36,685] - [rest_client] [140676780521216] - INFO - rebalance params : {'knownNodes': 'ns_1@172.23.106.20,ns_1@172.23.121.75,ns_1@172.23.97.222', 'ejectedNodes': 'ns_1@172.23.106.20,ns_1@172.23.97.222', 'user': 'Administrator', 'password': 'password'}
      [2021-02-27 03:18:36,713] - [rest_client] [140676780521216] - INFO - rebalance operation started
      [2021-02-27 03:18:36,717] - [rest_client] [140676780521216] - INFO - rebalance percentage : 0.00 %
      [2021-02-27 03:18:46,730] - [rest_client] [140676780521216] - INFO - rebalance percentage : 0.00 %
      [2021-02-27 03:18:56,734] - [rest_client] [140676780521216] - INFO - rebalance percentage : 0.00 %
      [2021-02-27 03:19:06,748] - [rest_client] [140676780521216] - ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - INFO - Latest logs from UI on 172.23.121.75:
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614424746687, 'shortText': 'message', 'text': "Rebalance exited with reason {prepare_rebalance_failed,\n                              {error,\n                               {failed_nodes,\n                                [{'ns_1@172.23.97.222',{error,timeout}}]}}}.\nRebalance Operation Id = e19e8d25353baa670bdc6d422c8b5212", 'serverTime': '2021-02-27T03:19:06.687Z'}
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614424716686, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.121.75'], EjectNodes = ['ns_1@172.23.106.20',\n                                                                      'ns_1@172.23.97.222'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = e19e8d25353baa670bdc6d422c8b5212", 'serverTime': '2021-02-27T03:18:36.686Z'}
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1614424716683, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.121.75\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.106.20\', \'ns_1@172.23.97.222\']', 'serverTime': '2021-02-27T03:18:36.683Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 11, 'module': 'menelaus_web', 'tstamp': 1614424716659, 'shortText': 'message', 'text': 'Deleted bucket "default"\n', 'serverTime': '2021-02-27T03:18:36.659Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1614424715998, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.121.75\' for deletion', 'serverTime': '2021-02-27T03:18:35.998Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.106.20', 'type': 'info', 'code': 1, 'module': 'menelaus_web_sup', 'tstamp': 1614424661647, 'shortText': 'web start ok', 'text': 'Couchbase Server has started on web port 8091 on node \'ns_1@172.23.106.20\'. Version: "7.0.0-4554-enterprise".', 'serverTime': '2021-02-27T03:17:41.647Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1614424661473, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.121.75' saw that node 'ns_1@172.23.106.20' came up. Tags: []", 'serverTime': '2021-02-27T03:17:41.473Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'warning', 'code': 5, 'module': 'ns_node_disco', 'tstamp': 1614424624260, 'shortText': 'node down', 'text': "Node 'ns_1@172.23.121.75' saw that node 'ns_1@172.23.106.20' went down. Details: [{nodedown_reason,\n                                                                                   connection_closed}]", 'serverTime': '2021-02-27T03:17:04.260Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1614424533594, 'shortText': 'message', 'text': 'Bucket "default" loaded on node \'ns_1@172.23.121.75\' in 0 seconds.', 'serverTime': '2021-02-27T03:15:33.594Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 12, 'module': 'menelaus_web', 'tstamp': 1614424533507, 'shortText': 'message', 'text': 'Created bucket "default" of type: couchbase\n[{num_replicas,1},\n {replica_index,true},\n {ram_quota,1727004672},\n {durability_min_level,none},\n {pitr_enabled,false},\n {pitr_granularity,600},\n {pitr_max_history_age,86400},\n {flush_enabled,true},\n {num_threads,3},\n {eviction_policy,value_only},\n {conflict_resolution_type,seqno},\n {storage_mode,couchstore},\n {max_ttl,0},\n {compression_mode,passive}]', 'serverTime': '2021-02-27T03:15:33.507Z'}
      [2021-02-27 03:19:06,759] - [fts_base] [140676780521216] - INFO - closing all ssh connections
      [2021-02-27 03:19:06,760] - [fts_base] [140676780521216] - INFO - closing all memcached connections
      

      Have attached collected logs, corresponding test execution logs are here:

      [2021-03-04 17:17:12,520] - [rest_client:1714] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.107.35,ns_1@172.23.107.56,ns_1@172.23.107.75', 'ejectedNodes': 'ns_1@172.23.107.75', 'user': 'Administrator', 'password': 'password'}
      [2021-03-04 17:17:12,596] - [rest_client:1719] INFO - rebalance operation started
      [2021-03-04 17:17:12,643] - [rest_client:1881] INFO - rebalance percentage : 0.00 %
      [2021-03-04 17:17:12,644] - [task:844] INFO - Rebalance - status: running, progress: 0.00%
      [2021-03-04 17:17:16,677] - [fts_base:654] INFO - Killing cbft on server: ip:172.23.107.56 port:8091 ssh_username:root
      [2021-03-04 17:17:16,682] - [remote_util:298] INFO - SSH Connecting to 172.23.107.56 with username:root, attempt#1 of 5
      [2021-03-04 17:17:17,115] - [remote_util:336] INFO - SSH Connected to 172.23.107.56 as root
      [2021-03-04 17:17:17,708] - [remote_util:3567] INFO - os_distro: CentOS, os_version: centos 7, is_linux_distro: True
      [2021-03-04 17:17:18,659] - [remote_util:3723] INFO - extract_remote_info-->distribution_type: CentOS, distribution_version: centos 7
      [2021-03-04 17:17:18,661] - [remote_util:3407] INFO - running command.raw on 172.23.107.56: pkill -e cbft
      [2021-03-04 17:17:18,834] - [remote_util:3455] INFO - command executed successfully with root
      [2021-03-04 17:17:18,835] - [remote_util:3240] INFO - cbft killed (pid 10984)
      [2021-03-04 17:17:22,787] - [rest_client:1864] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-03-04 17:17:22,867] - [rest_client:3765] INFO - Latest logs from UI on 172.23.107.35:
      [2021-03-04 17:17:22,868] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'auto_rebalance', 'tstamp': 1614907038877, 'shortText': 'message', 'text': 'Rebalance with Id 4e2741c980fc7af4c748558d26defbc4 will be retried after 60 seconds. Retry attempt number 1', 'serverTime': '2021-03-04T17:17:18.877Z'}
      [2021-03-04 17:17:22,869] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614907038875, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                                 {agent_died,<27720.2739.1>,\n                                     {lost_connection,shutdown}}}.\nRebalance Operation Id = 4e2741c980fc7af4c748558d26defbc4', 'serverTime': '2021-03-04T17:17:18.875Z'}
      [2021-03-04 17:17:22,870] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.56', 'type': 'info', 'code': 0, 'module': 'ns_log', 'tstamp': 1614907038868, 'shortText': 'message', 'text': "Service 'fts' exited with status 143. Restarting. Messages:\n2021-03-04T17:17:13.455-08:00 [INFO] (GOCBCORE) Creating new agent: &{MemdAddrs:[] HTTPAddrs:[127.0.0.1:8091] BucketName:default UserAgent:default_index_1_2465dc3f9bb14079_13aa53f3 UseTLS:false NetworkType: Auth:0x1d09b50 TLSRootCAProvider:<nil> UseMutationTokens:false UseCompression:false UseDurations:false DisableDecompression:false UseOutOfOrderResponses:false DisableXErrors:false DisableJSONHello:false DisableSyncReplicationHello:false UseCollections:true CompressionMinSize:0 CompressionMinRatio:0 HTTPRedialPeriod:0s HTTPRetryDelay:0s CccpMaxWait:0s CccpPollPeriod:0s ConnectTimeout:1m0s KVConnectTimeout:7s KvPoolSize:0 MaxQueueSize:0 HTTPMaxIdleConns:0 HTTPMaxIdleConnsPerHost:0 HTTPIdleConnectionTimeout:0s Tracer:<nil> NoRootTraceSpans:false DefaultRetryStrategy:<nil> CircuitBreakerConfig:{Enabled:false VolumeThreshold:0 ErrorThresholdPercentage:0 SleepWindow:0s RollingWindow:0s CompletionCallback:<nil> CanaryTimeout:0s} UseZombieLogger:false ZombieLoggerInterval:0s ZombieLoggerSampleSize:0 AuthMechanisms:[]}\n2021-03-04T17:17:13.525-08:00 [INFO] feed_dcp_gocbcore: newGocbcoreDCPFeed, name: default_index_1_2465dc3f9bb14079_13aa53f3, indexName: default_index_1, server: http://127.0.0.1:8091, bucketName: default, bucketUUID: abf3fb490b940a4c88f93e6031db8e39\n2021-03-04T17:17:13.525-08:00 [INFO] feed_dcp_gocbcore: Start, name: default_index_1_2465dc3f9bb14079_13aa53f3, num streams: 171, manifestUID: 0, streamOptions: {FilterOptions: &{ScopeID:0 CollectionIDs:[0]}, StreamOptions: &{StreamID:3}}\n2021-03-04T17:17:13.528-08:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: curMetaKvPlanKey\n2021-03-04T17:17:13.559-08:00 [INFO] janitor: pindexes to remove: 0\n2021-03-04T17:17:13.559-08:00 [INFO] janitor: pindexes to add: 0\n2021-03-04T17:17:13.559-08:00 [INFO] janitor: pindexes to restart: 0\n2021-03-04T17:17:13.560-08:00 [INFO] janitor: feeds to remove: 0\n2021-03-04T17:17:13.560-08:00 [INFO] janitor: feeds to add: 0\n", 'serverTime': '2021-03-04T17:17:18.868Z'}
      [2021-03-04 17:17:22,871] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_vbucket_mover', 'tstamp': 1614907033048, 'shortText': 'message', 'text': 'Bucket "default" rebalance appears to be swap rebalance', 'serverTime': '2021-03-04T17:17:13.048Z'}
      [2021-03-04 17:17:22,871] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_rebalancer', 'tstamp': 1614907032874, 'shortText': 'message', 'text': 'Started rebalancing bucket default', 'serverTime': '2021-03-04T17:17:12.874Z'}
      [2021-03-04 17:17:22,872] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614907032559, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.107.35','ns_1@172.23.107.56'], EjectNodes = ['ns_1@172.23.107.75'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 4e2741c980fc7af4c748558d26defbc4", 'serverTime': '2021-03-04T17:17:12.559Z'}
      [2021-03-04 17:17:22,873] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1614906881874, 'shortText': 'message', 'text': 'Bucket "default" loaded on node \'ns_1@172.23.107.35\' in 0 seconds.', 'serverTime': '2021-03-04T17:14:41.874Z'}
      [2021-03-04 17:17:22,874] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 12, 'module': 'menelaus_web', 'tstamp': 1614906881762, 'shortText': 'message', 'text': 'Created bucket "default" of type: couchbase\n[{num_replicas,1},\n {replica_index,true},\n {ram_quota,209715200},\n {durability_min_level,none},\n {pitr_enabled,false},\n {pitr_granularity,600},\n {pitr_max_history_age,86400},\n {flush_enabled,true},\n {num_threads,3},\n {eviction_policy,value_only},\n {conflict_resolution_type,seqno},\n {storage_mode,couchstore},\n {max_ttl,0},\n {compression_mode,passive}]', 'serverTime': '2021-03-04T17:14:41.762Z'}
      [2021-03-04 17:17:22,874] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614906869020, 'shortText': 'message', 'text': 'Rebalance completed successfully.\nRebalance Operation Id = 0f35eb393a53117d0e179e70e256930c', 'serverTime': '2021-03-04T17:14:29.020Z'}
      [2021-03-04 17:17:22,875] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614906868130, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.107.35','ns_1@172.23.107.56',\n                                 'ns_1@172.23.107.75'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 0f35eb393a53117d0e179e70e256930c", 'serverTime': '2021-03-04T17:14:28.130Z'}
      

       

      Collected logs can be found here:
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-05T012013-ns_1%40172.23.107.35.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-05T012013-ns_1%40172.23.107.56.zip

      Attachments

        1. 172.23.105.49-diag.txt.gz
          163 kB
        2. 172.23.106.20-diag.txt.gz
          507 kB
        3. 172.23.121.75-diag.txt.gz
          520 kB
        4. 172.23.97.222-diag.txt.gz
          164 kB
        5. 172.23.97.65-diag.txt.gz
          163 kB
        6. test.log
          137 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          girish.benakappa Girish Benakappa added a comment - - edited

          We have seen this failure with 7.0.0-4617 as well. The test is to supposed to test retry rebalance.

           ./testrunner -i /tmp/testexec.29773.ini -p get-cbcollect-info=False,disable_HTP=True,get-logs=False,stop-on-failure=False,GROUP=P1,index_type=scorch,fts_quota=750 -t fts.moving_topology_fts.MovingTopFTS.retry_rebalance_in_during_querying,items=100000,cluster=D,F,num_queries=50,GROUP=P1,retry_time=60,num_retries=1

          Steps in the test:

          • Create a cluster with kv and fts
          • Retry Rebalance settings changed to : {'enabled': True, 'afterTimePeriod': 60, 'maxAttempts': 1}
          • Create below index and load docs and wait for indexing to complete

          {"type": "fulltext-index", "name": "default_index", "uuid": "", "params": {"store": {"kvStoreName": "mossStore", "mossStoreOptions": {}, "indexType": "scorch"}}, "sourceType": "couchbase", "sourceName": "default", "sourceUUID": "", "planParams": {"numReplicas": 0, "maxPartitionsPerPIndex": 171}, "sourceParams": {}} 

          * Start rebalance in fts node and query fts index

          • after 5 secs, reboot fts node 
          • So rebalance in operation is expected and fail and retry rebalance should kick in
          • Wait for querying fts index to finish
          • Start teardown where to remove all nodes from the cluster which is failing with:

          [2021-03-07 14:08:01,390] - [cluster_helper:262] INFO - rebalancing all nodes in order to remove nodes
          [2021-03-07 14:08:01,392] - [rest_client:1714] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.122.130,ns_1@172.23.123.57,ns_1@172.23.123.58', 'ejectedNodes': 'ns_1@172.23.122.130,ns_1@172.23.123.57', 'user': 'Administrator', 'password': 'password'}
          [2021-03-07 14:08:01,421] - [rest_client:1719] INFO - rebalance operation started
          [2021-03-07 14:08:01,424] - [rest_client:1881] INFO - rebalance percentage : 0.00 %
          [2021-03-07 14:08:11,428] - [rest_client:1881] INFO - rebalance percentage : 0.00 %
          [2021-03-07 14:08:21,441] - [rest_client:1881] INFO - rebalance percentage : 0.00 %
          [2021-03-07 14:08:31,455] - [rest_client:1864] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
          [2021-03-07 14:08:31,467] - [rest_client:3765] INFO - Latest logs from UI on 172.23.123.58:
          [2021-03-07 14:08:31,467] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1615154911400, 'shortText': 'message', 'text': "Rebalance exited with reason {prepare_rebalance_failed,\n                              {error,\n                               {failed_nodes,\n                                [{'ns_1@172.23.122.130',{error,timeout}}]}}}.\nRebalance Operation Id = 119ad6d09c43e4f09322705ab14514ab", 'serverTime': '2021-03-07T14:08:31.400Z'}
          [2021-03-07 14:08:31,467] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1615154881398, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.123.58'], EjectNodes = ['ns_1@172.23.122.130',\n                                                                      'ns_1@172.23.123.57'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 119ad6d09c43e4f09322705ab14514ab", 'serverTime': '2021-03-07T14:08:01.398Z'}
          [2021-03-07 14:08:31,467] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1615154881395, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.123.58\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.122.130\', \'ns_1@172.23.123.57\']', 'serverTime': '2021-03-07T14:08:01.395Z'}
          [2021-03-07 14:08:31,467] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'info', 'code': 11, 'module': 'menelaus_web', 'tstamp': 1615154881371, 'shortText': 'message', 'text': 'Deleted bucket "default"\n', 'serverTime': '2021-03-07T14:08:01.371Z'}
          [2021-03-07 14:08:31,467] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1615154880724, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.123.58\' for deletion', 'serverTime': '2021-03-07T14:08:00.724Z'}
          [2021-03-07 14:08:31,467] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1615154826965, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.123.58' saw that node 'ns_1@172.23.123.57' came up. Tags: []", 'serverTime': '2021-03-07T14:07:06.965Z'}
          [2021-03-07 14:08:31,468] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.57', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1615154826961, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.123.57' saw that node 'ns_1@172.23.123.58' came up. Tags: []", 'serverTime': '2021-03-07T14:07:06.961Z'}
          [2021-03-07 14:08:31,468] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'warning', 'code': 5, 'module': 'ns_node_disco', 'tstamp': 1615154791356, 'shortText': 'node down', 'text': "Node 'ns_1@172.23.123.58' saw that node 'ns_1@172.23.123.57' went down. Details: [{nodedown_reason,\n                                                                                   connection_closed}]", 'serverTime': '2021-03-07T14:06:31.356Z'}
          [2021-03-07 14:08:31,468] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1615154705902, 'shortText': 'message', 'text': 'Bucket "default" loaded on node \'ns_1@172.23.123.58\' in 0 seconds.', 'serverTime': '2021-03-07T14:05:05.902Z'}
          [2021-03-07 14:08:31,468] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.123.58', 'type': 'info', 'code': 12, 'module': 'menelaus_web', 'tstamp': 1615154705340, 'shortText': 'message', 'text': 'Created bucket "default" of type: couchbase\n[{num_replicas,1},\n {replica_index,true},\n {ram_quota,1727004672},\n {durability_min_level,none},\n {pitr_enabled,false},\n {pitr_granularity,600},\n {pitr_max_history_age,86400},\n {flush_enabled,true},\n {num_threads,3},\n {eviction_policy,value_only},\n {conflict_resolution_type,seqno},\n {storage_mode,couchstore},\n {max_ttl,0},\n {compression_mode,passive}]', 'serverTime': '2021-03-07T14:05:05.340Z'}
          Cluster instance shutdown with force 

          girish.benakappa Girish Benakappa added a comment - - edited We have seen this failure with 7.0.0-4617 as well. The test is to supposed to test retry rebalance. ./testrunner -i /tmp/testexec. 29773 .ini -p get-cbcollect-info=False,disable_HTP=True,get-logs=False,stop-on-failure=False,GROUP=P1,index_type=scorch,fts_quota= 750 -t fts.moving_topology_fts.MovingTopFTS.retry_rebalance_in_during_querying,items= 100000 ,cluster=D,F,num_queries= 50 ,GROUP=P1,retry_time= 60 ,num_retries= 1 Steps in the test: Create a cluster with kv and fts Retry Rebalance settings changed to : {'enabled': True, 'afterTimePeriod': 60, 'maxAttempts': 1} Create below index and load docs and wait for indexing to complete { "type" : "fulltext-index" , "name" : "default_index" , "uuid" : "" , "params" : { "store" : { "kvStoreName" : "mossStore" , "mossStoreOptions" : {}, "indexType" : "scorch" }}, "sourceType" : "couchbase" , "sourceName" : "default" , "sourceUUID" : "" , "planParams" : { "numReplicas" : 0 , "maxPartitionsPerPIndex" : 171 }, "sourceParams" : {}} * Start rebalance in fts node and query fts index after 5 secs, reboot fts node  So rebalance in operation is expected and fail and retry rebalance should kick in Wait for querying fts index to finish Start teardown where to remove all nodes from the cluster which is failing with: [ 2021 - 03 - 07 14 : 08 : 01 , 390 ] - [cluster_helper: 262 ] INFO - rebalancing all nodes in order to remove nodes [ 2021 - 03 - 07 14 : 08 : 01 , 392 ] - [rest_client: 1714 ] INFO - rebalance params : { 'knownNodes' : 'ns_1@172.23.122.130,ns_1@172.23.123.57,ns_1@172.23.123.58' , 'ejectedNodes' : 'ns_1@172.23.122.130,ns_1@172.23.123.57' , 'user' : 'Administrator' , 'password' : 'password' } [ 2021 - 03 - 07 14 : 08 : 01 , 421 ] - [rest_client: 1719 ] INFO - rebalance operation started [ 2021 - 03 - 07 14 : 08 : 01 , 424 ] - [rest_client: 1881 ] INFO - rebalance percentage : 0.00 % [ 2021 - 03 - 07 14 : 08 : 11 , 428 ] - [rest_client: 1881 ] INFO - rebalance percentage : 0.00 % [ 2021 - 03 - 07 14 : 08 : 21 , 441 ] - [rest_client: 1881 ] INFO - rebalance percentage : 0.00 % [ 2021 - 03 - 07 14 : 08 : 31 , 455 ] - [rest_client: 1864 ] ERROR - { 'status' : 'none' , 'errorMessage' : 'Rebalance failed. See logs for detailed reason. You can try again.' } - rebalance failed [ 2021 - 03 - 07 14 : 08 : 31 , 467 ] - [rest_client: 3765 ] INFO - Latest logs from UI on 172.23 . 123.58 : [ 2021 - 03 - 07 14 : 08 : 31 , 467 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'critical' , 'code' : 0 , 'module' : 'ns_orchestrator' , 'tstamp' : 1615154911400 , 'shortText' : 'message' , 'text' : "Rebalance exited with reason {prepare_rebalance_failed,\n {error,\n {failed_nodes,\n [{'ns_1@172.23.122.130',{error,timeout}}]}}}.\nRebalance Operation Id = 119ad6d09c43e4f09322705ab14514ab" , 'serverTime' : '2021-03-07T14:08:31.400Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 467 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'info' , 'code' : 0 , 'module' : 'ns_orchestrator' , 'tstamp' : 1615154881398 , 'shortText' : 'message' , 'text' : "Starting rebalance, KeepNodes = ['ns_1@172.23.123.58'], EjectNodes = ['ns_1@172.23.122.130',\n 'ns_1@172.23.123.57'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 119ad6d09c43e4f09322705ab14514ab" , 'serverTime' : '2021-03-07T14:08:01.398Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 467 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'warning' , 'code' : 102 , 'module' : 'menelaus_web' , 'tstamp' : 1615154881395 , 'shortText' : 'client-side error report' , 'text' : 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.123.58\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.122.130\', \'ns_1@172.23.123.57\']' , 'serverTime' : '2021-03-07T14:08:01.395Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 467 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'info' , 'code' : 11 , 'module' : 'menelaus_web' , 'tstamp' : 1615154881371 , 'shortText' : 'message' , 'text' : 'Deleted bucket "default"\n' , 'serverTime' : '2021-03-07T14:08:01.371Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 467 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'info' , 'code' : 0 , 'module' : 'ns_memcached' , 'tstamp' : 1615154880724 , 'shortText' : 'message' , 'text' : 'Shutting down bucket "default" on \'ns_1@172.23.123.58\' for deletion' , 'serverTime' : '2021-03-07T14:08:00.724Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 467 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'info' , 'code' : 4 , 'module' : 'ns_node_disco' , 'tstamp' : 1615154826965 , 'shortText' : 'node up' , 'text' : "Node 'ns_1@172.23.123.58' saw that node 'ns_1@172.23.123.57' came up. Tags: []" , 'serverTime' : '2021-03-07T14:07:06.965Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 468 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.57' , 'type' : 'info' , 'code' : 4 , 'module' : 'ns_node_disco' , 'tstamp' : 1615154826961 , 'shortText' : 'node up' , 'text' : "Node 'ns_1@172.23.123.57' saw that node 'ns_1@172.23.123.58' came up. Tags: []" , 'serverTime' : '2021-03-07T14:07:06.961Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 468 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'warning' , 'code' : 5 , 'module' : 'ns_node_disco' , 'tstamp' : 1615154791356 , 'shortText' : 'node down' , 'text' : "Node 'ns_1@172.23.123.58' saw that node 'ns_1@172.23.123.57' went down. Details: [{nodedown_reason,\n connection_closed}]" , 'serverTime' : '2021-03-07T14:06:31.356Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 468 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'info' , 'code' : 0 , 'module' : 'ns_memcached' , 'tstamp' : 1615154705902 , 'shortText' : 'message' , 'text' : 'Bucket "default" loaded on node \'ns_1@172.23.123.58\' in 0 seconds.' , 'serverTime' : '2021-03-07T14:05:05.902Z' } [ 2021 - 03 - 07 14 : 08 : 31 , 468 ] - [rest_client: 3766 ] ERROR - { 'node' : 'ns_1@172.23.123.58' , 'type' : 'info' , 'code' : 12 , 'module' : 'menelaus_web' , 'tstamp' : 1615154705340 , 'shortText' : 'message' , 'text' : 'Created bucket "default" of type: couchbase\n[{num_replicas,1},\n {replica_index,true},\n {ram_quota,1727004672},\n {durability_min_level,none},\n {pitr_enabled,false},\n {pitr_granularity,600},\n {pitr_max_history_age,86400},\n {flush_enabled,true},\n {num_threads,3},\n {eviction_policy,value_only},\n {conflict_resolution_type,seqno},\n {storage_mode,couchstore},\n {max_ttl,0},\n {compression_mode,passive}]' , 'serverTime' : '2021-03-07T14:05:05.340Z' } Cluster instance shutdown with force
          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

          Thanks Girish!

          But the test timestamps /logs aren't covering the FTS events in the logs. Your recent logs covers the final nodes out rebalance, but what happened before that remains unknown.

          If I look into the logs from you then

          Rebalance exited with reason {prepare_rebalance_failed,\n                              {error,\n                               {failed_nodes,\n                                [{'ns_1@172.23.122.130' 

          Is this node 130 an fts node? We don't have logs for checking the prepare api call failure.

          Similarly, we don't have similar test/cbcollected log stamps for the earlier attachments too.

          Evgeny Makarenko test logs were pointing to the beginning of the test but not towards the end.

           

          So, it would be great if you can update both test and cbcollect logs for this test alone.

          With the current set of logs, there is nothing wrong is observed in the logs.

           

           

          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Thanks Girish! But the test timestamps /logs aren't covering the FTS events in the logs. Your recent logs covers the final nodes out rebalance, but what happened before that remains unknown. If I look into the logs from you then Rebalance exited with reason {prepare_rebalance_failed,\n {error,\n {failed_nodes,\n [{ 'ns_1@172.23.122.130' Is this node 130 an fts node? We don't have logs for checking the prepare api call failure. Similarly, we don't have similar test/cbcollected log stamps for the earlier attachments too. Evgeny Makarenko  test logs were pointing to the beginning of the test but not towards the end.   So, it would be great if you can update both test and cbcollect logs for this test alone. With the current set of logs, there is nothing wrong is observed in the logs.    

          I have reproduced this problem locally for build 7.0.0-4617.

          Cluster config is this:

          [servers]
          1:_1
          2:_2
          3:_3
          4:_4
           
          [_1]
          ip:172.23.107.35
          [_2]
          ip:172.23.107.56
          [_3]
          ip:172.23.107.75
          [_4]
          ip:172.23.107.77

          Critical test.log fragment is here:

          2021-03-16 15:17:35 | INFO | MainProcess | test_thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23.107.35:
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.35', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1615933055024, 'shortText': 'message', 'text': "Rebalance exited with reason {prepare_rebalance_failed,\n                              {error,\n                               {failed_nodes,\n                                [{'ns_1@172.23.107.77',{error,timeout}}]}}}.\nRebalance Operation Id = 2a363d2f3a4492f65e0dcc99c0a40baf", 'serverTime': '2021-03-16T15:17:35.024Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1615933025021, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.107.35'], EjectNodes = ['ns_1@172.23.107.56',\n                                                                      'ns_1@172.23.107.77'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2a363d2f3a4492f65e0dcc99c0a40baf", 'serverTime': '2021-03-16T15:17:05.021Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.35', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1615933024977, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.107.35\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.107.56\', \'ns_1@172.23.107.77\']', 'serverTime': '2021-03-16T15:17:04.977Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 11, 'module': 'menelaus_web', 'tstamp': 1615933024655, 'shortText': 'message', 'text': 'Deleted bucket "default"\n', 'serverTime': '2021-03-16T15:17:04.655Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1615933024101, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.107.35\' for deletion', 'serverTime': '2021-03-16T15:17:04.101Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.56', 'type': 'info', 'code': 1, 'module': 'menelaus_web_sup', 'tstamp': 1615932898099, 'shortText': 'web start ok', 'text': 'Couchbase Server has started on web port 8091 on node \'ns_1@172.23.107.56\'. Version: "7.0.0-4617-enterprise".', 'serverTime': '2021-03-16T15:14:58.099Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1615932897584, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.107.35' saw that node 'ns_1@172.23.107.56' came up. Tags: []", 'serverTime': '2021-03-16T15:14:57.584Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.56', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1615932897577, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.107.56' saw that node 'ns_1@172.23.107.35' came up. Tags: []", 'serverTime': '2021-03-16T15:14:57.577Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.56', 'type': 'info', 'code': 2, 'module': 'ns_cookie_manager', 'tstamp': 1615932897545, 'shortText': 'cookie update', 'text': 'Node \'ns_1@172.23.107.56\' synchronized otp cookie {sanitized,\n                                                   <<"kh9CnWP47Kka4ppGmaM3BjBAJbug2lmLq23zwHPGK2I=">>} from cluster', 'serverTime': '2021-03-16T15:14:57.545Z'}
          2021-03-16 15:17:35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] {'node': 'ns_1@172.23.107.35', 'type': 'warning', 'code': 5, 'module': 'ns_node_disco', 'tstamp': 1615932835079, 'shortText': 'node down', 'text': "Node 'ns_1@172.23.107.35' saw that node 'ns_1@172.23.107.56' went down. Details: [{nodedown_reason,\n                                                                                   connection_closed}]", 'serverTime': '2021-03-16T15:13:55.079Z'}

          Full test log is attached (test.log)

          Fresh logs are collected:

          https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222349-ns_1%40172.23.107.77.zip

          https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222258-ns_1%40cb.local.zip
          https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222208-ns_1%40172.23.107.35.zip
          https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222208-ns_1%40172.23.107.56.zip

          evgeny.makarenko Evgeny Makarenko (Inactive) added a comment - I have reproduced this problem locally for build 7.0.0-4617. Cluster config is this: [servers] 1 :_1 2 :_2 3 :_3 4 :_4   [_1] ip: 172.23 . 107.35 [_2] ip: 172.23 . 107.56 [_3] ip: 172.23 . 107.75 [_4] ip: 172.23 . 107.77 Critical test.log fragment is here: 2021 - 03 - 16 15 : 17 : 35 | INFO | MainProcess | test_thread | [rest_client.print_UI_logs] Latest logs from UI on 172.23 . 107.35 : 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.35' , 'type' : 'critical' , 'code' : 0 , 'module' : 'ns_orchestrator' , 'tstamp' : 1615933055024 , 'shortText' : 'message' , 'text' : "Rebalance exited with reason {prepare_rebalance_failed,\n {error,\n {failed_nodes,\n [{'ns_1@172.23.107.77',{error,timeout}}]}}}.\nRebalance Operation Id = 2a363d2f3a4492f65e0dcc99c0a40baf" , 'serverTime' : '2021-03-16T15:17:35.024Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.35' , 'type' : 'info' , 'code' : 0 , 'module' : 'ns_orchestrator' , 'tstamp' : 1615933025021 , 'shortText' : 'message' , 'text' : "Starting rebalance, KeepNodes = ['ns_1@172.23.107.35'], EjectNodes = ['ns_1@172.23.107.56',\n 'ns_1@172.23.107.77'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2a363d2f3a4492f65e0dcc99c0a40baf" , 'serverTime' : '2021-03-16T15:17:05.021Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.35' , 'type' : 'warning' , 'code' : 102 , 'module' : 'menelaus_web' , 'tstamp' : 1615933024977 , 'shortText' : 'client-side error report' , 'text' : 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.107.35\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.107.56\', \'ns_1@172.23.107.77\']' , 'serverTime' : '2021-03-16T15:17:04.977Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.35' , 'type' : 'info' , 'code' : 11 , 'module' : 'menelaus_web' , 'tstamp' : 1615933024655 , 'shortText' : 'message' , 'text' : 'Deleted bucket "default"\n' , 'serverTime' : '2021-03-16T15:17:04.655Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.35' , 'type' : 'info' , 'code' : 0 , 'module' : 'ns_memcached' , 'tstamp' : 1615933024101 , 'shortText' : 'message' , 'text' : 'Shutting down bucket "default" on \'ns_1@172.23.107.35\' for deletion' , 'serverTime' : '2021-03-16T15:17:04.101Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.56' , 'type' : 'info' , 'code' : 1 , 'module' : 'menelaus_web_sup' , 'tstamp' : 1615932898099 , 'shortText' : 'web start ok' , 'text' : 'Couchbase Server has started on web port 8091 on node \'ns_1@172.23.107.56\'. Version: "7.0.0-4617-enterprise".' , 'serverTime' : '2021-03-16T15:14:58.099Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.35' , 'type' : 'info' , 'code' : 4 , 'module' : 'ns_node_disco' , 'tstamp' : 1615932897584 , 'shortText' : 'node up' , 'text' : "Node 'ns_1@172.23.107.35' saw that node 'ns_1@172.23.107.56' came up. Tags: []" , 'serverTime' : '2021-03-16T15:14:57.584Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.56' , 'type' : 'info' , 'code' : 4 , 'module' : 'ns_node_disco' , 'tstamp' : 1615932897577 , 'shortText' : 'node up' , 'text' : "Node 'ns_1@172.23.107.56' saw that node 'ns_1@172.23.107.35' came up. Tags: []" , 'serverTime' : '2021-03-16T15:14:57.577Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.56' , 'type' : 'info' , 'code' : 2 , 'module' : 'ns_cookie_manager' , 'tstamp' : 1615932897545 , 'shortText' : 'cookie update' , 'text' : 'Node \'ns_1@172.23.107.56\' synchronized otp cookie {sanitized,\n <<"kh9CnWP47Kka4ppGmaM3BjBAJbug2lmLq23zwHPGK2I=">>} from cluster' , 'serverTime' : '2021-03-16T15:14:57.545Z' } 2021 - 03 - 16 15 : 17 : 35 | ERROR | MainProcess | test_thread | [rest_client.print_UI_logs] { 'node' : 'ns_1@172.23.107.35' , 'type' : 'warning' , 'code' : 5 , 'module' : 'ns_node_disco' , 'tstamp' : 1615932835079 , 'shortText' : 'node down' , 'text' : "Node 'ns_1@172.23.107.35' saw that node 'ns_1@172.23.107.56' went down. Details: [{nodedown_reason,\n connection_closed}]" , 'serverTime' : '2021-03-16T15:13:55.079Z' } Full test log is attached (test.log) Fresh logs are collected: https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222349-ns_1%40172.23.107.77.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222258-ns_1%40cb.local.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222208-ns_1%40172.23.107.35.zip https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-16T222208-ns_1%40172.23.107.56.zip
          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited

          Hey Evgeny Makarenko,  If you look into the attachments here, there are no log contents in the FTS logs of nodes 77.

          The above test logs refers to the error "prepare_rebalance_failed" with node  'ns_1@172.23.107.77',{error,timeout}}.

          And without any fts logs, we can't say anything here.

          And if you look into the fts logs of 56, there are no startTopology/prepareTopology call traces (rebalance start indicators) around the timestamp 2021-03-16 15:17:00 onwards.

           

          My suggestion would be, you please ensure that the fts logs are also captured in the attachments. Also, if the environment is still there, please get it for these nodes.

          Sreekanth Sivasankaran Sreekanth Sivasankaran added a comment - - edited Hey Evgeny Makarenko ,  If you look into the attachments here, there are no log contents in the FTS logs of nodes 77. The above test logs refers to the error "prepare_rebalance_failed" with node  'ns_1@172.23.107.77',{error,timeout}}. And without any fts logs, we can't say anything here. And if you look into the fts logs of 56, there are no startTopology/prepareTopology call traces (rebalance start indicators) around the timestamp 2021-03-16 15:17:00 onwards.   My suggestion would be, you please ensure that the fts logs are also captured in the attachments. Also, if the environment is still there, please get it for these nodes.

          After updating the test to not reboot the node to attain a failed rebalance, the test is now passing and we do not see this issue any more.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - After updating the test to not reboot the node to attain a failed rebalance, the test is now passing and we do not see this issue any more.

          People

            evgeny.makarenko Evgeny Makarenko (Inactive)
            evgeny.makarenko Evgeny Makarenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty