Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44730

Unexpected server error received during rebalance

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-4554

      Unexpected server error is found during rebalance after test  execution.

      Corresponding test log fragment is here:

      [2021-02-27 03:18:36,683] - [cluster_helper] [140676780521216] - INFO - rebalancing all nodes in order to remove nodes
      [2021-02-27 03:18:36,685] - [rest_client] [140676780521216] - INFO - rebalance params : {'knownNodes': 'ns_1@172.23.106.20,ns_1@172.23.121.75,ns_1@172.23.97.222', 'ejectedNodes': 'ns_1@172.23.106.20,ns_1@172.23.97.222', 'user': 'Administrator', 'password': 'password'}
      [2021-02-27 03:18:36,713] - [rest_client] [140676780521216] - INFO - rebalance operation started
      [2021-02-27 03:18:36,717] - [rest_client] [140676780521216] - INFO - rebalance percentage : 0.00 %
      [2021-02-27 03:18:46,730] - [rest_client] [140676780521216] - INFO - rebalance percentage : 0.00 %
      [2021-02-27 03:18:56,734] - [rest_client] [140676780521216] - INFO - rebalance percentage : 0.00 %
      [2021-02-27 03:19:06,748] - [rest_client] [140676780521216] - ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - INFO - Latest logs from UI on 172.23.121.75:
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614424746687, 'shortText': 'message', 'text': "Rebalance exited with reason {prepare_rebalance_failed,\n                              {error,\n                               {failed_nodes,\n                                [{'ns_1@172.23.97.222',{error,timeout}}]}}}.\nRebalance Operation Id = e19e8d25353baa670bdc6d422c8b5212", 'serverTime': '2021-02-27T03:19:06.687Z'}
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614424716686, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.121.75'], EjectNodes = ['ns_1@172.23.106.20',\n                                                                      'ns_1@172.23.97.222'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = e19e8d25353baa670bdc6d422c8b5212", 'serverTime': '2021-02-27T03:18:36.686Z'}
      [2021-02-27 03:19:06,758] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'warning', 'code': 102, 'module': 'menelaus_web', 'tstamp': 1614424716683, 'shortText': 'client-side error report', 'text': 'Client-side error-report for user "Administrator" on node \'ns_1@172.23.121.75\':\nUser-Agent:Python-httplib2/0.13.1 (gzip)\nStarting rebalance from test, ejected nodes [\'ns_1@172.23.106.20\', \'ns_1@172.23.97.222\']', 'serverTime': '2021-02-27T03:18:36.683Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 11, 'module': 'menelaus_web', 'tstamp': 1614424716659, 'shortText': 'message', 'text': 'Deleted bucket "default"\n', 'serverTime': '2021-02-27T03:18:36.659Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1614424715998, 'shortText': 'message', 'text': 'Shutting down bucket "default" on \'ns_1@172.23.121.75\' for deletion', 'serverTime': '2021-02-27T03:18:35.998Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.106.20', 'type': 'info', 'code': 1, 'module': 'menelaus_web_sup', 'tstamp': 1614424661647, 'shortText': 'web start ok', 'text': 'Couchbase Server has started on web port 8091 on node \'ns_1@172.23.106.20\'. Version: "7.0.0-4554-enterprise".', 'serverTime': '2021-02-27T03:17:41.647Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1614424661473, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.121.75' saw that node 'ns_1@172.23.106.20' came up. Tags: []", 'serverTime': '2021-02-27T03:17:41.473Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'warning', 'code': 5, 'module': 'ns_node_disco', 'tstamp': 1614424624260, 'shortText': 'node down', 'text': "Node 'ns_1@172.23.121.75' saw that node 'ns_1@172.23.106.20' went down. Details: [{nodedown_reason,\n                                                                                   connection_closed}]", 'serverTime': '2021-02-27T03:17:04.260Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1614424533594, 'shortText': 'message', 'text': 'Bucket "default" loaded on node \'ns_1@172.23.121.75\' in 0 seconds.', 'serverTime': '2021-02-27T03:15:33.594Z'}
      [2021-02-27 03:19:06,759] - [rest_client] [140676780521216] - ERROR - {'node': 'ns_1@172.23.121.75', 'type': 'info', 'code': 12, 'module': 'menelaus_web', 'tstamp': 1614424533507, 'shortText': 'message', 'text': 'Created bucket "default" of type: couchbase\n[{num_replicas,1},\n {replica_index,true},\n {ram_quota,1727004672},\n {durability_min_level,none},\n {pitr_enabled,false},\n {pitr_granularity,600},\n {pitr_max_history_age,86400},\n {flush_enabled,true},\n {num_threads,3},\n {eviction_policy,value_only},\n {conflict_resolution_type,seqno},\n {storage_mode,couchstore},\n {max_ttl,0},\n {compression_mode,passive}]', 'serverTime': '2021-02-27T03:15:33.507Z'}
      [2021-02-27 03:19:06,759] - [fts_base] [140676780521216] - INFO - closing all ssh connections
      [2021-02-27 03:19:06,760] - [fts_base] [140676780521216] - INFO - closing all memcached connections
      

      Have attached collected logs, corresponding test execution logs are here:

      [2021-03-04 17:17:12,520] - [rest_client:1714] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.107.35,ns_1@172.23.107.56,ns_1@172.23.107.75', 'ejectedNodes': 'ns_1@172.23.107.75', 'user': 'Administrator', 'password': 'password'}
      [2021-03-04 17:17:12,596] - [rest_client:1719] INFO - rebalance operation started
      [2021-03-04 17:17:12,643] - [rest_client:1881] INFO - rebalance percentage : 0.00 %
      [2021-03-04 17:17:12,644] - [task:844] INFO - Rebalance - status: running, progress: 0.00%
      [2021-03-04 17:17:16,677] - [fts_base:654] INFO - Killing cbft on server: ip:172.23.107.56 port:8091 ssh_username:root
      [2021-03-04 17:17:16,682] - [remote_util:298] INFO - SSH Connecting to 172.23.107.56 with username:root, attempt#1 of 5
      [2021-03-04 17:17:17,115] - [remote_util:336] INFO - SSH Connected to 172.23.107.56 as root
      [2021-03-04 17:17:17,708] - [remote_util:3567] INFO - os_distro: CentOS, os_version: centos 7, is_linux_distro: True
      [2021-03-04 17:17:18,659] - [remote_util:3723] INFO - extract_remote_info-->distribution_type: CentOS, distribution_version: centos 7
      [2021-03-04 17:17:18,661] - [remote_util:3407] INFO - running command.raw on 172.23.107.56: pkill -e cbft
      [2021-03-04 17:17:18,834] - [remote_util:3455] INFO - command executed successfully with root
      [2021-03-04 17:17:18,835] - [remote_util:3240] INFO - cbft killed (pid 10984)
      [2021-03-04 17:17:22,787] - [rest_client:1864] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-03-04 17:17:22,867] - [rest_client:3765] INFO - Latest logs from UI on 172.23.107.35:
      [2021-03-04 17:17:22,868] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'auto_rebalance', 'tstamp': 1614907038877, 'shortText': 'message', 'text': 'Rebalance with Id 4e2741c980fc7af4c748558d26defbc4 will be retried after 60 seconds. Retry attempt number 1', 'serverTime': '2021-03-04T17:17:18.877Z'}
      [2021-03-04 17:17:22,869] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614907038875, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                                 {agent_died,<27720.2739.1>,\n                                     {lost_connection,shutdown}}}.\nRebalance Operation Id = 4e2741c980fc7af4c748558d26defbc4', 'serverTime': '2021-03-04T17:17:18.875Z'}
      [2021-03-04 17:17:22,870] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.56', 'type': 'info', 'code': 0, 'module': 'ns_log', 'tstamp': 1614907038868, 'shortText': 'message', 'text': "Service 'fts' exited with status 143. Restarting. Messages:\n2021-03-04T17:17:13.455-08:00 [INFO] (GOCBCORE) Creating new agent: &{MemdAddrs:[] HTTPAddrs:[127.0.0.1:8091] BucketName:default UserAgent:default_index_1_2465dc3f9bb14079_13aa53f3 UseTLS:false NetworkType: Auth:0x1d09b50 TLSRootCAProvider:<nil> UseMutationTokens:false UseCompression:false UseDurations:false DisableDecompression:false UseOutOfOrderResponses:false DisableXErrors:false DisableJSONHello:false DisableSyncReplicationHello:false UseCollections:true CompressionMinSize:0 CompressionMinRatio:0 HTTPRedialPeriod:0s HTTPRetryDelay:0s CccpMaxWait:0s CccpPollPeriod:0s ConnectTimeout:1m0s KVConnectTimeout:7s KvPoolSize:0 MaxQueueSize:0 HTTPMaxIdleConns:0 HTTPMaxIdleConnsPerHost:0 HTTPIdleConnectionTimeout:0s Tracer:<nil> NoRootTraceSpans:false DefaultRetryStrategy:<nil> CircuitBreakerConfig:{Enabled:false VolumeThreshold:0 ErrorThresholdPercentage:0 SleepWindow:0s RollingWindow:0s CompletionCallback:<nil> CanaryTimeout:0s} UseZombieLogger:false ZombieLoggerInterval:0s ZombieLoggerSampleSize:0 AuthMechanisms:[]}\n2021-03-04T17:17:13.525-08:00 [INFO] feed_dcp_gocbcore: newGocbcoreDCPFeed, name: default_index_1_2465dc3f9bb14079_13aa53f3, indexName: default_index_1, server: http://127.0.0.1:8091, bucketName: default, bucketUUID: abf3fb490b940a4c88f93e6031db8e39\n2021-03-04T17:17:13.525-08:00 [INFO] feed_dcp_gocbcore: Start, name: default_index_1_2465dc3f9bb14079_13aa53f3, num streams: 171, manifestUID: 0, streamOptions: {FilterOptions: &{ScopeID:0 CollectionIDs:[0]}, StreamOptions: &{StreamID:3}}\n2021-03-04T17:17:13.528-08:00 [INFO] janitor: awakes, op: kick, msg: cfg changed, key: curMetaKvPlanKey\n2021-03-04T17:17:13.559-08:00 [INFO] janitor: pindexes to remove: 0\n2021-03-04T17:17:13.559-08:00 [INFO] janitor: pindexes to add: 0\n2021-03-04T17:17:13.559-08:00 [INFO] janitor: pindexes to restart: 0\n2021-03-04T17:17:13.560-08:00 [INFO] janitor: feeds to remove: 0\n2021-03-04T17:17:13.560-08:00 [INFO] janitor: feeds to add: 0\n", 'serverTime': '2021-03-04T17:17:18.868Z'}
      [2021-03-04 17:17:22,871] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_vbucket_mover', 'tstamp': 1614907033048, 'shortText': 'message', 'text': 'Bucket "default" rebalance appears to be swap rebalance', 'serverTime': '2021-03-04T17:17:13.048Z'}
      [2021-03-04 17:17:22,871] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_rebalancer', 'tstamp': 1614907032874, 'shortText': 'message', 'text': 'Started rebalancing bucket default', 'serverTime': '2021-03-04T17:17:12.874Z'}
      [2021-03-04 17:17:22,872] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614907032559, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.107.35','ns_1@172.23.107.56'], EjectNodes = ['ns_1@172.23.107.75'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 4e2741c980fc7af4c748558d26defbc4", 'serverTime': '2021-03-04T17:17:12.559Z'}
      [2021-03-04 17:17:22,873] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_memcached', 'tstamp': 1614906881874, 'shortText': 'message', 'text': 'Bucket "default" loaded on node \'ns_1@172.23.107.35\' in 0 seconds.', 'serverTime': '2021-03-04T17:14:41.874Z'}
      [2021-03-04 17:17:22,874] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 12, 'module': 'menelaus_web', 'tstamp': 1614906881762, 'shortText': 'message', 'text': 'Created bucket "default" of type: couchbase\n[{num_replicas,1},\n {replica_index,true},\n {ram_quota,209715200},\n {durability_min_level,none},\n {pitr_enabled,false},\n {pitr_granularity,600},\n {pitr_max_history_age,86400},\n {flush_enabled,true},\n {num_threads,3},\n {eviction_policy,value_only},\n {conflict_resolution_type,seqno},\n {storage_mode,couchstore},\n {max_ttl,0},\n {compression_mode,passive}]', 'serverTime': '2021-03-04T17:14:41.762Z'}
      [2021-03-04 17:17:22,874] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614906869020, 'shortText': 'message', 'text': 'Rebalance completed successfully.\nRebalance Operation Id = 0f35eb393a53117d0e179e70e256930c', 'serverTime': '2021-03-04T17:14:29.020Z'}
      [2021-03-04 17:17:22,875] - [rest_client:3766] ERROR - {'node': 'ns_1@172.23.107.35', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1614906868130, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.107.35','ns_1@172.23.107.56',\n                                 'ns_1@172.23.107.75'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 0f35eb393a53117d0e179e70e256930c", 'serverTime': '2021-03-04T17:14:28.130Z'}
      

       

      Collected logs can be found here:
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-05T012013-ns_1%40172.23.107.35.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/Evgeny+Makarenko/collectinfo-2021-03-05T012013-ns_1%40172.23.107.56.zip

      Attachments

        1. 172.23.105.49-diag.txt.gz
          163 kB
        2. 172.23.106.20-diag.txt.gz
          507 kB
        3. 172.23.121.75-diag.txt.gz
          520 kB
        4. 172.23.97.222-diag.txt.gz
          164 kB
        5. 172.23.97.65-diag.txt.gz
          163 kB
        6. test.log
          137 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            evgeny.makarenko Evgeny Makarenko (Inactive)
            evgeny.makarenko Evgeny Makarenko (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty