Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-60227

FTS - Rebalance failed, {Status:\\"503 Service Unavailable}

    XMLWordPrintable

Details

    Description

      7.2.4-7065 

      Test Steps:

      • Load data
      • create fts indexes
      • async perform update delete
        • Test passes
      • Run same test with different no of fts indexes and config
      • All the test running after this fail in cluster setup itself with service unavailable while adding node to the cluster
      • Issue is reproducible consistently

      This issue wasn't seen in 7.2.3 but is seen from 7.2.4-6966 (oldest build on which fts regression was run)

      The changeLog can be seen here - change-log

      This is mostly likely because of change related to https://issues.couchbase.com/browse/MB-59043
      commit - https://github.com/couchbase/cbgt/commit/5a48540989b62f781b426a39733b137477cafd24https://review.couchbase.org/c/cbgt/+/198788  – CC: Aditi Ahuja 

      Server logs:

      2023-12-25T12:37:14.979+00:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:705
      2023-12-25T12:37:14.991+00:00 [WARN] (GOCBCORE) Failed to connect to host, bad bucket. -- cbgt.GocbcoreLogger.Log() at gocbcore_utils.go:705
      2023-12-25T12:37:15.085+00:00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex standard_bucket_1_index_1_731546a5505f8525_4c1c5584, partition 0, formerPrimaryNode 050efb0c7c5705d756e91a54e5453567
      2023-12-25T12:37:15.091+00:00 [INFO] rebalance: runMonitor, ignoring the s.Error: &errors.errorString{s:"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\"503 Service Unavailable\", StatusCode:503, Proto:\"HTTP/1.1\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\"Content-Length\":[]string{\"50\"}, \"Content-Type\":[]string{\"text/plain; charset=utf-8\"}, \"Date\":[]string{\"Mon, 25 Dec 2023 12:37:15 GMT\"}}, Body:(*http.bodyEOFSignal)(0xc014038540), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc0005f1300), TLS:(*tls.ConnectionState)(0xc0003a1290)}, urlUUID: monitor.UrlUUID{Url:\"https://172.23.121.251:18094\", UUID:\"050efb0c7c5705d756e91a54e5453567\"}, kind: /api/stats?partitions=true, err: <nil>"}, for node: 050efb0c7c5705d756e91a54e5453567, for 1 time
      2023-12-25T12:37:15.091+00:00 [WARN] feed_dcp_gocbcore: Request has timed out, canceling op -- cbgt.waitForResponse() at feed_dcp_gocbcore.go:362
      2023-12-25T12:37:15.091+00:00 [WARN] writePartitionStatsJSON: couldn't obtain source stats for pindex: `standard_bucket_5_index_1_3b6996b19d825407_4c1c5584`, err: gocbcore_utils: CBPartitionSeqs, GetCollectionManifest failed, err: operation has timed out -- rest.writePartitionStatsJSON() at rest_manage.go:274


      Job Logs:

      [2023-12-25 04:35:47,256] - [rest_client:1619] INFO - adding remote node @172.23.121.249:18091 to this cluster @172.23.121.251:8091
      [2023-12-25 04:35:57,277] - [rest_client:1957] INFO - rebalance progress took 10.02 seconds 
      [2023-12-25 04:35:57,277] - [rest_client:1958] INFO - sleep for 10 seconds after rebalance...
      [2023-12-25 04:36:11,237] - [rest_client:2627] INFO - Node 172.23.121.249 not part of cluster inactiveAdded
      [2023-12-25 04:36:11,246] - [rest_client:1851] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.121.249,ns_1@172.23.121.251', 'ejectedNodes': '', 'user': 'Administrator', 'password': 'password'}
      [2023-12-25 04:36:11,296] - [rest_client:1856] INFO - rebalance operation started
      [2023-12-25 04:36:21,302] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:36:21,302] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:36:41,334] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:36:41,334] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:37:01,357] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:37:01,357] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:37:21,389] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:37:21,389] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:37:41,416] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:37:41,416] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:38:01,450] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:38:01,450] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:38:21,485] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:38:21,485] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:38:41,520] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:38:41,520] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:39:01,545] - [rest_client:2018] INFO - rebalance percentage : 25.00 %
      [2023-12-25 04:39:01,546] - [task:845] INFO - Rebalance - status: running, progress: 25.00%
      [2023-12-25 04:39:21,581] - [rest_client:2001] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2023-12-25 04:39:21,591] - [rest_client:3998] INFO - Latest logs from UI on 172.23.121.251:
      [2023-12-25 04:39:21,591] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.251', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1703507955358, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {worker_died,\n                               {\'EXIT\',<0.5918.2>,\n                                {rebalance_failed,\n                                 {service_error,\n                                  <<"nodes: sample res.StatusCode not 200, res: &http.Response{Status:\\"503 Service Unavailable\\", StatusCode:503, Proto:\\"HTTP/1.1\\", ProtoMajor:1, ProtoMinor:1, Header:http.Header{\\"Content-Length\\":[]string{\\"50\\"}, \\"Content-Type\\":[]string{\\"text/plain; charset=utf-8\\"}, \\"Date\\":[]string{\\"Mon, 25 Dec 2023 12:39:15 GMT\\"}}, Body:(*http.bodyEOFSignal)(0xc0004b0740), ContentLength:50, TransferEncoding:[]string(nil), Close:false, Uncompressed:false, Trailer:http.Header(nil), Request:(*http.Request)(0xc0148c6200), TLS:(*tls.ConnectionState)(0xc00054e210)}, urlUUID: monitor.UrlUUID{Url:\\"https://172.23.121.251:18094\\", UUID:\\"050efb0c7c5705d756e91a54e5453567\\"}, kind: /api/stats?partitions=true, err: <nil>">>}}}}}.\nRebalance Operation Id = 25aee874f682f479df7a057ad1aa6b9b', 'serverTime': '2023-12-25T12:39:15.358Z'}
      [2023-12-25 04:39:21,591] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.249', 'type': 'info', 'code': 0, 'module': 'memcached_config_mgr', 'tstamp': 1703507771436, 'shortText': 'message', 'text': 'Hot-reloaded memcached.json for config change of the following keys: [<<"scramsha_fallback_salt">>]', 'serverTime': '2023-12-25T04:36:11.436Z'}
      [2023-12-25 04:39:21,592] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.251', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1703507771250, 'shortText': 'message', 'text': "Starting rebalance, KeepNodes = ['ns_1@172.23.121.249','ns_1@172.23.121.251'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 25aee874f682f479df7a057ad1aa6b9b", 'serverTime': '2023-12-25T12:36:11.250Z'}
      [2023-12-25 04:39:21,592] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.249', 'type': 'info', 'code': 3, 'module': 'ns_cluster', 'tstamp': 1703507771227, 'shortText': 'message', 'text': 'Node ns_1@172.23.121.249 joined cluster', 'serverTime': '2023-12-25T04:36:11.227Z'}
      [2023-12-25 04:39:21,592] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.249', 'type': 'info', 'code': 1, 'module': 'menelaus_web_sup', 'tstamp': 1703507771202, 'shortText': 'web start ok', 'text': 'Couchbase Server has started on web port 8091 on node \'ns_1@172.23.121.249\'. Version: "7.2.4-7065-enterprise".', 'serverTime': '2023-12-25T04:36:11.202Z'}
      [2023-12-25 04:39:21,593] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.251', 'type': 'info', 'code': 4, 'module': 'ns_node_disco', 'tstamp': 1703507768007, 'shortText': 'node up', 'text': "Node 'ns_1@172.23.121.251' saw that node 'ns_1@172.23.121.249' came up. Tags: []", 'serverTime': '2023-12-25T12:36:08.007Z'}
      [2023-12-25 04:39:21,593] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.251', 'type': 'info', 'code': 0, 'module': 'ns_cluster', 'tstamp': 1703507767851, 'shortText': 'message', 'text': "Started node add transaction by adding node 'ns_1@172.23.121.249' to nodes_wanted (group: undefined)", 'serverTime': '2023-12-25T12:36:07.851Z'}
      [2023-12-25 04:39:21,594] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.251', 'type': 'info', 'code': 0, 'module': 'ns_cluster', 'tstamp': 1703507767293, 'shortText': 'message', 'text': 'Change of address to "172.23.121.251" is requested.', 'serverTime': '2023-12-25T12:36:07.293Z'}
      [2023-12-25 04:39:21,594] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.251', 'type': 'warning', 'code': 5, 'module': 'ns_node_disco', 'tstamp': 1703507700408, 'shortText': 'node down', 'text': "Node 'ns_1@172.23.121.251' saw that node 'ns_1@172.23.121.249' went down. Details: [{nodedown_reason,\n                                                                                     connection_closed}]", 'serverTime': '2023-12-25T12:35:00.408Z'}
      [2023-12-25 04:39:21,594] - [rest_client:3999] ERROR - {'node': 'ns_1@172.23.121.251', 'type': 'info', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1703507700341, 'shortText': 'message', 'text': 'Rebalance completed successfully.\nRebalance Operation Id = e32c0ac7db4a93222e8f4a00d5b6629a', 'serverTime': '2023-12-25T12:35:00.341Z'}
      downloading 172.23.121.251 .................................................downloading 172.23.121.249 ...............................................[<FrameSummary file /usr/local/lib/python3.7/threading.py, line 890 in _bootstrap>, <FrameSummary file /usr/local/lib/python3.7/threading.py, line 926 in _bootstrap_inner>, <FrameSummary file testrunner.py, line 553 in run>, <FrameSummary file /usr/local/lib/python3.7/unittest/runner.py, line 176 in run>, <FrameSummary file /usr/local/lib/python3.7/unittest/case.py, line 676 in __call__>, <FrameSummary file /usr/local/lib/python3.7/unittest/case.py, line 624 in run>, <FrameSummary file pytests/fts/stable_topology_fts.py, line 27 in setUp>, <FrameSummary file pytests/fts/fts_base.py, line 3771 in setUp>, <FrameSummary file pytests/fts/fts_base.py, line 4066 in __setup_for_test>, <FrameSummary file pytests/fts/fts_base.py, line 2264 in init_cluster>, <FrameSummary file lib/tasks/future.py, line 160 in result>, <FrameSummary file lib/tasks/future.py, line 111 in __get_result>]
      [<FrameSummary file /usr/local/lib/python3.7/threading.py, line 890 in _bootstrap>, <FrameSummary file /usr/local/lib/python3.7/threading.py, line 926 in _bootstrap_inner>, <FrameSummary file lib/tasks/taskmanager.py, line 34 in run>, <FrameSummary file lib/tasks/task.py, line 108 in step>, <FrameSummary file lib/tasks/task.py, line 856 in check>, <FrameSummary file lib/tasks/future.py, line 265 in set_exception>] 

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sarthak.dua Sarthak Dua
              sarthak.dua Sarthak Dua
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty