Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46255

Rebalance failures observed in custom-map-n1ql-rqg-scorch-match-phrase

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.0.0
    • Cheshire-Cat
    • fts
    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0-5127
      Testsuite: centos-fts_custom-map-n1ql-rqg-scorch-match-phrase_6.5_P1

      Seeing number of Rebalance failures in this job, during test setUp. Every time we run, these rebalance failures are seen after around 20 tests or so (not at the same test everytime).

      One of the run: http://qa.sc.couchbase.com/job/test_suite_executor/344023/consoleFull
      First instance of rebalance failure:

      Test: ./testrunner -i /tmp/testexec.14754.ini -p get-cbcollect-info=True,disable_HTP=True,get-logs=True,stop-on-failure=False,cluster=D+F,index_type=scorch,fts_quota=1000,run_via_n1ql=True,custom_map_add_non_indexed_fields=False,GROUP=N1QL_MATCH_PHRASE,query_types=N1QL_MATCH_PHRASE -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,num_custom_analyzers=1,compare_es=True,cm_id=24,num_queries=100,GROUP=BUCKETS;P0;SKIP_FOR_N1QL;N1QL_MATCH_PHRASE;COLLECTIONS

      In the setUp of the test, if there is already cluster initialized, we remove all the nodes from the cluster and we re-initialize the cluster where we try to add 172.23.120.115 to the cluster 172.23.120.92:8091 with (kv, fts) services. We have seen number of rebalance failures during removing all nodes or during init cluster. Note that all 29 tests before are passed and uses same setUp call. This test failed while init cluster with below.

      [2021-05-10 13:06:22,514] - [fts_base:2224] INFO - 172.23.120.115 will be configured with services kv,fts
      [2021-05-10 13:06:23,515] - [task:769] INFO - adding node 172.23.120.115:8091 to cluster
      [2021-05-10 13:06:23,516] - [rest_client:1500] INFO - adding remote node @172.23.120.115:8091 to this cluster @172.23.120.92:8091
      [2021-05-10 13:06:33,536] - [rest_client:1833] INFO - rebalance progress took 10.02 seconds 
      [2021-05-10 13:06:33,537] - [rest_client:1834] INFO - sleep for 10 seconds after rebalance...
      [2021-05-10 13:06:54,757] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': '', 'user': 'Administrator', 'password': 'password'}
      [2021-05-10 13:06:57,667] - [rest_client:1732] INFO - rebalance operation started
      [2021-05-10 13:07:08,331] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 13:07:08,331] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
      [2021-05-10 13:07:28,622] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 13:07:28,622] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
      [2021-05-10 13:07:49,690] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 13:07:49,690] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
      [2021-05-10 13:08:10,195] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-05-10 13:08:10,259] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:
      [2021-05-10 13:08:10,259] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1620677271186, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {agent_died,<0.3947.0>,\n                               {linked_process_died,<0.127.12>,\n                                {\'ns_1@172.23.120.92\',\n                                 {timeout,\n                                  {gen_server,call,\n                                   [<0.3990.0>,\n                                    {call,"ServiceAPI.GetTaskList",\n                                     #Fun<json_rpc_connection.0.77329884>},\n                                    60000]}}}}}}.\nRebalance Operation Id = 07abcc20f925b6db8de5493c03feeac6', 'serverTime': '2021-05-10T13:07:51.186Z'}
      

      snippet from logs:

      2021-05-10T13:06:54.815-07:00, memcached_config_mgr:0:info:message(ns_1@172.23.120.115) - Hot-reloaded memcached.json for config change of the following keys: [<<"scramsha_fallback_salt">>]
      2021-05-10T13:06:55.294-07:00, ns_orchestrator:0:info:message(ns_1@172.23.120.92) - Starting rebalance, KeepNodes = ['ns_1@172.23.120.115','ns_1@172.23.120.92'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 07abcc20f925b6db8de5493c03feeac6
      2021-05-10T13:07:51.186-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.120.92) - Rebalance exited with reason {service_rebalance_failed,fts,
                                    {agent_died,<0.3947.0>,
                                     {linked_process_died,<0.127.12>,
                                      {'ns_1@172.23.120.92',
                                       {timeout,
                                        {gen_server,call,
                                         [<0.3990.0>,
                                          {call,"ServiceAPI.GetTaskList",
                                           #Fun<json_rpc_connection.0.77329884>},
                                          60000]}}}}}}.
      Rebalance Operation Id = 07abcc20f925b6db8de5493c03feeac6
      

      Logs:

      test_30.zip

      After this test, 3 tests passed and test_34 failed with below during setup:

      [2021-05-10 13:38:04,694] - [rest_client:1500] INFO - adding remote node @172.23.120.115:8091 to this cluster @172.23.120.92:8091
      [2021-05-10 13:38:14,719] - [rest_client:1833] INFO - rebalance progress took 10.02 seconds 
      [2021-05-10 13:38:14,719] - [rest_client:1834] INFO - sleep for 10 seconds after rebalance...
      [2021-05-10 13:38:45,632] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': '', 'user': 'Administrator', 'password': 'password'}
      [2021-05-10 13:38:46,557] - [rest_client:1732] INFO - rebalance operation started
      [2021-05-10 13:38:56,700] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 13:38:56,700] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
      [2021-05-10 13:39:17,221] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 13:39:17,222] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
      [2021-05-10 13:39:37,301] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-05-10 13:39:37,386] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:
      [2021-05-10 13:39:37,386] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1620679162803, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {agent_died,<0.8315.14>,\n                               {linked_process_died,<0.28839.14>,\n                                {\'ns_1@172.23.120.92\',\n                                 {timeout,\n                                  {gen_server,call,\n                                   [<0.8551.14>,\n                                    {call,"ServiceAPI.GetCurrentTopology",\n                                     #Fun<json_rpc_connection.0.77329884>},\n                                    60000]}}}}}}.\nRebalance Operation Id = f95eb48ec82f65df688154a0dbfc522e', 'serverTime': '2021-05-10T13:39:22.803Z'}
      

      After this test_39 init cluster rebalance failed because of timeout accessing ServiceAPI.GetCurrentTopology
      After this test_43 init cluster rebalance failed because of timeout accessing ServiceAPI.GetTaskList
      After this test_45 init cluster rebalance failed because of timeout accessing ServiceAPI.GetCurrentTopology
      After this test_53/test_54 remove all nodes rebalance failed because of timeout accessing ServiceAPI.GetTaskList
      After this test_55 init cluster rebalance failed because of timeout accessing ServiceAPI.StartTopologyChange
      After this test_56 remove all nodes rebalance failed because of timeout accessing ServiceAPI.GetTaskList
      After this test_57 init cluster rebalance failed because fts crash

      [2021-05-10 17:06:08,477] - [cluster_helper:262] INFO - rebalancing all nodes in order to remove nodes
      [2021-05-10 17:06:08,517] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': 'ns_1@172.23.120.115', 'user': 'Administrator', 'password': 'password'}
      [2021-05-10 17:06:08,928] - [rest_client:1732] INFO - rebalance operation started
      [2021-05-10 17:06:08,933] - [rest_client:1894] INFO - rebalance percentage : 0.00 %
      [2021-05-10 17:06:18,959] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 17:06:29,027] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 17:06:39,173] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 17:06:49,243] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
      [2021-05-10 17:06:59,258] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
      [2021-05-10 17:06:59,296] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:
      [2021-05-10 17:06:59,296] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'info', 'code': 0, 'module': 'ns_log', 'tstamp': 1620691616882, 'shortText': 'message', 'text': "Service 'fts' exited with status 137. Restarting. Messages:\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdpipeline.go:77 +0xa0\n\ngoroutine 2126920 [runnable]:\nbufio.NewReaderSize(...)\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.13.7/go/src/bufio/bufio.go:56\ngithub.com/couchbase/gocbcore/v9/memd.NewConn(...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memd/conn.go:55\ngithub.com/couchbase/gocbcore/v9.dialMemdConn(0x1355420, 0xc07d8e1fc0, 0xc006e29b80, 0x14, 0x0, 0xc01e908863a4e190, 0x133b945791ef, 0x1d0d9a0, 0x2, 0x2, ...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdconn.go:91 +0x59e\ngithub.com/couchbase/gocbcore/v9.(*memdClientDialerComponent).dialMemdClient(0xc001050a20, 0xc00840d980, 0xc006e29b80, 0x14, 0xc01e908863a4e190, 0x133b945791ef, 0x1d0d9a0, 0xc00802cb20, 0xc0130e7df0, 0x441ddc, ...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdclientdialer_component.go:145 +0x12a\n", 'serverTime': '2021-05-10T17:06:56.882Z'}
      

      Logs:

      test_57.zip

      Let me know if you need different issue for fts crash in the last failure above.
      After the crash, all 25 tests passed.

      Attachments

        1. consoleText
          16.66 MB
        2. test_30.zip
          39.45 MB
        3. test_57.zip
          62.39 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              girish.benakappa Girish Benakappa
              girish.benakappa Girish Benakappa
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty