Details
-
Bug
-
Resolution: Fixed
-
Critical
-
Cheshire-Cat
-
Untriaged
-
1
-
Unknown
Description
Build: 7.0.0-5127
Testsuite: centos-fts_custom-map-n1ql-rqg-scorch-match-phrase_6.5_P1
Seeing number of Rebalance failures in this job, during test setUp. Every time we run, these rebalance failures are seen after around 20 tests or so (not at the same test everytime).
One of the run: http://qa.sc.couchbase.com/job/test_suite_executor/344023/consoleFull
First instance of rebalance failure:
Test: ./testrunner -i /tmp/testexec.14754.ini -p get-cbcollect-info=True,disable_HTP=True,get-logs=True,stop-on-failure=False,cluster=D+F,index_type=scorch,fts_quota=1000,run_via_n1ql=True,custom_map_add_non_indexed_fields=False,GROUP=N1QL_MATCH_PHRASE,query_types=N1QL_MATCH_PHRASE -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,num_custom_analyzers=1,compare_es=True,cm_id=24,num_queries=100,GROUP=BUCKETS;P0;SKIP_FOR_N1QL;N1QL_MATCH_PHRASE;COLLECTIONS
In the setUp of the test, if there is already cluster initialized, we remove all the nodes from the cluster and we re-initialize the cluster where we try to add 172.23.120.115 to the cluster 172.23.120.92:8091 with (kv, fts) services. We have seen number of rebalance failures during removing all nodes or during init cluster. Note that all 29 tests before are passed and uses same setUp call. This test failed while init cluster with below.
[2021-05-10 13:06:22,514] - [fts_base:2224] INFO - 172.23.120.115 will be configured with services kv,fts
|
[2021-05-10 13:06:23,515] - [task:769] INFO - adding node 172.23.120.115:8091 to cluster
|
[2021-05-10 13:06:23,516] - [rest_client:1500] INFO - adding remote node @172.23.120.115:8091 to this cluster @172.23.120.92:8091
|
[2021-05-10 13:06:33,536] - [rest_client:1833] INFO - rebalance progress took 10.02 seconds
|
[2021-05-10 13:06:33,537] - [rest_client:1834] INFO - sleep for 10 seconds after rebalance...
|
[2021-05-10 13:06:54,757] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': '', 'user': 'Administrator', 'password': 'password'}
|
[2021-05-10 13:06:57,667] - [rest_client:1732] INFO - rebalance operation started
|
[2021-05-10 13:07:08,331] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 13:07:08,331] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
|
[2021-05-10 13:07:28,622] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 13:07:28,622] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
|
[2021-05-10 13:07:49,690] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 13:07:49,690] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
|
[2021-05-10 13:08:10,195] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
|
[2021-05-10 13:08:10,259] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:
|
[2021-05-10 13:08:10,259] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1620677271186, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n {agent_died,<0.3947.0>,\n {linked_process_died,<0.127.12>,\n {\'ns_1@172.23.120.92\',\n {timeout,\n {gen_server,call,\n [<0.3990.0>,\n {call,"ServiceAPI.GetTaskList",\n #Fun<json_rpc_connection.0.77329884>},\n 60000]}}}}}}.\nRebalance Operation Id = 07abcc20f925b6db8de5493c03feeac6', 'serverTime': '2021-05-10T13:07:51.186Z'}
|
snippet from logs:
2021-05-10T13:06:54.815-07:00, memcached_config_mgr:0:info:message(ns_1@172.23.120.115) - Hot-reloaded memcached.json for config change of the following keys: [<<"scramsha_fallback_salt">>]
|
2021-05-10T13:06:55.294-07:00, ns_orchestrator:0:info:message(ns_1@172.23.120.92) - Starting rebalance, KeepNodes = ['ns_1@172.23.120.115','ns_1@172.23.120.92'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 07abcc20f925b6db8de5493c03feeac6
|
2021-05-10T13:07:51.186-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.120.92) - Rebalance exited with reason {service_rebalance_failed,fts,
|
{agent_died,<0.3947.0>,
|
{linked_process_died,<0.127.12>,
|
{'ns_1@172.23.120.92',
|
{timeout,
|
{gen_server,call,
|
[<0.3990.0>,
|
{call,"ServiceAPI.GetTaskList",
|
#Fun<json_rpc_connection.0.77329884>},
|
60000]}}}}}}.
|
Rebalance Operation Id = 07abcc20f925b6db8de5493c03feeac6
|
Logs:
After this test, 3 tests passed and test_34 failed with below during setup:
[2021-05-10 13:38:04,694] - [rest_client:1500] INFO - adding remote node @172.23.120.115:8091 to this cluster @172.23.120.92:8091
|
[2021-05-10 13:38:14,719] - [rest_client:1833] INFO - rebalance progress took 10.02 seconds
|
[2021-05-10 13:38:14,719] - [rest_client:1834] INFO - sleep for 10 seconds after rebalance...
|
[2021-05-10 13:38:45,632] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': '', 'user': 'Administrator', 'password': 'password'}
|
[2021-05-10 13:38:46,557] - [rest_client:1732] INFO - rebalance operation started
|
[2021-05-10 13:38:56,700] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 13:38:56,700] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
|
[2021-05-10 13:39:17,221] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 13:39:17,222] - [task:839] INFO - Rebalance - status: running, progress: 37.00%
|
[2021-05-10 13:39:37,301] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
|
[2021-05-10 13:39:37,386] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:
|
[2021-05-10 13:39:37,386] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1620679162803, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n {agent_died,<0.8315.14>,\n {linked_process_died,<0.28839.14>,\n {\'ns_1@172.23.120.92\',\n {timeout,\n {gen_server,call,\n [<0.8551.14>,\n {call,"ServiceAPI.GetCurrentTopology",\n #Fun<json_rpc_connection.0.77329884>},\n 60000]}}}}}}.\nRebalance Operation Id = f95eb48ec82f65df688154a0dbfc522e', 'serverTime': '2021-05-10T13:39:22.803Z'}
|
After this test_39 init cluster rebalance failed because of timeout accessing ServiceAPI.GetCurrentTopology
After this test_43 init cluster rebalance failed because of timeout accessing ServiceAPI.GetTaskList
After this test_45 init cluster rebalance failed because of timeout accessing ServiceAPI.GetCurrentTopology
After this test_53/test_54 remove all nodes rebalance failed because of timeout accessing ServiceAPI.GetTaskList
After this test_55 init cluster rebalance failed because of timeout accessing ServiceAPI.StartTopologyChange
After this test_56 remove all nodes rebalance failed because of timeout accessing ServiceAPI.GetTaskList
After this test_57 init cluster rebalance failed because fts crash
[2021-05-10 17:06:08,477] - [cluster_helper:262] INFO - rebalancing all nodes in order to remove nodes
|
[2021-05-10 17:06:08,517] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': 'ns_1@172.23.120.115', 'user': 'Administrator', 'password': 'password'}
|
[2021-05-10 17:06:08,928] - [rest_client:1732] INFO - rebalance operation started
|
[2021-05-10 17:06:08,933] - [rest_client:1894] INFO - rebalance percentage : 0.00 %
|
[2021-05-10 17:06:18,959] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 17:06:29,027] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 17:06:39,173] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 17:06:49,243] - [rest_client:1894] INFO - rebalance percentage : 37.00 %
|
[2021-05-10 17:06:59,258] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed
|
[2021-05-10 17:06:59,296] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:
|
[2021-05-10 17:06:59,296] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'info', 'code': 0, 'module': 'ns_log', 'tstamp': 1620691616882, 'shortText': 'message', 'text': "Service 'fts' exited with status 137. Restarting. Messages:\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdpipeline.go:77 +0xa0\n\ngoroutine 2126920 [runnable]:\nbufio.NewReaderSize(...)\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.13.7/go/src/bufio/bufio.go:56\ngithub.com/couchbase/gocbcore/v9/memd.NewConn(...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memd/conn.go:55\ngithub.com/couchbase/gocbcore/v9.dialMemdConn(0x1355420, 0xc07d8e1fc0, 0xc006e29b80, 0x14, 0x0, 0xc01e908863a4e190, 0x133b945791ef, 0x1d0d9a0, 0x2, 0x2, ...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdconn.go:91 +0x59e\ngithub.com/couchbase/gocbcore/v9.(*memdClientDialerComponent).dialMemdClient(0xc001050a20, 0xc00840d980, 0xc006e29b80, 0x14, 0xc01e908863a4e190, 0x133b945791ef, 0x1d0d9a0, 0xc00802cb20, 0xc0130e7df0, 0x441ddc, ...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdclientdialer_component.go:145 +0x12a\n", 'serverTime': '2021-05-10T17:06:56.882Z'}
|
Logs:
Let me know if you need different issue for fts crash in the last failure above.
After the crash, all 25 tests passed.
Attachments
Issue Links
- is duplicated by
-
MB-46228 FTS is failing to shut down "stats" agents on nodes other than where an "index deletion" is received
- Closed
For Gerrit Dashboard: MB-46255 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
153498,5 | MB-46255: Close stats' KV connections on all nodes when index is deleted | master | cbgt | Status: MERGED | +2 | +1 |
153499,3 | MB-46255: Clear out source specs for indexes that are deleted | master | cbft | Status: MERGED | +2 | +1 |