Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: fts
Labels:
- functional-test

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

Build: 7.0.0-5127
Testsuite: centos-fts_custom-map-n1ql-rqg-scorch-match-phrase_6.5_P1

Seeing number of Rebalance failures in this job, during test setUp. Every time we run, these rebalance failures are seen after around 20 tests or so (not at the same test everytime).

One of the run: http://qa.sc.couchbase.com/job/test_suite_executor/344023/consoleFull
First instance of rebalance failure:

Test: ./testrunner -i /tmp/testexec.14754.ini -p get-cbcollect-info=True,disable_HTP=True,get-logs=True,stop-on-failure=False,cluster=D+F,index_type=scorch,fts_quota=1000,run_via_n1ql=True,custom_map_add_non_indexed_fields=False,GROUP=N1QL_MATCH_PHRASE,query_types=N1QL_MATCH_PHRASE -t fts.stable_topology_fts.StableTopFTS.index_query_custom_mapping,items=1000,custom_map=True,num_custom_analyzers=1,compare_es=True,cm_id=24,num_queries=100,GROUP=BUCKETS;P0;SKIP_FOR_N1QL;N1QL_MATCH_PHRASE;COLLECTIONS

In the setUp of the test, if there is already cluster initialized, we remove all the nodes from the cluster and we re-initialize the cluster where we try to add 172.23.120.115 to the cluster 172.23.120.92:8091 with (kv, fts) services. We have seen number of rebalance failures during removing all nodes or during init cluster. Note that all 29 tests before are passed and uses same setUp call. This test failed while init cluster with below.

[2021-05-10 13:06:22,514] - [fts_base:2224] INFO - 172.23.120.115 will be configured with services kv,fts

[2021-05-10 13:06:23,515] - [task:769] INFO - adding node 172.23.120.115:8091 to cluster

[2021-05-10 13:06:23,516] - [rest_client:1500] INFO - adding remote node @172.23.120.115:8091 to this cluster @172.23.120.92:8091

[2021-05-10 13:06:33,536] - [rest_client:1833] INFO - rebalance progress took 10.02 seconds

[2021-05-10 13:06:33,537] - [rest_client:1834] INFO - sleep for 10 seconds after rebalance...

[2021-05-10 13:06:54,757] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': '', 'user': 'Administrator', 'password': 'password'}

[2021-05-10 13:06:57,667] - [rest_client:1732] INFO - rebalance operation started

[2021-05-10 13:07:08,331] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 13:07:08,331] - [task:839] INFO - Rebalance - status: running, progress: 37.00%

[2021-05-10 13:07:28,622] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 13:07:28,622] - [task:839] INFO - Rebalance - status: running, progress: 37.00%

[2021-05-10 13:07:49,690] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 13:07:49,690] - [task:839] INFO - Rebalance - status: running, progress: 37.00%

[2021-05-10 13:08:10,195] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed

[2021-05-10 13:08:10,259] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:

[2021-05-10 13:08:10,259] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1620677271186, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {agent_died,<0.3947.0>,\n                               {linked_process_died,<0.127.12>,\n                                {\'ns_1@172.23.120.92\',\n                                 {timeout,\n                                  {gen_server,call,\n                                   [<0.3990.0>,\n                                    {call,"ServiceAPI.GetTaskList",\n                                     #Fun<json_rpc_connection.0.77329884>},\n                                    60000]}}}}}}.\nRebalance Operation Id = 07abcc20f925b6db8de5493c03feeac6', 'serverTime': '2021-05-10T13:07:51.186Z'}

snippet from logs:

2021-05-10T13:06:54.815-07:00, memcached_config_mgr:0:info:message(ns_1@172.23.120.115) - Hot-reloaded memcached.json for config change of the following keys: [<<"scramsha_fallback_salt">>]

2021-05-10T13:06:55.294-07:00, ns_orchestrator:0:info:message(ns_1@172.23.120.92) - Starting rebalance, KeepNodes = ['ns_1@172.23.120.115','ns_1@172.23.120.92'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 07abcc20f925b6db8de5493c03feeac6

2021-05-10T13:07:51.186-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.120.92) - Rebalance exited with reason {service_rebalance_failed,fts,

                              {agent_died,<0.3947.0>,

                               {linked_process_died,<0.127.12>,

                                {'ns_1@172.23.120.92',

                                 {timeout,

                                  {gen_server,call,

                                   [<0.3990.0>,

                                    {call,"ServiceAPI.GetTaskList",

                                     #Fun<json_rpc_connection.0.77329884>},

                                    60000]}}}}}}.

Rebalance Operation Id = 07abcc20f925b6db8de5493c03feeac6

Logs:

test_30.zip

After this test, 3 tests passed and test_34 failed with below during setup:

[2021-05-10 13:38:04,694] - [rest_client:1500] INFO - adding remote node @172.23.120.115:8091 to this cluster @172.23.120.92:8091

[2021-05-10 13:38:14,719] - [rest_client:1833] INFO - rebalance progress took 10.02 seconds

[2021-05-10 13:38:14,719] - [rest_client:1834] INFO - sleep for 10 seconds after rebalance...

[2021-05-10 13:38:45,632] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': '', 'user': 'Administrator', 'password': 'password'}

[2021-05-10 13:38:46,557] - [rest_client:1732] INFO - rebalance operation started

[2021-05-10 13:38:56,700] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 13:38:56,700] - [task:839] INFO - Rebalance - status: running, progress: 37.00%

[2021-05-10 13:39:17,221] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 13:39:17,222] - [task:839] INFO - Rebalance - status: running, progress: 37.00%

[2021-05-10 13:39:37,301] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed

[2021-05-10 13:39:37,386] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:

[2021-05-10 13:39:37,386] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'critical', 'code': 0, 'module': 'ns_orchestrator', 'tstamp': 1620679162803, 'shortText': 'message', 'text': 'Rebalance exited with reason {service_rebalance_failed,fts,\n                              {agent_died,<0.8315.14>,\n                               {linked_process_died,<0.28839.14>,\n                                {\'ns_1@172.23.120.92\',\n                                 {timeout,\n                                  {gen_server,call,\n                                   [<0.8551.14>,\n                                    {call,"ServiceAPI.GetCurrentTopology",\n                                     #Fun<json_rpc_connection.0.77329884>},\n                                    60000]}}}}}}.\nRebalance Operation Id = f95eb48ec82f65df688154a0dbfc522e', 'serverTime': '2021-05-10T13:39:22.803Z'}

After this test_39 init cluster rebalance failed because of timeout accessing ServiceAPI.GetCurrentTopology
After this test_43 init cluster rebalance failed because of timeout accessing ServiceAPI.GetTaskList
After this test_45 init cluster rebalance failed because of timeout accessing ServiceAPI.GetCurrentTopology
After this test_53/test_54 remove all nodes rebalance failed because of timeout accessing ServiceAPI.GetTaskList
After this test_55 init cluster rebalance failed because of timeout accessing ServiceAPI.StartTopologyChange
After this test_56 remove all nodes rebalance failed because of timeout accessing ServiceAPI.GetTaskList
After this test_57 init cluster rebalance failed because fts crash

[2021-05-10 17:06:08,477] - [cluster_helper:262] INFO - rebalancing all nodes in order to remove nodes

[2021-05-10 17:06:08,517] - [rest_client:1727] INFO - rebalance params : {'knownNodes': 'ns_1@172.23.120.115,ns_1@172.23.120.92', 'ejectedNodes': 'ns_1@172.23.120.115', 'user': 'Administrator', 'password': 'password'}

[2021-05-10 17:06:08,928] - [rest_client:1732] INFO - rebalance operation started

[2021-05-10 17:06:08,933] - [rest_client:1894] INFO - rebalance percentage : 0.00 %

[2021-05-10 17:06:18,959] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 17:06:29,027] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 17:06:39,173] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 17:06:49,243] - [rest_client:1894] INFO - rebalance percentage : 37.00 %

[2021-05-10 17:06:59,258] - [rest_client:1877] ERROR - {'status': 'none', 'errorMessage': 'Rebalance failed. See logs for detailed reason. You can try again.'} - rebalance failed

[2021-05-10 17:06:59,296] - [rest_client:3810] INFO - Latest logs from UI on 172.23.120.92:

[2021-05-10 17:06:59,296] - [rest_client:3811] ERROR - {'node': 'ns_1@172.23.120.92', 'type': 'info', 'code': 0, 'module': 'ns_log', 'tstamp': 1620691616882, 'shortText': 'message', 'text': "Service 'fts' exited with status 137. Restarting. Messages:\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdpipeline.go:77 +0xa0\n\ngoroutine 2126920 [runnable]:\nbufio.NewReaderSize(...)\n\t/home/couchbase/.cbdepscache/exploded/x86_64/go-1.13.7/go/src/bufio/bufio.go:56\ngithub.com/couchbase/gocbcore/v9/memd.NewConn(...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memd/conn.go:55\ngithub.com/couchbase/gocbcore/v9.dialMemdConn(0x1355420, 0xc07d8e1fc0, 0xc006e29b80, 0x14, 0x0, 0xc01e908863a4e190, 0x133b945791ef, 0x1d0d9a0, 0x2, 0x2, ...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdconn.go:91 +0x59e\ngithub.com/couchbase/gocbcore/v9.(*memdClientDialerComponent).dialMemdClient(0xc001050a20, 0xc00840d980, 0xc006e29b80, 0x14, 0xc01e908863a4e190, 0x133b945791ef, 0x1d0d9a0, 0xc00802cb20, 0xc0130e7df0, 0x441ddc, ...)\n\t/tmp/workspace/couchbase-server-unix/server_build/gopkg/go-1.13.7/pkg/mod/github.com/couchbase/gocbcore/v9@v9.1.4/memdclientdialer_component.go:145 +0x12a\n", 'serverTime': '2021-05-10T17:06:56.882Z'}

Logs:

test_57.zip

Let me know if you need different issue for fts crash in the last failure above.
After the crash, all 25 tests passed.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

consoleText
16.66 MB
11/May/21 7:21 PM
test_30.zip
39.45 MB
11/May/21 6:55 PM
test_57.zip
62.39 MB
11/May/21 7:29 PM

Issue Links

is duplicated by

MB-46228 FTS is failing to shut down "stats" agents on nodes other than where an "index deletion" is received

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-46255
#	Subject	Branch	Project	Status	CR	V
153498,5	MB-46255: Close stats' KV connections on all nodes when index is deleted	master	cbgt	Status: MERGED	+2	+1
153499,3	MB-46255: Clear out source specs for indexes that are deleted	master	cbft	Status: MERGED	+2	+1

Rebalance failures observed in custom-map-n1ql-rqg-scorch-match-phrase

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty