Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.0.2
-
Untriaged
-
Centos 64-bit
-
1
-
No
-
CX Sprint 263
Description
Steps to Repro
1. Run the following longevity script on 6.6.3 for 5 days.
./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
|
At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.
#!/bin/sh
|
for i in {0..10000}
|
do
|
`curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
|
`curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`
|
done
|
3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.
6. Now run the following commands to enable IPV4 only and set encryption level to strict
/opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only
|
/opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
|
/opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict
|
7. Add new 7.0.2 nodes and remove few 7.0.2 nodes and start rebalance(Operation id: 015dc7f6b30f1864adf4611a37435014). Had to stop/start this rebalance due to unrelated issue(See MB-48449). Retried rebalance(Operation id : 2535978d0ed7e241b4a93065d1fcf79e) failed as shown below.
ns_1@172.23.106.136 2:11:41 AM 15 Sep, 2021
Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136', 'ns_1@172.23.106.137','ns_1@172.23.106.138', 'ns_1@172.23.120.58','ns_1@172.23.120.73', 'ns_1@172.23.120.74','ns_1@172.23.120.75', 'ns_1@172.23.120.77','ns_1@172.23.120.81', 'ns_1@172.23.120.86','ns_1@172.23.121.118', 'ns_1@172.23.121.77','ns_1@172.23.123.24', 'ns_1@172.23.123.25','ns_1@172.23.123.26', 'ns_1@172.23.123.31','ns_1@172.23.123.32', 'ns_1@172.23.123.33','ns_1@172.23.96.122', 'ns_1@172.23.96.14','ns_1@172.23.96.243', 'ns_1@172.23.96.254','ns_1@172.23.96.48', 'ns_1@172.23.97.105','ns_1@172.23.97.110', 'ns_1@172.23.97.112','ns_1@172.23.97.148', 'ns_1@172.23.97.149','ns_1@172.23.97.150', 'ns_1@172.23.97.151','ns_1@172.23.97.241', 'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2535978d0ed7e241b4a93065d1fcf79e
|
ns_1@172.23.97.241 2:18:22 AM 15 Sep, 2021
Analytics Service unable to successfully rebalance d41b688310a12c6cf599bee64c6afde6 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [79b50a33da8ff241d7aae2df002048d6], state: ACTIVE)'; see analytics_info.log for details
|
ns_1@172.23.106.136 2:18:22 AM 15 Sep, 2021
Rebalance exited with reason {service_rebalance_failed,cbas, {worker_died, {'EXIT',<0.14871.1636>, {rebalance_failed, {service_error, <<"Rebalance d41b688310a12c6cf599bee64c6afde6 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [172.23.123.32:8091 (79b50a33da8ff241d7aae2df002048d6)], state: ACTIVE)">>}}}}}. Rebalance Operation Id = 2535978d0ed7e241b4a93065d1fcf79e
|
cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.
Attachments
For Gerrit Dashboard: MB-48468 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
161723,2 | MB-48468: Keep http server running | cheshire-cat | cbas | Status: MERGED | +2 | +1 |