Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
7.0.2
-
6.6.3-9808 -> 7.0.2-6668
-
Untriaged
-
Centos 64-bit
-
1
-
No
Description
Steps to Repro
1. Run the following longevity script on 6.6.3 for 5 days.
./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
|
At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.
#!/bin/sh
|
for i in {0..10000}
|
do
|
`curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
|
`curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`
|
done
|
3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.
After upgrade I validated that all the metakv tombstones are purged and enabled IPv4 only + enforce tls using the below commands.
[root@localhost logs]# grep 'ns_config tombstone' debug.log
|
[ns_server:debug,2021-09-14T04:26:34.722-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 11869 ns_config tombstone(s) up to timestamp 63798837690. Tombstones:
|
[ns_server:debug,2021-09-14T04:30:38.296-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798837934. Tombstones:
|
[ns_server:debug,2021-09-14T04:31:41.731-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 127 ns_config tombstone(s) up to timestamp 63798837998. Tombstones:
|
[ns_server:debug,2021-09-14T04:39:42.419-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798838481. Tombstones:
|
[ns_server:debug,2021-09-14T04:40:43.040-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798838542. Tombstones:
|
[root@localhost logs]#
|
[root@localhost logs]# /opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only
|
Switched IP family for node: http://172.23.106.134:8091
|
Switched IP family for node: http://172.23.106.136:8091
|
Switched IP family for node: http://172.23.106.137:8091
|
Switched IP family for node: http://172.23.106.138:8091
|
Switched IP family for node: http://172.23.120.58:8091
|
Switched IP family for node: http://172.23.120.73:8091
|
Switched IP family for node: http://172.23.120.74:8091
|
Switched IP family for node: http://172.23.120.75:8091
|
Switched IP family for node: http://172.23.120.77:8091
|
Switched IP family for node: http://172.23.120.81:8091
|
Switched IP family for node: http://172.23.120.86:8091
|
Switched IP family for node: http://172.23.121.118:8091
|
Switched IP family for node: http://172.23.121.77:8091
|
Switched IP family for node: http://172.23.123.24:8091
|
Switched IP family for node: http://172.23.123.25:8091
|
Switched IP family for node: http://172.23.123.26:8091
|
Switched IP family for node: http://172.23.123.31:8091
|
Switched IP family for node: http://172.23.123.32:8091
|
Switched IP family for node: http://172.23.123.33:8091
|
Switched IP family for node: http://172.23.96.122:8091
|
Switched IP family for node: http://172.23.96.14:8091
|
Switched IP family for node: http://172.23.96.243:8091
|
Switched IP family for node: http://172.23.97.105:8091
|
Switched IP family for node: http://172.23.97.148:8091
|
Switched IP family for node: http://172.23.97.149:8091
|
Switched IP family for node: http://172.23.97.150:8091
|
Switched IP family for node: http://172.23.97.151:8091
|
SUCCESS: Switched IP family of the cluster
|
[root@localhost logs]# /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
|
Turned on encryption for node: http://172.23.106.134:8091
|
Turned on encryption for node: http://172.23.106.136:8091
|
Turned on encryption for node: http://172.23.106.137:8091
|
Turned on encryption for node: http://172.23.106.138:8091
|
Turned on encryption for node: http://172.23.120.58:8091
|
Turned on encryption for node: http://172.23.120.73:8091
|
Turned on encryption for node: http://172.23.120.74:8091
|
Turned on encryption for node: http://172.23.120.75:8091
|
Turned on encryption for node: http://172.23.120.77:8091
|
Turned on encryption for node: http://172.23.120.81:8091
|
Turned on encryption for node: http://172.23.120.86:8091
|
Turned on encryption for node: http://172.23.121.118:8091
|
Turned on encryption for node: http://172.23.121.77:8091
|
Turned on encryption for node: http://172.23.123.24:8091
|
Turned on encryption for node: http://172.23.123.25:8091
|
Turned on encryption for node: http://172.23.123.26:8091
|
Turned on encryption for node: http://172.23.123.31:8091
|
Turned on encryption for node: http://172.23.123.32:8091
|
Turned on encryption for node: http://172.23.123.33:8091
|
Turned on encryption for node: http://172.23.96.122:8091
|
Turned on encryption for node: http://172.23.96.14:8091
|
Turned on encryption for node: http://172.23.96.243:8091
|
Turned on encryption for node: http://172.23.97.105:8091
|
Turned on encryption for node: http://172.23.97.148:8091
|
Turned on encryption for node: http://172.23.97.149:8091
|
Turned on encryption for node: http://172.23.97.150:8091
|
Turned on encryption for node: http://172.23.97.151:8091
|
SUCCESS: Switched node-to-node encryption on
|
[root@localhost logs]# /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict
|
SUCCESS: Security settings updated
|
[root@localhost logs]#
|
At this point I noticed that Rebalance button was enabled. Not sure why this was enabled as we had nothing to rebalance afaik(this is tracked through MB-48448). However, when I did rebalance it failed as shown below.
ns_1@172.23.106.136 5:47:32 AM 14 Sep, 2021
Rebalance exited with reason {service_rebalance_failed,eventing,
|
{worker_died,
|
{'EXIT',<0.29362.951>,
|
{rebalance_failed,
|
{service_error,
|
<<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}.
|
Rebalance Operation Id = f38dda25a6ccfe31114529f5d88b0753
|
cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.
Noticed errors like these on the eventing log
172.23.96.122
2021-09-14T07:15:56.942-07:00 [Error] Consumer::getOpCallback [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1550114910.sock:9871] Bucket fetch failed for key: <ud>eventing::1550114910::timer_op::vb::797</ud>, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"Get","Opaque":"0x0","TimeObserved":2500111201,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
2021-09-14T07:15:56.942-07:00 [Error] Consumer::getOpCallback [worker_timer_op_0:/tmp/127.0.0.1:8091_0_1550114910.sock:9887] Bucket fetch failed for key: <ud>eventing::1550114910::timer_op::vb::683</ud>, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"Get","Opaque":"0x0","TimeObserved":2500136088,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
2021-09-14T07:15:56.942-07:00 [Error] Consumer::addOwnershipHistorySECallback [worker_bucket_op_curl_2:/tmp/127.0.0.1:8091_2_1030245617.sock:9793] Key: eventing::1030245617::bucket_op_curl::vb::944, subdoc operation failed while performing ownership entry app post STREAMEND, err: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"MutateIn","Opaque":"0x0","TimeObserved":2500086758,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}