Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.0.2
Affects Version/s: 7.0.2
Component/s: analytics
Labels:

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
No
Sprint:
CX Sprint 263

Description

Steps to Repro
1. Run the following longevity script on 6.6.3 for 5 days.

./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true

At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
2. Create 10k metakv tombstones. This has been part of our testing since ~~MB-44838~~ was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.

 #!/bin/sh

for i in {0..10000}

do

        `curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`

        `curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`

    done

3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.
6. Now run the following commands to enable IPV4 only and set encryption level to strict

 /opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only

 /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable

 /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict

7. Add new 7.0.2 nodes and remove few 7.0.2 nodes and start rebalance(Operation id: 015dc7f6b30f1864adf4611a37435014). Had to stop/start this rebalance due to unrelated issue(See ~~MB-48449~~). Retried rebalance(Operation id : 2535978d0ed7e241b4a93065d1fcf79e) failed as shown below.

ns_1@172.23.106.136 2:11:41 AM 15 Sep, 2021

Starting rebalance, KeepNodes = ['ns_1@172.23.106.134','ns_1@172.23.106.136', 'ns_1@172.23.106.137','ns_1@172.23.106.138', 'ns_1@172.23.120.58','ns_1@172.23.120.73', 'ns_1@172.23.120.74','ns_1@172.23.120.75', 'ns_1@172.23.120.77','ns_1@172.23.120.81', 'ns_1@172.23.120.86','ns_1@172.23.121.118', 'ns_1@172.23.121.77','ns_1@172.23.123.24', 'ns_1@172.23.123.25','ns_1@172.23.123.26', 'ns_1@172.23.123.31','ns_1@172.23.123.32', 'ns_1@172.23.123.33','ns_1@172.23.96.122', 'ns_1@172.23.96.14','ns_1@172.23.96.243', 'ns_1@172.23.96.254','ns_1@172.23.96.48', 'ns_1@172.23.97.105','ns_1@172.23.97.110', 'ns_1@172.23.97.112','ns_1@172.23.97.148', 'ns_1@172.23.97.149','ns_1@172.23.97.150', 'ns_1@172.23.97.151','ns_1@172.23.97.241', 'ns_1@172.23.97.74'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2535978d0ed7e241b4a93065d1fcf79e

ns_1@172.23.97.241 2:18:22 AM 15 Sep, 2021

Analytics Service unable to successfully rebalance d41b688310a12c6cf599bee64c6afde6 due to 'java.lang.IllegalStateException: timed out waiting for all nodes to join & cluster active (missing nodes: [79b50a33da8ff241d7aae2df002048d6], state: ACTIVE)'; see analytics_info.log for details

ns_1@172.23.106.136 2:18:22 AM 15 Sep, 2021

Rebalance exited with reason {service_rebalance_failed,cbas, {worker_died, {'EXIT',<0.14871.1636>, {rebalance_failed, {service_error, <<"Rebalance d41b688310a12c6cf599bee64c6afde6 failed: timed out waiting for all nodes to join & cluster active (missing nodes: [172.23.123.32:8091 (79b50a33da8ff241d7aae2df002048d6)], state: ACTIVE)">>}}}}}. Rebalance Operation Id = 2535978d0ed7e241b4a93065d1fcf79e

cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

rebalanceReport.json
3.33 MB
15/Sep/21 2:43 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-48468
#	Subject	Branch	Project	Status	CR	V
161723,2	MB-48468: Keep http server running	cheshire-cat	cbas	Status: MERGED	+2	+1

Activity

People

Assignee:: Umang

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 15/Sep/21 2:40 AM

Updated:: 21/Sep/21 4:58 PM

Resolved:: 17/Sep/21 10:37 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-48468: Keep http server running: Gerrit Review:

[System test upgrade] : Post upgrade analytics rebalance fails with "Rebalance failed: timed out waiting for all nodes to join & cluster active (missing nodes:"

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty