Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.6.3
Affects Version/s: 6.6.3
Component/s: None
Labels:
- system_test_upgrade
- upgrade
Environment:
6.6.2-9588. --> 6.6.3-9796

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
No

Description

Steps to Repro
1. Run the following 6.6.2 longevity test for 4 days.

./sequoia -client 172.23.96.162:2375 -provider file:centos_third_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.2-9588 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true

2. At this point we would have 27 node cluster(3 analytics, 3 index, 3 fts, 3 query, 6 index, 9 data).
3. Do a swap rebalance of 6 nodes (1 of each service type). Worked fine.
4. Do a failover(graceful for kv) of 6 nodes (1 of each service type). Do an upgrade, recovery(delta for kv) and do a rebalance. Worked fine.
5. Do a failover(graceful for kv) of 6 nodes (1 of each service type). This graceful failover on kv node(172.23.105.164) failed as shown below.

"completionMessage": "Graceful failover exited with reason {mover_crashed,\n                                      {unexpected_exit,\n                                       {'EXIT',<0.1591.25>,\n                                        {failed_to_update_vbucket_map,\n                                         \"NEW_ORDER\",977,\n                                         {error,\n                                          [{'ns_1@172.23.106.54',\n                                            {exit,\n                                             {{nodedown,'ns_1@172.23.106.54'},\n                                              {gen_server,call,\n                                               [{ns_config_rep,\n                                                 'ns_1@172.23.106.54'},\n                                                synchronize_everything,\n                                                infinity]}}}}]}}}}}."

This reminds me of the bug I hit into during 6.6.2 -> 7.0.0 upgrade because of bloated metakv tombstones. Notably ~~MB-46778~~ and ~~MB-46787~~. Not sure if it's the same though.

Some important things to note.
1. This is the first time we are doing system upgrade from 6.6.2 -> 6.6.3. So there is no baseline to speak of. This test was done for the first time in 7.0.0 using 6.6.2 -> 7.0.0 upgrade
2. Number of metakv tombstone are

[root@localhost ~]#  curl --silent -u Administrator:password http://localhost:8091/diag/eval -d 'ns_config:get()' | grep '_deleted' | wc -l

[root@localhost ~]#

Please not these are organically created tombstones unlike the ones we used to do in 7.0.0 using a shell script to test metakv purge testing for system test upgrade. No changes were done to the longevity test. These were written during MH time frame possibly around 2 years ago.

cbcollect_info attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

rebalanceReport.json
815 kB
19/Jul/21 5:43 AM
cpu-usage-214.png
128 kB
23/Jul/21 10:33 AM

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 19/Jul/21 5:34 AM

Updated:: 25/Jul/21 12:17 AM

Resolved:: 25/Jul/21 12:17 AM

[System Test] - Graceful failover done during upgrade fails with "Graceful failover exited with reason {mover_crashed,{unexpected_exit,{'EXIT',<0.1591.25>,{failed_to_update_vbucket_map,"

Details

Description

Attachments

Attachments

Activity

People

Dates

PagerDuty