Loading...

XML

Word

Printable

Details

Type: Improvement
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 6.6.3, 7.0.0
Component/s: ns_server
Labels:
- approved-for-trinity
- pm-fast-failover

Epic Link:
Failover Improvements
Story Points:
1

Description

As seen in CBSE-10704

We promise 5 seconds auto failover timeout but it takes 9 seconds for the shut down node to be failed over.

The node is considered to be down if the latest heartbeat was received 2 seconds ago. So if the mode went down exactly after heartbeat was sent, you will have 2 seconds lag on detecting that it is down.

Then unfortunately we have series of internal monitors that refresh their status once a second requesting status from other monitors that also refresh their status once a second. Some unfortunate alignment of such events can account for other 2 seconds of delay.

And only after the top level monitor discovers that the node is down we begin the count down for the configured auto failover time out.

What can be improved:
1. Reduce asynchronicity in health monitors. Do all the information gathering in one tick, instead of doing independent ticks on each monitor
2. Account for 2 seconds passed since last heartbeat when we start count down to auto failover.
3. Consider using ns_node_disco info as a backup way to see if other node is down.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

Screenshot 2023-09-29 at 17.25.04.png
152 kB
29/Sep/23 9:25 AM
Screenshot 2023-09-29 at 17.19.33.png
145 kB
29/Sep/23 9:22 AM
Screen Shot 2021-10-06 at 12.24.23 PM.png
15 kB
06/Oct/21 12:26 PM

Issue Links

relates to

MB-58264 Addition of disk read/write failure timeout to auto-failover timeout is unintuitive

Open

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

For Gerrit Dashboard: MB-48412
#	Subject	Branch	Project	Status	CR	V
190251,14	MB-48412: wip: notify unhealthy node	master	ns_server	Status: NEW	0	0
192184,9	MB-48412: Add ability to forcefully tick auto_failover	master	ns_server	Status: NEW	0	0

Activity

People

Assignee:: Ben Huddleston

Reporter:: Artem Stemkovski

Votes:: 0 Vote for this issue

Watchers:: 14 Start watching this issue

Dates

Created:: 10/Sep/21 4:57 PM

Updated:: 18/Jan/24 11:40 PM

Gerrit Reviews

There are 2 open Gerrit changes

MB-48412: wip: notify unhealthy node

+1 Gerrit Review:
MB-48412: Add ability to forcefully tick auto_failover

+1 Gerrit Review:

shorten the time between observing the missing heartbeat to the autofailover

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty