Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49795

MultiNodeFailure: Failover is attempted when it can't failover 'N' nodes at the same time

    XMLWordPrintable

Details

    Description

      Build: 7.1.0-1787

      Scenario:

      • 7 node cluster
      • Couchbase bucket with replicas=3
      • Set auto-failover with max_events=1 and timeout=5
      • Stop memcached on 2 nodes  (172.23.100.13 and 172.23.100.14)

      Observation:

      Seeing auto-failover is getting triggered and fails with the following reason and getting retied continuously.

      Could not auto-failover more nodes (['ns_1@172.23.100.14']). Maximum number of auto-failover events (1) has been reached

      Expected behavior:

      Auto-failover should never get attempted if the max_events configured is less than the failed nodes in cluster.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          ashwin.govindarajulu Ashwin Govindarajulu created issue -
          meni.hillel Meni Hillel (Inactive) made changes -
          Field Original Value New Value
          Assignee Meni Hillel [ JIRAUSER25407 ] Hareen Kancharla [ JIRAUSER25304 ]

          This seems to be unrelated to System Events feature. Meni Hillel: I am not sure but probably Artem should take a look since it relates to MultiNodefailure?

          hareen.kancharla Hareen Kancharla added a comment - This seems to be unrelated to System Events feature. Meni Hillel : I am not sure but probably Artem should take a look since it relates to MultiNodefailure?

          Failover should and will be attempted and reported to the user to document failover was attempted and did not take place due to max limit. I am not seeing what the issue is.

          meni.hillel Meni Hillel (Inactive) added a comment - Failover should and will be attempted and reported to the user to document failover was attempted and did not take place due to max limit. I am not seeing what the issue is.
          meni.hillel Meni Hillel (Inactive) made changes -
          Assignee Hareen Kancharla [ JIRAUSER25304 ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          meni.hillel Meni Hillel (Inactive) made changes -
          Resolution Not a Bug [ 10200 ]
          Status Open [ 1 ] Resolved [ 5 ]
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

          Meni Hillel, I agree the  failover attempt is okay from the cluster.

          But the problem is, it is running the actual failover procedure for the 2 nodes where the max_auto-failover events configured is only one.

          That is the bug here.

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Meni Hillel , I agree the  failover attempt is okay from the cluster. But the problem is, it is running the actual failover procedure for the 2 nodes where the max_auto-failover events configured is only one. That is the bug here.
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Assignee Ashwin Govindarajulu [ ashwin.govindarajulu ] Meni Hillel [ JIRAUSER25407 ]
          Resolution Not a Bug [ 10200 ]
          Status Resolved [ 5 ] Reopened [ 4 ]
          meni.hillel Meni Hillel (Inactive) made changes -
          Assignee Meni Hillel [ JIRAUSER25407 ] Artem Stemkovski [ artem ]

          Here's what's happening:

          Node 'ns_1@172.23.100.14' becomes unhealthy
          Slightly later node 'ns_1@172.23.100.13' becomes unhelthy

          Node 'ns_1@172.23.100.14' reaches nearly_down status first
          Then when 'ns_1@172.23.100.13' reaches nearly_down status the warning is issued:

          [ns_server:debug,2021-11-26T02:30:52.779-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover_logic:process_frame:285]List of candidates changed from [{'ns_1@172.23.100.14',
                                               <<"c10446c156c3aad490ef0e812e5ff96e">>}] to [{'ns_1@172.23.100.13',
                                                                                             <<"ed89ade40233610b8fafdfd1a03377dd">>},
                                                                                            {'ns_1@172.23.100.14',
                                                                                             <<"c10446c156c3aad490ef0e812e5ff96e">>}]. Resetting counter
           
          [ns_server:debug,2021-11-26T02:30:52.779-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{mail_down_warning_multi_node,
                                             {'ns_1@172.23.100.14',
                                                 <<"c10446c156c3aad490ef0e812e5ff96e">>}}]
           
          2021-11-26T02:30:52.779-08:00, auto_failover:0:info:message(ns_1@172.23.105.155) - Could not auto-failover node ('ns_1@172.23.100.14'). The list of nodes being down has changed.
          

          Then both nodes reach failover status:

          [ns_server:debug,2021-11-26T02:30:54.782-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{failover,
                                          [{'ns_1@172.23.100.13',
                                            <<"ed89ade40233610b8fafdfd1a03377dd">>},
                                           {'ns_1@172.23.100.14',
                                            <<"c10446c156c3aad490ef0e812e5ff96e">>}]}]
          

          Then because the auto failover counter allows to fail over just one node the list is trimmed from ['ns_1@172.23.100.13', ns_1@172.23.100.14] to ['ns_1@172.23.100.13']
          The warning is issued:

          [user:info,2021-11-26T02:30:54.783-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover:maybe_report_max_node_reached:512]Could not auto-failover more nodes (['ns_1@172.23.100.14']). Maximum number of auto-failover events (1) has been reached.
          

          Failover of just one node is started:

          [ns_server:debug,2021-11-26T02:30:59.783-08:00,ns_1@172.23.105.155:<0.16689.0>:failover:start:35]Starting failover with Nodes = ['ns_1@172.23.100.13'], Options = #{allow_unsafe =>
                                                                              false,
                                                                             auto =>
                                                                              true,
                                                                             failover_reasons =>
                                                                              [{'ns_1@172.23.100.13',
                                                                                "The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. "}]}
          

          Failover fails because Max Replicas numbers could not be queried from ns_1@172.23.100.14 on which memcached is still down

          [ns_server:error,2021-11-26T02:31:09.806-08:00,ns_1@172.23.105.155:<0.9411.19>:failover:failover_buckets:261]Caught failover exception: "Failed to get failover info for bucket \"default\": ['ns_1@172.23.100.14']"
          

          This is unfortunately not pretty, but I don't know how can we improve this

          artem Artem Stemkovski added a comment - Here's what's happening: Node 'ns_1@172.23.100.14' becomes unhealthy Slightly later node 'ns_1@172.23.100.13' becomes unhelthy Node 'ns_1@172.23.100.14' reaches nearly_down status first Then when 'ns_1@172.23.100.13' reaches nearly_down status the warning is issued: [ns_server:debug,2021-11-26T02:30:52.779-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover_logic:process_frame:285]List of candidates changed from [{'ns_1@172.23.100.14', <<"c10446c156c3aad490ef0e812e5ff96e">>}] to [{'ns_1@172.23.100.13', <<"ed89ade40233610b8fafdfd1a03377dd">>}, {'ns_1@172.23.100.14', <<"c10446c156c3aad490ef0e812e5ff96e">>}]. Resetting counter   [ns_server:debug,2021-11-26T02:30:52.779-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{mail_down_warning_multi_node, {'ns_1@172.23.100.14', <<"c10446c156c3aad490ef0e812e5ff96e">>}}]   2021-11-26T02:30:52.779-08:00, auto_failover:0:info:message(ns_1@172.23.105.155) - Could not auto-failover node ('ns_1@172.23.100.14'). The list of nodes being down has changed. Then both nodes reach failover status: [ns_server:debug,2021-11-26T02:30:54.782-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover_logic:process_frame:324]Decided on following actions: [{failover, [{'ns_1@172.23.100.13', <<"ed89ade40233610b8fafdfd1a03377dd">>}, {'ns_1@172.23.100.14', <<"c10446c156c3aad490ef0e812e5ff96e">>}]}] Then because the auto failover counter allows to fail over just one node the list is trimmed from ['ns_1@172.23.100.13', ns_1@172.23.100.14] to ['ns_1@172.23.100.13'] The warning is issued: [user:info,2021-11-26T02:30:54.783-08:00,ns_1@172.23.105.155:<0.16691.0>:auto_failover:maybe_report_max_node_reached:512]Could not auto-failover more nodes (['ns_1@172.23.100.14']). Maximum number of auto-failover events (1) has been reached. Failover of just one node is started: [ns_server:debug,2021-11-26T02:30:59.783-08:00,ns_1@172.23.105.155:<0.16689.0>:failover:start:35]Starting failover with Nodes = ['ns_1@172.23.100.13'], Options = #{allow_unsafe => false, auto => true, failover_reasons => [{'ns_1@172.23.100.13', "The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. "}]} Failover fails because Max Replicas numbers could not be queried from ns_1@172.23.100.14 on which memcached is still down [ns_server:error,2021-11-26T02:31:09.806-08:00,ns_1@172.23.105.155:<0.9411.19>:failover:failover_buckets:261]Caught failover exception: "Failed to get failover info for bucket \"default\": ['ns_1@172.23.100.14']" This is unfortunately not pretty, but I don't know how can we improve this
          Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - - edited

          Artem Stemkovski  asked me to have a look into this bug for my opinion.

          Ashwin Govindarajulu : My assumption is that QE deems this is a bug because we attempt failover when 2 nodes are down > autofailover max count, which didn't happen prior to Neo? I think the key part here is that we have changed the behavior of auto-failover and this effort needs to be documented appropriately.

          New in Neo,

          1. we attempt to failover a subset of nodes deemed down limited by autofailover max count, in this case we attempt failover of 1 node.
          2. index failover.

          I believe we do the right thing in attempting the failover of 1 node as we can have different services down on different nodes. Consider the hypothetical scenario where we have node13 with data service down and node14 with index service down. We should(and we do) try to failover node 13 which has the data service and this would be successful in failing over node 13. We prioritize kv node failover. Also, we would still be under the max_count, note we are not failing over more nodes than max_count.

          Note: In pre-neo, node14 wouldn't be deemed down.

           

          Artem Stemkovski : We could perhaps change the autofailover behavior to identify which monitors believe the node is down and only attempt partial failover if the overlap of monitors that deem the node is down is zero. That is, the nodes we want to failover do not have the same services that are down as the nodes we do not wish to failover.  Thoughts on this?

          Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - - edited Artem Stemkovski   asked me to have a look into this bug for my opinion. Ashwin Govindarajulu : My assumption is that QE deems this is a bug because we attempt failover when 2 nodes are down > autofailover max count, which didn't happen prior to Neo? I think the key part here is that we have changed the behavior of auto-failover and this effort needs to be documented appropriately. New in Neo, we attempt to failover a subset of nodes deemed down limited by autofailover max count, in this case we attempt failover of 1 node. index failover. I believe we do the right thing in attempting the failover of 1 node as we can have different services down on different nodes. Consider the hypothetical scenario where we have node13 with data service down and node14 with index service down. We should(and we do) try to failover node 13 which has the data service and this would be successful in failing over node 13. We prioritize kv node failover. Also, we would still be under the max_count, note we are not failing over more nodes than max_count. Note: In pre-neo, node14 wouldn't be deemed down.   Artem Stemkovski : We could perhaps change the autofailover behavior to identify which monitors believe the node is down and only attempt partial failover if the overlap of monitors that deem the node is down is zero. That is, the nodes we want to failover do not have the same services that are down as the nodes we do not wish to failover.  Thoughts on this?

          There was some offline discussion with Artem on this as he reached out to me for PM input:

          Hi Shivani,
           
          Here https://issues.couchbase.com/browse/MB-49795 2 KV nodes are down and it is safe to fail them over. But the autofailover limit allows to fail over just one node.
           
          Therefore, the current logic picks one node of 2 and tries to fail it over. The safety check passes, but after we try to fetch the max replica numbers from the remaining KV nodes (to ensure durability) and fail because one of the KV nodes is down.
           
          That results in repeated failover error until one of the nodes goes up.
           
          I thought about the whole situation and came to the conclusion that the simplest thing we can do is not even try to auto fail over the partial list of KV nodes.
           
          So if we have a group of KV nodes that we consider unhealthy but there's not enough limit left to fail it over as a whole, we just notify the administrator and do nothing.
           
          Do you agree?
           
          Thanks,
          Artem

          This was my initial response:

          Hi Artem,

          I don’t think I agree.

          We should failover as many nodes (as long as it is safe) until the max count.

          What I don’t understand is the following:

          >>The safety check passes, but after we try to fetch the max replica numbers from the remaining KV nodes (to ensure durability) and fail because one of the KV nodes is down.

           

          Why is the durability factor checked for failing over? That should never be the case. We know majority may not be achievable after failover. As long as there is one data copy for all vbuckets we should failover (which is probably what the safety check is). Is this durability check something new you have added?

           

          Also, we should fix the parameter, UI and error messages to say ‘max number of auto-failover nodes’ rather than ‘max number of auto-failover events’. Let me know if you would like me to file this bug.

           

          Thanks

          --Shivani

          But then Artem explained further (and we also discussed in a meeting):

          Hi Shivani,
           
          For durability aware failover we need to promote replicas with the highest seq no. To do that we need to find out which replicas have the highest seqno. So, for each chain that lost its master partition we need to query seqno's of replicas from the replica nodes. If one of such nodes is down, the failover will fail. In most of the cases durability aware failover needs all other KV nodes to be available and responding to succeed, which is not the case if we fail over just a portion of down KV nodes.
           
          Ticket for the autofailover limit label: https://issues.couchbase.com/browse/MB-49563. The change should be already in place.
           
          Thanks,
          Artem 

          So basically there is a risk of losing previously done Durable writes if we pick one of the nodes to failover and leave the other one as is. Hence the decision we came to is the following:

          If the auto failover node limit is exceeded if all these nodes are failed over, then do not fail over any. So all or nothing behavior for multi-node concurrent failover.

          Artem Stemkovski was also going to double check with Aliaksey Artamonau  and Dave Finlay

          Additionally I made the following request:

          As for the autofailover limit label, we should fix it in all places (not just UI). E.g. the error message says the following:

           

          Could not auto-failover more nodes ('[ns_1@172.23.100.14). Maximum number of auto-failover events (1) has been reached

          The error message should say 'maximum number of auto-failover nodes has been reached' and not use the word 'events'. I did not file a bug for fixing the error messages as Artem said he will take care of fixing them. Let me know if a bug should be filed.

          I did file a DOC bug for the same: DOC-9489

          shivani.gupta Shivani Gupta added a comment - There was some offline discussion with Artem on this as he reached out to me for PM input: Hi Shivani,   Here  https://issues.couchbase.com/browse/MB-49795  2 KV nodes are down and it is safe to fail them over. But the autofailover limit allows to fail over just one node.   Therefore, the current logic picks one node of 2 and tries to fail it over. The safety check passes, but after we try to fetch the max replica numbers from the remaining KV nodes (to ensure durability) and fail because one of the KV nodes is down.   That results in repeated failover error until one of the nodes goes up.   I thought about the whole situation and came to the conclusion that the simplest thing we can do is not even try to auto fail over the partial list of KV nodes.   So if we have a group of KV nodes that we consider unhealthy but there's not enough limit left to fail it over as a whole, we just notify the administrator and do nothing.   Do you agree?   Thanks, Artem This was my initial response: Hi Artem, I don’t think I agree. We should failover as many nodes (as long as it is safe) until the max count. What I don’t understand is the following: >>The safety check passes, but after we try to fetch the max replica numbers from the remaining KV nodes (to ensure durability) and fail because one of the KV nodes is down.   Why is the durability factor checked for failing over? That should never be the case. We know majority may not be achievable after failover. As long as there is one data copy for all vbuckets we should failover (which is probably what the safety check is). Is this durability check something new you have added?   Also, we should fix the parameter, UI and error messages to say ‘max number of auto-failover nodes’ rather than ‘max number of auto-failover events’. Let me know if you would like me to file this bug.   Thanks --Shivani But then Artem explained further (and we also discussed in a meeting): Hi Shivani,   For durability aware failover we need to promote replicas with the highest seq no. To do that we need to find out which replicas have the highest seqno. So, for each chain that lost its master partition we need to query seqno's of replicas from the replica nodes. If one of such nodes is down, the failover will fail. In most of the cases durability aware failover needs all other KV nodes to be available and responding to succeed, which is not the case if we fail over just a portion of down KV nodes.   Ticket for the autofailover limit label:  https://issues.couchbase.com/browse/MB-49563 . The change should be already in place.   Thanks, Artem  So basically there is a risk of losing previously done Durable writes if we pick one of the nodes to failover and leave the other one as is. Hence the decision we came to is the following: If the auto failover node limit is exceeded if all these nodes are failed over, then do not fail over any. So all or nothing behavior for multi-node concurrent failover. Artem Stemkovski was also going to double check with Aliaksey Artamonau   and Dave Finlay .  Additionally I made the following request: As for the autofailover limit label, we should fix it in all places (not just UI). E.g. the error message says the following:   Could not auto-failover more nodes ( '[ns_1@172.23.100.14 ). Maximum number of auto-failover events (1) has been reached The error message should say 'maximum number of auto-failover nodes has been reached' and not use the word 'events'. I did not file a bug for fixing the error messages as Artem said he will take care of fixing them. Let me know if a bug should be filed. I did file a DOC bug for the same: DOC-9489
          artem Artem Stemkovski made changes -
          Resolution Fixed [ 1 ]
          Status Reopened [ 4 ] Resolved [ 5 ]

          Build couchbase-server-7.1.0-1931 contains ns_server commit 2c3701f with commit message:
          MB-49795 Better message when multi-node kv autofailover couldn't complete

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1931 contains ns_server commit 2c3701f with commit message: MB-49795 Better message when multi-node kv autofailover couldn't complete

          Validated the same scenario using 7.1.0-2073

          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Validated the same scenario using 7.1.0-2073
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          VERIFICATION STEPS Seeing following log,
          Could not auto-failover more nodes (['ns_1@172.23.105.212']). Maximum number of auto-failover nodes (1) has been reached.
          Assignee Artem Stemkovski [ artem ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          Status Resolved [ 5 ] Closed [ 6 ]

          People

            ashwin.govindarajulu Ashwin Govindarajulu
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty