Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4906

autofailover may failover two nodes automatically within 1 minute if the master node is failed over and the old master nodes is elected as the master again

    Details

      Description

      this issue was reported by one of the users where autofailover was triggered twice instead of once on a cluster.

      the root cause is still under investigation by aliaksey

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-4906=message:MB-4906

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        So this is what happened.

        • node .238 that's master and runs autofailover service starts to have some networking issues and is split from rest of cluster
        • cluster elects .240 as new master
        • autofailover on new master fails over .238
        • network problems on .238 are somehow resolved and connection to rest of cluster is restored
        • now cluster has 2 masters briefly: .238 and .240. .240 surrenders mastership to .238
        • now .238 is the only master and things are fine, except it's autofailover service is not aware of automatic failover that happened when .238 was disconnected
        • when some other node has problems .238 fails it over automatically

        So the fix is to make sure autofailover service is always using latest autofailover count that's stored in config.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - So this is what happened. node .238 that's master and runs autofailover service starts to have some networking issues and is split from rest of cluster cluster elects .240 as new master autofailover on new master fails over .238 network problems on .238 are somehow resolved and connection to rest of cluster is restored now cluster has 2 masters briefly: .238 and .240. .240 surrenders mastership to .238 now .238 is the only master and things are fine, except it's autofailover service is not aware of automatic failover that happened when .238 was disconnected when some other node has problems .238 fails it over automatically So the fix is to make sure autofailover service is always using latest autofailover count that's stored in config.
        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Done http://review.couchbase.org/14411
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        somehow we didn't do it for 1.8.1 but for 1.8.2 instead. Will backport

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - somehow we didn't do it for 1.8.1 but for 1.8.2 instead. Will backport
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #328 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/328/)
        MB-4906 Always fetch autofailover count from config. (Revision a7e289d8b9f4e25ef1b4bf06956ad2074f50f0ea)
        bp: MB-4906 Always fetch autofailover count from config. (Revision 79739e41a10fda00c537501e1b547baf5ac2c5d6)

        Result = SUCCESS
        Aliaksey Kandratsenka :
        Files :

        • src/auto_failover.erl

        Aliaksey Kandratsenka :
        Files :

        • src/auto_failover.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #328 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/328/ ) MB-4906 Always fetch autofailover count from config. (Revision a7e289d8b9f4e25ef1b4bf06956ad2074f50f0ea) bp: MB-4906 Always fetch autofailover count from config. (Revision 79739e41a10fda00c537501e1b547baf5ac2c5d6) Result = SUCCESS Aliaksey Kandratsenka : Files : src/auto_failover.erl Aliaksey Kandratsenka : Files : src/auto_failover.erl

          People

          • Assignee:
            Aliaksey Artamonau Aliaksey Artamonau
            Reporter:
            farshid Farshid Ghods (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: