Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-24986

Autofailover of node is taking more than 8 secs when failure type is due to memcached failure.

    XMLWordPrintable

Details

    Description

      1. Create a cluster with 3 nodes and atleast 1 bucket in the cluster
      2. Enable autofailover and set the timeout to 5 secs.
      3. On any of the node, stop the memcached process (the tests do it by sending kill SIGSTOP signal to the memcached process). Note the time when the failure was injected.
      4. Wait for the autofailover of the node to be completed. Note the time when autofailover was initiated.
      We expect the failover to be initiated within 8 secs (5 sec is ideal but we give 3 sec buffer to the initiation). But the failover is initiated after around 9-10 secs.
      This is a regression as compared to last week's build. The tests for memcached failures were passing till last weeks build (5.0.0-3088) but are failing due to autofailover being initated after the expected time.
      The tests can be found here. http://qa.sc.couchbase.com/view/nserver/job/cen006-nserv-autofailover-memcached/35/consoleFull
      Test_1, test_3, test_10, test_12, test_13 all failed due to this issue.
      The issue can be reproduced by running the following test
      ./testrunner -i <ini file> -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover,timeout=5,num_node_failures=1,failover_action=stop_memcached,nodes_init=3

      Attaching the logs from the run mentioned above for test_1 for the cluster.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          bharath.gp Bharath G P added a comment - - edited

          Created a CBQE CBQE-4225 for changing the timeout we track on memcached failure scenarios.

          bharath.gp Bharath G P added a comment - - edited Created a CBQE CBQE-4225 for changing the timeout we track on memcached failure scenarios.

          Hi Bharath,

           I tried  the “memcached stop” test with latest Spock on my setup. Couple of times it took 8s, couple of times  9 - 10s.

           I checked my logs. The extra delay of ~1s or so depends on various timings e.g. when exactly the failure occurred w.r.t to the KV monitor run, at what time various monitors run w.r.t one another and so on.

           Before going on the break, I had enhanced the code to reduce the failure detection for the “memcached stop” case as much as possible. E.g. Reduced the timeout for how long ns-server (KV monitor) waits for memcached to respond.  More aggressive timeout value or health monitor refresh intervals can reduce the failure detection time but it can also lead to unnecessary failovers due to transient issues.

           Also, in real life, it is likely that even if memcached is slow to respond to ns-server requests, it is still serving data/DCP traffic. As long as DCP traffic is in progress, there will be no failover.

           We cannot make any more enhancements for the “memcached stop” case. Please update the QA test to expect additional delay. If we start seeing issues in field because of the extra couple of seconds delay, then we can revisit this.

           Poonam

          poonam Poonam Dhavale added a comment - Hi Bharath,  I tried  the “memcached stop” test with latest Spock on my setup. Couple of times it took 8s, couple of times  9 - 10s.  I checked my logs. The extra delay of ~1s or so depends on various timings e.g. when exactly the failure occurred w.r.t to the KV monitor run, at what time various monitors run w.r.t one another and so on.  Before going on the break, I had enhanced the code to reduce the failure detection for the “memcached stop” case as much as possible. E.g. Reduced the timeout for how long ns-server (KV monitor) waits for memcached to respond.  More aggressive timeout value or health monitor refresh intervals can reduce the failure detection time but it can also lead to unnecessary failovers due to transient issues.  Also, in real life, it is likely that even if memcached is slow to respond to ns-server requests, it is still serving data/DCP traffic. As long as DCP traffic is in progress, there will be no failover.  We cannot make any more enhancements for the “memcached stop” case. Please update the QA test to expect additional delay. If we start seeing issues in field because of the extra couple of seconds delay, then we can revisit this.  Poonam

          People

            bharath.gp Bharath G P
            bharath.gp Bharath G P
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty