Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49119

Query Engine Failover Connections and CPU

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Running initial perf tests for the query engine failover and have a couple observations:

      1: the way query is failed over is via the hard failover rest endpoint, however the concept of hard failover is in opposition to query engine waiting for existing queries to complete before failing over.  From the docs: "Hard: The ability to drop a node from the cluster reactively, because the node has become unavailable" - the query failover is used on responsive nodes, "Hard failover should not be used on a responsive node, since this may disrupt ongoing operations" – query failover will now not disrupt ongoing operations. From this, it makes more sense to have query failover initiated as a graceful failover

       

      2: Running two tests to see the perf impact of calling failover. Both tests are Q3 Range Scan with Plasma. Both tests use the same number of client machines and client threads. The only difference is testA starts with 6 nodes - 4 kv, 1 query, 1 index, and testB starts with 7 nodes - 4 kv, 2 query, 1 index. All services are on dedicated nodes in both cases. testB will failover a single query node after 25% (5 min) of the access phase time (20 min) has elapsed. Failover is initialed by making a call to /controller/failOver. At this point both tests have the same set of nodes and should be identical. However, I see a couple things of note, a) memcached cpu utilization in testB is 10-20% higher throughout the entire test, even after failover, b) in testB established connection to indexer and the remaining query node a significantly higher after failover than the baseline in testA, it appears the the failed over query node is holding onto ~70 connections, c) cbq rss jumps up to ~10% higher than baseline after failover, d) cpu utilization across all cores in the cluster jumps to 90% after failover but in the baseline testA the steady state is around 80% utilization.

       

      3: Perfrunner will grab the query throughput by calling admin/stats and getting the request count. This is done on a single query node. Not sure if this stats is showing only for this query node or for all query nodes, but the returned value in the testB is lower than testA despite having more query nodes for 25% of the access time.

       

      testA: http://perf.jenkins.couchbase.com/job/iris-multi-client/12081/ - 3916.0

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1345_access_a7b6

      testB (with failover): http://perf.jenkins.couchbase.com/job/iris-multi-client/12520/ - 3525.0

      http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1345_failover_8a92

      comparison: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=iris_710-1345_access_a7b6&snapshot=iris_710-1345_failover_8a92

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Ok so I think I get it now... the feature applies to the following 2 cases:

          1-Marking query node for removal and then rebalancing out the node

          2-Applying graceful failover when query is co-resident with a data node and the bucket has at least 1 replica (subsequent rebalance out is not necessary to initiate graceful shutdown of query, right?). 

           

          If you insist on using this endpoint without ensuring the node is unresponsive, then you need to treat it like you would a graceful failover and not assert that it has to operate like a hard failover.

          Still a bit confused by the /failOver endpoint. Are you saying that if I hit that endpoint to failover query node when query node is still responsive, then query node will behave like a graceful failover was initiated and proceed to gracefully shutdown?

          korrigan.clark Korrigan Clark added a comment - Ok so I think I get it now... the feature applies to the following 2 cases: 1-Marking query node for removal and then rebalancing out the node 2-Applying graceful failover when query is co-resident with a data node and the bucket has at least 1 replica (subsequent rebalance out is not necessary to initiate graceful shutdown of query, right?).    If you insist on using this endpoint without ensuring the node is unresponsive, then you need to treat it like you would a graceful failover and not assert that it has to operate like a hard failover. Still a bit confused by the /failOver endpoint. Are you saying that if I hit that endpoint to failover query node when query node is still responsive, then query node will behave like a graceful failover was initiated and proceed to gracefully shutdown?

          > (subsequent rebalance out is not necessary to initiate graceful shutdown of query, right?)

          No, in the case of a failover it is initiated directly.  Only in the case of a removal does it take place on rebalance (see below).

          > Still a bit confused by the /failOver endpoint.

          Yes.  You hit that endpoint and tell ns_server to failover "hard".  ns_server then notifies the controlling node for the service of the failover (note, only "failover" not "hard failover").  The controlling node is then responsible for reacting to the notification and informing ns_server when it is complete.

          (Other services are notified in an identical manner and perform their necessary steps similarly.)

          The query service only reacts when it gets the failover notification.  If it doesn't get one it doesn't react, if it does it always reacts the same way - the controlling node that receives the notification attempts to instruct the failed over node to shutdown. 

          If the node isn't there or isn't responsive it doesn't care and simply reports back to ns_server immediately (that it is done).  Otherwise the failed over node stops new connections and permits existing connections (& transactions) to continue to completion (which may be until timeout).  When the last is done, its status changes so the controlling query node can report back to the waiting ns_server that all tasks are complete and ns_server can then proceed with the rest of the failover. 

           

          We call this action "graceful shutdown" but it isn't in any way bound to the cluster/ns_server "graceful failover" (other than it will happen then too).  Probably both using the adjective "graceful" is confusing things somewhat.

           

          If you rebalance-remove a server, the same basic actions take place - ns_server informs the controlling query node that the removal is taking place during its rebalance operation and waits until the query service reports that there are no outstanding tasks before moving on.   The query nodes perform the exact same operations as in reaction to a failover - exactly the same code runs.

          Donald.haggart Donald Haggart added a comment - > (subsequent rebalance out is not necessary to initiate graceful shutdown of query, right?) No, in the case of a failover it is initiated directly.  Only in the case of a removal does it take place on rebalance (see below). > Still a bit confused by the /failOver endpoint. Yes.  You hit that endpoint and tell ns_server to failover "hard".  ns_server then notifies the controlling node for the service of the failover (note, only "failover" not "hard failover").  The controlling node is then responsible for reacting to the notification and informing ns_server when it is complete. (Other services are notified in an identical manner and perform their necessary steps similarly.) The query service only reacts when it gets the failover notification.  If it doesn't get one it doesn't react, if it does it always reacts the same way - the controlling node that receives the notification attempts to instruct the failed over node to shutdown.  If the node isn't there or isn't responsive it doesn't care and simply reports back to ns_server immediately (that it is done).  Otherwise the failed over node stops new connections and permits existing connections (& transactions) to continue to completion (which may be until timeout).  When the last is done, its status changes so the controlling query node can report back to the waiting ns_server that all tasks are complete and ns_server can then proceed with the rest of the failover.    We call this action "graceful shutdown" but it isn't in any way bound to the cluster/ns_server "graceful failover" (other than it will happen then too).  Probably both using the adjective "graceful" is confusing things somewhat.   If you rebalance-remove a server, the same basic actions take place - ns_server informs the controlling query node that the removal is taking place during its rebalance operation and waits until the query service reports that there are no outstanding tasks before moving on.   The query nodes perform the exact same operations as in reaction to a failover - exactly the same code runs.

          Hi Korrigan Clark ,

          Are we clear enough with the expected operations for this feature now?  If so, shall I resolve this issue?

          Donald.haggart Donald Haggart added a comment - Hi Korrigan Clark , Are we clear enough with the expected operations for this feature now?  If so, shall I resolve this issue?

          I think so. I have added tests for the outlined scenarios here http://showfast.sc.couchbase.com/#/timeline/Linux/n1ql/aggregation/Latency

          We can close this ticket and I will analyze the test runs and open a separate ticket if I see anything. Thanks

          korrigan.clark Korrigan Clark added a comment - I think so. I have added tests for the outlined scenarios here http://showfast.sc.couchbase.com/#/timeline/Linux/n1ql/aggregation/Latency We can close this ticket and I will analyze the test runs and open a separate ticket if I see anything. Thanks
          Donald.haggart Donald Haggart added a comment - Thanks Korrigan Clark

          People

            korrigan.clark Korrigan Clark
            korrigan.clark Korrigan Clark
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty