Running initial perf tests for the query engine failover and have a couple observations:
1: the way query is failed over is via the hard failover rest endpoint, however the concept of hard failover is in opposition to query engine waiting for existing queries to complete before failing over. From the docs: "Hard: The ability to drop a node from the cluster reactively, because the node has become unavailable" - the query failover is used on responsive nodes, "Hard failover should not be used on a responsive node, since this may disrupt ongoing operations" – query failover will now not disrupt ongoing operations. From this, it makes more sense to have query failover initiated as a graceful failover
2: Running two tests to see the perf impact of calling failover. Both tests are Q3 Range Scan with Plasma. Both tests use the same number of client machines and client threads. The only difference is testA starts with 6 nodes - 4 kv, 1 query, 1 index, and testB starts with 7 nodes - 4 kv, 2 query, 1 index. All services are on dedicated nodes in both cases. testB will failover a single query node after 25% (5 min) of the access phase time (20 min) has elapsed. Failover is initialed by making a call to /controller/failOver. At this point both tests have the same set of nodes and should be identical. However, I see a couple things of note, a) memcached cpu utilization in testB is 10-20% higher throughout the entire test, even after failover, b) in testB established connection to indexer and the remaining query node a significantly higher after failover than the baseline in testA, it appears the the failed over query node is holding onto ~70 connections, c) cbq rss jumps up to ~10% higher than baseline after failover, d) cpu utilization across all cores in the cluster jumps to 90% after failover but in the baseline testA the steady state is around 80% utilization.
3: Perfrunner will grab the query throughput by calling admin/stats and getting the request count. This is done on a single query node. Not sure if this stats is showing only for this query node or for all query nodes, but the returned value in the testB is lower than testA despite having more query nodes for 25% of the access time.
testB (with failover): http://perf.jenkins.couchbase.com/job/iris-multi-client/12520/ - 3525.0