Details
-
Bug
-
Resolution: Unresolved
-
Major
-
None
-
None
-
None
-
None
-
1
Description
Filed as a JCBC because there doesn't seem to be a separate project for the MultiClusterClient - please move if there's somewhere more appropriate).
Query (well all HTTP-based services) connection pools are dynamically scaled based on use and idle time, sometimes there may be no open connections to a query node.
It seems that if the connection pool is scaled to 0 and then a query is run, the NodeHealthDetector mistakenly identifies the node as 'down'.
This can be seen in the following log extract (the whole log is available at mca.log):
5362 [cb-computations-4] DEBUG com.couchbase.client.core.service.Service - [10.142.184.103][QueryService]: Endpoint com.couchbase.client.core.endpoint.query.QueryEndpoint@687ec060 idle for longer than 2s, disconnecting.
|
5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful - State (EndpointStateZipper) CONNECTED -> IDLE
|
5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful - State (QueryService) CONNECTED -> IDLE
|
5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful - State (ServiceStateZipper) CONNECTED -> IDLE
|
5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful - State (CouchbaseNode) CONNECTED -> IDLE
|
5362 [cb-computations-4] DEBUG com.couchbase.client.core.service.Service - [10.142.184.103][QueryService]: New number of endpoints is 0
|
|
...
|
|
6509 [cb-core-3-2] DEBUG com.couchbase.client.core.service.Service - [10.142.184.103][QueryService]: Need to open a new Endpoint (current size 0)
|
6510 [cb-core-3-2] DEBUG com.couchbase.client.core.endpoint.Endpoint - Using a connectCallbackGracePeriod of 2000 on Endpoint 10.142.184.103:8093
|
6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (EndpointStateZipper) IDLE -> DISCONNECTED
|
6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (QueryService) IDLE -> DISCONNECTED
|
6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (ServiceStateZipper) IDLE -> DISCONNECTED
|
6510 [cb-core-3-2] INFO com.couchbase.client.core.node.Node - Disconnected from Node 10.142.184.103/10.142.184.103
|
6510 [cb-core-3-2] DEBUG com.couchbase.client.core.node.Node - [10.142.184.103/10.142.184.103]: Disconnected (IDLE) from Node
|
6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (CouchbaseNode) IDLE -> DISCONNECTED
|
6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (QueryEndpoint) DISCONNECTED -> CONNECTING
|
6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (EndpointStateZipper) DISCONNECTED -> CONNECTING
|
6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (QueryService) DISCONNECTED -> CONNECTING
|
6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (ServiceStateZipper) DISCONNECTED -> CONNECTING
|
6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful - State (CouchbaseNode) DISCONNECTED -> CONNECTING
|
6512 [cb-computations-1] DEBUG com.couchbase.client.mc.detection.NodeHealthFailureDetector - Legit NodeDisconnectedEvent, node /10.142.184.103 is still part of config.
|
6512 [cb-computations-1] INFO com.couchbase.client.mc.detection.NodeHealthFailureDetector - Detected NodeDisconnected from Node /10.142.184.103
|
6512 [cb-computations-1] TRACE com.couchbase.client.core.state.Stateful - State (NodeHealthFailureDetector) GREEN -> RED
|
6512 [cb-computations-1] INFO com.couchbase.client.mc.detection.NodeHealthFailureDetector - minFailedNodes threshold of 1/1 reached, switching into RED state and signaling.
|
6512 [cb-computations-1] DEBUG com.couchbase.client.mc.detection.NodeHealthFailureDetector - Signaling node failure for /10.142.184.103 to coordinator
|
6513 [cb-computations-1] DEBUG com.couchbase.client.mc.coordination.IsolatedCoordinator - Set node unavailable 10.142.184.103 for topology entry DefaultTopologyEntry{serviceTypes=[QUERY, BINARY], identifier='matt1', nodes=[10.142.184.101], priority=2, active=[QUERY, BINARY], unavailableNodes=[]}
|
A workaround is to set the service configs so that they have a minimum of 1 connection, but the mutli-cluster cluster should either override this to enforce that always at least 1 connection to each node is open at once or use different logic to determine if a node is 'healthy'.
Steps To Reproduce
- Setup two clusters, each with 1 data node and 1 query node
- Connect to both clusters using the MCA client
- Run a query
- Wait for the query connection idletime to pass (by default 300 seconds)
- Run another query
- Observe that the query node is (incorrectly) marked as unhealthy
I have attached main.java which does all of the steps above, you just need to plug in the correct clusters, bucket names and user credentials.