Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Story Points:
1

Description

Filed as a JCBC because there doesn't seem to be a separate project for the MultiClusterClient - please move if there's somewhere more appropriate).

Query (well all HTTP-based services) connection pools are dynamically scaled based on use and idle time, sometimes there may be no open connections to a query node.
It seems that if the connection pool is scaled to 0 and then a query is run, the NodeHealthDetector mistakenly identifies the node as 'down'.
This can be seen in the following log extract (the whole log is available at mca.log):

5362 [cb-computations-4] DEBUG com.couchbase.client.core.service.Service  - [10.142.184.103][QueryService]: Endpoint com.couchbase.client.core.endpoint.query.QueryEndpoint@687ec060 idle for longer than 2s, disconnecting.

5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (EndpointStateZipper) CONNECTED -> IDLE

5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (QueryService) CONNECTED -> IDLE

5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (ServiceStateZipper) CONNECTED -> IDLE

5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (CouchbaseNode) CONNECTED -> IDLE

5362 [cb-computations-4] DEBUG com.couchbase.client.core.service.Service  - [10.142.184.103][QueryService]: New number of endpoints is 0

...

6509 [cb-core-3-2] DEBUG com.couchbase.client.core.service.Service  - [10.142.184.103][QueryService]: Need to open a new Endpoint (current size 0)

6510 [cb-core-3-2] DEBUG com.couchbase.client.core.endpoint.Endpoint  - Using a connectCallbackGracePeriod of 2000 on Endpoint 10.142.184.103:8093

6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (EndpointStateZipper) IDLE -> DISCONNECTED

6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (QueryService) IDLE -> DISCONNECTED

6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (ServiceStateZipper) IDLE -> DISCONNECTED

6510 [cb-core-3-2] INFO  com.couchbase.client.core.node.Node  - Disconnected from Node 10.142.184.103/10.142.184.103

6510 [cb-core-3-2] DEBUG com.couchbase.client.core.node.Node  - [10.142.184.103/10.142.184.103]: Disconnected (IDLE) from Node

6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (CouchbaseNode) IDLE -> DISCONNECTED

6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (QueryEndpoint) DISCONNECTED -> CONNECTING

6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (EndpointStateZipper) DISCONNECTED -> CONNECTING

6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (QueryService) DISCONNECTED -> CONNECTING

6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (ServiceStateZipper) DISCONNECTED -> CONNECTING

6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (CouchbaseNode) DISCONNECTED -> CONNECTING

6512 [cb-computations-1] DEBUG com.couchbase.client.mc.detection.NodeHealthFailureDetector  - Legit NodeDisconnectedEvent, node /10.142.184.103 is still part of config.

6512 [cb-computations-1] INFO  com.couchbase.client.mc.detection.NodeHealthFailureDetector  - Detected NodeDisconnected from Node /10.142.184.103

6512 [cb-computations-1] TRACE com.couchbase.client.core.state.Stateful  - State (NodeHealthFailureDetector) GREEN -> RED

6512 [cb-computations-1] INFO  com.couchbase.client.mc.detection.NodeHealthFailureDetector  - minFailedNodes threshold of 1/1 reached, switching into RED state and signaling.

6512 [cb-computations-1] DEBUG com.couchbase.client.mc.detection.NodeHealthFailureDetector  - Signaling node failure for /10.142.184.103 to coordinator

6513 [cb-computations-1] DEBUG com.couchbase.client.mc.coordination.IsolatedCoordinator  - Set node unavailable 10.142.184.103 for topology entry DefaultTopologyEntry{serviceTypes=[QUERY, BINARY], identifier='matt1', nodes=[10.142.184.101], priority=2, active=[QUERY, BINARY], unavailableNodes=[]}

A workaround is to set the service configs so that they have a minimum of 1 connection, but the mutli-cluster cluster should either override this to enforce that always at least 1 connection to each node is open at once or use different logic to determine if a node is 'healthy'.

Steps To Reproduce

Setup two clusters, each with 1 data node and 1 query node
Connect to both clusters using the MCA client
Run a query
Wait for the query connection idletime to pass (by default 300 seconds)
Run another query
Observe that the query node is (incorrectly) marked as unhealthy

I have attached main.java which does all of the steps above, you just need to plug in the correct clusters, bucket names and user credentials.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

main.java
4 kB
09/Jul/19 10:07 AM
mca.log
384 kB
09/Jul/19 10:11 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Michael Nitschinger

Reporter:: Matt Carabine (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Dates

Created:: 09/Jul/19 10:14 AM

Updated:: 24/Apr/20 1:55 PM

Gerrit Reviews

There are no open Gerrit changes

Multi-Cluster Client - NodeHealthDetector doesn't work correctly for non-Data nodes

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty