Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-1372

Multi-Cluster Client - NodeHealthDetector doesn't work correctly for non-Data nodes

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • None
    • None
    • None
    • None
    • 1

    Description

      Filed as a JCBC because there doesn't seem to be a separate project for the MultiClusterClient - please move if there's somewhere more appropriate).

      Query (well all HTTP-based services) connection pools are dynamically scaled based on use and idle time, sometimes there may be no open connections to a query node.
      It seems that if the connection pool is scaled to 0 and then a query is run, the NodeHealthDetector mistakenly identifies the node as 'down'.
      This can be seen in the following log extract (the whole log is available at mca.log):

      5362 [cb-computations-4] DEBUG com.couchbase.client.core.service.Service  - [10.142.184.103][QueryService]: Endpoint com.couchbase.client.core.endpoint.query.QueryEndpoint@687ec060 idle for longer than 2s, disconnecting.
      5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (EndpointStateZipper) CONNECTED -> IDLE
      5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (QueryService) CONNECTED -> IDLE
      5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (ServiceStateZipper) CONNECTED -> IDLE
      5362 [cb-computations-4] TRACE com.couchbase.client.core.state.Stateful  - State (CouchbaseNode) CONNECTED -> IDLE
      5362 [cb-computations-4] DEBUG com.couchbase.client.core.service.Service  - [10.142.184.103][QueryService]: New number of endpoints is 0
       
      ...
       
      6509 [cb-core-3-2] DEBUG com.couchbase.client.core.service.Service  - [10.142.184.103][QueryService]: Need to open a new Endpoint (current size 0)
      6510 [cb-core-3-2] DEBUG com.couchbase.client.core.endpoint.Endpoint  - Using a connectCallbackGracePeriod of 2000 on Endpoint 10.142.184.103:8093
      6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (EndpointStateZipper) IDLE -> DISCONNECTED
      6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (QueryService) IDLE -> DISCONNECTED
      6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (ServiceStateZipper) IDLE -> DISCONNECTED
      6510 [cb-core-3-2] INFO  com.couchbase.client.core.node.Node  - Disconnected from Node 10.142.184.103/10.142.184.103
      6510 [cb-core-3-2] DEBUG com.couchbase.client.core.node.Node  - [10.142.184.103/10.142.184.103]: Disconnected (IDLE) from Node
      6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (CouchbaseNode) IDLE -> DISCONNECTED
      6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (QueryEndpoint) DISCONNECTED -> CONNECTING
      6510 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (EndpointStateZipper) DISCONNECTED -> CONNECTING
      6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (QueryService) DISCONNECTED -> CONNECTING
      6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (ServiceStateZipper) DISCONNECTED -> CONNECTING
      6511 [cb-core-3-2] TRACE com.couchbase.client.core.state.Stateful  - State (CouchbaseNode) DISCONNECTED -> CONNECTING
      6512 [cb-computations-1] DEBUG com.couchbase.client.mc.detection.NodeHealthFailureDetector  - Legit NodeDisconnectedEvent, node /10.142.184.103 is still part of config.
      6512 [cb-computations-1] INFO  com.couchbase.client.mc.detection.NodeHealthFailureDetector  - Detected NodeDisconnected from Node /10.142.184.103
      6512 [cb-computations-1] TRACE com.couchbase.client.core.state.Stateful  - State (NodeHealthFailureDetector) GREEN -> RED
      6512 [cb-computations-1] INFO  com.couchbase.client.mc.detection.NodeHealthFailureDetector  - minFailedNodes threshold of 1/1 reached, switching into RED state and signaling.
      6512 [cb-computations-1] DEBUG com.couchbase.client.mc.detection.NodeHealthFailureDetector  - Signaling node failure for /10.142.184.103 to coordinator
      6513 [cb-computations-1] DEBUG com.couchbase.client.mc.coordination.IsolatedCoordinator  - Set node unavailable 10.142.184.103 for topology entry DefaultTopologyEntry{serviceTypes=[QUERY, BINARY], identifier='matt1', nodes=[10.142.184.101], priority=2, active=[QUERY, BINARY], unavailableNodes=[]}
      

      A workaround is to set the service configs so that they have a minimum of 1 connection, but the mutli-cluster cluster should either override this to enforce that always at least 1 connection to each node is open at once or use different logic to determine if a node is 'healthy'.

      Steps To Reproduce

      1. Setup two clusters, each with 1 data node and 1 query node
      2. Connect to both clusters using the MCA client
      3. Run a query
      4. Wait for the query connection idletime to pass (by default 300 seconds)
      5. Run another query
      6. Observe that the query node is (incorrectly) marked as unhealthy

      I have attached main.java which does all of the steps above, you just need to plug in the correct clusters, bucket names and user credentials.

      Attachments

        1. main.java
          4 kB
        2. mca.log
          384 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            daschl Michael Nitschinger
            matt.carabine Matt Carabine (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty