Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-779

CCCP subsystem hangs when current source node fails

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.7.1, 2.7.4
    • 2.7.5
    • None
    • None
    • Couchnode 2.3.2/libcouchbase 2.7.4
      Couchbase 4.5.1 Cluster
      CentOS 7

    Description

      Before swap/rebalance Couchbase cluster consists of nodes 101 and 105
      After swap/rebalance Couchbase cluster consists of nodes 101 and 102

      Couchnode 2.3.2 client is running N1QL queries again Couchbase 4.5.1, a two node cluster.
      A swap rebalance is done where a node is removed (105), and another is added (102).
      Right as the rebalance is finished, a query is done where the index being used was on the node 105.
      The connection to 8093 on node 105 fails and causes a cluster map refresh.
      A 'Hello Request' is sent to both node 101 and 105 on port 11210.
      Both node 105 and 101 respond.
      Node 105 responds first and both go through the SASL Auth process.
      At this point no more requests are sent to node 101.
      Node 105 I assume has shutdown or stopped replying on port 11210 at this point.
      Couchnode (or libcouchbase) continues to keep trying node 105 over and over and a new cluster map is never downloaded.

      I have included the tcpdump output from this transaction. The first connection reset to node 105 on port 8093 happens at 14:45:42.557965. From there you can follow the events.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Erik, can you see if this patch also fixes the issue?

          mnunberg Mark Nunberg (Inactive) added a comment - Erik, can you see if this patch also fixes the issue?

          Yes it appears it did, CCCP is behaving more like the HTTP provider now during the failover.

          erik.manor Erik Manor (Inactive) added a comment - Yes it appears it did, CCCP is behaving more like the HTTP provider now during the failover.

          Mark Nunberg: can you comment on what we might be able to add to the test suite to verify/test for this situation? Thanks!

          ingenthr Matt Ingenthron added a comment - Mark Nunberg : can you comment on what we might be able to add to the test suite to verify/test for this situation? Thanks!

          I believe this should be reproducible with a non-KV workload, where the removed node's memcached process is still active.

          mnunberg Mark Nunberg (Inactive) added a comment - I believe this should be reproducible with a non-KV workload, where the removed node's memcached process is still active.

          Notes: root of the issue was that a CP request would timeout. The bug in the lcb handler was that the state machinery would sort of get stuck and it was in an indefinite fetch state.

          Thought is that we do an MDS remove/rebalance or swap rebalance with N1QL on nodes to verify behavior in QE. However, it's all timing based so there's no guarantee that we'd trigger the issue. Also, the low node count of 2 affected things here. If the node count had been higher, it'd be far less likely.

          ingenthr Matt Ingenthron added a comment - Notes: root of the issue was that a CP request would timeout. The bug in the lcb handler was that the state machinery would sort of get stuck and it was in an indefinite fetch state. Thought is that we do an MDS remove/rebalance or swap rebalance with N1QL on nodes to verify behavior in QE. However, it's all timing based so there's no guarantee that we'd trigger the issue. Also, the low node count of 2 affected things here. If the node count had been higher, it'd be far less likely.

          People

            mnunberg Mark Nunberg (Inactive)
            erik.manor Erik Manor (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty