Uploaded image for project: 'Couchbase C client library libcouchbase'
  1. Couchbase C client library libcouchbase
  2. CCBC-779

CCCP subsystem hangs when current source node fails

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.7.1, 2.7.4
    • Fix Version/s: 2.7.5
    • Component/s: None
    • Labels:
      None
    • Environment:
      Couchnode 2.3.2/libcouchbase 2.7.4
      Couchbase 4.5.1 Cluster
      CentOS 7

      Description

      Before swap/rebalance Couchbase cluster consists of nodes 101 and 105
      After swap/rebalance Couchbase cluster consists of nodes 101 and 102

      Couchnode 2.3.2 client is running N1QL queries again Couchbase 4.5.1, a two node cluster.
      A swap rebalance is done where a node is removed (105), and another is added (102).
      Right as the rebalance is finished, a query is done where the index being used was on the node 105.
      The connection to 8093 on node 105 fails and causes a cluster map refresh.
      A 'Hello Request' is sent to both node 101 and 105 on port 11210.
      Both node 105 and 101 respond.
      Node 105 responds first and both go through the SASL Auth process.
      At this point no more requests are sent to node 101.
      Node 105 I assume has shutdown or stopped replying on port 11210 at this point.
      Couchnode (or libcouchbase) continues to keep trying node 105 over and over and a new cluster map is never downloaded.

      I have included the tcpdump output from this transaction. The first connection reset to node 105 on port 8093 happens at 14:45:42.557965. From there you can follow the events.

        Attachments

        For Gerrit Dashboard: CCBC-779
        # Subject Branch Project Status CR V

          Activity

          Hide
          mnunberg Mark Nunberg (Inactive) added a comment -

          Erik, can you see if this patch also fixes the issue?

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - Erik, can you see if this patch also fixes the issue?
          Hide
          erik.manor Erik Manor (Inactive) added a comment -

          Yes it appears it did, CCCP is behaving more like the HTTP provider now during the failover.

          Show
          erik.manor Erik Manor (Inactive) added a comment - Yes it appears it did, CCCP is behaving more like the HTTP provider now during the failover.
          Hide
          ingenthr Matt Ingenthron added a comment -

          Mark Nunberg: can you comment on what we might be able to add to the test suite to verify/test for this situation? Thanks!

          Show
          ingenthr Matt Ingenthron added a comment - Mark Nunberg : can you comment on what we might be able to add to the test suite to verify/test for this situation? Thanks!
          Hide
          mnunberg Mark Nunberg (Inactive) added a comment -

          I believe this should be reproducible with a non-KV workload, where the removed node's memcached process is still active.

          Show
          mnunberg Mark Nunberg (Inactive) added a comment - I believe this should be reproducible with a non-KV workload, where the removed node's memcached process is still active.
          Hide
          ingenthr Matt Ingenthron added a comment -

          Notes: root of the issue was that a CP request would timeout. The bug in the lcb handler was that the state machinery would sort of get stuck and it was in an indefinite fetch state.

          Thought is that we do an MDS remove/rebalance or swap rebalance with N1QL on nodes to verify behavior in QE. However, it's all timing based so there's no guarantee that we'd trigger the issue. Also, the low node count of 2 affected things here. If the node count had been higher, it'd be far less likely.

          Show
          ingenthr Matt Ingenthron added a comment - Notes: root of the issue was that a CP request would timeout. The bug in the lcb handler was that the state machinery would sort of get stuck and it was in an indefinite fetch state. Thought is that we do an MDS remove/rebalance or swap rebalance with N1QL on nodes to verify behavior in QE. However, it's all timing based so there's no guarantee that we'd trigger the issue. Also, the low node count of 2 affected things here. If the node count had been higher, it'd be far less likely.

            People

            • Assignee:
              mnunberg Mark Nunberg (Inactive)
              Reporter:
              erik.manor Erik Manor (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.