Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-287

Failover + Readd of Streaming Node against 1.8.1 fails

    Details

    • Type: Task
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 1.1.5
    • Fix Version/s: .backlog1.x
    • Component/s: Core
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Any Couchbase SDK against a 1.8 cluster
    # Subject Project Status CR V
    For Gerrit Dashboard: &For+JCBC-287=message:JCBC-287

      Activity

      Hide
      daschl Michael Nitschinger added a comment - - edited

      Let me describe the observed behaviour in greater detail to clarify what is the issue.
      I just verified against 2.0 that it's the same behaviour, but I don't know why wie didn't observe it there this way.

      When someone clicks failover in the UI on the streaming node, our streaming connection gets closed. Now the client code is implemented that way that it gets back to the node list passed in and iterates over it. In the case observed it is the first node in the list which we failed over, but since it has not been removed yet the client is able to connect (and even retrieve the bucket information!).

      Now comes the part that I don't understand fully. With the changeset not applied, it connects to the streaming connection from the failovered node but somehow it doesnt get the information fast enough when the same node gets added back. Maybe because when the node gets rebalanced out the streaming connection gets closed again and we lag little bit behind to connect to the second node in the list.

      That said, WITH the changeset applied we will not connect to the failovered node immediately because we recognize its in the cluster, but not part of the node list (which kinda sucks from a logical perspective in my opinion, because why do we close the streaming connection on failover AND allow to reestablish it afterwards?). So what will happen is that the SDK immediately connects to the second one and will observe the upcoming operations (rebalance out, rebalance in) correctly.

      Does this make more sense now?

      Show
      daschl Michael Nitschinger added a comment - - edited Let me describe the observed behaviour in greater detail to clarify what is the issue. I just verified against 2.0 that it's the same behaviour, but I don't know why wie didn't observe it there this way. When someone clicks failover in the UI on the streaming node, our streaming connection gets closed. Now the client code is implemented that way that it gets back to the node list passed in and iterates over it. In the case observed it is the first node in the list which we failed over, but since it has not been removed yet the client is able to connect (and even retrieve the bucket information!). Now comes the part that I don't understand fully. With the changeset not applied, it connects to the streaming connection from the failovered node but somehow it doesnt get the information fast enough when the same node gets added back. Maybe because when the node gets rebalanced out the streaming connection gets closed again and we lag little bit behind to connect to the second node in the list. That said, WITH the changeset applied we will not connect to the failovered node immediately because we recognize its in the cluster, but not part of the node list (which kinda sucks from a logical perspective in my opinion, because why do we close the streaming connection on failover AND allow to reestablish it afterwards?). So what will happen is that the SDK immediately connects to the second one and will observe the upcoming operations (rebalance out, rebalance in) correctly. Does this make more sense now?
      Hide
      ingenthr Matt Ingenthron added a comment -

      Not 100%, no. Do we have a log that shows responses from nodes by chance?

      Show
      ingenthr Matt Ingenthron added a comment - Not 100%, no. Do we have a log that shows responses from nodes by chance?
      Hide
      daschl Michael Nitschinger added a comment -

      You mean the JSON responses from the nodes? Or the netty logs on the SDK side.

      Show
      daschl Michael Nitschinger added a comment - You mean the JSON responses from the nodes? Or the netty logs on the SDK side.
      Hide
      daschl Michael Nitschinger added a comment -

      So if you set "failover" on one of the nodes, all HTTP Rest resources are still 100% accessible, even if the streaming connection was closed from the server before. We can even establish it again.

      The main diff is that the node shows up in pools, but not in the bucket node list. It also shows a "failovered" state, so we could also try to identify it at that point.

      {
      systemStats: {
      cpu_utilization_rate: 1,
      swap_total: 1069543424,
      swap_used: 0
      },
      interestingStats: { },
      uptime: "226",
      memoryTotal: 1572306944,
      memoryFree: 761069568,
      mcdMemoryReserved: 1199,
      mcdMemoryAllocated: 1199,
      couchApiBase: "http://192.168.56.101:8092/",
      clusterMembership: "inactiveFailed",
      status: "healthy",
      thisNode: true,
      hostname: "192.168.56.101:8091",
      clusterCompatibility: 131072,
      version: "2.0.1-170-rel-enterprise",
      os: "x86_64-unknown-linux-gnu",
      ports: {
      proxy: 11211,
      direct: 11210
      }
      },

      (this is pools) .. its not in the node list from the bucket info (see the clusterMembership part).

      I can get you SDK debug logs as well, but they are already in the CBSE from mark.

      Show
      daschl Michael Nitschinger added a comment - So if you set "failover" on one of the nodes, all HTTP Rest resources are still 100% accessible, even if the streaming connection was closed from the server before. We can even establish it again. The main diff is that the node shows up in pools, but not in the bucket node list. It also shows a "failovered" state, so we could also try to identify it at that point. { systemStats: { cpu_utilization_rate: 1, swap_total: 1069543424, swap_used: 0 }, interestingStats: { }, uptime: "226", memoryTotal: 1572306944, memoryFree: 761069568, mcdMemoryReserved: 1199, mcdMemoryAllocated: 1199, couchApiBase: "http://192.168.56.101:8092/", clusterMembership: "inactiveFailed", status: "healthy", thisNode: true, hostname: "192.168.56.101:8091", clusterCompatibility: 131072, version: "2.0.1-170-rel-enterprise", os: "x86_64-unknown-linux-gnu", ports: { proxy: 11211, direct: 11210 } }, (this is pools) .. its not in the node list from the bucket info (see the clusterMembership part). I can get you SDK debug logs as well, but they are already in the CBSE from mark.

        People

        • Assignee:
          daschl Michael Nitschinger
          Reporter:
          daschl Michael Nitschinger
        • Votes:
          0 Vote for this issue
          Watchers:
          2 Start watching this issue

          Dates

          • Created:
            Updated:

            Gerrit Reviews

            There are no open Gerrit changes