Uploaded image for project: 'Java Couchbase JVM Core'
  1. Java Couchbase JVM Core
  2. JVMCBC-346

Keep reconnecting downed cluster nodes during bootstrap

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • None
    • 1.3.2
    • Core
    • None

    Description

      Motivation

      Once a connection has been established and it goes down, the code
      always did retry properly until it came back online or was removed
      from the cluster configuration.

      During bootstrap however, if the socket could not be opened in the
      first place it was deemed down and kept that way. This meant that
      when one node in the cluster is down for some reason and the client
      could bootstrap from another, the node is not picked up properly
      when it comes back online.

      Modifications

      This changeset makes the bootstrap process more resilient to this
      kind of failure but still responds to the boot observable with a
      failed attempt. In the background, the reconnect is issued so
      that there is a chance the node can be picked up as soon as it
      comes online.

      Result

      The code is now more resilient to partial node failures during
      client bootstrap and the behaviour aligns more with what one
      might expect from the system.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          daschl Michael Nitschinger created issue -
          daschl Michael Nitschinger made changes -
          Field Original Value New Value
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          brett19 Brett Lawson made changes -
          Workflow classic default workflow [ 62067 ] Couchbase SDK Workflow [ 67680 ]
          oliver.downard Oliver Downard (Inactive) made changes -
          Description Motivation
          ----------
          Once a connection has been established and it goes down, the code
          always did retry properly until it came back online or was removed
          from the cluster configuration.

          During bootstrap however, if the socket could not be opened in the
          first place it was deemed down and kept that way. This meant that
          when one node in the cluster is down for some reason and the client
          could bootstrap from another, the node is not picked up properly
          when it comes back online.

          Modifications
          -------------
          This changeset makes the bootstrap process more resilient to this
          kind of failure but still responds to the boot observable with a
          failed attempt. In the background, the reconnect is issued so
          that there is a chance the node can be picked up as soon as it
          comes online.

          Result
          ------
          The code is now more resilient to partial node failures during
          client bootstrap and the behaviour aligns more with what one
          might expect from the system.
          oliver.downard Oliver Downard (Inactive) made changes -
          Description Motivation
          ----------
          Once a connection has been established and it goes down, the code
          always did retry properly until it came back online or was removed
          from the cluster configuration.

          During bootstrap however, if the socket could not be opened in the
          first place it was deemed down and kept that way. This meant that
          when one node in the cluster is down for some reason and the client
          could bootstrap from another, the node is not picked up properly
          when it comes back online.

          Modifications
          -------------
          This changeset makes the bootstrap process more resilient to this
          kind of failure but still responds to the boot observable with a
          failed attempt. In the background, the reconnect is issued so
          that there is a chance the node can be picked up as soon as it
          comes online.

          Result
          ------
          The code is now more resilient to partial node failures during
          client bootstrap and the behaviour aligns more with what one
          might expect from the system.
          +Motivation+

          Once a connection has been established and it goes down, the code
          always did retry properly until it came back online or was removed
          from the cluster configuration.

          During bootstrap however, if the socket could not be opened in the
          first place it was deemed down and kept that way. This meant that
          when one node in the cluster is down for some reason and the client
          could bootstrap from another, the node is not picked up properly
          when it comes back online.

          +Modifications+

          This changeset makes the bootstrap process more resilient to this
          kind of failure but still responds to the boot observable with a
          failed attempt. In the background, the reconnect is issued so
          that there is a chance the node can be picked up as soon as it
          comes online.

          +Result+

          The code is now more resilient to partial node failures during
          client bootstrap and the behaviour aligns more with what one
          might expect from the system.
          oliver.downard Oliver Downard (Inactive) made changes -
          Link This issue is triggering CBSE-3297 [ CBSE-3297 ]

          People

            daschl Michael Nitschinger
            daschl Michael Nitschinger
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty