Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-56845

[BP 7.2.1] - XDCR Direct Nebula SRV record refresh handling support

    XMLWordPrintable

Details

    • 1

    Description

      When a remote cluster reference is created using a SRV record,
      XDCR’s DNS SRV support works by performing a SRV lookup on the entered FQDN.
      XDCR will pick up one of the A records and contact the record’s target with port 8091/18091 for bootstrap.

      The list of nodes that the A record target (in this case, ns_server), returns, will be kept in XDCR memory. Each refresh cycle, XDCR will pick one of the nodes in this list to refresh. The originally entered FQDN for SRV record is never used again, until the next cold boot-up (i.e. XDCR process restart).

      For Direct Nebula (DN), the way it sets up SRV record will be:

      SRV record: server-X-A.elixir-internal.net
      1. (A record) - DirectNebula-1.elixir.net
      2. (A record) - DirectNebula-2.elixir.net
      3. (A record) - DirectNebula-3.elixir.net
      

      Where DN1, DN2, and DN3 are individual DNs that are not aware of each other’s existence.

      If XDCR picks DN1 at bootstrap time, it will receive the following vbucket map and node lists:
      DirectNebula-1.elixir.net:11207 -> [0,1, 2… 1023]

      XDCR will not be aware of DirectNebula-2 or DirectNebula-3 because it is not part of the “translated” map nor part of the node list pull from pools/default even if DN2 and DN3 are functionally equivalent.

      Even though XDCR does periodically check the SRV record (server-X-A.elixir-internal.net) and the targets (A records) to ensure that the original SRV record (server-X-A.elixir-internal.net) is valid, XDCR at this current state will never pick DN2 nor DN3 should DN1 have any issues. This is because XDCR assumes DN2 and DN3 is part of the node list that DN1 returns.

      The assumption was true under normal circumstances, but is no longer true with DirectNebula.

      In the error scenario where DN1 is completely out of the picture due to a downtime of some sort, XDCR at this stage does not have the ability to “re-bootstrap” from the SRV record (server-X-A.elixir-internal.net).
      It will get stuck because “none of the nodes” returned by DN1 is reachable.

      The MB should address the situation where if this is the case, then XDCR should try to re-bootstrap from the top SRV record.

      Without the MB, however, there can be a brute workaround, which is to simply do kill -9 goxdcr and let it bootstrap from the top again.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            neil.huang Neil Huang
            neil.huang Neil Huang
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty