Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30957

Possible "500 Internal Server Error" Returned When Using the `/node/controller/rename` REST API Endpoint

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • backlog
    • 5.1.0
    • ns_server
    • None
    • Untriaged
    • Centos 64-bit
    • Unknown

    Description

      When performing a rename on a node using the `/node/controller/rename` REST API endpoint, it is possible for a "500 Internal Server Error" to be returned. During the rename process, the following actions occur:

      • The network stack is stopped. This results in the disconnection of the following VMs, which are part of the Erlang distribution:
      1. 'ns_server'
      2. 'babysitter'
      3. 'couchdb'
      • The network stack is restarted, using the new IP address/Fully Qualified Domain Name (FQDN).
      • Writes a marker to the local file system to indicate that a node rename process has started.
      • The 'babysitter' VM is reconnected.
      • The couchdb VM is reconnected. This interaction is different, in that information from the 'ns_server' VM is passed to 'couchdb' as an environment variable. To achieve this, 'ns_server' does the following:
      1. Tries to reconnect to the 'couchdb' VM.
      2. Updates the config of 'couchdb' with the new 'ns_server' node name.
      3. When the 'couchdb' VM notices that the networking stack of the 'ns_server' VM is disabled, it starts to attempt to reconnect to the 'ns_server' VM. The 'ns_server' node name used by 'couchdb' is fetched from its environmental variable. This step can race with step 2 above.
      • The 'ns_server' VM replaces the previous node name with the new one in the cluster configuration.
      • The marker on the file system is deleted.

      If there are delays in scheduling, the attempts by the 'ns_server' VM to update the 'couchdb' VM with the new node name can get delayed (step 2). When this happens, the 'couchdb' VM starts its own internal reconnection attempts (step 3). In this case, the 'couchdb' VM will still be using the previous 'ns_server' node name in its connection attempts, so these will definitely fail as there is no such VM at this time (as the 'ns_server' VM has already been renamed). The outcome of these connection failures is that the 'couchdb' VM exits. In the meantime, when 'ns_server' gets some CPU time it will attempt to connect to the 'couchdb' VM. This will fail, as the 'couchdb' VM has exited. This then produces an "

      {error,wait_for_node_failed}

      " error message.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            dfinlay Dave Finlay
            stewart.peters Stewart Peters (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty