Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: backlog
Affects Version/s: 5.1.0
Component/s: ns_server
Labels:
None

Triage:
Untriaged
Operating System:
Centos 64-bit
Is this a Regression?:
Unknown

Description

When performing a rename on a node using the `/node/controller/rename` REST API endpoint, it is possible for a "500 Internal Server Error" to be returned. During the rename process, the following actions occur:

The network stack is stopped. This results in the disconnection of the following VMs, which are part of the Erlang distribution:

'ns_server'
'babysitter'
'couchdb'

The network stack is restarted, using the new IP address/Fully Qualified Domain Name (FQDN).
Writes a marker to the local file system to indicate that a node rename process has started.
The 'babysitter' VM is reconnected.
The couchdb VM is reconnected. This interaction is different, in that information from the 'ns_server' VM is passed to 'couchdb' as an environment variable. To achieve this, 'ns_server' does the following:

Tries to reconnect to the 'couchdb' VM.
Updates the config of 'couchdb' with the new 'ns_server' node name.
When the 'couchdb' VM notices that the networking stack of the 'ns_server' VM is disabled, it starts to attempt to reconnect to the 'ns_server' VM. The 'ns_server' node name used by 'couchdb' is fetched from its environmental variable. This step can race with step 2 above.

The 'ns_server' VM replaces the previous node name with the new one in the cluster configuration.
The marker on the file system is deleted.

If there are delays in scheduling, the attempts by the 'ns_server' VM to update the 'couchdb' VM with the new node name can get delayed (step 2). When this happens, the 'couchdb' VM starts its own internal reconnection attempts (step 3). In this case, the 'couchdb' VM will still be using the previous 'ns_server' node name in its connection attempts, so these will definitely fail as there is no such VM at this time (as the 'ns_server' VM has already been renamed). The outcome of these connection failures is that the 'couchdb' VM exits. In the meantime, when 'ns_server' gets some CPU time it will attempt to connect to the 'couchdb' VM. This will fail, as the 'couchdb' VM has exited. This then produces an "

{error,wait_for_node_failed}

" error message.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Dave Finlay

Reporter:: Stewart Peters (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 4 Start watching this issue

Dates

Created:: 17/Aug/18 3:31 AM

Updated:: 24/May/21 9:45 AM

Gerrit Reviews

There are no open Gerrit changes

Possible "500 Internal Server Error" Returned When Using the `/node/controller/rename` REST API Endpoint

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty