Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-58630

Node rename fails intermittently when encryption is on

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.2.4
    • 7.6.0, 7.0.0, 7.1.0, 7.2.0
    • ns_server
    • None
    • Untriaged
    • 0
    • Unknown

    Description

      When node-to-node encryption is on, node rename can fail with the reason provided below. This error usually leads to unsuccessful node addition which is very annoying and usually hard to reproduce.

       9542 {function_clause,
       9543     [{dist_manager,decode_status,
       9544          [{error,
       9545               {{shutdown,
       9546                    {failed_to_start_child,ssl_dist_sup,
       9547                        {already_started,<0.218.0>}}},
       9548                {child,undefined,net_sup_dynamic,
       9549                    {erl_distribution,start_link,
       9550                        [#{clean_halt => false,name => 'n_0@127.0.0.1',
       9551                           name_domain => longnames,net_tickintensity => 4,
       9552                           net_ticktime => 60,supervisor => net_sup_dynamic}]},
       9553                    permanent,false,1000,supervisor,
       9554                    [erl_distribution]}}}],
       9555          [{file,"src/dist_manager.erl"},{line,220}]},
       9556      {dist_manager,bringup,2,[{file,"src/dist_manager.erl"},{line,251}]},
       9557      {dist_manager,do_adjust_address,4,
       9558          [{file,"src/dist_manager.erl"},{line,357}]},
       9559      {async,'-async_init/4-fun-1-',3,[{file,"src/async.erl"},{line,199}]}]}
       9560
      

      basically it can't start ssl_dist_sup because it is already started. In reality it is not already started, but hasn't been stopped yet. When we are doing rename we stop distribution first before starting it. That stop is synchronous and is supposed to wait for all dist processes to stop (the stop is actually a stop of a supervisor net_sup_dynamic).
      The problem is in the fact that the shutdown timeout for net_sup_dynamic is only 1 second and if the whole net_kernel:stop() procedure takes more time, that supervisor simply gets killed, while its children are still running. They all stop eventually but if we try to restart distribution immediately we get already_started. 

      net_sup_dynamic spec:

       38 start(Opts) ->
       39     C = #{id => net_sup_dynamic,
       40           start => {?MODULE,start_link,[Opts#{clean_halt => false,
       41                                               supervisor => net_sup_dynamic}]},
       42           restart => permanent,
       43           shutdown => 1000,
       44           type => supervisor,
       45           modules => [erl_distribution]},
       46     supervisor:start_child(kernel_sup, C).
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              shaazin.sheikh Shaazin Sheikh
              timofey.barmin Timofey Barmin
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty