Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51209

[upgrade] online upgrade using swap rebalance from CE to EE fails

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      upgrade from 7.1.0-2383 CE to 7.1.0-2383 EE.

      Steps:
      1. 2 node cluster with 7.1.0-2383 CE
      2. install 7.1.0-2383 EE on 3rd node
      3. Add 3rd node to CE cluster – fails with error

      "Attention: Join completion call failed. Failed to start ns_server cluster processes back. Logs might have more details."

      Also the UI page of the EE node goes inaccessible.

      Attachments

        1. EE_logs.zip
          0.2 kB
        2. EE_node_cbcollect.zip
          5.94 MB
        3. node1.zip
          12.59 MB
        4. node2.zip
          9.62 MB
        5. node3.zip
          104.96 MB
        6. Screenshot 2022-02-25 at 3.15.26 PM.png
          Screenshot 2022-02-25 at 3.15.26 PM.png
          148 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          From Anitha's latest reproduction (cbcollect_info_ns_1@10.112.211.103_20220301-052414). Here's the ns_server.debug.log

          [cluster:error,2022-03-01T05:18:14.419Z,ns_1@10.112.211.103:ns_cluster<0.252.0>:ns_cluster:perform_actual_join:1547]Failed to join cluster because of: {error,
                                              {shutdown,
                                               {failed_to_start_child,ns_server_sup,
                                                {shutdown,
                                                 {failed_to_start_child,
                                                  index_settings_manager,
                                                  {badarg,
                                                   [{dict,fetch,
                                                     [<<"indexer.settings.rebalance.redistribute_indexes">>,
                                                      {dict,13,16,16,8,80,48,
                                                       {[],[],[],[],[],[],[],[],[],[],
                                                        [],[],[],[],[],[]},
                                                       {{[],
                                                         [[<<"indexer.settings.compaction.days_of_week">>|
                                                           <<"Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday">>]],
                                                         [[<<"indexer.settings.compaction.interval">>|
                                                           <<"00:00,00:00">>]],
                                                         [[<<"indexer.settings.compaction.compaction_mode">>|
                                                           <<"circular">>],
                                                          [<<"indexer.settings.log_level">>|
                                                           <<"info">>],
                                                          [<<"indexer.settings.persisted_snapshot.interval">>|
                                                           5000]],
                                                         [[<<"indexer.settings.compaction.min_frag">>|
                                                           30]],
                                                         [[<<"indexer.settings.inmemory_snapshot.interval">>|
                                                           200]],
                                                         [],
                                                         [[<<"indexer.settings.max_cpu_percent">>|
                                                           0]],
                                                         [[<<"indexer.settings.storage_mode">>|
                                                           <<>>]],
                                                         [],
                                                         [[<<"indexer.settings.recovery.max_rollbacks">>|
                                                           5]],
                                                         [[<<"indexer.settings.num_replica">>|
                                                           0],
                                                          [<<"indexer.settings.memory_quota">>|
                                                           536870912]],
                                                         [],
                                                         [[<<"indexer.settings.compaction.abort_exceed_interval">>|
                                                           false]],                                               [],[]}}}],
                                                     [{file,"dict.erl"},{line,132}]},
                                                    {json_settings_manager,
                                                     '-lens_get_many/2-lc$^0/1-0-',2,
                                                     [{file,
                                                       "src/json_settings_manager.erl"},
                                                      {line,186}]},
                                                    {json_settings_manager,
                                                     '-lens_get_many/2-lc$^0/1-0-',2,
                                                     [{file,
                                                       "src/json_settings_manager.erl"},
                                                      {line,186}]},
                                                    {json_settings_manager,
                                                     '-lens_get_many/2-lc$^0/1-0-',2,
                                                     [{file,
                                                       "src/json_settings_manager.erl"},
                                                      {line,186}]},
                                                    {json_settings_manager,
                                                     '-lens_get_many/2-lc$^0/1-0-',2,
                                                     [{file,
                                                       "src/json_settings_manager.erl"},
                                                      {line,186}]},
                                                    {json_settings_manager,
                                                     do_populate_ets_table,3,
                                                     [{file,
                                                       "src/json_settings_manager.erl"},
                                                      {line,169}]},
                                                    {work_queue,init,1,
                                                     [{file,"src/work_queue.erl"},
                                                      {line,47}]},
                                                    {gen_server,init_it,2,
                                                     [{file,"gen_server.erl"},
                                                      {line,423}]}]}}}}}} 

          This matches the intermittent occurrence I see on my laptop. This occurs when the index_settings_manager on the EE node does it's init. The populate_ets_table gets the settings from the ns_config:latest and those 13 items (all that are defined on CE) are what we saw above in the dict.

          init(M) ->
              ets:new(M, [named_table, set, protected]),
              ModCfgKey = M:cfg_key(),
          <snip>
              populate_ets_table(M).
           
          populate_ets_table(M) ->
              JSON = fetch_settings_json(M:cfg_key()),
              populate_ets_table(M, JSON).
           
          fetch_settings_json(CfgKey) ->
              fetch_settings_json(ns_config:latest(), CfgKey).
           
          fetch_settings_json(Config, CfgKey) ->
              ns_config:search(Config, CfgKey, <<"{}">>).
          

          populate_ets_table, after fetching the settings from ns_config then goes to populate the ets table. At this point JSON contains the 13 settings from the ns_config. The "M:known_settings()" call below calls index_settings_manager:known_settings(). As this node is running Enterprise edition it includes all the settings including indexer.settings.rebalance.redistribute_indexes.

          populate_ets_table(M, JSON) ->
              case not M:is_enabled()
                  orelse erlang:get(prev_json) =:= JSON of
                  true ->
                      ok;
                  false ->
                      do_populate_ets_table(M, JSON, M:known_settings())
              end.
          

          At this point "JSON" contains the 13 settings from ns_config and "Settings" all of the settings on Enterprise edition. The "Dict" contains just the 13 settings and so when lens_get_many iterates through "Settings" it'll look for indexer.settings.rebalance.redistribute_indexes in the "Dict" and crash.

          do_populate_ets_table(M, JSON, Settings) ->
              Dict = decode_settings_json(JSON),
              NotFound = make_ref(),
           
              lists:foreach(
                fun ({Key, NewValue}) ->
                        OldValue = json_settings_manager:get(M, Key, NotFound),
                        case OldValue =:= NewValue of
                            true ->
                                ok;
                            false ->
                                ets:insert(M, {Key, NewValue}),
                                M:on_update(Key, NewValue)
                        end
                end, lens_get_many(Settings, Dict)),
          

          The root of this problem is that kvmeta is a global property and there's currently no way to have some nodes (e.g. community) have one view of the settings while other nodes (e.g. enterprise) have a different view of the settings. We'd need something conceptually similar to the cluster compat mode except where the Enterprise Edition behaves at the level of Community Edition until all nodes have been upgraded to Enterprise Editon.

          steve.watanabe Steve Watanabe added a comment - From Anitha's latest reproduction (cbcollect_info_ns_1@10.112.211.103_20220301-052414). Here's the ns_server.debug.log [cluster:error,2022-03-01T05:18:14.419Z,ns_1@10.112.211.103:ns_cluster<0.252.0>:ns_cluster:perform_actual_join:1547]Failed to join cluster because of: {error,                                     {shutdown,                                      {failed_to_start_child,ns_server_sup,                                       {shutdown,                                        {failed_to_start_child,                                         index_settings_manager,                                         {badarg,                                          [{dict,fetch,                                            [<<"indexer.settings.rebalance.redistribute_indexes">>,                                             {dict,13,16,16,8,80,48,                                              {[],[],[],[],[],[],[],[],[],[],                                               [],[],[],[],[],[]},                                              {{[],                                                [[<<"indexer.settings.compaction.days_of_week">>|                                                  <<"Sunday,Monday,Tuesday,Wednesday,Thursday,Friday,Saturday">>]],                                                [[<<"indexer.settings.compaction.interval">>|                                                  <<"00:00,00:00">>]],                                                [[<<"indexer.settings.compaction.compaction_mode">>|                                                  <<"circular">>],                                                 [<<"indexer.settings.log_level">>|                                                  <<"info">>],                                                 [<<"indexer.settings.persisted_snapshot.interval">>|                                                  5000]],                                                [[<<"indexer.settings.compaction.min_frag">>|                                                  30]],                                                [[<<"indexer.settings.inmemory_snapshot.interval">>|                                                  200]],                                                [],                                                [[<<"indexer.settings.max_cpu_percent">>|                                                  0]],                                                [[<<"indexer.settings.storage_mode">>|                                                  <<>>]],                                                [],                                                [[<<"indexer.settings.recovery.max_rollbacks">>|                                                  5]],                                                [[<<"indexer.settings.num_replica">>|                                                  0],                                                 [<<"indexer.settings.memory_quota">>|                                                  536870912]],                                                [],                                                [[<<"indexer.settings.compaction.abort_exceed_interval">>|                                                  false]],                                               [],[]}}}],                                            [{file,"dict.erl"},{line,132}]},                                           {json_settings_manager,                                            '-lens_get_many/2-lc$^0/1-0-',2,                                            [{file,                                              "src/json_settings_manager.erl"},                                             {line,186}]},                                           {json_settings_manager,                                            '-lens_get_many/2-lc$^0/1-0-',2,                                            [{file,                                              "src/json_settings_manager.erl"},                                             {line,186}]},                                           {json_settings_manager,                                            '-lens_get_many/2-lc$^0/1-0-',2,                                            [{file,                                              "src/json_settings_manager.erl"},                                             {line,186}]},                                           {json_settings_manager,                                            '-lens_get_many/2-lc$^0/1-0-',2,                                            [{file,                                              "src/json_settings_manager.erl"},                                             {line,186}]},                                           {json_settings_manager,                                            do_populate_ets_table,3,                                            [{file,                                              "src/json_settings_manager.erl"},                                             {line,169}]},                                           {work_queue,init,1,                                            [{file,"src/work_queue.erl"},                                             {line,47}]},                                           {gen_server,init_it,2,                                            [{file,"gen_server.erl"},                                             {line,423}]}]}}}}}} This matches the intermittent occurrence I see on my laptop. This occurs when the index_settings_manager on the EE node does it's init. The populate_ets_table gets the settings from the ns_config:latest and those 13 items (all that are defined on CE) are what we saw above in the dict. init(M) -> ets:new (M, [named_table, set, protected]), ModCfgKey = M:cfg_key(), <snip> populate_ets_table(M).   populate_ets_table(M) -> JSON = fetch_settings_json(M:cfg_key()), populate_ets_table(M, JSON ).   fetch_settings_json( CfgKey ) -> fetch_settings_json( ns_config:latest (), CfgKey ).   fetch_settings_json( Config , CfgKey ) -> ns_config:search ( Config , CfgKey , << "{}" >>). populate_ets_table, after fetching the settings from ns_config then goes to populate the ets table. At this point JSON contains the 13 settings from the ns_config. The "M:known_settings()" call below calls index_settings_manager:known_settings(). As this node is running Enterprise edition it includes all the settings including indexer.settings.rebalance.redistribute_indexes. populate_ets_table(M, JSON ) -> case not M:is_enabled() orelse erlang:get (prev_json) =:= JSON of true -> ok; false -> do_populate_ets_table(M, JSON , M:known_settings()) end . At this point "JSON" contains the 13 settings from ns_config and "Settings" all of the settings on Enterprise edition. The "Dict" contains just the 13 settings and so when lens_get_many iterates through "Settings" it'll look for indexer.settings.rebalance.redistribute_indexes in the "Dict" and crash. do_populate_ets_table(M, JSON , Settings ) -> Dict = decode_settings_json( JSON ), NotFound = make_ref(),   lists:foreach ( fun ({ Key , NewValue }) -> OldValue = json_settings_manager:get (M, Key , NotFound ), case OldValue =:= NewValue of true -> ok; false -> ets:insert (M, { Key , NewValue }), M:on_update( Key , NewValue ) end end , lens_get_many( Settings , Dict )), The root of this problem is that kvmeta is a global property and there's currently no way to have some nodes (e.g. community) have one view of the settings while other nodes (e.g. enterprise) have a different view of the settings. We'd need something conceptually similar to the cluster compat mode except where the Enterprise Edition behaves at the level of Community Edition until all nodes have been upgraded to Enterprise Editon.

          The reason this is intermittent is due to indexer eventually does a metakv update adding indexer.settings.rebalance.redistribute_indexes which seems to occur within a minute of the rebalance joining the first two CE nodes into a two-node CE cluster.

          steve.watanabe Steve Watanabe added a comment - The reason this is intermittent is due to indexer eventually does a metakv update adding indexer.settings.rebalance.redistribute_indexes which seems to occur within a minute of the rebalance joining the first two CE nodes into a two-node CE cluster.
          steve.watanabe Steve Watanabe added a comment - - edited

          I've reverted the changes for MB-51112 and no longer see the issue in my testing. Thank you for discovering this issue and working with me to gather the information needed to triage it.

          steve.watanabe Steve Watanabe added a comment - - edited I've reverted the changes for MB-51112 and no longer see the issue in my testing. Thank you for discovering this issue and working with me to gather the information needed to triage it.

          Resolved by reverting the change that introduced this issue.

          steve.watanabe Steve Watanabe added a comment - Resolved by reverting the change that introduced this issue.

          successfully upgraded CE cluster to EE cluster. used build 7.1.0-2433

          anitha.kuberan Anitha Kuberan added a comment - successfully upgraded CE cluster to EE cluster. used build 7.1.0-2433

          People

            anitha.kuberan Anitha Kuberan
            anitha.kuberan Anitha Kuberan
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty