Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47087

Multi node graceful failover + rebalance out fails with "Rebalance exited with reason {badarg,[{io_lib,format,["Rebalancing bucket ~p with config ~p"

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,quota_percent=95,crash_warning=True,bucket_storage=magma,enable_dp=True,retry_get_process_num=200 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=after,scrape_interval=2,rebalance_moves_per_node=64,skip_validations=False,GROUP=P0_failover_and_rebalance_out'
      

      Steps to Repro
      1. Create a 5 node cluster
      2021-06-23 21:09:04,476 | test | INFO | pool-5-thread-6 | [table_view:display:72] Rebalance Overview
      ----------------------------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------------------------------------------------------

      172.23.98.196 kv 7.1.0-1036-enterprise 5.39450466347 Cluster node
      172.23.98.195 None     <--- IN —
      172.23.121.10 None     <--- IN —
      172.23.104.186 None     <--- IN —
      172.23.120.201 None     <--- IN —

      ----------------------------------------------------------------------

      2. Create bucket/scopes/collections/data
      2021-06-23 21:14:21,836 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used

      -----------------------------------------------------------------------------------------------------------------------------------------------------------------

      4yZIQSEJJD6IzNnFF4kWkPgVRxWd0tC-1DBmRsF05LD1MVBbaPNwgg2vNjO8SCRd_syOC9QCrPVlNpP_x-15-414000 couchbase 2 none 0 3000000 10485760000 2613283152 8185163251

      -----------------------------------------------------------------------------------------------------------------------------------------------------------------

      3. Change few setting and graceful failover 2 nodes(172.23.104.186 and 172.23.120.201)

      2021-06-23 21:14:29,920 | test  | INFO    | MainThread | [collections_rebalance:setUp:59] Changing scrape interval to 2
      2021-06-23 21:14:32,463 | test  | INFO    | MainThread | [cluster_ready_functions:set_rebalance_moves_per_nodes:129] Changed Rebalance settings: {u'rebalanceMovesPerNode': 64}
      2021-06-23 21:14:32,464 | test  | INFO    | MainThread | [collections_rebalance:load_collections_with_rebalance:932] Doing collection data load after graceful_failover_rebalance_out
      2021-06-23 21:14:32,466 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:390] Starting rebalance operation of type : graceful_failover_rebalance_out
      2021-06-23 21:14:32,467 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:602] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.201 port:8091 ssh_username:root]
      

      4. Wait for graceful failover to complete

      2021-06-23 21:21:03,559 | test  | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.104.186 - Node not part of cluster inactiveFailed
      2021-06-23 21:21:03,561 | test  | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.120.201 - Node not part of cluster inactiveFailed
      

      5. Rebalance out the node
      2021-06-23 21:21:59,946 | test | INFO | pool-5-thread-11 | [table_view:display:72] Rebalance Overview
      -----------------------------------------------------------------------

      Nodes Services Version CPU Status

      -----------------------------------------------------------------------

      172.23.98.196 kv 7.1.0-1036-enterprise 10.5732484076 Cluster node
      172.23.98.195 kv 7.1.0-1036-enterprise 24.7787610619 Cluster node
      172.23.104.186 kv 7.1.0-1036-enterprise 1.00755667506 — OUT --->
      172.23.120.201 kv 7.1.0-1036-enterprise 0.752823086575 — OUT --->
      172.23.121.10 kv 7.1.0-1036-enterprise 12.2474747475 Cluster node

      -----------------------------------------------------------------------

      6. Rebalance fails as in rebalance_failure.txt.

      cbcollect_info attached.

      Attachments

        1. bucket_conf
          68 kB
        2. MB-47087_repro.log
          106 kB
        3. rebalance_failure.txt
          65 kB
        4. test.log
          108 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Balakumaran Gopal, logs for node 172.23.121.10 are not complete (or part of the snapshot). If you still have them, please re-upload.

            Never seen this rebalance error before (linked in description) and jira searches aren't turning up anything useful for me. Meni Hillel, could somebody from ns_server please take a look and provide info on what sort of error this is (bad_arg).

            ben.huddleston Ben Huddleston added a comment - Balakumaran Gopal , logs for node 172.23.121.10 are not complete (or part of the snapshot). If you still have them, please re-upload. Never seen this rebalance error before (linked in description) and jira searches aren't turning up anything useful for me. Meni Hillel , could somebody from ns_server please take a look and provide info on what sort of error this is (bad_arg).

            Looking at the ns_server.debug.log for 172.23.98.196 we see this

            [ns_server:debug,2021-07-13T10:35:47.042-07:00,ns_1@172.23.98.196:leader_activities<0.658.0>:leader_activities:handle_activity_down:505]Activity terminated with reason {shutdown,
                                             {async_died,
                                              {raised,
                                               {error,badarg,
                                                [{io_lib,format,
                                                  ["Rebalancing bucket ~p with config ~p",
                                                   ["fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4%og42zG37TitH7%-NP_51-34-501000",
                                                    [{auth_type,sasl},
                                                     {num_replicas,2},
                                                     {replica_index,true},
                                                     {ram_quota,2097152000},
                                                     {durability_min_level,none},
                                                     {frag_percent,50},
                                                     {pitr_enabled,false},
                                                     {pitr_granularity,600},
                                                     {pitr_max_history_age,86400},
                                                     {autocompaction,false},
            <snip>
                                                        'ns_1@172.23.98.195'],
                                                       ['ns_1@172.23.120.201',
                                                        'ns_1@172.23.98.196'|...],
                                                       ['ns_1@172.23.120.201'|...],
                                                       [...]|...]},
                                                     {fastForwardMap,undefined}]]],
                                                  [{file,"io_lib.erl"},{line,202}]},
                                                 {'ale_logger-rebalance',info,5,
                                                  [{file,[]},{line,39}]},
                                                 {ns_rebalancer,rebalance_bucket,6,
                                                  [{file,"src/ns_rebalancer.erl"},
                                                   {line,622}]},
                                                 {lists,foreach,2,
                                                  [{file,"lists.erl"},{line,1338}]},
                                                 {ns_rebalancer,rebalance_kv,4,
                                                  [{file,"src/ns_rebalancer.erl"},
                                                   {line,605}]},
                                                 {ns_rebalancer,rebalance_body,5,
                                                  [{file,"src/ns_rebalancer.erl"},
                                                   {line,556}]},
                                                 {async,'-async_init/4-fun-1-',...}]}}}}. Activity:
            ...
            

            The code in io_lib.erl

             197 format(Format, Args, Options) ->
             198     try io_lib_format:fwrite(Format, Args, Options)
             199     catch
             200         C:R:S ->
             201             test_modules_loaded(C, R, S),
             202             erlang:error(badarg, [Format, Args])
             203     end.
            

            is being called from ns_rebalancer

             619 rebalance_bucket(BucketName, BucketConfig, ProgressFun,
             620                  KeepKVNodes, EjectNodes, ForcedMap) ->
             621     ale:info(?USER_LOGGER, "Started rebalancing bucket ~s", [BucketName]),
             622     ?rebalance_info("Rebalancing bucket ~p with config ~p",
             623                     [BucketName, sanitize(BucketConfig)]),
            

            I'm not seeing what is causing the badarg. Prior to this crash there is similar code, also in ns_rebalancer that successfully logs the same sanitized bucket config

             590 rebalance_kv(KeepNodes, EjectNodes, BucketConfigs, DeltaRecoveryBuckets) ->
             591     NumBuckets = length(BucketConfigs),
             592     ?rebalance_debug("BucketConfigs = ~p", [sanitize(BucketConfigs)]),
            

            Here's that log from ns_server.debug.log

            [rebalance:debug,2021-07-13T10:35:46.759-07:00,ns_1@172.23.98.196:<0.8060.2>:ns_rebalancer:rebalance_kv:592]BucketConfigs = [{"fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4%og42zG37TitH7%-NP_51-34-501000",
                              [{auth_type,sasl},
                               {num_replicas,2},
                               {replica_index,true},
                               {ram_quota,2097152000},
                               {durability_min_level,none},
                               {frag_percent,50},
                               {pitr_enabled,false},
                               {pitr_granularity,600},
                               {pitr_max_history_age,86400},
                               {autocompaction,false},
                               {purge_interval,undefined},
            <snip>
                                     ['ns_1@172.23.98.196','ns_1@172.23.120.201',
                                      undefined],
                                     ['ns_1@172.23.98.196','ns_1@172.23.120.201'|...],
                                     ['ns_1@172.23.98.196'|...],
                                     [...]|...]},
                               {fastForwardMap,...}]}]
            [user:info,2021-07-13T10:35:46.917-07:00,ns_1@172.23.98.196:<0.8060.2>:ns_rebalancer:rebalance_bucket:621]Started rebalancing bucket fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4%og42zG37TitH7%-NP_51-34-501000
            

            steve.watanabe Steve Watanabe added a comment - Looking at the ns_server.debug.log for 172.23.98.196 we see this [ns_server:debug,2021-07-13T10:35:47.042-07:00,ns_1@172.23.98.196:leader_activities<0.658.0>:leader_activities:handle_activity_down:505]Activity terminated with reason {shutdown, {async_died, {raised, {error,badarg, [{io_lib,format, ["Rebalancing bucket ~p with config ~p", ["fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4%og42zG37TitH7%-NP_51-34-501000", [{auth_type,sasl}, {num_replicas,2}, {replica_index,true}, {ram_quota,2097152000}, {durability_min_level,none}, {frag_percent,50}, {pitr_enabled,false}, {pitr_granularity,600}, {pitr_max_history_age,86400}, {autocompaction,false}, <snip> 'ns_1@172.23.98.195'], ['ns_1@172.23.120.201', 'ns_1@172.23.98.196'|...], ['ns_1@172.23.120.201'|...], [...]|...]}, {fastForwardMap,undefined}]]], [{file,"io_lib.erl"},{line,202}]}, {'ale_logger-rebalance',info,5, [{file,[]},{line,39}]}, {ns_rebalancer,rebalance_bucket,6, [{file,"src/ns_rebalancer.erl"}, {line,622}]}, {lists,foreach,2, [{file,"lists.erl"},{line,1338}]}, {ns_rebalancer,rebalance_kv,4, [{file,"src/ns_rebalancer.erl"}, {line,605}]}, {ns_rebalancer,rebalance_body,5, [{file,"src/ns_rebalancer.erl"}, {line,556}]}, {async,'-async_init/4-fun-1-',...}]}}}}. Activity: ... The code in io_lib.erl 197 format( Format , Args , Options ) -> 198 try io_lib_format:fwrite ( Format , Args , Options ) 199 catch 200 C:R:S -> 201 test_modules_loaded(C, R, S), 202 erlang:error (badarg, [ Format , Args ]) 203 end . is being called from ns_rebalancer 619 rebalance_bucket( BucketName , BucketConfig , ProgressFun , 620 KeepKVNodes , EjectNodes , ForcedMap ) -> 621 ale:info ( ?USER_LOGGER , "Started rebalancing bucket ~s" , [ BucketName ]), 622 ?rebalance_info ( "Rebalancing bucket ~p with config ~p" , 623 [ BucketName , sanitize( BucketConfig )]), I'm not seeing what is causing the badarg. Prior to this crash there is similar code, also in ns_rebalancer that successfully logs the same sanitized bucket config 590 rebalance_kv( KeepNodes , EjectNodes , BucketConfigs , DeltaRecoveryBuckets ) -> 591 NumBuckets = length( BucketConfigs ), 592 ?rebalance_debug ( "BucketConfigs = ~p" , [sanitize( BucketConfigs )]), Here's that log from ns_server.debug.log [rebalance:debug,2021-07-13T10:35:46.759-07:00,ns_1@172.23.98.196:<0.8060.2>:ns_rebalancer:rebalance_kv:592]BucketConfigs = [{"fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4%og42zG37TitH7%-NP_51-34-501000", [{auth_type,sasl}, {num_replicas,2}, {replica_index,true}, {ram_quota,2097152000}, {durability_min_level,none}, {frag_percent,50}, {pitr_enabled,false}, {pitr_granularity,600}, {pitr_max_history_age,86400}, {autocompaction,false}, {purge_interval,undefined}, <snip> ['ns_1@172.23.98.196','ns_1@172.23.120.201', undefined], ['ns_1@172.23.98.196','ns_1@172.23.120.201'|...], ['ns_1@172.23.98.196'|...], [...]|...]}, {fastForwardMap,...}]}] [user:info,2021-07-13T10:35:46.917-07:00,ns_1@172.23.98.196:<0.8060.2>:ns_rebalancer:rebalance_bucket:621]Started rebalancing bucket fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4%og42zG37TitH7%-NP_51-34-501000

            On my mac I created a magma bucket named "fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4og42zG37TitH7%-NP_51-34-501000" and did failovers/delta recover and see the logs generated without issue. I had 1024 vbuckets for it to help rule out size of bucket config issues.

            steve.watanabe Steve Watanabe added a comment - On my mac I created a magma bucket named "fW7UdP9XFESEaeyN9jyZh-P95aEjty7iOj4IOQSKlHTT0RORo4og42zG37TitH7%-NP_51-34-501000" and did failovers/delta recover and see the logs generated without issue. I had 1024 vbuckets for it to help rule out size of bucket config issues.

            Aliaksei was able to reproduce this:

            4> io_lib_format:fwrite("Rebalancing bucket ~p with config ~p", [Name, Conf], [{chars_limit, 50000}]).
            ** exception error: no function clause matching io_lib_pretty:pp_tag_tuple({dots,3,3,#Fun<io_lib_pretty.9.27497820>},
                                                                                       117,80,49889,1,
                                                                                       [32,32,
                                                                                        [32,
                                                                                         [[[32,"   ",32,32,32],32,"   ",32,32,32],
                                                                                          [32,"   ",32,32,32],
                                                                                          32,"   ",32,32,32],
                                                                                         [[32,"   ",32,32,32],32,"   ",32,32,32],
                                                                                         [32,"   ",32,32,32],
                                                                                         32,"   ",32,32,32],
                                                                                        32,
                                                                                        [[[32,"   ",32,32,32],32,"   ",32,32,32],
                                                                                         [32,"   ",32,32,32],
                                                                                         32,"   ",32,32,32],
                                                                                        [[32,"   ",32,32,32],32,"   ",32,32,32],
                                                                                        [32,"   ",32,32,32],
                                                                                        32,"   ",32,32,32],
                                                                                       1,1) (io_lib_pretty.erl, line 188)
                 in function  io_lib_pretty:pp/8 (io_lib_pretty.erl, line 173)
                 in call from io_lib_pretty:pp_element/8 (io_lib_pretty.erl, line 341)
                 in call from io_lib_pretty:pp_tail/10 (io_lib_pretty.erl, line 323)
                 in call from io_lib_pretty:pp_tail/10 (io_lib_pretty.erl, line 325)
                 in call from io_lib_pretty:pp_list/9 (io_lib_pretty.erl, line 310)
                 in call from io_lib_pretty:pp/8 (io_lib_pretty.erl, line 171)
                 in call from io_lib_format:build_limited/5 (io_lib_format.erl, line 282)
            

            I'll attach his test program which he says

            Here's the file with data. Use as follows:
            3> {ok, [Name, Conf]} = file:consult("bucket_conf"). 
             
            Changing chars_limit to a different value (unsurprisingly) avoids the issue.
            And it's also reproducible on erlang 24.
            Even just removing "Rebalancing bucket" from the format string makes it stop failing.
            

            bucket_conf

            steve.watanabe Steve Watanabe added a comment - Aliaksei was able to reproduce this: 4> io_lib_format:fwrite("Rebalancing bucket ~p with config ~p", [Name, Conf], [{chars_limit, 50000}]). ** exception error: no function clause matching io_lib_pretty:pp_tag_tuple({dots,3,3,#Fun<io_lib_pretty.9.27497820>}, 117,80,49889,1, [32,32, [32, [[[32," ",32,32,32],32," ",32,32,32], [32," ",32,32,32], 32," ",32,32,32], [[32," ",32,32,32],32," ",32,32,32], [32," ",32,32,32], 32," ",32,32,32], 32, [[[32," ",32,32,32],32," ",32,32,32], [32," ",32,32,32], 32," ",32,32,32], [[32," ",32,32,32],32," ",32,32,32], [32," ",32,32,32], 32," ",32,32,32], 1,1) (io_lib_pretty.erl, line 188) in function io_lib_pretty:pp/8 (io_lib_pretty.erl, line 173) in call from io_lib_pretty:pp_element/8 (io_lib_pretty.erl, line 341) in call from io_lib_pretty:pp_tail/10 (io_lib_pretty.erl, line 323) in call from io_lib_pretty:pp_tail/10 (io_lib_pretty.erl, line 325) in call from io_lib_pretty:pp_list/9 (io_lib_pretty.erl, line 310) in call from io_lib_pretty:pp/8 (io_lib_pretty.erl, line 171) in call from io_lib_format:build_limited/5 (io_lib_format.erl, line 282) I'll attach his test program which he says Here's the file with data. Use as follows: 3> {ok, [Name, Conf]} = file:consult("bucket_conf").   Changing chars_limit to a different value (unsurprisingly) avoids the issue. And it's also reproducible on erlang 24. Even just removing "Rebalancing bucket" from the format string makes it stop failing. bucket_conf

            Reassign to Hareen Kancharla to examine Erlang issue as discovered by Aliaksey Artamonau

            meni.hillel Meni Hillel (Inactive) added a comment - Reassign to Hareen Kancharla to examine Erlang issue as discovered by Aliaksey Artamonau

            The issue was due to a bug in Erlang io_lib:format/3. An upstream bug was opened and has been fixed by the OTP folks -> https://github.com/erlang/otp/issues/5053

            hareen.kancharla Hareen Kancharla added a comment - The issue was due to a bug in Erlang io_lib:format/3. An upstream bug was opened and has been fixed by the OTP folks -> https://github.com/erlang/otp/issues/5053

            Log truncation was introduced to address resource consumption (mem/cpu) in cases where log line is very large (>50K). As this issue clearly points out to a regression, that may occur at any random place, when truncation kicks in. We've thus disabled truncation in 7.0.1 for now and we'll need to re-activation once Erlang issue is fixed, as well as come up with testing strategy to achieve more comprehensive coverage.

            meni.hillel Meni Hillel (Inactive) added a comment - Log truncation was introduced to address resource consumption (mem/cpu) in cases where log line is very large (>50K). As this issue clearly points out to a regression, that may occur at any random place, when truncation kicks in. We've thus disabled truncation in 7.0.1 for now and we'll need to re-activation once Erlang issue is fixed, as well as come up with testing strategy to achieve more comprehensive coverage.

            Build couchbase-server-7.0.1-5942 contains ns_server commit cded5f4 with commit message:
            MB-47087 Change chars_limit default from 50K to -1

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-5942 contains ns_server commit cded5f4 with commit message: MB-47087 Change chars_limit default from 50K to -1

            Build couchbase-server-7.1.0-1082 contains ns_server commit cded5f4 with commit message:
            MB-47087 Change chars_limit default from 50K to -1

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1082 contains ns_server commit cded5f4 with commit message: MB-47087 Change chars_limit default from 50K to -1

            In case someone is interested in assessing the blast radius - the circumstances under which this can happen can be quite random. First, a log line needs to exceed 50K, the chance of that happening is during rebalance, where we print vbmap, which is usually large, especially with large bucket names, and if there are/or large number nodes. Then, the log line truncation needs to happen right on an element of "map" data type. I'll guestimate that number of customers upgrading to 7.0.0 in production with these kind of setups is not high.

            meni.hillel Meni Hillel (Inactive) added a comment - In case someone is interested in assessing the blast radius - the circumstances under which this can happen can be quite random. First, a log line needs to exceed 50K, the chance of that happening is during rebalance, where we print vbmap, which is usually large, especially with large bucket names, and if there are/or large number nodes. Then, the log line truncation needs to happen right on an element of "map" data type. I'll guestimate that number of customers upgrading to 7.0.0 in production with these kind of setups is not high.

            Build couchbase-server-7.0.1-6004 contains ns_server commit 2a578df with commit message:
            MB-47087 Change chars_limit default from 50K to -1

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-6004 contains ns_server commit 2a578df with commit message: MB-47087 Change chars_limit default from 50K to -1

            Build couchbase-server-7.1.0-1132 contains ns_server commit 2a578df with commit message:
            MB-47087 Change chars_limit default from 50K to -1

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1132 contains ns_server commit 2a578df with commit message: MB-47087 Change chars_limit default from 50K to -1

            Validated this on 7.0.1-6102.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this on 7.0.1-6102.

            Validated this on 7.0.2-6535 as well.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this on 7.0.2-6535 as well.

            People

              hareen.kancharla Hareen Kancharla
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty