Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47087

Multi node graceful failover + rebalance out fails with "Rebalance exited with reason {badarg,[{io_lib,format,["Rebalancing bucket ~p with config ~p"

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/win10-bucket-ops.ini rerun=False,get-cbcollect-info=True,quota_percent=95,crash_warning=True,bucket_storage=magma,enable_dp=True,retry_get_process_num=200 -t bucket_collections.collections_rebalance.CollectionsRebalance.test_data_load_collections_with_graceful_failover_rebalance_out,nodes_init=5,nodes_failover=2,bucket_spec=multi_bucket.buckets_for_rebalance_tests,data_load_stage=after,scrape_interval=2,rebalance_moves_per_node=64,skip_validations=False,GROUP=P0_failover_and_rebalance_out'
      

      Steps to Repro
      1. Create a 5 node cluster
      2021-06-23 21:09:04,476 | test | INFO | pool-5-thread-6 | [table_view:display:72] Rebalance Overview
      ----------------------------------------------------------------------

      Nodes Services Version CPU Status

      ----------------------------------------------------------------------

      172.23.98.196 kv 7.1.0-1036-enterprise 5.39450466347 Cluster node
      172.23.98.195 None     <--- IN —
      172.23.121.10 None     <--- IN —
      172.23.104.186 None     <--- IN —
      172.23.120.201 None     <--- IN —

      ----------------------------------------------------------------------

      2. Create bucket/scopes/collections/data
      2021-06-23 21:14:21,836 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
      -----------------------------------------------------------------------------------------------------------------------------------------------------------------

      Bucket Type Replicas Durability TTL Items RAM Quota RAM Used Disk Used

      -----------------------------------------------------------------------------------------------------------------------------------------------------------------

      4yZIQSEJJD6IzNnFF4kWkPgVRxWd0tC-1DBmRsF05LD1MVBbaPNwgg2vNjO8SCRd_syOC9QCrPVlNpP_x-15-414000 couchbase 2 none 0 3000000 10485760000 2613283152 8185163251

      -----------------------------------------------------------------------------------------------------------------------------------------------------------------

      3. Change few setting and graceful failover 2 nodes(172.23.104.186 and 172.23.120.201)

      2021-06-23 21:14:29,920 | test  | INFO    | MainThread | [collections_rebalance:setUp:59] Changing scrape interval to 2
      2021-06-23 21:14:32,463 | test  | INFO    | MainThread | [cluster_ready_functions:set_rebalance_moves_per_nodes:129] Changed Rebalance settings: {u'rebalanceMovesPerNode': 64}
      2021-06-23 21:14:32,464 | test  | INFO    | MainThread | [collections_rebalance:load_collections_with_rebalance:932] Doing collection data load after graceful_failover_rebalance_out
      2021-06-23 21:14:32,466 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:390] Starting rebalance operation of type : graceful_failover_rebalance_out
      2021-06-23 21:14:32,467 | test  | INFO    | MainThread | [collections_rebalance:rebalance_operation:602] failing over nodes [ip:172.23.104.186 port:8091 ssh_username:root, ip:172.23.120.201 port:8091 ssh_username:root]
      

      4. Wait for graceful failover to complete

      2021-06-23 21:21:03,559 | test  | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.104.186 - Node not part of cluster inactiveFailed
      2021-06-23 21:21:03,561 | test  | WARNING | MainThread | [rest_client:get_nodes:1756] 172.23.120.201 - Node not part of cluster inactiveFailed
      

      5. Rebalance out the node
      2021-06-23 21:21:59,946 | test | INFO | pool-5-thread-11 | [table_view:display:72] Rebalance Overview
      -----------------------------------------------------------------------

      Nodes Services Version CPU Status

      -----------------------------------------------------------------------

      172.23.98.196 kv 7.1.0-1036-enterprise 10.5732484076 Cluster node
      172.23.98.195 kv 7.1.0-1036-enterprise 24.7787610619 Cluster node
      172.23.104.186 kv 7.1.0-1036-enterprise 1.00755667506 — OUT --->
      172.23.120.201 kv 7.1.0-1036-enterprise 0.752823086575 — OUT --->
      172.23.121.10 kv 7.1.0-1036-enterprise 12.2474747475 Cluster node

      -----------------------------------------------------------------------

      6. Rebalance fails as in rebalance_failure.txt.

      cbcollect_info attached.

      Attachments

        1. test.log
          108 kB
        2. rebalance_failure.txt
          65 kB
        3. MB-47087_repro.log
          106 kB
        4. bucket_conf
          68 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            In case someone is interested in assessing the blast radius - the circumstances under which this can happen can be quite random. First, a log line needs to exceed 50K, the chance of that happening is during rebalance, where we print vbmap, which is usually large, especially with large bucket names, and if there are/or large number nodes. Then, the log line truncation needs to happen right on an element of "map" data type. I'll guestimate that number of customers upgrading to 7.0.0 in production with these kind of setups is not high.

            meni.hillel Meni Hillel (Inactive) added a comment - In case someone is interested in assessing the blast radius - the circumstances under which this can happen can be quite random. First, a log line needs to exceed 50K, the chance of that happening is during rebalance, where we print vbmap, which is usually large, especially with large bucket names, and if there are/or large number nodes. Then, the log line truncation needs to happen right on an element of "map" data type. I'll guestimate that number of customers upgrading to 7.0.0 in production with these kind of setups is not high.

            Build couchbase-server-7.0.1-6004 contains ns_server commit 2a578df with commit message:
            MB-47087 Change chars_limit default from 50K to -1

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.1-6004 contains ns_server commit 2a578df with commit message: MB-47087 Change chars_limit default from 50K to -1

            Build couchbase-server-7.1.0-1132 contains ns_server commit 2a578df with commit message:
            MB-47087 Change chars_limit default from 50K to -1

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1132 contains ns_server commit 2a578df with commit message: MB-47087 Change chars_limit default from 50K to -1

            Validated this on 7.0.1-6102.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this on 7.0.1-6102.

            Validated this on 7.0.2-6535 as well.

            Balakumaran.Gopal Balakumaran Gopal added a comment - Validated this on 7.0.2-6535 as well.

            People

              hareen.kancharla Hareen Kancharla
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  PagerDuty