Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48772

[System Test] Rebalance exited with reason not_all_nodes_are_ready_yet

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.1.0
    • 7.1.0
    • XDCR
    • Enterprise Edition 7.1.0 build 1416

    Description

      Test -

      -test tests/integration/neo/test_neo_magma_wo_gsi_n1ql.yml -scope tests/integration/neo/scope_neo_magma_wo_gsi_n1ql.yml
      

      Day - 2
      Scale - 3

      Test step -
      Rebalance in data node.

      On 172.23.120.74
      ns_server.debug.log

      [ns_server:debug,2021-10-05T17:58:13.542-07:00,ns_1@172.23.120.74:leader_activities<0.11706.0>:leader_activities:handle_activity_down:505]Activity terminated with reason {shutdown,
                                       {async_died,
                                        {raised,
                                         {exit,
                                          {not_all_nodes_are_ready_yet,
                                           ['ns_1@172.23.121.77']},
                                          [{ns_rebalancer,rebalance_membase_bucket,
                                            6,
                                            [{file,"src/ns_rebalancer.erl"},
                                             {line,654}]},
                                           {lists,foreach,2,
                                            [{file,"lists.erl"},{line,1338}]},
                                           {ns_rebalancer,rebalance_kv,4,
                                            [{file,"src/ns_rebalancer.erl"},
                                             {line,605}]},
                                           {ns_rebalancer,rebalance_body,5,
                                            [{file,"src/ns_rebalancer.erl"},
                                             {line,556}]},
                                           {async,'-async_init/4-fun-1-',3,
                                            [{file,"src/async.erl"},
                                             {line,191}]}]}}}}. Activity:
      {activity,<0.466.216>,#Ref<0.4050249434.2986344455.178469>,default,
                <<"acb85b4b67c10af2ffb786ffc74ab13b">>,
                [rebalance],
                majority,[]}
      [error_logger:error,2021-10-05T17:58:13.543-07:00,ns_1@172.23.120.74:<0.196.216>:ale_error_logger_handler:do_log:101]
      =========================CRASH REPORT=========================
        crasher:
          initial call: erlang:apply/2
          pid: <0.196.216>
          registered_name: []
          exception exit: {not_all_nodes_are_ready_yet,['ns_1@172.23.121.77']}
            in function  ns_rebalancer:rebalance_membase_bucket/6 (src/ns_rebalancer.erl, line 654)
            in call from lists:foreach/2 (lists.erl, line 1338)
            in call from ns_rebalancer:rebalance_kv/4 (src/ns_rebalancer.erl, line 605)
            in call from ns_rebalancer:rebalance_body/5 (src/ns_rebalancer.erl, line 556)
            in call from async:'-async_init/4-fun-1-'/3 (src/async.erl, line 191)
          ancestors: [<0.18226.206>,ns_orchestrator_child_sup,ns_orchestrator_sup,
                        mb_master_sup,mb_master,leader_registry_sup,
                        leader_services_sup,<0.11703.0>,ns_server_sup,
                        ns_server_nodes_sup,<0.11292.0>,ns_server_cluster_sup,
                        root_sup,<0.139.0>]
          message_queue_len: 0
          messages: []
          links: [<0.18226.206>]
          dictionary: []
          trap_exit: false
          status: running
          heap_size: 46422
          stack_size: 27
          reductions: 7521
        neighbours:
       
      [user:error,2021-10-05T17:58:13.543-07:00,ns_1@172.23.120.74:<0.18226.206>:ns_orchestrator:log_rebalance_completion:1412]Rebalance exited with reason {not_all_nodes_are_ready_yet,
                                       ['ns_1@172.23.121.77']}.
      Rebalance Operation Id = e68aebfa6a4ca37bc1b2696743d8f161
      

      Timeouts are seen in a window of 1 minute prior to rebalance failure.

      [ns_server:error,2021-10-05T17:58:03.538-07:00,ns_1@172.23.120.74:<0.22804.208>:stats_interface:from_nodes:122]Failed to get buckets_interesting[] stats from ['ns_1@172.23.121.77'] with reason:
      [{'ns_1@172.23.121.77',timeout}]
      [ns_server:error,2021-10-05T17:58:03.759-07:00,ns_1@172.23.120.74:<0.16280.221>:stats_interface:from_nodes:122]Failed to get buckets_interesting[] stats from ['ns_1@172.23.121.77'] with reason:
      [{'ns_1@172.23.121.77',timeout}]
      [ns_server:error,2021-10-05T17:58:07.074-07:00,ns_1@172.23.120.74:<0.21207.220>:stats_interface:from_nodes:122]Failed to get buckets_interesting[] stats from ['ns_1@172.23.121.77'] with reason:
      [{'ns_1@172.23.121.77',timeout}]
      [ns_server:error,2021-10-05T17:58:07.431-07:00,ns_1@172.23.120.74:<0.16532.221>:stats_interface:from_nodes:122]Failed to get buckets_interesting[] stats from ['ns_1@172.23.121.77'] with reason:
      [{'ns_1@172.23.121.77',timeout}]
      [ns_server:info,2021-10-05T17:58:07.715-07:00,ns_1@172.23.120.74:<0.18226.206>:ns_orchestrator:handle_event:486]Skipping janitor in state rebalancing
      [ns_server:error,2021-10-05T17:58:10.645-07:00,ns_1@172.23.120.74:<0.16280.221>:stats_interface:from_nodes:122]Failed to get buckets_interesting[] stats from ['ns_1@172.23.121.77'] with reason:
      [{'ns_1@172.23.121.77',timeout}]
      

      On 172.23.121.77
      memcached.log
      Lots of slow operations around the time of rebalance failure.

      2021-10-05T17:58:05.847607-07:00 WARNING 21046: Slow operation: {"bucket":"bucket9","cid":"127.0.0.1:60273/1000000","command":"DCP_SNAPSHOT_MARKER","duration":"558 ms","packet":{"bodylen":37,"cas":0,"datatype":"raw","extlen":1,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":1,"opcode":"DCP_SNAPSHOT_MARKER","vbucket":991},"peer":{"ip":"127.0.0.1","port":60273},"trace":"request=24021805724386876:558026 execute=24021805724386876:558026","worker_tid":140208107382528}
      2021-10-05T17:58:05.918539-07:00 INFO (bucket8) vb:561 drop collection:id:0x144 from scope:0x4e, seq:788, manifest:0x2a4, replica
      2021-10-05T17:58:05.918652-07:00 WARNING 21122: Slow operation: {"bucket":"bucket8","cid":"127.0.0.1:42608/14000000","command":"DCP_SYSTEM_EVENT","duration":"834 ms","packet":{"bodylen":29,"cas":0,"datatype":"raw","extlen":13,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":20,"opcode":"DCP_SYSTEM_EVENT","vbucket":561},"peer":{"ip":"127.0.0.1","port":42608},"trace":"request=24021805519770220:833772 execute=24021805519770220:833772","worker_tid":140208090597120}
      2021-10-05T17:58:06.015138-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Hashtable resizer on vb:571' on thread NonIoPool0. Schedule overhead: 29 s
      2021-10-05T17:58:06.058360-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Hashtable resizer on vb:580' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:06.066642-07:00 INFO (bucket9) vb:559 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:06.105501-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Hashtable resizer no vbucket assigned' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:06.105854-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Hashtable resizer on vb:608' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:06.120761-07:00 INFO (bucket9) vb:560 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:06.137936-07:00 INFO (bucket9) vb:479 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:06.231609-07:00 WARNING (No Engine) Slow runtime for 'Hashtable resizer on vb:675' on thread NonIoPool0: 126 ms
      2021-10-05T17:58:06.231705-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Hashtable resizer on vb:620' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:06.247893-07:00 INFO (bucket8) vb:229 drop collection:id:0x144 from scope:0x4e, seq:788, manifest:0x2a4, replica
      2021-10-05T17:58:06.248019-07:00 WARNING 19960: Slow operation: {"bucket":"bucket8","cid":"127.0.0.1:44523/12000000","command":"DCP_SYSTEM_EVENT","duration":"620 ms","packet":{"bodylen":29,"cas":0,"datatype":"raw","extlen":13,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":18,"opcode":"DCP_SYSTEM_EVENT","vbucket":229},"peer":{"ip":"127.0.0.1","port":44523},"trace":"request=24021806062466558:620437 execute=24021806062466558:620437","worker_tid":140207999416064}
      2021-10-05T17:58:06.410734-07:00 WARNING (No Engine) Slow runtime for 'Hashtable resizer on vb:676' on thread NonIoPool0: 179 ms
      2021-10-05T17:58:06.410797-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Hashtable resizer no vbucket assigned' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:06.433453-07:00 INFO (bucket9) vb:777 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:657, manifest:0x29a, replica
      2021-10-05T17:58:06.517620-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Hashtable resizer no vbucket assigned' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:06.518047-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Adjusting hash table sizes.' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:06.518281-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Connection Manager' on thread NonIoPool0. Schedule overhead: 27 s
      2021-10-05T17:58:06.588008-07:00 INFO (bucket8) vb:230 drop collection:id:0x144 from scope:0x4e, seq:788, manifest:0x2a4, replica
      2021-10-05T17:58:06.716590-07:00 INFO (bucket9) vb:815 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:06.885982-07:00 INFO (bucket9) vb:854 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:06.889197-07:00 INFO (bucket8) vb:231 drop collection:id:0x144 from scope:0x4e, seq:788, manifest:0x2a4, replica
      2021-10-05T17:58:06.945336-07:00 INFO (bucket8) vb:200 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.234323-07:00 WARNING 41181: Slow operation: {"bucket":"bucket4","cid":"127.0.0.1:38690/6000000","command":"DCP_MUTATION","duration":"638 ms","packet":{"bodylen":681,"cas":1633481838182268928,"datatype":["JSON","Snappy"],"extlen":31,"key":"<ud>.doc_2</ud>","keylen":6,"magic":"ClientRequest","opaque":6,"opcode":"DCP_MUTATION","vbucket":688},"peer":{"ip":"127.0.0.1","port":38690},"trace":"request=24021807031347798:637801 execute=24021807031347798:637801","worker_tid":140208082204416}
      2021-10-05T17:58:07.238722-07:00 INFO (bucket8) vb:201 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.275932-07:00 INFO (bucket9) vb:935 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:07.276164-07:00 WARNING 21046: Slow operation: {"bucket":"bucket9","cid":"127.0.0.1:60273/15000000","command":"DCP_SYSTEM_EVENT","duration":"1138 ms","packet":{"bodylen":53,"cas":0,"datatype":"raw","extlen":13,"key":"<ud>3dLYhuOeNKKR7Zp6Mmv5Y5kX</ud>","keylen":24,"magic":"ClientRequest","opaque":21,"opcode":"DCP_SYSTEM_EVENT","vbucket":935},"peer":{"ip":"127.0.0.1","port":60273},"trace":"request=24021806572966708:1137970 execute=24021806572966708:1137970","worker_tid":140208107382528}
      2021-10-05T17:58:07.311655-07:00 INFO (bucket8) vb:235 drop collection:id:0x144 from scope:0x4e, seq:788, manifest:0x2a4, replica
      2021-10-05T17:58:07.343788-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Adjusting hash table sizes.' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:07.375992-07:00 INFO (bucket8) vb:205 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.415508-07:00 WARNING (No Engine) Slow runtime for 'Adjusting hash table sizes.' on thread NonIoPool0: 72 ms
      2021-10-05T17:58:07.415562-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Adjusting hash table sizes.' on thread NonIoPool0. Schedule overhead: 28 s
      2021-10-05T17:58:07.454872-07:00 WARNING (No Engine) Slow runtime for 'Adjusting hash table sizes.' on thread NonIoPool0: 39 ms
      2021-10-05T17:58:07.458753-07:00 INFO (bucket8) vb:206 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.485579-07:00 INFO (bucket9) vb:1007 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:07.579159-07:00 INFO (bucket8) vb:313 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.710379-07:00 INFO (bucket8) vb:229 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.710508-07:00 INFO (bucket8) vb:230 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.710553-07:00 INFO (bucket8) vb:231 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.710598-07:00 INFO (bucket8) vb:235 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.710639-07:00 INFO (bucket8) vb:229 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:07.732668-07:00 INFO (bucket8) vb:230 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:07.732775-07:00 INFO (bucket8) vb:231 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:07.773654-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Process checkpoint(s) for DCP producer eq_dcpq:replication:ns_1@172.23.121.77->ns_1@172.23.120.75:bucket7' on thread NonIoPool0. Schedule overhead: 23 s
      2021-10-05T17:58:07.817744-07:00 INFO (bucket8) vb:235 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:07.849485-07:00 INFO (bucket8) vb:495 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.849591-07:00 INFO (bucket8) vb:561 drop collection:id:0x141 from scope:0x4e, seq:789, manifest:0x2a4, replica
      2021-10-05T17:58:07.880215-07:00 INFO (bucket8) vb:200 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:07.880315-07:00 INFO (bucket8) vb:201 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:07.942126-07:00 INFO (bucket9) vb:936 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:07.942256-07:00 WARNING 21046: Slow operation: {"bucket":"bucket9","cid":"127.0.0.1:60273/14000000","command":"DCP_SYSTEM_EVENT","duration":"666 ms","packet":{"bodylen":53,"cas":0,"datatype":"raw","extlen":13,"key":"<ud>3dLYhuOeNKKR7Zp6Mmv5Y5kX</ud>","keylen":24,"magic":"ClientRequest","opaque":20,"opcode":"DCP_SYSTEM_EVENT","vbucket":936},"peer":{"ip":"127.0.0.1","port":60273},"trace":"request=24021807711163705:665976 execute=24021807711163705:665976","worker_tid":140208107382528}
      2021-10-05T17:58:07.952356-07:00 INFO (bucket8) vb:205 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:07.992900-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Process checkpoint(s) for DCP producer eq_dcpq:xdcr:dcp_5f4b32de7a13b85221a2eb842ed656ea/default/remote_172.23.121.77:11210_0:yCXGC3D_SvvO1B1x3jXD_Q==' on thread NonIoPool0. Schedule overhead: 22 s
      2021-10-05T17:58:08.083106-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Process checkpoint(s) for DCP producer eq_dcpq:replication:ns_1@172.23.121.77->ns_1@172.23.123.26:bucket5' on thread NonIoPool0. Schedule overhead: 23 s
      2021-10-05T17:58:08.099749-07:00 INFO (bucket9) vb:954 create collection:id:0x139, name:3dLYhuOeNKKR7Zp6Mmv5Y5kX in scope:0x52, seq:737, manifest:0x29a, replica
      2021-10-05T17:58:08.112350-07:00 INFO (bucket8) vb:206 drop scope:id:0x4e seq:790, manifest:0x2a5, replica
      2021-10-05T17:58:08.270341-07:00 WARNING (No Engine) Slow scheduling for NON_IO task 'Process checkpoint(s) for DCP producer eq_dcpq:replication:ns_1@172.23.121.77->ns_1@172.23.123.24:bucket7' on thread NonIoPool0. Schedule overhead: 22 s
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sujay.gad Sujay Gad
              sujay.gad Sujay Gad
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty