Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35674

[Intermittent]: Swap rebalance of 1 node failed with reason: pre_rebalance_janitor_run_failed

    XMLWordPrintable

    Details

      Description

      1. Initialize a 3 node cluster: [u'ns_1@10.112.180.101', u'ns_1@10.112.180.103', u'ns_1@10.112.180.102']
      2. Create 2 buckets: bucket-0, bucket-1 with replica=2 for both.
      3. Load 100k docs in each bucket with durability=Majority
      4. Start loading another 100k docs in both the buckets. In parallel, do a swap rebalance:

      Adding remote node 10.112.180.104
      removing node 10.112.180.103

      Rebalance failed immediately. CPU usage in 100% at step 4:

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          owend Daniel Owen added a comment -

          Rebalance failed due to:

          [user:error,2019-08-22T21:56:38.807-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:log_rebalance_completion:1472]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"bucket-0",
                                           {error,wait_for_memcached_failed,
                                               ['ns_1@10.112.180.101']}}.
          

          Show
          owend Daniel Owen added a comment - Rebalance failed due to: [user:error,2019-08-22T21:56:38.807-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:log_rebalance_completion:1472]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"bucket-0", {error,wait_for_memcached_failed, ['ns_1@10.112.180.101']}}.
          Hide
          owend Daniel Owen added a comment - - edited

          Have attached CPU usage for all nodes. They all show a similar pattern of periodically hit 100%.

          However at the point of the rebalance failure 2019-08-22T21:56:38 the CPU on any of the nodes is not above 25%.

          Show
          owend Daniel Owen added a comment - - edited Have attached CPU usage for all nodes. They all show a similar pattern of periodically hit 100%. However at the point of the rebalance failure 2019-08-22T21:56:38 the CPU on any of the nodes is not above 25%.
          Hide
          owend Daniel Owen added a comment -

          I do see that we have a constant follow of "Dropped ticks". A sample is shown below for the time of the rebalance failuire:

          [stats:warn,2019-08-22T21:56:31.130-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 4 ticks
          [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:34.941-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:34.991-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [ns_server:warn,2019-08-22T21:56:36.673-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats
          [stats:warn,2019-08-22T21:56:36.863-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 5 ticks
          [stats:warn,2019-08-22T21:56:38.804-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:38.804-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:40.354-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 2 ticks
          [stats:warn,2019-08-22T21:56:42.946-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:43.207-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:44.663-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 3 ticks
          [stats:warn,2019-08-22T21:56:46.831-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 2 ticks
          [stats:warn,2019-08-22T22:12:44.678-07:00,ns_1@10.112.180.101:<0.494.0>:base_stats_collector:latest_tick:69](Collector: global_stats_collector) Dropped 7 ticks
          [stats:warn,2019-08-22T22:12:44.692-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 7 ticks
          

          I could not spot anything in the memcached logs that would indicate why ns_server failed the rebalance with wait_for_memcached_failed.

          So assigning to ns_server for them to take a deeper look on their side.

          Show
          owend Daniel Owen added a comment - I do see that we have a constant follow of "Dropped ticks". A sample is shown below for the time of the rebalance failuire: [stats:warn,2019-08-22T21:56:31.130-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 4 ticks [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:34.941-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:34.991-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [ns_server:warn,2019-08-22T21:56:36.673-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats [stats:warn,2019-08-22T21:56:36.863-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 5 ticks [stats:warn,2019-08-22T21:56:38.804-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:38.804-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:40.354-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 2 ticks [stats:warn,2019-08-22T21:56:42.946-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:43.207-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:44.663-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 3 ticks [stats:warn,2019-08-22T21:56:46.831-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 2 ticks [stats:warn,2019-08-22T22:12:44.678-07:00,ns_1@10.112.180.101:<0.494.0>:base_stats_collector:latest_tick:69](Collector: global_stats_collector) Dropped 7 ticks [stats:warn,2019-08-22T22:12:44.692-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 7 ticks I could not spot anything in the memcached logs that would indicate why ns_server failed the rebalance with wait_for_memcached_failed. So assigning to ns_server for them to take a deeper look on their side.
          Hide
          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment -

          Steve Watanabe - Can you please take a look at this?

          Show
          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment - Steve Watanabe - Can you please take a look at this?
          Hide
          steve.watanabe Steve Watanabe added a comment - - edited

          As part of rebalance ns_rebalancer:rebalance_body does:

              %% We run the janitor here to make sure that the vbucket map is in sync
              %% with the vbucket states.
              %% Unfortunately, we need to run it once more in rebalance_kv after
              %% the server list for the bucket is updated. So that the states of the
              %% vbucket on newly added nodes are applied.
              lists:foreach(fun (Bucket) ->
                                    run_janitor_pre_rebalance(Bucket)
                            end, ns_bucket:get_bucket_names()),
          {code:erlang}
          Which executes this code with a 10 second timeout
          {code:java}
          run_janitor_pre_rebalance(BucketName) ->
              case ns_janitor:cleanup(BucketName,
                                      [{query_states_timeout, ?REBALANCER_QUERY_STATES_TIMEOUT},
                                       {apply_config_timeout, ?REBALANCER_APPLY_CONFIG_TIMEOUT}]) of
          

          and way down we do

          get_vbucket_details(Sock, all, ReqdKeys) ->
              get_vbucket_details_inner(Sock, <<"vbucket-details">>, ReqdKeys);
          

          which sends the request to memcached

          quick_stats(Sock, Key, CB, CBState, Timeout) ->
              Req = encode(req, #mc_header{opcode=?STAT}, #mc_entry{key=Key}),
          

          Here's the start of rebalance (from ns_server.debug.log)

          [user:info,2019-08-22T21:56:27.561-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:idle:834]Starting rebalance, KeepNodes = ['ns_1@10.112.180.101','ns_1@10.112.180.104',
                                           'ns_1@10.112.180.102'], EjectNodes = ['ns_1@10.112.180.103'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 02f540160a05c8d2179dcdcb176589df
          

          This is activity in rebalance_body

          [rebalance:info,2019-08-22T21:56:28.199-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1248]Going to drop possible old 2i indexes on nodes ['ns_1@10.112.180.104']
          [rebalance:info,2019-08-22T21:56:28.201-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1254]Going to keep possible 2i indexes on nodes []
          

          and the failure after ~10 seconds

          [ns_server:debug,2019-08-22T21:56:38.795-07:00,ns_1@10.112.180.101:leader_activities<0.526.0>:leader_activities:handle_activity_down:526]Activity terminated with reason {raised,
                                           {exit,
                                            {pre_rebalance_janitor_run_failed,
                                             "bucket-0",
                                             {error,wait_for_memcached_failed,
          

          During this period we see things are running slow. Note the two dropped heartbeats, each indicating that the heartbeat handling didn't run for 5 seconds.

          ns_server:warn,2019-08-22T21:56:30.383-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats
          [ns_server:debug,2019-08-22T21:56:30.707-07:00,ns_1@10.112.180.101:<0.2501.34>:ns_memcached:verify_report_long_call:368]call {stats,<<"timings">>} took too long: 509695 us
          [stats:warn,2019-08-22T21:56:31.130-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 4 ticks
          [ns_server:info,2019-08-22T21:56:31.131-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:handle_info:517]Skipping janitor in state rebalancing
          [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:34.941-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [stats:warn,2019-08-22T21:56:34.991-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
          [ns_server:info,2019-08-22T21:56:36.011-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:handle_info:517]Skipping janitor in state rebalancing
          [ns_server:warn,2019-08-22T21:56:36.673-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats
          

          Show
          steve.watanabe Steve Watanabe added a comment - - edited As part of rebalance ns_rebalancer:rebalance_body does: %% We run the janitor here to make sure that the vbucket map is in sync %% with the vbucket states. %% Unfortunately, we need to run it once more in rebalance_kv after %% the server list for the bucket is updated. So that the states of the %% vbucket on newly added nodes are applied. lists:foreach ( fun ( Bucket ) -> run_janitor_pre_rebalance( Bucket ) end , ns_bucket:get_bucket_names ()), { code:erlang } Which executes this code with a 10 second timeout { code:java } run_janitor_pre_rebalance( BucketName ) -> case ns_janitor:cleanup ( BucketName , [{query_states_timeout, ?REBALANCER_QUERY_STATES_TIMEOUT }, {apply_config_timeout, ?REBALANCER_APPLY_CONFIG_TIMEOUT }]) of and way down we do get_vbucket_details( Sock , all, ReqdKeys ) -> get_vbucket_details_inner( Sock , << "vbucket-details" >>, ReqdKeys ); which sends the request to memcached quick_stats( Sock , Key , CB , CBState , Timeout ) -> Req = encode(req, #mc_header{opcode= ?STAT }, #mc_entry{key= Key }), Here's the start of rebalance (from ns_server.debug.log) [user:info,2019-08-22T21:56:27.561-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:idle:834]Starting rebalance, KeepNodes = ['ns_1@10.112.180.101','ns_1@10.112.180.104', 'ns_1@10.112.180.102'], EjectNodes = ['ns_1@10.112.180.103'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 02f540160a05c8d2179dcdcb176589df This is activity in rebalance_body [rebalance:info,2019-08-22T21:56:28.199-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1248]Going to drop possible old 2i indexes on nodes ['ns_1@10.112.180.104'] [rebalance:info,2019-08-22T21:56:28.201-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1254]Going to keep possible 2i indexes on nodes [] and the failure after ~10 seconds [ns_server:debug,2019-08-22T21:56:38.795-07:00,ns_1@10.112.180.101:leader_activities<0.526.0>:leader_activities:handle_activity_down:526]Activity terminated with reason {raised, {exit, {pre_rebalance_janitor_run_failed, "bucket-0", {error,wait_for_memcached_failed, During this period we see things are running slow. Note the two dropped heartbeats, each indicating that the heartbeat handling didn't run for 5 seconds. ns_server:warn,2019-08-22T21:56:30.383-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats [ns_server:debug,2019-08-22T21:56:30.707-07:00,ns_1@10.112.180.101:<0.2501.34>:ns_memcached:verify_report_long_call:368]call {stats,<<"timings">>} took too long: 509695 us [stats:warn,2019-08-22T21:56:31.130-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 4 ticks [ns_server:info,2019-08-22T21:56:31.131-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:handle_info:517]Skipping janitor in state rebalancing [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:34.941-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [stats:warn,2019-08-22T21:56:34.991-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks [ns_server:info,2019-08-22T21:56:36.011-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:handle_info:517]Skipping janitor in state rebalancing [ns_server:warn,2019-08-22T21:56:36.673-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats
          Hide
          steve.watanabe Steve Watanabe added a comment -

          Assigning to the memcached folks for triage as to why theĀ "vbucket-details" stats calls are taking a long time to process.

          Show
          steve.watanabe Steve Watanabe added a comment - Assigning to the memcached folks for triage as to why theĀ "vbucket-details" stats calls are taking a long time to process.
          Hide
          steve.watanabe Steve Watanabe added a comment -

          I ran nutshell on .101 and these items were pointed out:

          [BAD] Installed CPUs : 1 - less than the minimum required (2).
          [BAD] Installed RAM : 1841 MB - less than minimum required (4096 MB)
          [BAD] Transparent Huge Pages: always - Recommended is 'never' or 'madvise'
          [BAD] Memcached vsize : 1,056 MB - greater than Data Quota: 470 MB. Check for memory leaks.

          Show
          steve.watanabe Steve Watanabe added a comment - I ran nutshell on .101 and these items were pointed out: [BAD] Installed CPUs : 1 - less than the minimum required (2). [BAD] Installed RAM : 1841 MB - less than minimum required (4096 MB) [BAD] Transparent Huge Pages: always - Recommended is 'never' or 'madvise' [BAD] Memcached vsize : 1,056 MB - greater than Data Quota: 470 MB. Check for memory leaks.
          Hide
          steve.watanabe Steve Watanabe added a comment -

          I discussed this with Ajit and the recommendation is that these test be run on configurations which meet the minimal requirements.

          Show
          steve.watanabe Steve Watanabe added a comment - I discussed this with Ajit and the recommendation is that these test be run on configurations which meet the minimal requirements.

            People

            Assignee:
            ritesh.agarwal Ritesh Agarwal
            Reporter:
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty