1. Initialize a 3 node cluster: [u'ns_1@10.112.180.101', u'ns_1@10.112.180.103', u'ns_1@10.112.180.102']
2. Create 2 buckets: bucket-0, bucket-1 with replica=2 for both.
3. Load 100k docs in each bucket with durability=Majority
4. Start loading another 100k docs in both the buckets. In parallel, do a swap rebalance:
[user:error,2019-08-22T21:56:38.807-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:log_rebalance_completion:1472]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"bucket-0",
{error,wait_for_memcached_failed,
['ns_1@10.112.180.101']}}.
Daniel Owen
added a comment - Rebalance failed due to:
[user:error,2019-08-22T21:56:38.807-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:log_rebalance_completion:1472]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"bucket-0",
{error,wait_for_memcached_failed,
['ns_1@10.112.180.101']}}.
Have attached CPU usage for all nodes. They all show a similar pattern of periodically hit 100%.
However at the point of the rebalance failure 2019-08-22T21:56:38 the CPU on any of the nodes is not above 25%.
Daniel Owen
added a comment - - edited Have attached CPU usage for all nodes. They all show a similar pattern of periodically hit 100%.
However at the point of the rebalance failure 2019-08-22T21:56:38 the CPU on any of the nodes is not above 25%.
I could not spot anything in the memcached logs that would indicate why ns_server failed the rebalance with wait_for_memcached_failed.
So assigning to ns_server for them to take a deeper look on their side.
Daniel Owen
added a comment - I do see that we have a constant follow of "Dropped ticks". A sample is shown below for the time of the rebalance failuire:
[stats:warn,2019-08-22T21:56:31.130-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 4 ticks
[stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:34.941-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:34.991-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[ns_server:warn,2019-08-22T21:56:36.673-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats
[stats:warn,2019-08-22T21:56:36.863-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 5 ticks
[stats:warn,2019-08-22T21:56:38.804-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:38.804-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:40.354-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 2 ticks
[stats:warn,2019-08-22T21:56:42.946-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:43.207-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:44.663-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 3 ticks
[stats:warn,2019-08-22T21:56:46.831-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 2 ticks
[stats:warn,2019-08-22T22:12:44.678-07:00,ns_1@10.112.180.101:<0.494.0>:base_stats_collector:latest_tick:69](Collector: global_stats_collector) Dropped 7 ticks
[stats:warn,2019-08-22T22:12:44.692-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 7 ticks
I could not spot anything in the memcached logs that would indicate why ns_server failed the rebalance with wait_for_memcached_failed.
So assigning to ns_server for them to take a deeper look on their side.
'ns_1@10.112.180.102'], EjectNodes = ['ns_1@10.112.180.103'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 02f540160a05c8d2179dcdcb176589df
This is activity in rebalance_body
[rebalance:info,2019-08-22T21:56:28.199-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1248]Going to drop possible old 2i indexes on nodes ['ns_1@10.112.180.104']
[rebalance:info,2019-08-22T21:56:28.201-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1254]Going to keep possible 2i indexes on nodes []
and the failure after ~10 seconds
[ns_server:debug,2019-08-22T21:56:38.795-07:00,ns_1@10.112.180.101:leader_activities<0.526.0>:leader_activities:handle_activity_down:526]Activity terminated with reason {raised,
{exit,
{pre_rebalance_janitor_run_failed,
"bucket-0",
{error,wait_for_memcached_failed,
During this period we see things are running slow. Note the two dropped heartbeats, each indicating that the heartbeat handling didn't run for 5 seconds.
[ns_server:debug,2019-08-22T21:56:30.707-07:00,ns_1@10.112.180.101:<0.2501.34>:ns_memcached:verify_report_long_call:368]call {stats,<<"timings">>} took too long: 509695 us
Steve Watanabe
added a comment - - edited As part of rebalance ns_rebalancer:rebalance_body does:
%% We run the janitor here to make sure that the vbucket map is in sync
%% with the vbucket states.
%% Unfortunately, we need to run it once more in rebalance_kv after
%% the server list for the bucket is updated. So that the states of the
%% vbucket on newly added nodes are applied.
lists:foreach ( fun ( Bucket ) ->
run_janitor_pre_rebalance( Bucket )
end , ns_bucket:get_bucket_names ()),
{ code:erlang }
Which executes this code with a 10 second timeout
{ code:java }
run_janitor_pre_rebalance( BucketName ) ->
case ns_janitor:cleanup ( BucketName ,
[{query_states_timeout, ?REBALANCER_QUERY_STATES_TIMEOUT },
{apply_config_timeout, ?REBALANCER_APPLY_CONFIG_TIMEOUT }]) of
and way down we do
get_vbucket_details( Sock , all, ReqdKeys ) ->
get_vbucket_details_inner( Sock , << "vbucket-details" >>, ReqdKeys );
which sends the request to memcached
quick_stats( Sock , Key , CB , CBState , Timeout ) ->
Req = encode(req, #mc_header{opcode= ?STAT }, #mc_entry{key= Key }),
Here's the start of rebalance (from ns_server.debug.log)
[user:info,2019-08-22T21:56:27.561-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:idle:834]Starting rebalance, KeepNodes = ['ns_1@10.112.180.101','ns_1@10.112.180.104',
'ns_1@10.112.180.102'], EjectNodes = ['ns_1@10.112.180.103'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 02f540160a05c8d2179dcdcb176589df
This is activity in rebalance_body
[rebalance:info,2019-08-22T21:56:28.199-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1248]Going to drop possible old 2i indexes on nodes ['ns_1@10.112.180.104']
[rebalance:info,2019-08-22T21:56:28.201-07:00,ns_1@10.112.180.101:<0.9519.35>:ns_rebalancer:drop_old_2i_indexes:1254]Going to keep possible 2i indexes on nodes []
and the failure after ~10 seconds
[ns_server:debug,2019-08-22T21:56:38.795-07:00,ns_1@10.112.180.101:leader_activities<0.526.0>:leader_activities:handle_activity_down:526]Activity terminated with reason {raised,
{exit,
{pre_rebalance_janitor_run_failed,
"bucket-0",
{error,wait_for_memcached_failed,
During this period we see things are running slow. Note the two dropped heartbeats, each indicating that the heartbeat handling didn't run for 5 seconds.
ns_server:warn,2019-08-22T21:56:30.383-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats
[ns_server:debug,2019-08-22T21:56:30.707-07:00,ns_1@10.112.180.101:<0.2501.34>:ns_memcached:verify_report_long_call:368]call {stats,<<"timings">>} took too long: 509695 us
[stats:warn,2019-08-22T21:56:31.130-07:00,ns_1@10.112.180.101:system_stats_collector<0.476.0>:base_stats_collector:latest_tick:69](Collector: system_stats_collector) Dropped 4 ticks
[ns_server:info,2019-08-22T21:56:31.131-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:handle_info:517]Skipping janitor in state rebalancing
[stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:32.918-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:34.941-07:00,ns_1@10.112.180.101:<0.2918.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[stats:warn,2019-08-22T21:56:34.991-07:00,ns_1@10.112.180.101:<0.2477.34>:base_stats_collector:latest_tick:69](Collector: stats_collector) Dropped 1 ticks
[ns_server:info,2019-08-22T21:56:36.011-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:handle_info:517]Skipping janitor in state rebalancing
[ns_server:warn,2019-08-22T21:56:36.673-07:00,ns_1@10.112.180.101:ns_heart<0.349.0>:ns_heart:handle_info:116]Dropped 1 heartbeats
Assigning to the memcached folks for triage as to why theĀ "vbucket-details" stats calls are taking a long time to process.
Steve Watanabe
added a comment - Assigning to the memcached folks for triage as to why theĀ "vbucket-details" stats calls are taking a long time to process.
I ran nutshell on .101 and these items were pointed out:
[BAD] Installed CPUs : 1 - less than the minimum required (2). [BAD] Installed RAM : 1841 MB - less than minimum required (4096 MB) [BAD] Transparent Huge Pages: always - Recommended is 'never' or 'madvise' [BAD] Memcached vsize : 1,056 MB - greater than Data Quota: 470 MB. Check for memory leaks.
Steve Watanabe
added a comment - I ran nutshell on .101 and these items were pointed out:
[BAD] Installed CPUs : 1 - less than the minimum required (2).
[BAD] Installed RAM : 1841 MB - less than minimum required (4096 MB)
[BAD] Transparent Huge Pages: always - Recommended is 'never' or 'madvise'
[BAD] Memcached vsize : 1,056 MB - greater than Data Quota: 470 MB. Check for memory leaks.
I discussed this with Ajit and the recommendation is that these test be run on configurations which meet the minimal requirements.
Steve Watanabe
added a comment - I discussed this with Ajit and the recommendation is that these test be run on configurations which meet the minimal requirements.
Rebalance failed due to:
[user:error,2019-08-22T21:56:38.807-07:00,ns_1@10.112.180.101:<0.602.0>:ns_orchestrator:log_rebalance_completion:1472]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"bucket-0",
{error,wait_for_memcached_failed,
['ns_1@10.112.180.101']}}.