Details
Description
Scenario:
- 10 node cluster with build 1942
- Rebalance out 5 nodes (completed successfully)
- Cluster right now: 5 nodes
- Add 5 nodes (with build 1944) and remove 3 nodes.
- Hit rebalance.
- Rebalance failed with reason:
Rebalance exited with reason {badmatch,
[{<0.26283.119>,
badmatch,{error,emfile,
[
,
{misc,try_with_maybe_ignorant_after,2}, {gen_server,terminate,6}, {proc_lib,init_p_do_apply,3}]}}]}- Tried rebalance again, but failed repetitively:
Rebalance exited with reason {{{badmatch,[{<18058.14511.0>,noproc}]},
[{misc,sync_shutdown_many_i_am_trapping_exits, 1},{misc,try_with_maybe_ignorant_after,2}
,
{gen_server,terminate,6},
{proc_lib,init_p_do_apply,3}]},
{gen_server,call,
[<0.11023.120>,
,
infinity]}}
Will upload logs from one of the nodes in the cluster present in the cluster during the time of the rebalance failures, shortly.
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
Noticed this on one of the nodes being rebalanced out:
Unable to listen on 'ns_1@ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com'.
So failed over the node and tried rebalancing, rebalancing still failed.
So added that node back, and did not involve that particular node in the rebalance operation, rebalance succeeded.