In the 3->3 swap case described above, it is moving 33% more vBuckets.
Regarding why it is taking 50-60% more time when it is moving only 33% more vBuckets:
In addition to moving higher #of vBuckets, swap also has different rebalance characteristic when compared to reb-in/reb-out. This affects the vBucket scheduling logic which also plays a role in how fast a rebalance can go.
The vBucket scheduling logic (described in the link below) allows limited # of backfills and moves for nodes that are acting as the old or the new master.
Consider a 3 node cluster, N0, N1, N2.
- 3->3 swap rebalance to remove N2 and replace it with N3.
- 341 active vBuckets will move from N2 to N3. N2 is the old master for all of these.
- 341 replica vBuckets will move, the master for these is one of N0 or N1.
- 3->4 rebalance in to add N3:
- 256 active vBuckets will move to N3. The master for these is one of N0, N1, N2.
- 256 replica vBuckets will move to N3. The master for these is one of N0, N1, N2.
- 4 -> 3 rebalance out to remove N3 will have similar characteristics as described above for 3 ->4 rebalance in.
So, in above swap rebalance, one node (N2) is the old master for majority of the vBuckets (341).
Whereas for reb-in & reb-out, the current/old master for vBucket movements are more or less evenly distributed across the 3 nodes. (170 each).
This affects the order in which vBuckets are moved and how many are moved at a time.
But, I have added a note to the design doc below to investigate whether we can improve on swap rebalance time. This will be for Cheshire Cat.