Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.4, 7.0.5, 7.1.0, 7.1.1, 7.1.2, 7.2.0, 7.1.3, 7.1.5
-
Untriaged
-
0
-
No
-
KV 2023-4
Description
As seen in rebalance test described in MB-57271, a cluster with KV plus other services (FTS and GSI in the above instance) performing a (KV) rebalance can hang if KV requires backfills and all of the available backfill slots (default 4096) are consumed by other services.
For example, it was observed that 4231 streams were attempting to backfill:
$ rg 'stream_\d+_state:' stats.log | rg backfilling|wc -l
|
4231
|
Which were made up of:
$ rg 'stream_\d+_state:' stats.log | rg backfilling | cut -d : -f 2 | sort | uniq -c
|
3420 fts
|
4 replication
|
807 secidx
|
However given there's only 4096 possible at once, a number of stream were pending (waiting for a slot to become available before they can start):
$ rg 'backfill_num_pending' stats.log | rg -v -w 0 | cut -d: -f 2,4- | column -t
|
replication:default0:backfill_num_pending: 4
|
secidx:backfill_num_pending: 8
|
secidx:backfill_num_pending: 8
|
secidx:backfill_num_pending: 6
|
secidx:backfill_num_pending: 7
|
secidx:backfill_num_pending: 17
|
secidx:backfill_num_pending: 9
|
secidx:backfill_num_pending: 20
|
secidx:backfill_num_pending: 14
|
secidx:backfill_num_pending: 16
|
While we can also look at reducing the number of concurrent streams other services create, Ideally we want a solution such that KV is "defensive" - irrespective of what other services request, it can always make rebalance progress.
Issue | Resolution |
Data Service rebalance duration was significantly impacted if other DCP clients created a large number of Streams, if those streams needed to be read from disk, due to the lack of prioritizing between rebalance and other DCP clients. | The number of backfills each DCP client can perform concurrently has been limited to allow fairer allocation of resources. |
Attachments
Issue Links
- causes
-
MB-58011 Initial index time in Magma GSI test increased from 60 minutes to 250 minutes on build 7.2.1-5878
-
- Closed
-
-
MB-58013 XDCR performance dropped by 15-20% in 7.2.1-5974 due to MB-57304
-
- Closed
-
-
MB-58184 XDCR - high/low priority ineffective due to Sigar changes
-
- Closed
-
- duplicates
-
MB-49006 Prioritise backfills for replication DCP streams
-
- Open
-
- relates to
-
MB-56768 XDCR sync stalled for a few hours
-
- Closed
-
-
MB-49006 Prioritise backfills for replication DCP streams
-
- Open
-