Details
-
Bug
-
Status: Reopened
-
Critical
-
Resolution: Unresolved
-
7.1.1
-
Enterprise Edition 7.1.1 build 3067
-
Untriaged
-
-
1
-
Unknown
-
KV June 2022, KV July 2022, KV Aug 2022
Description
- Create a 3 node KV cluster
- Create a magma bucket with 1 replica with RAM=200GB
- Load 10B 1024 bytes documents. This is 20TB of Active + replica and puts the bucket in 1% DGM.
- Upsert the whole data to create 50% fragmentation.
- Create 25 datasets on cbas ingesting data from different collections. Let the ingestion start. Start SQL++ load with 10QPS asynchronously.
- Start an asnyc CRUD data load:
Read Start: 0
Read End: 100000000
Update Start: 0
Update End: 100000000
Expiry Start: 0
Expiry End: 0
Delete Start: 100000000
Delete End: 200000000
Create Start: 200000000
Create End: 300000000
Final Start: 200000000
Final End: 300000000
- Rebalance in 1 KV node. Rebalance seem to be stuck since hours...
QE Test |
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job3.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.ClusterOpsVolume,nodes_init=3,graceful=True,skip_cleanup=True,num_items=100000000,num_buckets=1,bucket_names=GleamBook,doc_size=1300,bucket_type=membase,eviction_policy=fullEviction,iterations=2,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=25,pc=10,index_nodes=0,cbas_nodes=1,fts_nodes=0,ops_rate=200000,ramQuota=68267,doc_ops=create:update:delete:read,mutation_perc=100,rebl_ops_rate=50000,key_type=RandomKey -m rest'
|
Attachments
Issue Links
- is duplicated by
-
MB-52574 [30TB, 1% KV DGM, FTS]: No progress in data movement during rebalance in of one KV node since 2+ hours of rebalance start.
-
- Closed
-
Gerrit Reviews
For Gerrit Dashboard: MB-52490 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
176234,1 | MB-52490: Add BackfillManager::producer member | neo | kv_engine | Status: NEW | -1 | +1 |
176236,5 | MB-52490: Avoid that a Producer consumes all backfills.maxRunning slots | neo | kv_engine | Status: NEW | -1 | -1 |
176424,6 | MB-52490: Move Backfill Task to its own source files | neo | kv_engine | Status: NEW | 0 | +1 |
176712,8 | MB-52490: Pass Producer to BackfillManagerTask | neo | kv_engine | Status: NEW | -1 | -1 |
176802,5 | MB-52490: Prevent that backfill-busy Producers block others | neo | kv_engine | Status: NEW | -1 | -1 |
We seems to be hitting again
MB-44562here.We hit the max number of backfills that can run on a node:
ep_dcp_max_running_backfills: 4096
ep_dcp_num_running_backfills: 4096
Replication streams 1010/1011 stay in the pending queue:
eq_dcpq:replication:ns_1@172.23.110.67->ns_1@172.23.110.70:GleamBookUsers0:backfill_num_pending: 2
We are close to 10k outboud streams on the node:
cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | wc -l
9623
Most of them are cbas streams:
cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | grep "replication" | wc -l
673
cbcollect_info_ns_1@172.23.110.67_20220609-182552 % grep -E "stream_.*opaque" stats.log | grep "cbas" | wc -l
8950
Hey Ritesh Agarwal, could you ask cbas to have a look here for checking whether the high number of open streams is normal/expected behaviour please?
We had a similar problem in
MB-44562with FTS. At some point FTS had introduced a bug where stale streams were left open.Meanwhile, I'm reviewing some possible improvements in the way KV handles this kind of scenarios.
Update
Ritesh Agarwal There's also an ongoing discussion in
MB-51950where the same CBAS behaviour (ie, creating one stream per collection) has pushed one single node to creating ~ 125k streams. CBAS seems to have set the fix for that to Morpheus (MB-45591).