Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Duplicate
-
7.1.0
-
7.1.0-1312
-
Untriaged
-
-
1
-
Unknown
-
KV 2021-Nov, KV 2021-Dec
Description
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope
3. Load 5B items and upsert them
4. Load another 5B items and upsert them as well
5. Create 100 indexes on 100 collections and build them. Start 100 QPS
6. Rebalance In 1 KV node. Rebalance failed
7. At this moment the indexer drain rate is very slow and building indexes is taking ages.
If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s
For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm
Note: The total drain rate across indexes is around 10k - 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM
Total Disk Writes are varying between 5-30Mbps
QE Test |
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job3.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.test_rebalance,nodes_init=3,graceful=True,skip_cleanup=True,num_items=100000000,num_buckets=1,bucket_names=GleamBook,doc_size=1536,bucket_type=membase,eviction_policy=fullEviction,iterations=5,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=100,pc=10,index_nodes=1,cbas_nodes=0,fts_nodes=0,ops_rate=200000,ramQuota=102400,doc_ops=create:update:delete:read,mutation_perc=100,rebl_ops_rate=50000,key_type=RandomKey -m rest'
|
Attachments
Issue Links
Activity
Field | Original Value | New Value |
---|---|---|
Attachment | idx45.png [ 160871 ] |
Description |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm |
Description |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across is around 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* |
Description |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across is around 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across indexes is around 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* |
Description |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across indexes is around 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across indexes is around 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* Total Disk Writes are varying between 5-30Mbps |
Link to Log File, atop/blg, CBCollectInfo, Core dump |
http://supportal.couchbase.com/snapshot/2bfda4db1650f0892acf6675832aebbd::2
s3://cb-customers-secure/indx_drain_rate/2021-09-23/collectinfo-2021-09-23t072519-ns_1@172.23.110.65.zip |
Description |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across indexes is around 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* Total Disk Writes are varying between 5-30Mbps |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across indexes is around 10k - 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* Total Disk Writes are varying between 5-30Mbps |
Description |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across indexes is around 10k - 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* Total Disk Writes are varying between 5-30Mbps |
1. Create a 3 kv node and 1 index/n1ql node cluster
2. Create magma bucket, 100 collections under default scope 3. Load 5B items and upsert them 4. Load another 5B items and upsert them as well 5. Create 100 indexes on 100 collections and build them. Start 100 QPS 6. Rebalance In 1 KV node. Rebalance failed 7. At this moment the indexer drain rate is very slow and building indexes is taking ages. If we look at idx45, the remaining mutations are 10M while the drain rate is ~500/s !idx45.png|thumbnail! For today, the cluster is live at: http://172.23.110.64:8091/ui/index.htm *Note: The total drain rate across indexes is around 10k - 75k/s on a 72 core machine, ssd disk, 200GB indexer RAM* Total Disk Writes are varying between 5-30Mbps {code:title=QE Test} guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job3.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.test_rebalance,nodes_init=3,graceful=True,skip_cleanup=True,num_items=100000000,num_buckets=1,bucket_names=GleamBook,doc_size=1536,bucket_type=membase,eviction_policy=fullEviction,iterations=5,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=100,pc=10,index_nodes=1,cbas_nodes=0,fts_nodes=0,ops_rate=200000,ramQuota=102400,doc_ops=create:update:delete:read,mutation_perc=100,rebl_ops_rate=50000,key_type=RandomKey -m rest' {code} |
Labels | volume-test | magma volume-test |
Assignee | Jeelan Poola [ jeelan.poola ] | Varun Velamuri [ varun.velamuri ] |
Labels | magma volume-test | affects-neo-testing magma volume-test |
Component/s | secondary-index [ 11211 ] | |
Component/s | indexing [ 15110 ] |
Component/s | couchbase-bucket [ 10173 ] | |
Component/s | secondary-index [ 11211 ] |
Assignee | Varun Velamuri [ varun.velamuri ] | Daniel Owen [ owend ] |
Epic Link |
|
Rank | Ranked higher |
Labels | affects-neo-testing magma volume-test | magma volume-test |
Assignee | Daniel Owen [ owend ] | Dave Rigby [ drigby ] |
Assignee | Dave Rigby [ drigby ] | Daniel Owen [ owend ] |
Assignee | Daniel Owen [ owend ] | James Harrison [ james.harrison ] |
Status | Open [ 1 ] | In Progress [ 3 ] |
Sprint | KV 2021-Nov [ 1866 ] |
Rank | Ranked lower |
Rank | Ranked higher |
Rank | Ranked lower |
Sprint | KV 2021-Nov [ 1866 ] | KV 2021-Nov, Magma 2021-Sep-21 [ 1866, 1906 ] |
Assignee | James Harrison [ james.harrison ] | Ritesh Agarwal [ ritesh.agarwal ] |
Resolution | Duplicate [ 3 ] | |
Status | In Progress [ 3 ] | Resolved [ 5 ] |
Status | Resolved [ 5 ] | Closed [ 6 ] |
Node 110.67 is the only indexer node.
a. From ns_server.indexer_stats.log , the ts_queue_size stat for the keyspace INIT_STREAM_keyspace_GleamBookUsers0:_default:VolumeCollection5 is almost always "0". This means that indexer is catching up with all the mutations it is receiving from projector
b. The mutation_queue_size for this stream is <1000 for most of the time
c. Also, looking at projector latency stats from the time of cbcollect, projector is able to send mutations in <2ms of time. This means nothing is stuck on indexer side
2021-09-23T00:28:00.951-07:00 projlat {"INIT_STREAM/172.23.110.64:8091/projector_latency":1532399,"INIT_STREAM/172.23.110.65:8091/projector_latency":1257879,"INIT_STREAM/172.23.110.66:8091/projector_latency":1763309}
d. Looking at ENDP stats for each of the projectors at the time of cbcollect
Node 172.23.110.64:
2021-09-23T00:28:36.630-07:00 [Info] ENDP[<-(172.23.110.67:9103,1415)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":959883083,"upsertCount":490463993,"deleteCount":0,"upsdelCount":409161666,"syncCount":60227416,"beginCount":12226,"endCount":3676,"snapCount":6754,"flushCount":12522552,"collectionCreate":6156,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1196,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":10541,"latency.max":51845162353,"latency.avg":20396924,"latency.movingAvg":1210177,"endpChLen":0}
Node 172.23.110.65:
2021-09-23T00:27:56.635-07:00 [Info] ENDP[<-(172.23.110.67:9103,b452)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":1186039171,"upsertCount":606346512,"deleteCount":1871,"upsdelCount":519622318,"syncCount":60038097,"beginCount":12284,"endCount":3751,"snapCount":6820,"flushCount":13867551,"collectionCreate":6138,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1380,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":9757,"latency.max":51780373606,"latency.avg":7777851,"latency.movingAvg":1341601,"endpChLen":0}
Node 172.23.110.66:
2021-09-23T00:27:54.181-07:00 [Info] ENDP[<-(172.23.110.67:9103, 7ac)<-127.0.0.1:8091 #INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513] stats: {"mutCount":1155055689,"upsertCount":585126006,"deleteCount":3675,"upsdelCount":509897373,"syncCount":59998286,"beginCount":12276,"endCount":3751,"snapCount":6820,"flushCount":14300014,"collectionCreate":6138,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"updateSeqno":0,"seqnoAdvanced":1364,"osoSnapshotStart":0,"osoSnapshotEnd":0,"latency.min":8640,"latency.max":51718189901,"latency.avg":5483651,"latency.movingAvg":1876106,"endpChLen":0}
The endpoint channel length is almost always "0" from endpoint stats. This means that Either (1) Projector is taking more time to process mutations (2) Memcached is taking long time to stream mutations
e. Projector taking long time to process mutations can be eliminated as the projector latency is <2ms from (c)
f. Looking at DCPT stats on node 110.64, the DCP connection's receive channel length is almost always "0" - This confirms that memcached is slow in sending mutations
2021-09-23T00:28:36.620-07:00 [Info] DCPT[secidx:proj-GleamBookUsers0:_default:VolumeCollection5-INIT_STREAM_TOPIC_53c16e3f091971da1cf99c771ba1e513-7806952980894807309/3] ##3 stats: {"bytes":24318484283,"bufferacks":5914,"toAckBytes":32,"streamreqs":170,"snapshots":170,"mutations":16602714,"collectionCreate":85,"collectionDrop":0,"collectionFlush":0,"scopeCreate":0,"scopeDrop":0,"collectionChanged":0,"seqnoAdvanced":170,"osoSnapshotStart":0,"osoSnapshotEnd":0,"streamends":85,"closestreams":0,"lastAckTime":3h56m11.511080795s,"lastNoopSend":18.363453791s,"lastNoopRecv":18.363475492s,"lastMsgSend":18.363453791s,"lastMsgRecv":18.363475492s,"rcvchLen":0,"incomingMsg":16603399}
g. From memcached logs, looks like KV is having issues with backfill
2021-09-23T00:28:21.828256-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool3: 366 ms
2021-09-23T00:28:21.859324-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool7: 376 ms
2021-09-23T00:28:22.146628-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 39 s
2021-09-23T00:28:24.601576-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool7: 372 ms
2021-09-23T00:28:24.676034-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 420 ms
2021-09-23T00:28:25.310269-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 316 ms
2021-09-23T00:28:25.618340-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool1: 308 ms
2021-09-23T00:28:25.854327-07:00 WARNING (No Engine) Slow runtime for 'Backfilling items for a DCP Connection' on thread AuxIoPool2: 38 s
Daniel Owen, I do not see any issue on indexer side for slow mutation rate. Can you please take a look from KV side to see why the mutation transfer is very slow from memcached to projector.