Details
-
Bug
-
Resolution: Won't Do
-
Major
-
master
-
None
-
Untriaged
-
Unknown
Description
The R-0 Swap Rebalance test http://perf.jenkins.couchbase.com/job/hera-hidd/245 shows a very slow rebalance progression (8% completed in approximately 9 hours, then I aborted the test).
The baseline Couchstore test http://perf.jenkins.couchbase.com/job/hera-hidd/242 performs a full rebalance in approximately 3 hours.
The test scenario is:
- 3 KV nodes
- 1 bucket, 1 replica
- First: Load Phase of 2 billion items
- Second: Access Phase (90% reads, 5% creates, 5% updated) + Swap Rebalance of 1 node (swap-out 172.23.96.117, swap-in 172.23.96.120)
On the Producer (172.23.96.117), I see that the BackfillManagerTask is constantly very slow during the entire rebalance (from T14:35 to T23:29):
2018-04-10T14:35:17.216210Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 1442 ms
|
2018-04-10T14:35:18.522261Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 1306 ms
|
2018-04-10T14:35:19.929839Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 1407 ms
|
2018-04-10T14:35:21.315669Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 1385 ms
|
2018-04-10T14:35:22.765742Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 1450 ms
|
2018-04-10T14:35:24.195643Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 1429 ms
|
2018-04-10T14:35:25.928889Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 1733 ms
|
..
|
2018-04-10T23:29:53.839544Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_2: 753 ms
|
2018-04-10T23:29:54.535201Z WARNING (bucket-1) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_2: 695 ms
|
In 'kv_trace.json' (which contains the Phosphor event tracing), I see that the slow runtime of the BackfillManagerTask is caused by the slow runtime of the RocksDBKVStore::scan() function, e.g.:
{"name":"BackFillManagerTask","cat":"ep-engine\/task","ph":"B","ts":50446059825,"pid":5677,"tid":5824,"args":{}},
|
{"name":"scan","cat":"RocksDBKVStore","ph":"B","ts":50446059827,"pid":5677,"tid":5824,"args":{"vbid":120,"startSeqno":11628545}},
|
{"name":"scan","cat":"RocksDBKVStore","ph":"E","ts":50447374056,"pid":5677,"tid":5824,"args":{}},
|
{"name":"BackFillManagerTask","cat":"ep-engine\/task","ph":"E","ts":50447374058,"pid":5677,"tid":5824,"args":{}},
|