Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.5.0
-
6.5.0-4558
-
Untriaged
-
Unknown
Description
Note: The test has been executed on the perf cluster with machines having RAM of 128GB
Steps to reproduce:
1. Create a 4 node cluster
2. Create a bucket with RAM=400GB and replica=1
3. Load 50M items in the bucket:
Bucket statistics
|
+----------------+---------+----------+-----+----------+--------------+--------------+-------------+
|
| Bucket | Type | Replicas | TTL | Items | RAM Quota | RAM Used | Disk Used |
|
+----------------+---------+----------+-----+----------+--------------+--------------+-------------+
|
| GleamBookUsers | membase | 1 | 0 | 50000000 | 431270920192 | 136479113600 | 40289584016 |
|
+----------------+---------+----------+-----+----------+--------------+--------------+-------------+
|
4. Rebalance in 1 node and start loading another 10M creates, 10M deletes, 20M updates all in parallel. It is observed that rebalance could not start as data loading is happening.
16:24:08 +-------------+----------+--------------+
|
16:24:08 | Nodes | Services | Status |
|
16:24:08 +-------------+----------+--------------+
|
16:24:08 | 172.23.97.5 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.3 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.6 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.4 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.7 | None | <--- IN --- |
|
16:24:08 +-------------+----------+--------------+
|
After waiting for 240s for rebalance to start which it couldn't, the test failed and hence data loading also stops. Observed the cluster after the test failed, rebalance started immediately.
Rebalance started at 16:24:08 and until 16:28:10 there is no progress in rebalance and test failed:
16:24:08 2019-10-23 03:54:08,944 | test | INFO | pool-2-thread-7 | [table_view:display:72] Rebalance Overview
|
16:24:08 +-------------+----------+--------------+
|
16:24:08 | Nodes | Services | Status |
|
16:24:08 +-------------+----------+--------------+
|
16:24:08 | 172.23.97.5 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.3 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.6 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.4 | [u'kv'] | Cluster node |
|
16:24:08 | 172.23.97.7 | None | <--- IN --- |
|
16:24:08 +-------------+----------+--------------+
|
16:24:08 2019-10-23 03:54:08,960 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:24:19 2019-10-23 03:54:18,999 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:24:29 2019-10-23 03:54:29,048 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:24:39 2019-10-23 03:54:39,098 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:24:49 2019-10-23 03:54:49,148 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:24:59 2019-10-23 03:54:59,198 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:25:09 2019-10-23 03:55:09,247 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:25:19 2019-10-23 03:55:19,296 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:25:29 2019-10-23 03:55:29,358 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:25:39 2019-10-23 03:55:39,410 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:25:49 2019-10-23 03:55:49,466 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:25:59 2019-10-23 03:55:59,523 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:26:09 2019-10-23 03:56:09,579 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:26:19 2019-10-23 03:56:19,634 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:26:29 2019-10-23 03:56:29,693 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:26:39 2019-10-23 03:56:39,750 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:26:49 2019-10-23 03:56:49,805 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:26:59 2019-10-23 03:56:59,865 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:27:09 2019-10-23 03:57:09,921 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:27:20 2019-10-23 03:57:19,979 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:27:30 2019-10-23 03:57:30,032 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:27:40 2019-10-23 03:57:40,081 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:27:50 2019-10-23 03:57:50,135 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:28:00 2019-10-23 03:58:00,190 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:28:10 2019-10-23 03:58:10,246 | test | INFO | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
|
16:28:10 2019-10-23 03:58:10,322 | test | ERROR | pool-2-thread-7 | [task:call:222] Rebalance Failed: seems like rebalance hangs. please check logs!
|
16:28:10 FAIL
|
Also, rebalance took too much time to complete. It almost took more that 1 hour while on other cluster with 7 nodes(20GB boxes) and bucket RAM=140GB, 50M items, rebalance finished in ~300s.
Logs:
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.10.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.3.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.4.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.5.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.6.zip
Same issue is observed on 6.5.0-4667-enterprise build also on the perf cluster.
Attachments
Issue Links
- relates to
-
MB-36249 SyncWrite throughput for persistence-related levels is ~30% slower than OBSERVE-based durability
- Closed