Loading...

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 6.5.0
Affects Version/s: 6.5.0
Component/s: couchbase-bucket
Labels:
Environment:
6.5.0-4558

Triage:
Untriaged
Is this a Regression?:
Unknown

Description

Note: The test has been executed on the perf cluster with machines having RAM of 128GB

Steps to reproduce:
1. Create a 4 node cluster
2. Create a bucket with RAM=400GB and replica=1
3. Load 50M items in the bucket:

Bucket statistics

+----------------+---------+----------+-----+----------+--------------+--------------+-------------+

| Bucket         | Type    | Replicas | TTL | Items    | RAM Quota    | RAM Used     | Disk Used   |

+----------------+---------+----------+-----+----------+--------------+--------------+-------------+

| GleamBookUsers | membase | 1        | 0   | 50000000 | 431270920192 | 136479113600 | 40289584016 |

+----------------+---------+----------+-----+----------+--------------+--------------+-------------+

4. Rebalance in 1 node and start loading another 10M creates, 10M deletes, 20M updates all in parallel. It is observed that rebalance could not start as data loading is happening.

16:24:08 +-------------+----------+--------------+

16:24:08 | Nodes       | Services | Status       |

16:24:08 +-------------+----------+--------------+

16:24:08 | 172.23.97.5 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.3 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.6 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.4 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.7 | None     | <--- IN ---  |

16:24:08 +-------------+----------+--------------+

After waiting for 240s for rebalance to start which it couldn't, the test failed and hence data loading also stops. Observed the cluster after the test failed, rebalance started immediately.

Rebalance started at 16:24:08 and until 16:28:10 there is no progress in rebalance and test failed:

16:24:08 2019-10-23 03:54:08,944 | test  | INFO    | pool-2-thread-7 | [table_view:display:72] Rebalance Overview

16:24:08 +-------------+----------+--------------+

16:24:08 | Nodes       | Services | Status       |

16:24:08 +-------------+----------+--------------+

16:24:08 | 172.23.97.5 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.3 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.6 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.4 | [u'kv']  | Cluster node |

16:24:08 | 172.23.97.7 | None     | <--- IN ---  |

16:24:08 +-------------+----------+--------------+

16:24:08 2019-10-23 03:54:08,960 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:24:19 2019-10-23 03:54:18,999 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:24:29 2019-10-23 03:54:29,048 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:24:39 2019-10-23 03:54:39,098 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:24:49 2019-10-23 03:54:49,148 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:24:59 2019-10-23 03:54:59,198 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:25:09 2019-10-23 03:55:09,247 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:25:19 2019-10-23 03:55:19,296 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:25:29 2019-10-23 03:55:29,358 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:25:39 2019-10-23 03:55:39,410 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:25:49 2019-10-23 03:55:49,466 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:25:59 2019-10-23 03:55:59,523 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:26:09 2019-10-23 03:56:09,579 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:26:19 2019-10-23 03:56:19,634 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:26:29 2019-10-23 03:56:29,693 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:26:39 2019-10-23 03:56:39,750 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:26:49 2019-10-23 03:56:49,805 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:26:59 2019-10-23 03:56:59,865 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:27:09 2019-10-23 03:57:09,921 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:27:20 2019-10-23 03:57:19,979 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:27:30 2019-10-23 03:57:30,032 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:27:40 2019-10-23 03:57:40,081 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:27:50 2019-10-23 03:57:50,135 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:28:00 2019-10-23 03:58:00,190 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:28:10 2019-10-23 03:58:10,246 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0

16:28:10 2019-10-23 03:58:10,322 | test  | ERROR   | pool-2-thread-7 | [task:call:222] Rebalance Failed: seems like rebalance hangs. please check logs!

16:28:10 FAIL

Also, rebalance took too much time to complete. It almost took more that 1 hour while on other cluster with 7 nodes(20GB boxes) and bucket RAM=140GB, 50M items, rebalance finished in ~300s.

Logs:
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.10.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.3.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.4.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.5.zip
https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.6.zip

Same issue is observed on 6.5.0-4667-enterprise build also on the perf cluster.

Attachments

Issue Links

relates to

MB-36249 SyncWrite throughput for persistence-related levels is ~30% slower than OBSERVE-based durability

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Rebalance is not starting until the data loading stops

Details

Description

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty