Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36627

Rebalance is not starting until the data loading stops

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Note: The test has been executed on the perf cluster with machines having RAM of 128GB

      Steps to reproduce:
      1. Create a 4 node cluster
      2. Create a bucket with RAM=400GB and replica=1
      3. Load 50M items in the bucket:

      Bucket statistics
      +----------------+---------+----------+-----+----------+--------------+--------------+-------------+
      | Bucket         | Type    | Replicas | TTL | Items    | RAM Quota    | RAM Used     | Disk Used   |
      +----------------+---------+----------+-----+----------+--------------+--------------+-------------+
      | GleamBookUsers | membase | 1        | 0   | 50000000 | 431270920192 | 136479113600 | 40289584016 |
      +----------------+---------+----------+-----+----------+--------------+--------------+-------------+
      

      4. Rebalance in 1 node and start loading another 10M creates, 10M deletes, 20M updates all in parallel. It is observed that rebalance could not start as data loading is happening.

      16:24:08 +-------------+----------+--------------+
      16:24:08 | Nodes       | Services | Status       |
      16:24:08 +-------------+----------+--------------+
      16:24:08 | 172.23.97.5 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.3 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.6 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.4 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.7 | None     | <--- IN ---  |
      16:24:08 +-------------+----------+--------------+
      

      After waiting for 240s for rebalance to start which it couldn't, the test failed and hence data loading also stops. Observed the cluster after the test failed, rebalance started immediately.

      Rebalance started at 16:24:08 and until 16:28:10 there is no progress in rebalance and test failed:

      16:24:08 2019-10-23 03:54:08,944 | test  | INFO    | pool-2-thread-7 | [table_view:display:72] Rebalance Overview
      16:24:08 +-------------+----------+--------------+
      16:24:08 | Nodes       | Services | Status       |
      16:24:08 +-------------+----------+--------------+
      16:24:08 | 172.23.97.5 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.3 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.6 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.4 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.7 | None     | <--- IN ---  |
      16:24:08 +-------------+----------+--------------+
      16:24:08 2019-10-23 03:54:08,960 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:19 2019-10-23 03:54:18,999 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:29 2019-10-23 03:54:29,048 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:39 2019-10-23 03:54:39,098 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:49 2019-10-23 03:54:49,148 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:59 2019-10-23 03:54:59,198 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:09 2019-10-23 03:55:09,247 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:19 2019-10-23 03:55:19,296 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:29 2019-10-23 03:55:29,358 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:39 2019-10-23 03:55:39,410 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:49 2019-10-23 03:55:49,466 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:59 2019-10-23 03:55:59,523 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:09 2019-10-23 03:56:09,579 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:19 2019-10-23 03:56:19,634 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:29 2019-10-23 03:56:29,693 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:39 2019-10-23 03:56:39,750 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:49 2019-10-23 03:56:49,805 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:59 2019-10-23 03:56:59,865 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:09 2019-10-23 03:57:09,921 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:20 2019-10-23 03:57:19,979 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:30 2019-10-23 03:57:30,032 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:40 2019-10-23 03:57:40,081 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:50 2019-10-23 03:57:50,135 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:28:00 2019-10-23 03:58:00,190 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:28:10 2019-10-23 03:58:10,246 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:28:10 2019-10-23 03:58:10,322 | test  | ERROR   | pool-2-thread-7 | [task:call:222] Rebalance Failed: seems like rebalance hangs. please check logs!
      16:28:10 FAIL
      

      Also, rebalance took too much time to complete. It almost took more that 1 hour while on other cluster with 7 nodes(20GB boxes) and bucket RAM=140GB, 50M items, rebalance finished in ~300s.

      Logs:
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.10.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.3.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.4.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.5.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.6.zip

      Same issue is observed on 6.5.0-4667-enterprise build also on the perf cluster.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              prateek.kumar Prateek Kumar (Inactive)
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty