Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36627

Rebalance is not starting until the data loading stops

    XMLWordPrintable

Details

    • Untriaged
    • Unknown

    Description

      Note: The test has been executed on the perf cluster with machines having RAM of 128GB

      Steps to reproduce:
      1. Create a 4 node cluster
      2. Create a bucket with RAM=400GB and replica=1
      3. Load 50M items in the bucket:

      Bucket statistics
      +----------------+---------+----------+-----+----------+--------------+--------------+-------------+
      | Bucket         | Type    | Replicas | TTL | Items    | RAM Quota    | RAM Used     | Disk Used   |
      +----------------+---------+----------+-----+----------+--------------+--------------+-------------+
      | GleamBookUsers | membase | 1        | 0   | 50000000 | 431270920192 | 136479113600 | 40289584016 |
      +----------------+---------+----------+-----+----------+--------------+--------------+-------------+
      

      4. Rebalance in 1 node and start loading another 10M creates, 10M deletes, 20M updates all in parallel. It is observed that rebalance could not start as data loading is happening.

      16:24:08 +-------------+----------+--------------+
      16:24:08 | Nodes       | Services | Status       |
      16:24:08 +-------------+----------+--------------+
      16:24:08 | 172.23.97.5 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.3 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.6 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.4 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.7 | None     | <--- IN ---  |
      16:24:08 +-------------+----------+--------------+
      

      After waiting for 240s for rebalance to start which it couldn't, the test failed and hence data loading also stops. Observed the cluster after the test failed, rebalance started immediately.

      Rebalance started at 16:24:08 and until 16:28:10 there is no progress in rebalance and test failed:

      16:24:08 2019-10-23 03:54:08,944 | test  | INFO    | pool-2-thread-7 | [table_view:display:72] Rebalance Overview
      16:24:08 +-------------+----------+--------------+
      16:24:08 | Nodes       | Services | Status       |
      16:24:08 +-------------+----------+--------------+
      16:24:08 | 172.23.97.5 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.3 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.6 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.4 | [u'kv']  | Cluster node |
      16:24:08 | 172.23.97.7 | None     | <--- IN ---  |
      16:24:08 +-------------+----------+--------------+
      16:24:08 2019-10-23 03:54:08,960 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:19 2019-10-23 03:54:18,999 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:29 2019-10-23 03:54:29,048 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:39 2019-10-23 03:54:39,098 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:49 2019-10-23 03:54:49,148 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:24:59 2019-10-23 03:54:59,198 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:09 2019-10-23 03:55:09,247 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:19 2019-10-23 03:55:19,296 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:29 2019-10-23 03:55:29,358 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:39 2019-10-23 03:55:39,410 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:49 2019-10-23 03:55:49,466 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:25:59 2019-10-23 03:55:59,523 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:09 2019-10-23 03:56:09,579 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:19 2019-10-23 03:56:19,634 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:29 2019-10-23 03:56:29,693 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:39 2019-10-23 03:56:39,750 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:49 2019-10-23 03:56:49,805 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:26:59 2019-10-23 03:56:59,865 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:09 2019-10-23 03:57:09,921 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:20 2019-10-23 03:57:19,979 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:30 2019-10-23 03:57:30,032 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:40 2019-10-23 03:57:40,081 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:27:50 2019-10-23 03:57:50,135 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:28:00 2019-10-23 03:58:00,190 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:28:10 2019-10-23 03:58:10,246 | test  | INFO    | pool-2-thread-7 | [task:check:306] Rebalance - status: running, progress: 0
      16:28:10 2019-10-23 03:58:10,322 | test  | ERROR   | pool-2-thread-7 | [task:call:222] Rebalance Failed: seems like rebalance hangs. please check logs!
      16:28:10 FAIL
      

      Also, rebalance took too much time to complete. It almost took more that 1 hour while on other cluster with 7 nodes(20GB boxes) and bucket RAM=140GB, 50M items, rebalance finished in ~300s.

      Logs:
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.10.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.3.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.4.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.5.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/SlowRebalance/collectinfo-2019-10-23T084313-ns_1%40172.23.97.6.zip

      Same issue is observed on 6.5.0-4667-enterprise build also on the perf cluster.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ritam.sharma Ritam Sharma added a comment -

            Similar issue was logged by Prateek - https://issues.couchbase.com/browse/MB-36607.

            ritam.sharma Ritam Sharma added a comment - Similar issue was logged by Prateek - https://issues.couchbase.com/browse/MB-36607 .
            drigby Dave Rigby added a comment -

            Ritesh Agarwal Is this a regression? If so what was the last successful build?

            drigby Dave Rigby added a comment - Ritesh Agarwal Is this a regression? If so what was the last successful build?
            drigby Dave Rigby added a comment -

            Ritesh Agarwal Could you please attach the logs for the node coming into the cluster (97.7) ?

            drigby Dave Rigby added a comment - Ritesh Agarwal Could you please attach the logs for the node coming into the cluster (97.7) ?
            owend Daniel Owen added a comment -

            Hi Ritesh Agarwal,

            Could you try re-running on couchbase-server-6.5.0-4631?
            It may be a problem with replacing notification pipe FDs with libevent event_active().

            If that works OK - could you re-run with couchbase-server-6.5.0-4632 - that's when the change was introduced - so will help to isolate.

            thanks

            owend Daniel Owen added a comment - Hi Ritesh Agarwal , Could you try re-running on couchbase-server-6.5.0-4631? It may be a problem with replacing notification pipe FDs with libevent event_active(). If that works OK - could you re-run with couchbase-server-6.5.0-4632 - that's when the change was introduced - so will help to isolate. thanks

            Prateek Kumar - Can you please do a re-run and update the ticket.

            ritam.sharma Ritam Sharma added a comment - Prateek Kumar - Can you please do a re-run and update the ticket.
            lynn.straus Lynn Straus added a comment -

            Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.

            lynn.straus Lynn Straus added a comment - Added due date field (preset to Nov 15). Please update the due date to the current ETA for a fix. Thanks.
            owend Daniel Owen added a comment -

            Hi Prateek Kumar
            Given that the replacing notification pipe FDs with libevent event_active() has been reverted you can run with a later build.

            Could you re-run with couchbase-server-6.5.0-4724 or later, as we also had some serious locking bug that is addressed in 4724?

            thanks

            owend Daniel Owen added a comment - Hi Prateek Kumar Given that the replacing notification pipe FDs with libevent event_active() has been reverted you can run with a later build. Could you re-run with couchbase-server-6.5.0-4724 or later, as we also had some serious locking bug that is addressed in 4724? thanks
            owend Daniel Owen added a comment -

            Given the patch https://github.com/couchbase/kv_engine/commit/50429d8c9fefb4f1a9018ba83b8671c4a76a3bd6 from MB-36249 was reverted.
            Will mark as resolved. Please re-open if see in couchbase-server-6.5.0-4724 or later.

            thanks

            owend Daniel Owen added a comment - Given the patch https://github.com/couchbase/kv_engine/commit/50429d8c9fefb4f1a9018ba83b8671c4a76a3bd6 from MB-36249 was reverted. Will mark as resolved. Please re-open if see in couchbase-server-6.5.0-4724 or later. thanks

            People

              prateek.kumar Prateek Kumar (Inactive)
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty