Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44720

[Collections] Rebalance exited with dcp_wait_for_data_move failed

    XMLWordPrintable

Details

    Description

      Summary:
      Rebalance in volume test failed at this step:

      2021-03-02 22:52:40,614 | test  | INFO    | MainThread | [Collections_autofailover:test_volume_taf:289] Step 14: restart_machine -> Autoreprovision -> Rebalance

      Buckets statistics at this step

      2021-03-02 22:52:40,614 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +---------+-----------+----------+------------+-----+----------+--------------+-------------+-----------+
      | Bucket  | Type      | Replicas | Durability | TTL | Items    | RAM Quota    | RAM Used    | Disk Used |
      +---------+-----------+----------+------------+-----+----------+--------------+-------------+-----------+
      | bucket1 | ephemeral | 2        | none       | 0   | 1500     | 1887436800   | 92255216    | 204       |
      | bucket2 | ephemeral | 2        | none       | 0   | 1500     | 1887436800   | 91652128    | 204       |
      | default | ephemeral | 2        | none       | 0   | 20000000 | 106954752000 | 20736031328 | 204       |
      +---------+-----------+----------+------------+-----+----------+--------------+-------------+-----------+

      Timeline for this step:
      -> Induce failure (restart_machine) on 172.23.106.233

      2021-03-02 22:53:01,595 | test  | INFO    | MainThread | [Collections_autofailover:rebalance_after_autofailover:103] Inducing failure restart_machine on nodes: [ip:172.23.106.233 port:8091 ssh_username:root]

      -> Sleep and wait for autoreprovison and then rebalance

      2021-03-02 23:02:28,339 | test  | INFO    | MainThread | [common_lib:sleep:22] Sleep 300 seconds. Reason: keep the cluster with the failure(autofailover test)/warmup to complete(autoreprovision test)
      2021-03-02 23:07:37,821 | test  | INFO    | pool-1-thread-17 | [table_view:display:72] Rebalance Overview
      +----------------+----------+-----------------------+---------------+--------------+
      | Nodes          | Services | Version               | CPU           | Status       |
      +----------------+----------+-----------------------+---------------+--------------+
      | 172.23.105.175 | kv       | 7.0.0-4588-enterprise | 15.7456472369 | Cluster node |
      | 172.23.106.250 | kv       | 7.0.0-4588-enterprise | 13.1572327044 | Cluster node |
      | 172.23.106.236 | kv       | 7.0.0-4588-enterprise | 8.60201511335 | Cluster node |
      | 172.23.106.251 | kv       | 7.0.0-4588-enterprise | 12.4278182275 | Cluster node |
      | 172.23.106.233 | kv       | 7.0.0-4588-enterprise | 88.3878241263 | Cluster node |
      | 172.23.106.238 | kv       | 7.0.0-4588-enterprise | 5.36836811667 | Cluster node |
      +----------------+----------+-----------------------+---------------+--------------+

      Observations
      1. The CPU usage on .233 (node that was restarted) was unsually high ~88% at the time of rebalance. All other nodes had a very low CPU usage. (So rebalance failure could be be due unusual high cpu usage on .233) 
      2. Even after the rebalance failed, and so when the cluster was idle the node's cpu usage continued to remain high at 88% for some time
      Running top command

         PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                           
        2722 couchba+  20   0 7939468   4.2g   9244 S 692.0 17.9 525:00.41 memcached                                                                         
       11867 couchba+  20   0 3060060  23556   3224 S   8.3  0.1   0:00.25 beam.smp                                                                          
        2540 couchba+  20   0 4586772 318860   4804 S   5.6  1.3   6:10.55 beam.smp                                                                          
       11877 couchba+  20   0    4356    348    276 S   5.6  0.0   0:00.17 erl_child_setup                                                                   
        2676 couchba+  20   0  972856 136808  21888 S   0.7  0.6   3:05.17 prometheus                                                                        
        2027 root       0 -20    4348    540    432 S   0.3  0.0   0:04.52 atopacctd                                                                         
       10740 root      20   0  156984   5872   4424 S   0.3  0.0   0:00.22 sshd                                                                              
       10787 couchba+  20   0    6644   1340    952 S   0.3  0.0   0:00.09 goport                                                                            
       10793 couchba+  20   0  283396  12472   4524 S   0.3  0.1   0:00.89 python3                                                                           
           1 root      20   0  193644   6684   4160 S   0.0  0.0   0:02.42 systemd                                                                           
           2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd    

      3. grep WARN memcached.log*  | grep -v Slow | grep -v "The stream closed early because the conn was disconnected"
      on .175 (looks similar to MB-44685)

      2021-03-03T00:11:54.865556-08:00 WARNING (default) CheckpointManager::addStats: An error occurred while adding stats: std::bad_alloc
      2021-03-03T00:11:54.865663-08:00 WARNING (default) CheckpointManager::addStats: An error occurred while adding stats: std::bad_alloc
      2021-03-03T00:11:54.866164-08:00 WARNING (default) StatCheckpointVisitor::addCheckpointStat: error building stats: std::bad_alloc
      2021-03-03T00:11:54.866393-08:00 WARNING (default) CheckpointManager::addStats: An error occurred while adding stats: std::bad_alloc

      on .233

      2021-03-02T23:23:49.809649-08:00 WARNING 95: [ {"ip":"127.0.0.1","port":43762} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ] dcp.mutation returned cb::engine_errc::disconnect

       

      Attachments

        1. autoreprovison.png
          autoreprovison.png
          541 kB
        2. consoleText.txt
          3.16 MB
        3. cpu_usage_hour.png
          cpu_usage_hour.png
          324 kB
        4. node_233_cpu_usage.png
          node_233_cpu_usage.png
          185 kB
        5. rebalance_failure.png
          rebalance_failure.png
          522 kB
        6. Screenshot 2021-03-03 at 09.02.32.png
          Screenshot 2021-03-03 at 09.02.32.png
          58 kB
        7. Screenshot 2021-03-03 at 12.11.29.png
          Screenshot 2021-03-03 at 12.11.29.png
          63 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            sumedh.basarkod Sumedh Basarkod (Inactive)
            sumedh.basarkod Sumedh Basarkod (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty