Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.0.0
Affects Version/s: Cheshire-Cat
Component/s: couchbase-bucket
Labels:
Environment:
Centos 7 64 bit; CB EE 7.0.0-4588

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
http://supportal.couchbase.com/snapshot/b6024ccb3bbe9924c39b9c8b74629d35::0

s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.105.175.zip
s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.233.zip
s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.236.zip
s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.238.zip
s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.251.zip
s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.250.zip

Show
http://supportal.couchbase.com/snapshot/b6024ccb3bbe9924c39b9c8b74629d35::0 s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.105.175.zip s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.233.zip s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.236.zip s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.238.zip s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.251.zip s3://cb-customers-secure/collections_autoreprovison_volume/2021-03-03/collectinfo-2021-03-03t081009-ns_1@172.23.106.250.zip
Epic Link:
KV: Collections
Story Points:
1
Is this a Regression?:
Unknown
Sprint:
KV-Engine 2021-March

Description

Summary:
Rebalance in volume test failed at this step:

2021-03-02 22:52:40,614 | test  | INFO    | MainThread | [Collections_autofailover:test_volume_taf:289] Step 14: restart_machine -> Autoreprovision -> Rebalance

Buckets statistics at this step

2021-03-02 22:52:40,614 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics

+---------+-----------+----------+------------+-----+----------+--------------+-------------+-----------+

| Bucket  | Type      | Replicas | Durability | TTL | Items    | RAM Quota    | RAM Used    | Disk Used |

+---------+-----------+----------+------------+-----+----------+--------------+-------------+-----------+

| bucket1 | ephemeral | 2        | none       | 0   | 1500     | 1887436800   | 92255216    | 204       |

| bucket2 | ephemeral | 2        | none       | 0   | 1500     | 1887436800   | 91652128    | 204       |

| default | ephemeral | 2        | none       | 0   | 20000000 | 106954752000 | 20736031328 | 204       |

+---------+-----------+----------+------------+-----+----------+--------------+-------------+-----------+

Timeline for this step:
-> Induce failure (restart_machine) on 172.23.106.233

2021-03-02 22:53:01,595 | test  | INFO    | MainThread | [Collections_autofailover:rebalance_after_autofailover:103] Inducing failure restart_machine on nodes: [ip:172.23.106.233 port:8091 ssh_username:root]

-> Sleep and wait for autoreprovison and then rebalance

2021-03-02 23:02:28,339 | test  | INFO    | MainThread | [common_lib:sleep:22] Sleep 300 seconds. Reason: keep the cluster with the failure(autofailover test)/warmup to complete(autoreprovision test)

2021-03-02 23:07:37,821 | test  | INFO    | pool-1-thread-17 | [table_view:display:72] Rebalance Overview

+----------------+----------+-----------------------+---------------+--------------+

| Nodes          | Services | Version               | CPU           | Status       |

+----------------+----------+-----------------------+---------------+--------------+

| 172.23.105.175 | kv       | 7.0.0-4588-enterprise | 15.7456472369 | Cluster node |

| 172.23.106.250 | kv       | 7.0.0-4588-enterprise | 13.1572327044 | Cluster node |

| 172.23.106.236 | kv       | 7.0.0-4588-enterprise | 8.60201511335 | Cluster node |

| 172.23.106.251 | kv       | 7.0.0-4588-enterprise | 12.4278182275 | Cluster node |

| 172.23.106.233 | kv       | 7.0.0-4588-enterprise | 88.3878241263 | Cluster node |

| 172.23.106.238 | kv       | 7.0.0-4588-enterprise | 5.36836811667 | Cluster node |

+----------------+----------+-----------------------+---------------+--------------+

Observations
1. The CPU usage on .233 (node that was restarted) was unsually high ~88% at the time of rebalance. All other nodes had a very low CPU usage. (So rebalance failure could be be due unusual high cpu usage on .233)
2. Even after the rebalance failed, and so when the cluster was idle the node's cpu usage continued to remain high at 88% for some time
Running top command

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND

  2722 couchba+  20   0 7939468   4.2g   9244 S 692.0 17.9 525:00.41 memcached

 11867 couchba+  20   0 3060060  23556   3224 S   8.3  0.1   0:00.25 beam.smp

  2540 couchba+  20   0 4586772 318860   4804 S   5.6  1.3   6:10.55 beam.smp

 11877 couchba+  20   0    4356    348    276 S   5.6  0.0   0:00.17 erl_child_setup

  2676 couchba+  20   0  972856 136808  21888 S   0.7  0.6   3:05.17 prometheus

  2027 root       0 -20    4348    540    432 S   0.3  0.0   0:04.52 atopacctd

 10740 root      20   0  156984   5872   4424 S   0.3  0.0   0:00.22 sshd

 10787 couchba+  20   0    6644   1340    952 S   0.3  0.0   0:00.09 goport

 10793 couchba+  20   0  283396  12472   4524 S   0.3  0.1   0:00.89 python3

     1 root      20   0  193644   6684   4160 S   0.0  0.0   0:02.42 systemd

     2 root      20   0       0      0      0 S   0.0  0.0   0:00.01 kthreadd

3. grep WARN memcached.log* | grep -v Slow | grep -v "The stream closed early because the conn was disconnected"
on .175 (looks similar to ~~MB-44685~~)

2021-03-03T00:11:54.865556-08:00 WARNING (default) CheckpointManager::addStats: An error occurred while adding stats: std::bad_alloc

2021-03-03T00:11:54.865663-08:00 WARNING (default) CheckpointManager::addStats: An error occurred while adding stats: std::bad_alloc

2021-03-03T00:11:54.866164-08:00 WARNING (default) StatCheckpointVisitor::addCheckpointStat: error building stats: std::bad_alloc

2021-03-03T00:11:54.866393-08:00 WARNING (default) CheckpointManager::addStats: An error occurred while adding stats: std::bad_alloc

on .233

2021-03-02T23:23:49.809649-08:00 WARNING 95: [ {"ip":"127.0.0.1","port":43762} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ] dcp.mutation returned cb::engine_errc::disconnect

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

autoreprovison.png
541 kB
03/Mar/21 12:53 AM
consoleText.txt
3.16 MB
03/Mar/21 12:51 AM
cpu_usage_hour.png
324 kB
03/Mar/21 12:52 AM
node_233_cpu_usage.png
185 kB
03/Mar/21 1:18 AM
rebalance_failure.png
522 kB
03/Mar/21 12:51 AM
Screenshot 2021-03-03 at 09.02.32.png
58 kB
03/Mar/21 1:03 AM
Screenshot 2021-03-03 at 12.11.29.png
63 kB
03/Mar/21 4:12 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Sumedh Basarkod (Inactive)

Reporter:: Sumedh Basarkod (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 6 Start watching this issue

Dates

Created:: 03/Mar/21 12:41 AM

Updated:: 17/Jun/21 3:44 PM

Resolved:: 10/Mar/21 5:04 AM

Gerrit Reviews

There are no open Gerrit changes

Show There is 1 closed Gerrit change

Hide There is 1 closed Gerrit change

MB-44720: HTTombstonePurger needs to use shouldContinueVisiting: Gerrit Review:

[Collections] Rebalance exited with dcp_wait_for_data_move failed

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty