Details
-
Bug
-
Resolution: Incomplete
-
Critical
-
None
-
6.6.1
-
None
-
Enterprise Edition 6.6.1 build 9117 ‧ IPv4 © 2020 Couchbase, Inc.
-
Untriaged
-
Centos 64-bit
-
1
-
No
Description
Script to Repro
./testrunner -i /tmp/win10-bucket-ops.ini rerun=False -t volumetests.test_system_orchestrator heartbeats and timeouts.volume.test_volume_MB_41562,nodes_init=6,initial_load=10000,replicas=2
|
Steps to Repro
1. Set the following values on nodes
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, heartbeat_interval}, 500).'
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({mb_master, timeout_interval_count}, 3).’
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_time}, 5000).'
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_grace_time}, 2000).'
|
curl http://localhost:9000/diag/eval -u Administrator:asdasd -d 'ns_config:set({leader_lease_acquire_worker, lease_renew_after}, 500).'
|
2. Create a 6 node cluster - 4 kv , 1 index and 1 n1ql
------------------------------------
Nodes | Services | Status |
------------------------------------
172.23.105.175 | kv | Cluster node |
172.23.106.250 | index | Cluster node |
172.23.106.236 | kv | Cluster node |
172.23.106.251 | n1ql | Cluster node |
172.23.106.233 | kv | Cluster node |
172.23.106.238 | kv | Cluster node |
------------------------------------
3.Created 4 buckets, 4 primary indexes and do data load.
-----------------------------------------------------------------------
Bucket | Type | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used |
-----------------------------------------------------------------------
bucket1 | membase | 2 | none | 0 | 10000 | 19960692736 | 101458576 | 106809202 |
bucket2 | membase | 2 | none | 0 | 10000 | 19960692736 | 101470928 | 85743474 |
bucket3 | membase | 2 | none | 0 | 10000 | 19960692736 | 102381568 | 82729851 |
bucket4 | membase | 2 | none | 0 | 10000 | 19960692736 | 101491312 | 73828438 |
-----------------------------------------------------------------------
4. Start data load again , run n1ql queries in parallel and do a rebalance in.
------------------------------------
Nodes | Services | Status |
------------------------------------
172.23.105.175 | kv | Cluster node |
172.23.106.250 | index | Cluster node |
172.23.106.236 | kv | Cluster node |
172.23.106.251 | n1ql | Cluster node |
172.23.106.233 | kv | Cluster node |
172.23.106.238 | kv | Cluster node |
172.23.120.87 | None | <--- IN — |
------------------------------------
5. Run the following api and figure out the orchestrator node.
pools/default/terseClusterInfo
In this case it was ns_1@172.23.105.175
6. Ran the following commands on orchestrator for verification
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
13337
|
13457
|
13541
|
[root@localhost ~]# ps aux | grep 'beam.smp' | grep ns_babysitter_bootstrap | awk '{print $2}'
|
13337
|
[root@localhost ~]#
|
7. Kill the babysitter using the following command.
kill -9 $(ps aux | grep 'beam.smp' | grep ns_babysitter_bootstrap | awk '{print $2}')
|
8. Start couchbase-server
[root@localhost ~]# systemctl start couchbase-server.service
|
[root@localhost ~]# systemctl status couchbase-server.service
|
● couchbase-server.service - Couchbase Server
|
Loaded: loaded (/usr/lib/systemd/system/couchbase-server.service; enabled; vendor preset: disabled)
|
Active: active (running) since Tue 2020-10-06 23:11:41 PDT; 2s ago
|
Docs: http://docs.couchbase.com
|
Process: 16907 ExecStop=/opt/couchbase/bin/couchbase-server -k (code=exited, status=0/SUCCESS)
|
Main PID: 16979 (beam.smp)
|
CGroup: /system.slice/couchbase-server.service
|
├─16979 /opt/couchbase/lib/erlang/erts-9.3.3.9/bin/beam.smp -A 16 -sbwt none -- -root /opt/couchbase/lib/erlang -progname erl -- -home /opt/couchbase -- -smp enable -kernel error_logger false inetrc "/opt/couchbase/etc/couch...
|
├─16992 /opt/couchbase/lib/erlang/erts-9.3.3.9/bin/epmd -daemon
|
├─17062 erl_child_setup 70000
|
├─17090 /opt/couchbase/bin/gosecrets
|
├─17096 /opt/couchbase/lib/erlang/erts-9.3.3.9/bin/beam.smp -A 16 -sbt u -P 327680 -K true -swt low -sbwt none -MMmcs 30 -e102400 -- -root /opt/couchbase/lib/erlang -progname erl -- -home /opt/couchbase -- -smp enable -setco...
|
├─17118 erl_child_setup 70000
|
└─17147 sh -s disksup
|
|
Oct 06 23:11:41 localhost.localdomain systemd[1]: Started Couchbase Server.
|
[root@localhost ~]#
|
9) The CB server never comes up. The babysitter pid remains constant, however the other pid 's keeps changing inferring some kind of crash.
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep 'beam.smp' | grep ns_babysitter_bootstrap | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep 'beam.smp' | grep ns_babysitter_bootstrap | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep 'beam.smp' | grep ns_babysitter_bootstrap | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
17613
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
17613
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
17613
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
18010
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
[root@localhost ~]# ps aux | grep -v grep | grep 'beam.smp' | awk '{print $2}'
|
16979
|
26696
|
[root@localhost ~]#
|
This keeps happening and buckets never actually warmup and CB server UI is not accessible, so subsequent steps like delta recovery could not be carried out. In other scenarios I have seen all the 3 pid's after kill but except the baby sitter, other 2 pid's keep changing.
I tired this set of steps around 5 times, was able to consistently repo it every single time.
cbcollect_info attached