Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.1.0
-
7.1.0-2383-enterprise
-
Untriaged
-
Centos 64-bit
-
1
-
Unknown
Description
Script to Repro
guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.42624.ini GROUP=P0_set0,rerun=False,get-cbcollect-info=True,autoCompactionDefined=true,get-cbcollect-info=True,infra_log_level=info,log_level=info,upgrade_version=7.1.0-2383 -t failover.DiskFailoverTests.DiskAutofailoverTests.test_disk_autofailover_and_addback_of_node,doc_size=512,upgrade_version=7.1.0-2383,crash_warning=True,data_location=/root,timeout=10,rerun=False,GROUP=P0_set0,bucket_spec=magma_dgm.10_percent_dgm.4_node_1_replica_magma_512,data_load_spec=volume_test_load_with_CRUD_on_collections,get-cbcollect-info=True,log_level=info,recovery_strategy=delta,failover_action=disk_failure,nodes_init=4,autoCompactionDefined=true,disk_timeout=15,num_node_failures=1,randomize_value=True,infra_log_level=info'
|
Steps to Repro
1. Create a 4 node cluster.
2022-02-28 02:05:12,075 | test | INFO | MainThread | [table_view:display:72] Cluster statistics
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
| Node | Services | CPU_utilization | Mem_total | Mem_free | Swap_mem_used | Active / Replica | Version |
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
| 172.23.108.251 | kv | 0.263157894737 | 11.45 GiB | 10.81 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2383-enterprise |
|
| 172.23.106.204 | kv | 0.25078369906 | 11.45 GiB | 10.53 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2383-enterprise |
|
| 172.23.106.211 | kv | 0.802507836991 | 11.45 GiB | 10.49 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2383-enterprise |
|
| 172.23.109.68 | kv | 0.514429109159 | 11.45 GiB | 10.53 GiB | 0.0 Byte / 3.50 GiB | 0 / 0 | 7.1.0-2383-enterprise |
|
+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
|
2. Create buckets/scopes/collections/data.
2022-02-28 02:14:57,960 | test | INFO | MainThread | [table_view:display:72] Bucket statistics
|
+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
|
| Bucket | Type | Storage Backend | Replicas | Durability | TTL | Items | RAM Quota | RAM Used | Disk Used | ARR |
|
+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
|
| bucket1 | couchbase | couchstore | 1 | none | 0 | 100000 | 7.81 GiB | 117.93 MiB | 97.52 MiB | 100 |
|
| bucket2 | couchbase | magma | 1 | none | 0 | 50000 | 3.91 GiB | 293.18 MiB | 207.10 MiB | 100 |
|
| default | couchbase | magma | 1 | none | 0 | 19575000 | 2.00 GiB | 1.32 GiB | 4.72 GiB | 18.1990293742 |
|
+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
|
3. Set disk autofailover timeout to 15s. Start loading data into buckets (CRUD on collections and docs)
4. Induce disk failure on 172.23.108.251 node by un-mounting the data directory. Disk failover is successful.
command = "umount -l {0}; df -Thl".format(location) |
output, error = self.execute_command(command)
|
return output, error |
172.23.109.68 2:18:53 AM 28 Feb, 2022
Node ('ns_1@172.23.108.251') was automatically failed over. Reason: Disk writes failed on following buckets: default, bucket1, bucket2.
|
5. After a minute or so, recover the node by stopping Couchbase-server, mounting the directory back and starting Couchbase-server back again. See CBQE-7470.
6. Do a delta recovery and start rebalance
172.23.109.68 2:24:59 AM 28 Feb, 2022
Starting rebalance, KeepNodes = ['ns_1@172.23.108.251','ns_1@172.23.106.204', 'ns_1@172.23.106.211','ns_1@172.23.109.68'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.108.251'], Delta recovery buckets = all; Operation Id = e1759b0137cc052c6e61747c8e497cac
|
Rebalance fails as shown below.
172.23.109.68 2:30:00 AM 28 Feb, 2022
Rebalance exited with reason {prepare_delta_recovery_failed,"bucket2", {error, {failed_nodes, [{'ns_1@172.23.108.251',{error,aborted}}]}}}. Rebalance Operation Id = e1759b0137cc052c6e61747c8e497cac
|
172.23.108.251
-bash-4.2# grep CRITICAL memcached.log.0001* | grep 'should always increase monotonically between write batches'
|
memcached.log.000135.txt:2022-02-28T02:25:01.409270-08:00 CRITICAL [(bucket2) magma_1]Fatal error: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2
|
memcached.log.000135.txt:2022-02-28T02:25:01.510601-08:00 CRITICAL (bucket2) MagmaKVStore Magma open failed. Status:Invalid: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2
|
memcached.log.000135.txt:2022-02-28T02:25:01.846814-08:00 CRITICAL OneShotTask::run("Create bucket [bucket2]") received exception: MagmaKVStore Magma open failed. Status:Invalid: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2
|
-bash-4.2#
|
cbcollect_info attached.