Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Critical
Fix Version/s: 7.1.0
Affects Version/s: 7.1.0
Component/s: storage-engine
Labels:
Environment:
7.1.0-2383-enterprise

Triage:
Untriaged
Operating System:
Centos 64-bit
Story Points:
1
Is this a Regression?:
Unknown

Description

Script to Repro

guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.42624.ini GROUP=P0_set0,rerun=False,get-cbcollect-info=True,autoCompactionDefined=true,get-cbcollect-info=True,infra_log_level=info,log_level=info,upgrade_version=7.1.0-2383 -t failover.DiskFailoverTests.DiskAutofailoverTests.test_disk_autofailover_and_addback_of_node,doc_size=512,upgrade_version=7.1.0-2383,crash_warning=True,data_location=/root,timeout=10,rerun=False,GROUP=P0_set0,bucket_spec=magma_dgm.10_percent_dgm.4_node_1_replica_magma_512,data_load_spec=volume_test_load_with_CRUD_on_collections,get-cbcollect-info=True,log_level=info,recovery_strategy=delta,failover_action=disk_failure,nodes_init=4,autoCompactionDefined=true,disk_timeout=15,num_node_failures=1,randomize_value=True,infra_log_level=info'

Steps to Repro
1. Create a 4 node cluster.

2022-02-28 02:05:12,075 | test  | INFO    | MainThread | [table_view:display:72] Cluster statistics

+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+

| Node           | Services | CPU_utilization | Mem_total | Mem_free  | Swap_mem_used       | Active / Replica  | Version               |

+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+

| 172.23.108.251 | kv       | 0.263157894737  | 11.45 GiB | 10.81 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |

| 172.23.106.204 | kv       | 0.25078369906   | 11.45 GiB | 10.53 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |

| 172.23.106.211 | kv       | 0.802507836991  | 11.45 GiB | 10.49 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |

| 172.23.109.68  | kv       | 0.514429109159  | 11.45 GiB | 10.53 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |

+----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+

2. Create buckets/scopes/collections/data.

2022-02-28 02:14:57,960 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics

+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+

| Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items    | RAM Quota | RAM Used   | Disk Used  | ARR           |

+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+

| bucket1 | couchbase | couchstore      | 1        | none       | 0   | 100000   | 7.81 GiB  | 117.93 MiB | 97.52 MiB  | 100           |

| bucket2 | couchbase | magma           | 1        | none       | 0   | 50000    | 3.91 GiB  | 293.18 MiB | 207.10 MiB | 100           |

| default | couchbase | magma           | 1        | none       | 0   | 19575000 | 2.00 GiB  | 1.32 GiB   | 4.72 GiB   | 18.1990293742 |

+---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+

3. Set disk autofailover timeout to 15s. Start loading data into buckets (CRUD on collections and docs)
4. Induce disk failure on 172.23.108.251 node by un-mounting the data directory. Disk failover is successful.

command = "umount -l {0}; df -Thl".format(location)

output, error = self.execute_command(command)

return output, error

172.23.109.68 2:18:53 AM 28 Feb, 2022

Node ('ns_1@172.23.108.251') was automatically failed over. Reason: Disk writes failed on following buckets: default, bucket1, bucket2.

5. After a minute or so, recover the node by stopping Couchbase-server, mounting the directory back and starting Couchbase-server back again. See CBQE-7470.
6. Do a delta recovery and start rebalance
172.23.109.68 2:24:59 AM 28 Feb, 2022

Starting rebalance, KeepNodes = ['ns_1@172.23.108.251','ns_1@172.23.106.204', 'ns_1@172.23.106.211','ns_1@172.23.109.68'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.108.251'], Delta recovery buckets = all; Operation Id = e1759b0137cc052c6e61747c8e497cac

Rebalance fails as shown below.
172.23.109.68 2:30:00 AM 28 Feb, 2022

Rebalance exited with reason {prepare_delta_recovery_failed,"bucket2", {error, {failed_nodes, [{'ns_1@172.23.108.251',{error,aborted}}]}}}. Rebalance Operation Id = e1759b0137cc052c6e61747c8e497cac

172.23.108.251

-bash-4.2# grep CRITICAL memcached.log.0001* | grep 'should always increase monotonically between write batches'

memcached.log.000135.txt:2022-02-28T02:25:01.409270-08:00 CRITICAL [(bucket2) magma_1]Fatal error: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2

memcached.log.000135.txt:2022-02-28T02:25:01.510601-08:00 CRITICAL (bucket2) MagmaKVStore Magma open failed. Status:Invalid: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2

memcached.log.000135.txt:2022-02-28T02:25:01.846814-08:00 CRITICAL OneShotTask::run("Create bucket [bucket2]") received exception: MagmaKVStore Magma open failed. Status:Invalid: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2

-bash-4.2#

cbcollect_info attached.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

consoleText_MB-51235.txt
11.56 MB
07/Mar/22 6:24 AM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 5 Start watching this issue

Dates

Due:: 04/Mar/22

Created:: 28/Feb/22 2:52 AM

Updated:: 07/Mar/22 3:10 PM

Resolved:: 03/Mar/22 8:32 AM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-51235 wal: Fix WAL trunc offset being updated despite error status: Gerrit Review:

MB-51235 tools/magma_dump: Add kvstore filtering for wal dump: Gerrit Review:

[Magma] - MagmaKVStore Magma open failed. Status:Invalid: seqno (110) should always increase monotonically between write batches (111) numCommits:2

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty