Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51235

[Magma] - MagmaKVStore Magma open failed. Status:Invalid: seqno (110) should always increase monotonically between write batches (111) numCommits:2

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Script to Repro

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.42624.ini GROUP=P0_set0,rerun=False,get-cbcollect-info=True,autoCompactionDefined=true,get-cbcollect-info=True,infra_log_level=info,log_level=info,upgrade_version=7.1.0-2383 -t failover.DiskFailoverTests.DiskAutofailoverTests.test_disk_autofailover_and_addback_of_node,doc_size=512,upgrade_version=7.1.0-2383,crash_warning=True,data_location=/root,timeout=10,rerun=False,GROUP=P0_set0,bucket_spec=magma_dgm.10_percent_dgm.4_node_1_replica_magma_512,data_load_spec=volume_test_load_with_CRUD_on_collections,get-cbcollect-info=True,log_level=info,recovery_strategy=delta,failover_action=disk_failure,nodes_init=4,autoCompactionDefined=true,disk_timeout=15,num_node_failures=1,randomize_value=True,infra_log_level=info'
      

      Steps to Repro
      1. Create a 4 node cluster.

      2022-02-28 02:05:12,075 | test  | INFO    | MainThread | [table_view:display:72] Cluster statistics
      +----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
      | Node           | Services | CPU_utilization | Mem_total | Mem_free  | Swap_mem_used       | Active / Replica  | Version               |
      +----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
      | 172.23.108.251 | kv       | 0.263157894737  | 11.45 GiB | 10.81 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |
      | 172.23.106.204 | kv       | 0.25078369906   | 11.45 GiB | 10.53 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |
      | 172.23.106.211 | kv       | 0.802507836991  | 11.45 GiB | 10.49 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |
      | 172.23.109.68  | kv       | 0.514429109159  | 11.45 GiB | 10.53 GiB | 0.0 Byte / 3.50 GiB | 0 / 0             | 7.1.0-2383-enterprise |
      +----------------+----------+-----------------+-----------+-----------+---------------------+-------------------+-----------------------+
      

      2. Create buckets/scopes/collections/data.

      2022-02-28 02:14:57,960 | test  | INFO    | MainThread | [table_view:display:72] Bucket statistics
      +---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
      | Bucket  | Type      | Storage Backend | Replicas | Durability | TTL | Items    | RAM Quota | RAM Used   | Disk Used  | ARR           |
      +---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
      | bucket1 | couchbase | couchstore      | 1        | none       | 0   | 100000   | 7.81 GiB  | 117.93 MiB | 97.52 MiB  | 100           |
      | bucket2 | couchbase | magma           | 1        | none       | 0   | 50000    | 3.91 GiB  | 293.18 MiB | 207.10 MiB | 100           |
      | default | couchbase | magma           | 1        | none       | 0   | 19575000 | 2.00 GiB  | 1.32 GiB   | 4.72 GiB   | 18.1990293742 |
      +---------+-----------+-----------------+----------+------------+-----+----------+-----------+------------+------------+---------------+
      

      3. Set disk autofailover timeout to 15s. Start loading data into buckets (CRUD on collections and docs)
      4. Induce disk failure on 172.23.108.251 node by un-mounting the data directory. Disk failover is successful.

      command = "umount -l {0}; df -Thl".format(location)
      output, error = self.execute_command(command)
      return output, error 

      172.23.109.68 2:18:53 AM 28 Feb, 2022

      Node ('ns_1@172.23.108.251') was automatically failed over. Reason: Disk writes failed on following buckets: default, bucket1, bucket2. 
      

      5. After a minute or so, recover the node by stopping Couchbase-server, mounting the directory back and starting Couchbase-server back again. See CBQE-7470.
      6. Do a delta recovery and start rebalance
      172.23.109.68 2:24:59 AM 28 Feb, 2022

      Starting rebalance, KeepNodes = ['ns_1@172.23.108.251','ns_1@172.23.106.204', 'ns_1@172.23.106.211','ns_1@172.23.109.68'], EjectNodes = [], Failed over and being ejected nodes = [], Delta recovery nodes = ['ns_1@172.23.108.251'], Delta recovery buckets = all; Operation Id = e1759b0137cc052c6e61747c8e497cac
      

      Rebalance fails as shown below.
      172.23.109.68 2:30:00 AM 28 Feb, 2022

      Rebalance exited with reason {prepare_delta_recovery_failed,"bucket2", {error, {failed_nodes, [{'ns_1@172.23.108.251',{error,aborted}}]}}}. Rebalance Operation Id = e1759b0137cc052c6e61747c8e497cac
      

      172.23.108.251

      -bash-4.2# grep CRITICAL memcached.log.0001* | grep 'should always increase monotonically between write batches'
      memcached.log.000135.txt:2022-02-28T02:25:01.409270-08:00 CRITICAL [(bucket2) magma_1]Fatal error: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2
      memcached.log.000135.txt:2022-02-28T02:25:01.510601-08:00 CRITICAL (bucket2) MagmaKVStore Magma open failed. Status:Invalid: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2
      memcached.log.000135.txt:2022-02-28T02:25:01.846814-08:00 CRITICAL OneShotTask::run("Create bucket [bucket2]") received exception: MagmaKVStore Magma open failed. Status:Invalid: kvstore-97/rev-000000001: seqno (110) should always increase monotonically between write batches (111) numCommits:2
      -bash-4.2# 
      

      cbcollect_info attached.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty