Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44832

Prepare seqno of abort in backfill snapshot may be lower than snap start seqno

    XMLWordPrintable

Details

    Description

      On Node 172.23.121.115:

      ['2021-03-09T00:53:34.840274-08:00 ERROR 1267: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":7,"opcode":"DCP_ABORT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default, vb:143, state:backfilling\n', '2021-03-09T00:53:34.840325-08:00 ERROR 1267: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":7,"opcode":"DCP_ABORT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default, vb:143, state:backfilling\n', '2021-03-09T00:53:34.840353-08:00 ERROR 1267: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":7,"opcode":"DCP_ABORT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default, vb:143, state:backfilling\n', '2021-03-09T00:53:34.840382-08:00 ERROR 1267: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":7,"opcode":"DCP_ABORT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default, vb:143, state:backfilling\n', '2021-03-09T00:53:34.840409-08:00 ERROR 1267: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":7,"opcode":"DCP_ABORT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default, vb:143, state:backfilling\n', '2021-03-09T00:53:34.840438-08:00 ERROR 1267: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":7,"opcode":"DCP_ABORT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default, vb:143, state:backfilling\n', '2021-03-09T00:53:34.840464-08:00 ERROR 1267: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":7,"opcode":"DCP_ABORT","status":"Invalid arguments"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.115->ns_1@172.23.121.124:default, vb:143, state:backfilling\n']
      

      QE Test:

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/test_job_magma.ini  -t magma.magma_crash_recovery.MagmaCrashTests.test_crash_during_ops,num_items=5000000,infra_log_level=debug,log_level=debug,rerun=False,skip_cleanup=true,doc_size=1024,randomize_value=False,nodes_init=20,num_crashes=20,sdk_timeout=60,bucket_storage=magma,replicas=1,vbuckets=1024,graceful=False,doc_ops=create:update:delete:expiry,wait_warmup=False,maxttl=10,get-cbcollect-info=True,multiplier=100,process_concurrency=5,durability=MAJORITY,stop_server_on_crash=false'
      

      Note: Not seen this issue with couchstore

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Not merged into master yet but that's up for review so resolving this now as it's in 6.6.2.

          ben.huddleston Ben Huddleston added a comment - Not merged into master yet but that's up for review so resolving this now as it's in 6.6.2.

          Build couchbase-server-7.0.0-4727 contains kv_engine commit 028f229 with commit message:
          MB-44832: Allow abort with prepare seqno < snap start at backfill

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4727 contains kv_engine commit 028f229 with commit message: MB-44832 : Allow abort with prepare seqno < snap start at backfill

          I ran couple of iterations of the test, DIdn't observe this issue.(Verified on 7.0.0-4797)

          ankush.sharma Ankush Sharma added a comment - I ran couple of iterations of the test, DIdn't observe this issue.(Verified on 7.0.0-4797)
          drigby Dave Rigby added a comment -

          Ben Huddleston Please could you add a description for the release notes here?

          drigby Dave Rigby added a comment - Ben Huddleston Please could you add a description for the release notes here?

          Description for release notes:

          Summary: Known Issue Disconnecting in the middle of a replica backfill could cause the replication connection to be torn down if the connection is disconnected between the prepare seqno of an abort and the abort itself. The replication stream will not be able to progress until the abort has been overwritten or purged (duration of the metadata purge interval).

          Workaround: Overwrite aborted docs (retry durable writes) immediately.

          ben.huddleston Ben Huddleston added a comment - Description for release notes: Summary: Known Issue Disconnecting in the middle of a replica backfill could cause the replication connection to be torn down if the connection is disconnected between the prepare seqno of an abort and the abort itself. The replication stream will not be able to progress until the abort has been overwritten or purged (duration of the metadata purge interval). Workaround : Overwrite aborted docs (retry durable writes) immediately.

          People

            ritesh.agarwal Ritesh Agarwal
            ritesh.agarwal Ritesh Agarwal
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty