Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47318

DCP consumer acknowledgement under/over counting in replication flow control

    XMLWordPrintable

Details

    Description

      2021-07-11T07:28:58.094400+00:00 WARNING (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - Attempting to release 7173586 bytes which is greater than bytesOutstanding:7173282
      

      Steps to Recreate:

      1. Create a 4 node cluster (RAM=256MB per node, replicas=1)
      2. Create 40 million items(doc size = 256)
      3. Start upserting all the existing 40 million items
      4. While upsert load is going on keep restarting couchbase server 5 times with a wait of 3s in between each restart
      5. Wait for the bucket to warmup.
      6. Repeat step 4 and 5 , Observed " Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy" in logs

      Memcached Logs:
      Error found on Node 172.23.107.102

      ERROR 45016: Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy current:7173282 arg:7173586. Cookies: [{"aiostat":"success","connection":"[ {\\"ip\\":\\"172.23.106.217\\",\\"port\\":55810} - {\\"ip\\":\\"172.23.107.102\\",\\"port\\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":4,"cas":0,"datatype":"raw","extlen":4,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":102,"opcode":"DCP_BUFFER_ACKNOWLEDGEMENT","vbucket":0},"refcount":1}] Exception thrown from: ["#0  /opt/couchbase/bin/memcached() [0x400000+0x9e454]","#1  /opt/couchbase/bin/memcached() [0x400000+0xc22fe]","#2  /opt/couchbase/bin/memcached() [0x400000+0x3c7cf4]","#3  /opt/couchbase/bin/memcached() [0x400000+0x3c7e4d]","#4  /opt/couchbase/bin/memcached() [0x400000+0x3c8054]","#5  /opt/couchbase/bin/memcached() [0x400000+0x288739]","#6  /opt/couchbase/bin/memcached() [0x400000+0x17ee4f]","#7  /opt/couchbase/bin/memcached() [0x400000+0x21b3b0]","#8  /opt/couchbase/bin/memcached() [0x400000+0x1e7aa2]","#9  /opt/couchbase/bin/memcached() [0x400000+0x1e94c0]","#10 /opt/couchbase/bin/memcached() [0x400000+0x1ce269]","#11 /opt/couchbase/bin/memcached() [0x400000+0x1ce623]","#12 /opt/couchbase/bin/memcached() [0x400000+0x1cec9b]","#13 /opt/couchbase/lib/libevent_core-2.1.so.7() [0x7fbbc63c5000+0xf8ae]","#14 /opt/couchbase/lib/libevent_core-2.1.so.7() [0x7fbbc63c5000+0x18821]","#15 /opt/couchbase/lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fbbc63c5000+0x1906f]","#16 /opt/couchbase/bin/memcached() [0x400000+0x766e47]","#17 /opt/couchbase/bin/memcached() [0x400000+0x767316]","#18 /opt/couchbase/bin/memcached() [0x400000+0x768cd6]","#19 /opt/couchbase/bin/memcached() [0x400000+0x1a0e29]","#20 /opt/couchbase/bin/memcached() [0x400000+0x6f97f9]","#21 /lib/x86_64-linux-gnu/libpthread.so.0() [0x7fbbc671c000+0x9609]","#22 /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fbbc5ab2000+0x122293]"]\n']
      

      QE-TEST:

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.103712.ini bucket_storage=magma,rerun=false,GROUP=P0;256MB,randomize_value=true,bucket_eviction_policy=fullEviction,replicas=1,nodes_init=4,enable_dp=false,collect_pcaps=True,key_size=12,get-cbcollect-info=True,infra_log_level=critical,log_level=error,bucket_storage=magma,upgrade_version=7.1.0-1061 -t storage.magma.magma_crash_recovery.MagmaCrashTests.test_crash_during_ops,graceful=True,bucket_ram_quota=256,multiplier=1,batch_size=500,num_items=40000000,doc_size=256,doc_ops=update,GROUP=P0;256MB'
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment -

            Hi Ankush Sharma. Could you grab the backtrace from the machine? (avoids us having to set-up a duplicate environment with matching glibc libraries etc.)

            thanks

            owend Daniel Owen added a comment - Hi Ankush Sharma . Could you grab the backtrace from the machine? (avoids us having to set-up a duplicate environment with matching glibc libraries etc.) thanks
            ankush.sharma Ankush Sharma added a comment - - edited

            Hi Daniel Owen, there were no dmp/core files on the machines, I just observed this error in Memcached logs. Do you want me to capture thread dumps?

            ankush.sharma Ankush Sharma added a comment - - edited Hi Daniel Owen , there were no dmp/core files on the machines, I just observed this error in Memcached logs. Do you want me to capture thread dumps?
            owend Daniel Owen added a comment -

            Hi Ankush Sharma. Sorry my fault - I should have spotted the exception was caught so no core file - apologies
            Is the debug package installed?
            (as be good if we had the source files corresponding in the exception backtrace e.g.

            /opt/couchbase/bin/memcached() [0x400000+0x9e454]","#1  /opt/couchbase/bin/memcached() 
            

            thanks

            owend Daniel Owen added a comment - Hi Ankush Sharma . Sorry my fault - I should have spotted the exception was caught so no core file - apologies Is the debug package installed? (as be good if we had the source files corresponding in the exception backtrace e.g. /opt/couchbase/bin/memcached() [0x400000+0x9e454]","#1 /opt/couchbase/bin/memcached() thanks

            Hi Daniel Owen, Debug infos were not installed. Will try to repro this issue after installing debug info.

            ankush.sharma Ankush Sharma added a comment - Hi Daniel Owen , Debug infos were not installed. Will try to repro this issue after installing debug info.
            owend Daniel Owen added a comment -

            Many thanks Ankush Sharma. Ideally we should have the debug package installed for all tests - as will ensure we have the most information available.

            thanks

            owend Daniel Owen added a comment - Many thanks Ankush Sharma . Ideally we should have the debug package installed for all tests - as will ensure we have the most information available. thanks
            ankush.sharma Ankush Sharma added a comment - - edited

            Daniel Owen, Agree with you. We do have this support in TAF(we install debugInfos), In TestRunner we have recently added this(and is not merged yet). Future weekly runs will have debugInfos installed for all tests

            ankush.sharma Ankush Sharma added a comment - - edited Daniel Owen , Agree with you. We do have this support in TAF(we install debugInfos), In TestRunner we have recently added this(and is not merged yet). Future weekly runs will have debugInfos installed for all tests
            owend Daniel Owen added a comment -

            Perfect! - many thanks

            owend Daniel Owen added a comment - Perfect! - many thanks

            Daniel Owen, I have tried many runs on different builds, But I didn't hit this issue. Can we proceed with the logs attached in the bug?

            ankush.sharma Ankush Sharma added a comment - Daniel Owen , I have tried many runs on different builds, But I didn't hit this issue. Can we proceed with the logs attached in the bug?
            jwalker Jim Walker added a comment - - edited

            We can manually line up the addresses with symbols - but given the context of the disconnect I don't think we need to

            This is a buffer-ack issue, one side of the connection has miscounted causing an underflow in the ackedBytes (which will throw when dev-asserts is on). Both sides of this connection are KV replication, which should be pretty good with counting packets We did have a bug in this a while back (counting the prepare namespace), but should be fixed in this build (will check)

            172.23.107.102 throws and 172.23.106.217 is the consumer doing the ack

            jwalker Jim Walker added a comment - - edited We can manually line up the addresses with symbols - but given the context of the disconnect I don't think we need to This is a buffer-ack issue, one side of the connection has miscounted causing an underflow in the ackedBytes (which will throw when dev-asserts is on). Both sides of this connection are KV replication, which should be pretty good with counting packets We did have a bug in this a while back (counting the prepare namespace), but should be fixed in this build (will check) 172.23.107.102 throws and 172.23.106.217 is the consumer doing the ack
            jwalker Jim Walker added a comment -

            Suspected accounting bug was MB-46482 - this is fixed in this server version so no longer considered.

            jwalker Jim Walker added a comment - Suspected accounting bug was MB-46482 - this is fixed in this server version so no longer considered.
            jwalker Jim Walker added a comment -

            172.23.106.* logs are offset by 7 hours compared to 172.23.107.*

            jwalker Jim Walker added a comment - 172.23.106.* logs are offset by 7 hours compared to 172.23.107.*
            jwalker Jim Walker added a comment -

            Ankush Sharma in the description you say "steps to recreate" does this mean you saw this issue repeatedly? Or are they just the steps to reproduce the test?

            jwalker Jim Walker added a comment - Ankush Sharma in the description you say "steps to recreate" does this mean you saw this issue repeatedly? Or are they just the steps to reproduce the test?
            jwalker Jim Walker added a comment - - edited

            Found an issue in DCP consumer that can acknowledge incorrectly - this can be too much (causing this bug) or too little (causing an indefinite pause).

            The issue i've found is in the following code added from http://review.couchbase.org/c/kv_engine/+/143712

            Whenever we process a delete we have code which may modify the Item - now if after that modification DCP has to buffer the deletion, when it is later processed it will acknowledge with an incorrect number of bytes depending on how the item was modified. I've reproduced both variants

            • If the delete has no datatype but a value, it will have the body removed and will acknowledge less
            • If the delete has a snappy value it can have the value expanded, and will acknowledge more

            In the MB logs, from the consumer just before the producer disconnected we see a deletion was buffered.

            2021-07-11T00:28:47.372380-07:00 WARNING 6680: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - vb:788 Got error 'no memory' while trying to process deletion with seqno:45047
            

            If this test is able to generate deletes with bodies this is very likely the issue, note the client doesn't have to be using snappy, the DCP stream can send compressed 'copies' of items from backfill (as part of stream resume following one of the crashes).

            This is unlikely a magma bug - but a side effect of the high memory forcing buffering and exercising this code in a way which doesn't have test coverage.

            jwalker Jim Walker added a comment - - edited Found an issue in DCP consumer that can acknowledge incorrectly - this can be too much (causing this bug) or too little (causing an indefinite pause). The issue i've found is in the following code added from http://review.couchbase.org/c/kv_engine/+/143712 https://github.com/couchbase/kv_engine/blob/c1e166cfd7fe99c9bbb3ac671f90819e453a1518/engines/ep/src/dcp/consumer.cc#L639 Whenever we process a delete we have code which may modify the Item - now if after that modification DCP has to buffer the deletion, when it is later processed it will acknowledge with an incorrect number of bytes depending on how the item was modified. I've reproduced both variants If the delete has no datatype but a value, it will have the body removed and will acknowledge less If the delete has a snappy value it can have the value expanded, and will acknowledge more In the MB logs, from the consumer just before the producer disconnected we see a deletion was buffered. 2021-07-11T00:28:47.372380-07:00 WARNING 6680: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - vb:788 Got error 'no memory' while trying to process deletion with seqno:45047 If this test is able to generate deletes with bodies this is very likely the issue, note the client doesn't have to be using snappy, the DCP stream can send compressed 'copies' of items from backfill (as part of stream resume following one of the crashes). This is unlikely a magma bug - but a side effect of the high memory forcing buffering and exercising this code in a way which doesn't have test coverage.
            jwalker Jim Walker added a comment - - edited

            In terms of fixing - I would prefer not to put something just in the deletion path to 'remember' the un-modified size, it seems like there's a more robust solution here.

            • Ideally we wouldn't modify the incoming item until we're certain we will store it
            • Ideally we would have a const Item& passed around because the consumer assumes that when it calculates the UpdateFlowControl size
            • Maybe we can review the consumer acking - maybe the 'frontend' code which uses UpdateFlowControl doesn't have to 'give-up' when it sees a temporary_failure, as far as we know the message will be processed, so can just ack provided the message passes all validation and not defer to the buffer processor
            jwalker Jim Walker added a comment - - edited In terms of fixing - I would prefer not to put something just in the deletion path to 'remember' the un-modified size, it seems like there's a more robust solution here. Ideally we wouldn't modify the incoming item until we're certain we will store it Ideally we would have a const Item& passed around because the consumer assumes that when it calculates the UpdateFlowControl size Maybe we can review the consumer acking - maybe the 'frontend' code which uses UpdateFlowControl doesn't have to ' give-up ' when it sees a temporary_failure, as far as we know the message will be processed, so can just ack provided the message passes all validation and not defer to the buffer processor

            Build couchbase-server-7.1.0-1432 contains kv_engine commit b41d7c3 with commit message:
            MB-47318: Add DocKey::makeWireEncodedString

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1432 contains kv_engine commit b41d7c3 with commit message: MB-47318 : Add DocKey::makeWireEncodedString

            Build couchbase-server-7.1.0-1440 contains kv_engine commit e503cc6 with commit message:
            MB-47318: Add DcpControl to consumer and allow flow control override

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1440 contains kv_engine commit e503cc6 with commit message: MB-47318 : Add DcpControl to consumer and allow flow control override

            Build couchbase-server-7.1.0-1441 contains kv_engine commit b813501 with commit message:
            MB-47318: Add DcpConsumer related client connection methods

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1441 contains kv_engine commit b813501 with commit message: MB-47318 : Add DcpConsumer related client connection methods

            Build couchbase-server-7.1.0-1441 contains kv_engine commit 8276092 with commit message:
            MB-47318: Add PassiveStream always buffered mode

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1441 contains kv_engine commit 8276092 with commit message: MB-47318 : Add PassiveStream always buffered mode

            Some 12 ns-server jobs are getting hung at the first test itself hence marking this as "affects-neo-testing".

            ritesh.agarwal Ritesh Agarwal added a comment - Some 12 ns-server jobs are getting hung at the first test itself hence marking this as "affects-neo-testing".

            Build couchbase-server-7.1.0-1487 contains kv_engine commit a68f1f9 with commit message:
            MB-47318: Add a make_wire_encoded_string factory method in xattr utils

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1487 contains kv_engine commit a68f1f9 with commit message: MB-47318 : Add a make_wire_encoded_string factory method in xattr utils

            Build couchbase-server-7.1.0-1571 contains kv_engine commit bb20f27 with commit message:
            MB-47318: Account for packet data when received, but ACK on success

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1571 contains kv_engine commit bb20f27 with commit message: MB-47318 : Account for packet data when received, but ACK on success
            jwalker Jim Walker added a comment - - edited

            Ashwin Govindarajulu as requested some ideas for reproduction...

            There are two cases to consider for reproduction and validation - over-acknowledgement and under-acknowledgement (over-ack and under-ack). However for a release of Couchbase Server under-ack is the issue which will cause a noticeable problem.

            Case 1. over-ack

            This is where the DcpConsumer acknowledges more that it was sent (the DCP protocol is literally counting bytes received and sending an ACK message "I have received n bytes").

            For a development build this yields an underflow exception when the DcpProducer processes the ACK message and lead to this MB, for a release the underflow is masked and the DcpProducer will just accept the ACK, zero it's unacked_bytes counter and carry on (maybe logging a message).

            Case 2. under-ack

            This is where the DcpConsumer acknowledges less than it was sent, eventually the DcpProducer will pause sending data because the consumer hasn't correctly acknowledged all previously sent data. This is the more concerning issue as it affects GA releases. It also can have a negative effect on DCP latency as the DcpProducer may frequently pause/resume replication as the unacked_bytes value gets closer to the upper bound.

            For repro/validation of under-ack, which is the most problematic we need to start here https://issues.couchbase.com/browse/MB-43205

            • In that MB Ashwin was able to create deleted documents with xattr+value
            • When that xattr+value document is replicated onto for example a 6.6.2 node which has MB-47318 unfixed, KV will 'fix' the delete, throwing away some of the value.

            To test MB-47318 we need many documents like that to be replicated to a 6.6.2 node whilst the node is under memory pressure (and may need to background process the DCP message).

            So maybe something like this

            • With a single 6.5.2 node (nodeA), a low memory quota and full eviction generate many 'bad' documents.
            • Now offline upgrade to 6.6.2
            • Now add a second 6.6.2 node (nodeB)

            The idea is that if enough documents are replicated (so that nodeB uses up all of its quota) nodeB will begin buffering and fixing documents then under-ack

            We will know if we have reproduced the issue because either

            • rebalance fails and nodeA shows that its replication to nodeB is paused due to unacked_bytes exceeding the limit
            • rebalance is successful, the system is quiesced and nodeA has a non 0 unacked_bytes
            jwalker Jim Walker added a comment - - edited Ashwin Govindarajulu as requested some ideas for reproduction... There are two cases to consider for reproduction and validation - over-acknowledgement and under-acknowledgement (over-ack and under-ack). However for a release of Couchbase Server under-ack is the issue which will cause a noticeable problem. Case 1. over-ack This is where the DcpConsumer acknowledges more that it was sent (the DCP protocol is literally counting bytes received and sending an ACK message "I have received n bytes"). For a development build this yields an underflow exception when the DcpProducer processes the ACK message and lead to this MB, for a release the underflow is masked and the DcpProducer will just accept the ACK, zero it's unacked_bytes counter and carry on (maybe logging a message). Case 2. under-ack This is where the DcpConsumer acknowledges less than it was sent, eventually the DcpProducer will pause sending data because the consumer hasn't correctly acknowledged all previously sent data. This is the more concerning issue as it affects GA releases. It also can have a negative effect on DCP latency as the DcpProducer may frequently pause/resume replication as the unacked_bytes value gets closer to the upper bound. For repro/validation of under-ack, which is the most problematic we need to start here https://issues.couchbase.com/browse/MB-43205 In that MB Ashwin was able to create deleted documents with xattr+value When that xattr+value document is replicated onto for example a 6.6.2 node which has MB-47318 unfixed, KV will 'fix' the delete, throwing away some of the value. To test MB-47318 we need many documents like that to be replicated to a 6.6.2 node whilst the node is under memory pressure (and may need to background process the DCP message). So maybe something like this With a single 6.5.2 node (nodeA), a low memory quota and full eviction generate many 'bad' documents. Now offline upgrade to 6.6.2 Now add a second 6.6.2 node (nodeB) The idea is that if enough documents are replicated (so that nodeB uses up all of its quota) nodeB will begin buffering and fixing documents then under-ack We will know if we have reproduced the issue because either rebalance fails and nodeA shows that its replication to nodeB is paused due to unacked_bytes exceeding the limit rebalance is successful, the system is quiesced and nodeA has a non 0 unacked_bytes
            jwalker Jim Walker added a comment -

            backport to mad-hatter in progress

            jwalker Jim Walker added a comment - backport to mad-hatter in progress
            jwalker Jim Walker added a comment -

            Looks like this fix has introduced a bug - MH backport on hold

            MB-49329

            jwalker Jim Walker added a comment - Looks like this fix has introduced a bug - MH backport on hold MB-49329
            jwalker Jim Walker added a comment -

            Also MB-49346 which is definitley an issue with this change. MB-49329 is still not clear as to the cause

            jwalker Jim Walker added a comment - Also MB-49346 which is definitley an issue with this change. MB-49329 is still not clear as to the cause
            jwalker Jim Walker added a comment -

            Will revert final patch of this series (the fix) causing more problems than it fixes - tweaks or a new approach required.

            jwalker Jim Walker added a comment - Will revert final patch of this series (the fix) causing more problems than it fixes - tweaks or a new approach required.

            Build couchbase-server-7.1.0-1650 contains kv_engine commit 1efadb5 with commit message:
            MB-47318: Partially revert bb20f27c

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1650 contains kv_engine commit 1efadb5 with commit message: MB-47318 : Partially revert bb20f27c
            paolo.cocchi Paolo Cocchi added a comment -

            For reference, MB-49412 is another issues caused by this fix. The revert here fixes MB-49412 too.

            paolo.cocchi Paolo Cocchi added a comment - For reference, MB-49412 is another issues caused by this fix. The revert here fixes MB-49412 too.

            Build couchbase-server-7.1.0-1723 contains kv_engine commit 2ea007f with commit message:
            MB-47318: Save the 'pristine' message size when consumer buffers

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1723 contains kv_engine commit 2ea007f with commit message: MB-47318 : Save the 'pristine' message size when consumer buffers

            Build couchbase-server-7.1.0-1723 contains kv_engine commit b9098f6 with commit message:
            MB-47318: Push UpdateFlowControl through to PassiveStream

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1723 contains kv_engine commit b9098f6 with commit message: MB-47318 : Push UpdateFlowControl through to PassiveStream
            ritesh.agarwal Ritesh Agarwal added a comment - - edited

            Seems like i am hitting the same bug in the volume test on node 172.23.106.233. Here are the logs for the same:
            [^collectinfo-2021-11-24T093650-ns_1@172.23.105.175.zip]
            [^collectinfo-2021-11-24T093650-ns_1@172.23.106.233.zip]
            [^collectinfo-2021-11-24T093650-ns_1@172.23.106.236.zip]
            [^collectinfo-2021-11-24T093650-ns_1@172.23.106.238.zip]
            [^collectinfo-2021-11-24T093650-ns_1@172.23.121.74.zip]

            Test:

            guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job1.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.SystemTestMagma,nodes_init=4,graceful=True,skip_cleanup=True,num_items=10000000,num_buckets=1,bucket_names=GleamBook,doc_size=1024,bucket_type=membase,eviction_policy=fullEviction,iterations=10,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=50,pc=25,index_nodes=0,cbas_nodes=0,fts_nodes=0,ops_rate=30000,ramQuota=4096,doc_ops=create:update:delete:read,rebl_ops_rate=20000,key_type=RandomKey,vbuckets=1024,mutation_perc=100 -m rest'
            

            ritesh.agarwal Ritesh Agarwal added a comment - - edited Seems like i am hitting the same bug in the volume test on node 172.23.106.233. Here are the logs for the same: [^collectinfo-2021-11-24T093650-ns_1@172.23.105.175.zip] [^collectinfo-2021-11-24T093650-ns_1@172.23.106.233.zip] [^collectinfo-2021-11-24T093650-ns_1@172.23.106.236.zip] [^collectinfo-2021-11-24T093650-ns_1@172.23.106.238.zip] [^collectinfo-2021-11-24T093650-ns_1@172.23.121.74.zip] Test: guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/magma_temp_job1.ini -p bucket_storage=magma,bucket_eviction_policy=fullEviction,rerun=False -t aGoodDoctor.Hospital.Murphy.SystemTestMagma,nodes_init=4,graceful=True,skip_cleanup=True,num_items=10000000,num_buckets=1,bucket_names=GleamBook,doc_size=1024,bucket_type=membase,eviction_policy=fullEviction,iterations=10,batch_size=1000,sdk_timeout=60,log_level=debug,infra_log_level=debug,rerun=False,skip_cleanup=True,key_size=18,randomize_doc_size=False,randomize_value=True,assert_crashes_on_load=True,num_collections=50,maxttl=10,num_indexes=50,pc=25,index_nodes=0,cbas_nodes=0,fts_nodes=0,ops_rate=30000,ramQuota=4096,doc_ops=create:update:delete:read,rebl_ops_rate=20000,key_type=RandomKey,vbuckets=1024,mutation_perc=100 -m rest'
            jwalker Jim Walker added a comment - - edited

            It's the same error message - but same bug is unknown.

            DCP Consumer has acked 954 too much (which will get masked when we disable debug exceptions). Acking too much is not great, but at least won't lead to a hang.

            WARNING (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.233->ns_1@172.23.121.74:GleamBookUsers0 - Attempting to release 52430534 bytes which is greater than bytesOutstanding:52429580
             
            ThrowExceptionUnderflowPolicy current:52429580 arg:52430534
            

            It could be.

            • There are 2 buffer ack bugs - i've only fixed one
              • producer and consumer both manually have to do the accounting - it's a pain and has caused so many bugs now
            • My latest changes are buggy (they aren't hugely invasive though)
            jwalker Jim Walker added a comment - - edited It's the same error message - but same bug is unknown. DCP Consumer has acked 954 too much (which will get masked when we disable debug exceptions). Acking too much is not great, but at least won't lead to a hang. WARNING (GleamBookUsers0) DCP (Producer) eq_dcpq:replication:ns_1@172.23.106.233->ns_1@172.23.121.74:GleamBookUsers0 - Attempting to release 52430534 bytes which is greater than bytesOutstanding:52429580   ThrowExceptionUnderflowPolicy current:52429580 arg:52430534 It could be. There are 2 buffer ack bugs - i've only fixed one producer and consumer both manually have to do the accounting - it's a pain and has caused so many bugs now My latest changes are buggy (they aren't hugely invasive though)

            Hey Jim Walker, Let me know if we need a separate ticket to debug/analyse it further?

            ritesh.agarwal Ritesh Agarwal added a comment - Hey Jim Walker , Let me know if we need a separate ticket to debug/analyse it further?
            jwalker Jim Walker added a comment -

            I think i've spotted it - an issue in 2ea007f when we are closing a stream

            jwalker Jim Walker added a comment - I think i've spotted it - an issue in 2ea007f when we are closing a stream

            Build couchbase-server-7.1.0-1817 contains kv_engine commit d83d53e with commit message:
            MB-47318: Address double acking when closing a stream

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1817 contains kv_engine commit d83d53e with commit message: MB-47318 : Address double acking when closing a stream

            Build couchbase-server-6.6.4-9940 contains kv_engine commit caac463 with commit message:
            MB-47318: [BP] Add DcpControl to consumer and allow flow control override

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9940 contains kv_engine commit caac463 with commit message: MB-47318 : [BP] Add DcpControl to consumer and allow flow control override

            Build couchbase-server-6.6.4-9940 contains kv_engine commit 75d6f95 with commit message:
            MB-47318: [BP] Add DocKey::makeWireEncodedString

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9940 contains kv_engine commit 75d6f95 with commit message: MB-47318 : [BP] Add DocKey::makeWireEncodedString

            Build couchbase-server-6.6.4-9940 contains kv_engine commit 5786728 with commit message:
            MB-47318: Refactor: Move dcp_snapshot_marker_codec.h/cc

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9940 contains kv_engine commit 5786728 with commit message: MB-47318 : Refactor: Move dcp_snapshot_marker_codec.h/cc

            Build couchbase-server-6.6.4-9941 contains kv_engine commit 3010db4 with commit message:
            MB-47318: [BP] Add PassiveStream always buffered mode

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9941 contains kv_engine commit 3010db4 with commit message: MB-47318 : [BP] Add PassiveStream always buffered mode

            Build couchbase-server-6.6.4-9942 contains kv_engine commit 7131d27 with commit message:
            MB-47318: [BP] Push UpdateFlowControl through to PassiveStream

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9942 contains kv_engine commit 7131d27 with commit message: MB-47318 : [BP] Push UpdateFlowControl through to PassiveStream

            Build couchbase-server-6.6.4-9942 contains kv_engine commit 645f02e with commit message:
            MB-47318: [BP] Add disabled tests for the issue

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9942 contains kv_engine commit 645f02e with commit message: MB-47318 : [BP] Add disabled tests for the issue

            Build couchbase-server-6.6.4-9942 contains kv_engine commit 08f5eb9 with commit message:
            MB-47318: [BP] Add a make_wire_encoded_string factory method in xattr utils

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9942 contains kv_engine commit 08f5eb9 with commit message: MB-47318 : [BP] Add a make_wire_encoded_string factory method in xattr utils

            Build couchbase-server-6.6.4-9942 contains kv_engine commit e6fe669 with commit message:
            MB-47318: [BP] Add DcpConsumer related client connection methods

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9942 contains kv_engine commit e6fe669 with commit message: MB-47318 : [BP] Add DcpConsumer related client connection methods

            Build couchbase-server-6.6.4-9946 contains kv_engine commit 515dce7 with commit message:
            MB-47318: [BP] Save the 'pristine' message size when consumer buffers

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.4-9946 contains kv_engine commit 515dce7 with commit message: MB-47318 : [BP] Save the 'pristine' message size when consumer buffers

            Build couchbase-server-7.0.3-7020 contains kv_engine commit 75d6f95 with commit message:
            MB-47318: [BP] Add DocKey::makeWireEncodedString

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.3-7020 contains kv_engine commit 75d6f95 with commit message: MB-47318 : [BP] Add DocKey::makeWireEncodedString

            Build couchbase-server-7.0.3-7020 contains kv_engine commit 5786728 with commit message:
            MB-47318: Refactor: Move dcp_snapshot_marker_codec.h/cc

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.3-7020 contains kv_engine commit 5786728 with commit message: MB-47318 : Refactor: Move dcp_snapshot_marker_codec.h/cc
            jwalker Jim Walker added a comment -

            Fixed in all releases

            jwalker Jim Walker added a comment - Fixed in all releases

            Build couchbase-server-7.0.4-7201 contains kv_engine commit 515dce7 with commit message:
            MB-47318: [BP] Save the 'pristine' message size when consumer buffers

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7201 contains kv_engine commit 515dce7 with commit message: MB-47318 : [BP] Save the 'pristine' message size when consumer buffers

            Build couchbase-server-7.0.4-7201 contains kv_engine commit 7131d27 with commit message:
            MB-47318: [BP] Push UpdateFlowControl through to PassiveStream

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7201 contains kv_engine commit 7131d27 with commit message: MB-47318 : [BP] Push UpdateFlowControl through to PassiveStream

            Build couchbase-server-7.0.4-7201 contains kv_engine commit 645f02e with commit message:
            MB-47318: [BP] Add disabled tests for the issue

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7201 contains kv_engine commit 645f02e with commit message: MB-47318 : [BP] Add disabled tests for the issue

            Build couchbase-server-7.0.4-7201 contains kv_engine commit 08f5eb9 with commit message:
            MB-47318: [BP] Add a make_wire_encoded_string factory method in xattr utils

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7201 contains kv_engine commit 08f5eb9 with commit message: MB-47318 : [BP] Add a make_wire_encoded_string factory method in xattr utils

            Build couchbase-server-7.0.4-7201 contains kv_engine commit e6fe669 with commit message:
            MB-47318: [BP] Add DcpConsumer related client connection methods

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.4-7201 contains kv_engine commit e6fe669 with commit message: MB-47318 : [BP] Add DcpConsumer related client connection methods

            Build couchbase-server-7.1.0-1915 contains kv_engine commit 515dce7 with commit message:
            MB-47318: [BP] Save the 'pristine' message size when consumer buffers

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit 515dce7 with commit message: MB-47318 : [BP] Save the 'pristine' message size when consumer buffers

            Build couchbase-server-7.1.0-1915 contains kv_engine commit 7131d27 with commit message:
            MB-47318: [BP] Push UpdateFlowControl through to PassiveStream

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit 7131d27 with commit message: MB-47318 : [BP] Push UpdateFlowControl through to PassiveStream

            Build couchbase-server-7.1.0-1915 contains kv_engine commit 645f02e with commit message:
            MB-47318: [BP] Add disabled tests for the issue

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit 645f02e with commit message: MB-47318 : [BP] Add disabled tests for the issue

            Build couchbase-server-7.1.0-1915 contains kv_engine commit 08f5eb9 with commit message:
            MB-47318: [BP] Add a make_wire_encoded_string factory method in xattr utils

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit 08f5eb9 with commit message: MB-47318 : [BP] Add a make_wire_encoded_string factory method in xattr utils

            Build couchbase-server-7.1.0-1915 contains kv_engine commit e6fe669 with commit message:
            MB-47318: [BP] Add DcpConsumer related client connection methods

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit e6fe669 with commit message: MB-47318 : [BP] Add DcpConsumer related client connection methods

            Build couchbase-server-7.1.0-1915 contains kv_engine commit 3010db4 with commit message:
            MB-47318: [BP] Add PassiveStream always buffered mode

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit 3010db4 with commit message: MB-47318 : [BP] Add PassiveStream always buffered mode

            Build couchbase-server-7.1.0-1915 contains kv_engine commit caac463 with commit message:
            MB-47318: [BP] Add DcpControl to consumer and allow flow control override

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit caac463 with commit message: MB-47318 : [BP] Add DcpControl to consumer and allow flow control override

            Build couchbase-server-7.1.0-1915 contains kv_engine commit 75d6f95 with commit message:
            MB-47318: [BP] Add DocKey::makeWireEncodedString

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit 75d6f95 with commit message: MB-47318 : [BP] Add DocKey::makeWireEncodedString

            Build couchbase-server-7.1.0-1915 contains kv_engine commit 5786728 with commit message:
            MB-47318: Refactor: Move dcp_snapshot_marker_codec.h/cc

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1915 contains kv_engine commit 5786728 with commit message: MB-47318 : Refactor: Move dcp_snapshot_marker_codec.h/cc
            ritam.sharma Ritam Sharma added a comment -

            Ankush Sharma - Can you please validate defect for 6.6.5.

            ritam.sharma Ritam Sharma added a comment - Ankush Sharma - Can you please validate defect for 6.6.5.
            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

            Update:

            I tried reproducing the under-ack scenario using the following process,

            • Single node cluster running 6.5.2 build 6634
            • Couchbase bucket with 100M RAM, replica=1, eviction=full
            • Load 1M docs to bring the bucket ~ 20% DGM
            • Load more docs with xattr and doc_ttl=30seconds
            • Wait for docs to expire then read the docs to create tombstones
            • Perform offline upgrade of single node cluster to 6.6.2-9588
            • Still seeing bad docs on disk (MB-43205)

            Doc seq: 978
                 id: (collection:0x0:default) test_docs-00749872
                 rev: 6
                 content_meta: 128
                 size (on disk): 108
                 cas: 1641304587755323392, expiry: 1641304588, flags: 33554432, datatype: 0x05 (json,xattr)
                 doc deleted (explicit)
                 size: 103
                 xattrs: {"xattr_1":"xattr_val"}
                 data: (snappy) {"mutated":0,"body":"","first_name":"sharon","mutation_type":"ADD","age":1}

            * Added second node running 6.6.2-9588 and triggered rebalance

            • Rebalance succeeds

            Not able to hit the scenario of unacked bytes in dcp

            ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password all | egrep "wat|mem_u"
             ep_cursor_dropping_checkpoint_mem_upper_mark:          50
             ep_mem_high_wat:                                       89128960
             ep_mem_high_wat_percent:                               0.85
             ep_mem_low_wat:                                        78643200
             ep_mem_low_wat_percent:                                0.75
             ep_mem_used_merge_threshold_percent:                   0.5
             ep_storage_age_highwat:                                1370144
             iovused_high_watermark:                                1
             mem_used:                                              83156792
             mem_used_estimate:                                     83156792
             msgused_high_watermark:                                1
             
            ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password dcp | grep unacked
             eq_dcpq:replication:ns_1@10.112.212.101->ns_1@10.112.212.102:default:unacked_bytes:                             0
             eq_dcpq:replication:ns_1@10.112.212.102->ns_1@10.112.212.101:default:unacked_bytes:                             0

            Jim Walker can you please help validating the steps ?

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Update : I tried reproducing the under-ack scenario using the following process, Single node cluster running 6.5.2 build 6634 Couchbase bucket with 100M RAM, replica=1, eviction=full Load 1M docs to bring the bucket ~ 20% DGM Load more docs with xattr and doc_ttl=30seconds Wait for docs to expire then read the docs to create tombstones Perform offline upgrade of single node cluster to 6.6.2-9588 Still seeing bad docs on disk (MB-43205) Doc seq: 978 id: (collection:0x0:default) test_docs-00749872 rev: 6 content_meta: 128 size (on disk): 108 cas: 1641304587755323392, expiry: 1641304588, flags: 33554432, datatype: 0x05 (json,xattr) doc deleted (explicit) size: 103 xattrs: {"xattr_1":"xattr_val"} data: (snappy) {"mutated":0,"body":"","first_name":"sharon","mutation_type":"ADD","age":1} * Added second node running 6.6.2-9588 and triggered rebalance Rebalance succeeds Not able to hit the scenario of unacked bytes in dcp ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password all | egrep "wat|mem_u" ep_cursor_dropping_checkpoint_mem_upper_mark: 50 ep_mem_high_wat: 89128960 ep_mem_high_wat_percent: 0.85 ep_mem_low_wat: 78643200 ep_mem_low_wat_percent: 0.75 ep_mem_used_merge_threshold_percent: 0.5 ep_storage_age_highwat: 1370144 iovused_high_watermark: 1 mem_used: 83156792 mem_used_estimate: 83156792 msgused_high_watermark: 1   ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password dcp | grep unacked eq_dcpq:replication:ns_1@10.112.212.101->ns_1@10.112.212.102:default:unacked_bytes: 0 eq_dcpq:replication:ns_1@10.112.212.102->ns_1@10.112.212.101:default:unacked_bytes: 0 Jim Walker can you please help validating the steps ?
            jwalker Jim Walker added a comment -

            You need more memory pressure (go for even more DGM)

            If you don't see the messages like the following in memcached.log of the new node you are no triggering the scenario (e.g. grep for while trying to process )

            2021-07-11T00:28:47.372380-07:00 WARNING 6680: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - vb:788 Got error 'no memory' while trying to process deletion with seqno:45047
            

            jwalker Jim Walker added a comment - You need more memory pressure (go for even more DGM) If you don't see the messages like the following in memcached.log of the new node you are no triggering the scenario (e.g. grep for while trying to process ) 2021-07-11T00:28:47.372380-07:00 WARNING 6680: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - vb:788 Got error 'no memory' while trying to process deletion with seqno:45047

            Update:

            Had multiple runs with lower DGM levels (< 9%) + bad documents in the system.

            As suggested by Jim Walker, tried toggling flusher on/off using cbepctl command in the rebalance_in node to create the mem pressure.

            Still could not get the expected error (no memory) in the consumer node or "bytes which is greater than" in the producer side.

            Based on the discussions, will mark this bug as closed on unit_test results.

             

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Update : Had multiple runs with lower DGM levels (< 9%) + bad documents in the system. As suggested by Jim Walker , tried toggling flusher on/off using cbepctl command in the rebalance_in node to create the mem pressure. Still could not get the expected error ( no memory ) in the consumer node or " bytes which is greater than" in the producer side. Based on the discussions, will mark this bug as closed on unit_test results.  

            Closing based on unit_test runs.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing based on unit_test runs.
            wayne Wayne Siu added a comment -

            Ashwin Govindarajulu 

            Can you also verify the fixes in 7.0.4?  Thanks.

            wayne Wayne Siu added a comment - Ashwin Govindarajulu   Can you also verify the fixes in 7.0.4?  Thanks.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ankush.sharma Ankush Sharma
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are 3 open Gerrit changes

                  PagerDuty