Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47318

DCP consumer acknowledgement under/over counting in replication flow control

    XMLWordPrintable

Details

    Description

      2021-07-11T07:28:58.094400+00:00 WARNING (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - Attempting to release 7173586 bytes which is greater than bytesOutstanding:7173282
      

      Steps to Recreate:

      1. Create a 4 node cluster (RAM=256MB per node, replicas=1)
      2. Create 40 million items(doc size = 256)
      3. Start upserting all the existing 40 million items
      4. While upsert load is going on keep restarting couchbase server 5 times with a wait of 3s in between each restart
      5. Wait for the bucket to warmup.
      6. Repeat step 4 and 5 , Observed " Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy" in logs

      Memcached Logs:
      Error found on Node 172.23.107.102

      ERROR 45016: Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy current:7173282 arg:7173586. Cookies: [{"aiostat":"success","connection":"[ {\\"ip\\":\\"172.23.106.217\\",\\"port\\":55810} - {\\"ip\\":\\"172.23.107.102\\",\\"port\\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":4,"cas":0,"datatype":"raw","extlen":4,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":102,"opcode":"DCP_BUFFER_ACKNOWLEDGEMENT","vbucket":0},"refcount":1}] Exception thrown from: ["#0  /opt/couchbase/bin/memcached() [0x400000+0x9e454]","#1  /opt/couchbase/bin/memcached() [0x400000+0xc22fe]","#2  /opt/couchbase/bin/memcached() [0x400000+0x3c7cf4]","#3  /opt/couchbase/bin/memcached() [0x400000+0x3c7e4d]","#4  /opt/couchbase/bin/memcached() [0x400000+0x3c8054]","#5  /opt/couchbase/bin/memcached() [0x400000+0x288739]","#6  /opt/couchbase/bin/memcached() [0x400000+0x17ee4f]","#7  /opt/couchbase/bin/memcached() [0x400000+0x21b3b0]","#8  /opt/couchbase/bin/memcached() [0x400000+0x1e7aa2]","#9  /opt/couchbase/bin/memcached() [0x400000+0x1e94c0]","#10 /opt/couchbase/bin/memcached() [0x400000+0x1ce269]","#11 /opt/couchbase/bin/memcached() [0x400000+0x1ce623]","#12 /opt/couchbase/bin/memcached() [0x400000+0x1cec9b]","#13 /opt/couchbase/lib/libevent_core-2.1.so.7() [0x7fbbc63c5000+0xf8ae]","#14 /opt/couchbase/lib/libevent_core-2.1.so.7() [0x7fbbc63c5000+0x18821]","#15 /opt/couchbase/lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fbbc63c5000+0x1906f]","#16 /opt/couchbase/bin/memcached() [0x400000+0x766e47]","#17 /opt/couchbase/bin/memcached() [0x400000+0x767316]","#18 /opt/couchbase/bin/memcached() [0x400000+0x768cd6]","#19 /opt/couchbase/bin/memcached() [0x400000+0x1a0e29]","#20 /opt/couchbase/bin/memcached() [0x400000+0x6f97f9]","#21 /lib/x86_64-linux-gnu/libpthread.so.0() [0x7fbbc671c000+0x9609]","#22 /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fbbc5ab2000+0x122293]"]\n']
      

      QE-TEST:

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.103712.ini bucket_storage=magma,rerun=false,GROUP=P0;256MB,randomize_value=true,bucket_eviction_policy=fullEviction,replicas=1,nodes_init=4,enable_dp=false,collect_pcaps=True,key_size=12,get-cbcollect-info=True,infra_log_level=critical,log_level=error,bucket_storage=magma,upgrade_version=7.1.0-1061 -t storage.magma.magma_crash_recovery.MagmaCrashTests.test_crash_during_ops,graceful=True,bucket_ram_quota=256,multiplier=1,batch_size=500,num_items=40000000,doc_size=256,doc_ops=update,GROUP=P0;256MB'
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited

            Update:

            I tried reproducing the under-ack scenario using the following process,

            • Single node cluster running 6.5.2 build 6634
            • Couchbase bucket with 100M RAM, replica=1, eviction=full
            • Load 1M docs to bring the bucket ~ 20% DGM
            • Load more docs with xattr and doc_ttl=30seconds
            • Wait for docs to expire then read the docs to create tombstones
            • Perform offline upgrade of single node cluster to 6.6.2-9588
            • Still seeing bad docs on disk (MB-43205)

            Doc seq: 978
                 id: (collection:0x0:default) test_docs-00749872
                 rev: 6
                 content_meta: 128
                 size (on disk): 108
                 cas: 1641304587755323392, expiry: 1641304588, flags: 33554432, datatype: 0x05 (json,xattr)
                 doc deleted (explicit)
                 size: 103
                 xattrs: {"xattr_1":"xattr_val"}
                 data: (snappy) {"mutated":0,"body":"","first_name":"sharon","mutation_type":"ADD","age":1}

            * Added second node running 6.6.2-9588 and triggered rebalance

            • Rebalance succeeds

            Not able to hit the scenario of unacked bytes in dcp

            ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password all | egrep "wat|mem_u"
             ep_cursor_dropping_checkpoint_mem_upper_mark:          50
             ep_mem_high_wat:                                       89128960
             ep_mem_high_wat_percent:                               0.85
             ep_mem_low_wat:                                        78643200
             ep_mem_low_wat_percent:                                0.75
             ep_mem_used_merge_threshold_percent:                   0.5
             ep_storage_age_highwat:                                1370144
             iovused_high_watermark:                                1
             mem_used:                                              83156792
             mem_used_estimate:                                     83156792
             msgused_high_watermark:                                1
             
            ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password dcp | grep unacked
             eq_dcpq:replication:ns_1@10.112.212.101->ns_1@10.112.212.102:default:unacked_bytes:                             0
             eq_dcpq:replication:ns_1@10.112.212.102->ns_1@10.112.212.101:default:unacked_bytes:                             0

            Jim Walker can you please help validating the steps ?

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - - edited Update : I tried reproducing the under-ack scenario using the following process, Single node cluster running 6.5.2 build 6634 Couchbase bucket with 100M RAM, replica=1, eviction=full Load 1M docs to bring the bucket ~ 20% DGM Load more docs with xattr and doc_ttl=30seconds Wait for docs to expire then read the docs to create tombstones Perform offline upgrade of single node cluster to 6.6.2-9588 Still seeing bad docs on disk (MB-43205) Doc seq: 978 id: (collection:0x0:default) test_docs-00749872 rev: 6 content_meta: 128 size (on disk): 108 cas: 1641304587755323392, expiry: 1641304588, flags: 33554432, datatype: 0x05 (json,xattr) doc deleted (explicit) size: 103 xattrs: {"xattr_1":"xattr_val"} data: (snappy) {"mutated":0,"body":"","first_name":"sharon","mutation_type":"ADD","age":1} * Added second node running 6.6.2-9588 and triggered rebalance Rebalance succeeds Not able to hit the scenario of unacked bytes in dcp ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password all | egrep "wat|mem_u" ep_cursor_dropping_checkpoint_mem_upper_mark: 50 ep_mem_high_wat: 89128960 ep_mem_high_wat_percent: 0.85 ep_mem_low_wat: 78643200 ep_mem_low_wat_percent: 0.75 ep_mem_used_merge_threshold_percent: 0.5 ep_storage_age_highwat: 1370144 iovused_high_watermark: 1 mem_used: 83156792 mem_used_estimate: 83156792 msgused_high_watermark: 1   ]# /opt/couchbase/bin/cbstats localhost:11210 -u Administrator -p password dcp | grep unacked eq_dcpq:replication:ns_1@10.112.212.101->ns_1@10.112.212.102:default:unacked_bytes: 0 eq_dcpq:replication:ns_1@10.112.212.102->ns_1@10.112.212.101:default:unacked_bytes: 0 Jim Walker can you please help validating the steps ?
            jwalker Jim Walker added a comment -

            You need more memory pressure (go for even more DGM)

            If you don't see the messages like the following in memcached.log of the new node you are no triggering the scenario (e.g. grep for while trying to process )

            2021-07-11T00:28:47.372380-07:00 WARNING 6680: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - vb:788 Got error 'no memory' while trying to process deletion with seqno:45047
            

            jwalker Jim Walker added a comment - You need more memory pressure (go for even more DGM) If you don't see the messages like the following in memcached.log of the new node you are no triggering the scenario (e.g. grep for while trying to process ) 2021-07-11T00:28:47.372380-07:00 WARNING 6680: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.102->ns_1@172.23.106.217:default - vb:788 Got error 'no memory' while trying to process deletion with seqno:45047

            Update:

            Had multiple runs with lower DGM levels (< 9%) + bad documents in the system.

            As suggested by Jim Walker, tried toggling flusher on/off using cbepctl command in the rebalance_in node to create the mem pressure.

            Still could not get the expected error (no memory) in the consumer node or "bytes which is greater than" in the producer side.

            Based on the discussions, will mark this bug as closed on unit_test results.

             

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Update : Had multiple runs with lower DGM levels (< 9%) + bad documents in the system. As suggested by Jim Walker , tried toggling flusher on/off using cbepctl command in the rebalance_in node to create the mem pressure. Still could not get the expected error ( no memory ) in the consumer node or " bytes which is greater than" in the producer side. Based on the discussions, will mark this bug as closed on unit_test results.  

            Closing based on unit_test runs.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing based on unit_test runs.
            wayne Wayne Siu added a comment -

            Ashwin Govindarajulu 

            Can you also verify the fixes in 7.0.4?  Thanks.

            wayne Wayne Siu added a comment - Ashwin Govindarajulu   Can you also verify the fixes in 7.0.4?  Thanks.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ankush.sharma Ankush Sharma
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There is 1 open Gerrit change

                  PagerDuty