Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49329

[Jepsen] Underflow exception seen while processing {{DCP_BUFFER_ACKNOWLEDGEMENT}}

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 7.1.0
    • 7.1.0
    • couchbase-bucket
    • Couchbase version 7.1.0-1620
      Ubuntu 18.04

    Description

      Underflow exception seeing during kv-engine-neo-jepsen-nightly-114 while processing a DCP_BUFFER_ACKNOWLEDGEMENT op.

      2021-11-02T06:42:08.608731-07:00 ERROR 64: Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy current:4262112 arg:17049478. Cookies: [{"aiostat":"success","connection":"[ {\"ip\":\"172.23.107.227\",\"port\":59326} - {\"ip\":\"172.23.106.61\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":4,"cas":0,"datatype":"raw","extlen":4,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":268,"opcode":"DCP_BUFFER_ACKNOWLEDGEMENT","vbucket":0},"refcount":1}] Exception thrown from: ["#0  /opt/couchbase/bin/memcached() [0x400000+0x1339e4]","#1  /opt/couchbase/bin/memcached() [0x400000+0x3e7748]","#2  /opt/couchbase/bin/memcached() [0x400000+0x3e7fa0]","#3  /opt/couchbase/bin/memcached() [0x400000+0x3e81ad]","#4  /opt/couchbase/bin/memcached() [0x400000+0x3e84b4]","#5  /opt/couchbase/bin/memcached() [0x400000+0x29c6bb]","#6  /opt/couchbase/bin/memcached() [0x400000+0x1858ef]","#7  /opt/couchbase/bin/memcached() [0x400000+0x224dc0]","#8  /opt/couchbase/bin/memcached() [0x400000+0x1f793b]","#9  /opt/couchbase/bin/memcached() [0x400000+0x1f8ce0]","#10 /opt/couchbase/bin/memcached() [0x400000+0x1d8ad1]","#11 /opt/couchbase/bin/memcached() [0x400000+0x1d8d63]","#12 /opt/couchbase/bin/memcached() [0x400000+0x1d93eb]","#13 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fc7055a6000+0xed8e]","#14 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fc7055a6000+0x17d01]","#15 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fc7055a6000+0x1854f]","#16 /opt/couchbase/bin/memcached() [0x400000+0x7c7577]","#17 /opt/couchbase/bin/memcached() [0x400000+0x7c7a46]","#18 /opt/couchbase/bin/memcached() [0x400000+0x7c9406]","#19 /opt/couchbase/bin/memcached() [0x400000+0x1a84d9]","#20 /opt/couchbase/bin/memcached() [0x400000+0x7532b9]","#21 /lib/x86_64-linux-gnu/libpthread.so.0() [0x7fc7062eb000+0x76db]","#22 /lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7fc703a76000+0x12171f]"]
      

      Logs also include pcaps.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Closing issue as we're no longer seeing the exception on Jepsen nightly test runs for neo

          richard.demellow Richard deMellow added a comment - Closing issue as we're no longer seeing the exception on Jepsen nightly test runs for neo
          jwalker Jim Walker added a comment -

          Reverted MB-47318 which resolves this issue

          jwalker Jim Walker added a comment - Reverted MB-47318 which resolves this issue
          jwalker Jim Walker added a comment -

          Spotted issue - we're double counting unacked bytes when we set a stream to dead.

          Near to the exception on .187 we can see .227 closed a stream that is processing data between the two.

          2021-11-02T06:42:03.907873-07:00 INFO 58: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.187->ns_1@172.23.107.227:default - (vb:253) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early due to a close stream message
          

          jwalker Jim Walker added a comment - Spotted issue - we're double counting unacked bytes when we set a stream to dead. https://github.com/couchbase/kv_engine/blob/master/engines/ep/src/dcp/consumer.cc#L371 Near to the exception on .187 we can see .227 closed a stream that is processing data between the two. 2021-11-02T06:42:03.907873-07:00 INFO 58: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.187->ns_1@172.23.107.227:default - (vb:253) Setting stream to dead state, last_seqno is 0, unAckedBytes is 0, status is The stream closed early due to a close stream message
          jwalker Jim Walker added a comment - - edited

          The multiple threads calling back shouldn't happen - dcp step (consumer) should be invoked by the same frontend thread - but this issue suggests we may have had concurrent calls to handleFlowCtl and produced duplicate acks? Or something completely different

          We can see underflow captured twice in this log set. 172.23.106.61 and 172.23.105.187 logged the issue.

          172.23.106.61 underflowed from 172.23.107.227

           Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy current:4262112 arg:17049478
          

          172.23.105.187 underflowed from 172.23.107.227

           Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy current:28 arg:17049535.
          

          I looked at the PCAP data (Couchbase-172.23.107.227.pcap.1635860378) for 172.23.105.187 <-> 172.23.107.227, the log message from the underflow details the opaque (244), and we do see a DCP ack with that opaque for the overflowed size, this is packet 66897 from the capture.

          Searching for DCP acks only, opaque 242 is an ack for 2131311, and opaque 240 is an ack for 2131339

          So it seems we've done

          • ack 2131339
          • ack 2131311
          • ack 17049535 <- bad

          Maybe not concurrent acks (i'd expect the ack amounts to be closer) - but some other accounting issue...

          jwalker Jim Walker added a comment - - edited The multiple threads calling back shouldn't happen - dcp step (consumer) should be invoked by the same frontend thread - but this issue suggests we may have had concurrent calls to handleFlowCtl and produced duplicate acks? Or something completely different We can see underflow captured twice in this log set. 172.23.106.61 and 172.23.105.187 logged the issue. 172.23.106.61 underflowed from 172.23.107.227 Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy current:4262112 arg:17049478 172.23.105.187 underflowed from 172.23.107.227 Exception occurred during packet execution. Closing connection: ThrowExceptionUnderflowPolicy current:28 arg:17049535. I looked at the PCAP data (Couchbase-172.23.107.227.pcap.1635860378) for 172.23.105.187 <-> 172.23.107.227, the log message from the underflow details the opaque (244), and we do see a DCP ack with that opaque for the overflowed size, this is packet 66897 from the capture. Searching for DCP acks only, opaque 242 is an ack for 2131311, and opaque 240 is an ack for 2131339 So it seems we've done ack 2131339 ack 2131311 ack 17049535 <- bad Maybe not concurrent acks (i'd expect the ack amounts to be closer) - but some other accounting issue...
          jwalker Jim Walker added a comment -

          Suspect that we are triggering multiple threads to callback into the consumer and concurrently calling handleFlowCtl

          Even with the use of atomics, I can't see what would stop two threads reading freedBytes and then acking twice.

          Trying to reproduce this situation using the 'alwaysBuffered' hidden DCP consumer option - but nothing yet

          jwalker Jim Walker added a comment - Suspect that we are triggering multiple threads to callback into the consumer and concurrently calling handleFlowCtl https://github.com/couchbase/kv_engine/blob/15b36716b5cd32f337b0a58251071ad953b9911c/engines/ep/src/dcp/flow-control.cc#L39 Even with the use of atomics, I can't see what would stop two threads reading freedBytes and then acking twice. Trying to reproduce this situation using the 'alwaysBuffered' hidden DCP consumer option - but nothing yet
          jwalker Jim Walker added a comment - - edited

          Suspect this is picking up an issue with MB-47318, unproven but that did change acking logic.

          The test does encounter dcp consumer buffering

          Got error 'no memory' while trying to process prepare with seqno:1
          

          Maybe we've double acked - checking the code now for any incorrect/missing locks etc...

          jwalker Jim Walker added a comment - - edited Suspect this is picking up an issue with MB-47318 , unproven but that did change acking logic. The test does encounter dcp consumer buffering Got error 'no memory' while trying to process prepare with seqno:1 Maybe we've double acked - checking the code now for any incorrect/missing locks etc...

          People

            richard.demellow Richard deMellow
            richard.demellow Richard deMellow
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty