Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52170

queuedSeqnoAck: invariant failed

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes
    • KV June 2022

    Description

       

      Build:  7.1.1-3056

      Scenario:

      • 3 kv node cluster
      • Couchbase bucket with replicas=2
      • Delta failover (172.23.108.97) and rebalance out the node

      CB failure log:

      Rebalance exited with reason {{badmatch, {error, failed_nodes,[\'ns_1@172.23.108.98\']}}},
      [{ns_janitor,cleanup_apply_config_body,4,[{file,"src/ns_janitor.erl"},{line,295}]},
      {ns_janitor,\'-cleanup_apply_config/4-fun-0-\',4,[{file,"src/ns_janitor.erl"},{line,215}]},
      {async,\'-async_init/4-fun-1-\',3, [{file,"src/async.erl"},{line,191}]}]}.
      Rebalance Operation Id = 4d82719bd30c809e7ebc3b32e1f55882'}

      TAF test:

       

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.77890.ini GROUP=P0;durability,upgrade_version=7.1.1-3056 -t failover.failovertests.FailoverTests.test_failover_then_add_back_and_rebalance_in,nodes_init=3,replicas=2,num_failed_nodes=1,num_items=300000,recoveryType=delta:full,graceful=True,durability=MAJORITY,GROUP=P0;durability;windows;luks'

       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            navdeep.boparai Navdeep Boparai added a comment - - edited

            The rebalance fails when the janitor running on orchestrator(ns_1@172.23.108.97) fails on the "apply_new_config_replicas_phase" call to ns_1@72.23.108.98 as shown below:

            ns_1@172.23.108.97 log:

            [ns_server:info,2022-05-16T22:04:11.447-07:00,ns_1@172.23.108.97:<0.2997.2>:janitor_agent:call_on_servers:289]default:Some janitor state change requests (apply_new_config_replicas_phase) have failed:
            [{'ns_1@172.23.108.98',
                 {'EXIT',
                     {{{noproc,
                           {gen_server,call,
                               [undefined,
                                {setup_replication......}
            

            Now going to ns_1@172.23.108.98 to see why the call failed, we can see that "apply_new_config_replicas_phase" kicks off the replication manager on ns_1@172.23.108.98 and starts a DCP stream from ns_1@172.23.108.96 to ns_1@172.23.108.98

            [ns_server:debug,2022-05-16T22:04:09.558-07:00,ns_1@172.23.108.98:dcp_replication_manager-default<0.31265.0>:dcp_sup:start_replicator:48]Starting DCP replication from 'ns_1@172.23.108.96' for bucket "default"...
            [ns_server:debug,2022-05-16T22:04:09.562-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:open_connection:72]Open consumer connection "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default" on socket #Port<0.1656>: Body <<"{\"consumer_name\":\"ns_1@172.23.108.98\"}">>
            [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-ns_1@172.23.108.96<0.31463.0>:dcp_replicator:init:53]Opened connection to local memcached <0.31464.0>
            [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-<0.31466.0>:dcp_replicator:connect_to_producer:81]initiated new dcp replication with consumer side: <0.31464.0> and producer side: <0.31465.0>
            [ns_server:debug,2022-05-16T22:04:09.570-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 0, opaque = 0x0, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 1, opaque = 0x1, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 2, opaque = 0x2, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 3, opaque = 0x3, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 4, opaque = 0x4, type = add
            

            Further down the log we see a failure on the DCP stream because the socket for the connections is closed:

            =========================ERROR REPORT=========================
            ** Generic server <0.31465.0> terminating 
            ** Last message in was {tcp_closed,#Port<0.1657>}
            ** When Server state == {state,#Port<0.1657>,
                                        {producer,
                                            "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default",
                                            'ns_1@172.23.108.96',"default"},
                                        undefined,<<>>,dcp_producer_conn,[],#Port<0.1656>,
                                        <0.31464.0>,true}
            ** Reason for termination ==
            ** socket_closed
            

            The connection closing leads to the replication manger dying and hence explains why the initial "apply_new_config_replicas_phase" call failed:

            [ns_server:debug,2022-05-16T22:04:11.436-07:00,ns_1@172.23.108.98:replication_manager-default<0.31266.0>:replication_manager:terminate:71]Replication manager died {{noproc,
                                          {gen_server,call,
                                              [undefined,
                                               {setup_replication,
            

            [error_logger:error,2022-05-16T22:04:11.441-07:00,ns_1@172.23.108.98:janitor_agent-default<0.31269.0>:ale_error_logger_handler:do_log:101]
            =========================ERROR REPORT=========================
            ** Generic server 'janitor_agent-default' terminating 
            ** Last message in was {apply_new_config_replicas_phase,
                                       [{map,
            

            Now looking further into the cause of the connection closing, digging into the memcached log for ns_1@172.23.108.96 during relevant time period, a failure is shown that caused the connection to be closed:

            2022-05-16T22:04:09.825400-07:00 ERROR 269: Exception occurred during packet execution. Closing connection: Monotonic<l> (queuedSeqnoAck: ns_1@172.23.108.98) invariant failed: new value (565) breaks invariant on current value (565). Cookies: [{"aiostat":"success","connection":"[ {\"ip\":\"172.23.108.98\",\"port\":44228} - {\"ip\":\"172.23.108.96\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":5,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":2},"refcount":1}] Exception thrown from: ["#0  /opt/couchbase/bin/memcached() [0x400000+0x13a726]","#1  /opt/couchbase/bin/memcached() [0x400000+0x3f21ed]","#2  /opt/couchbase/bin/memcached() [0x400000+0x3f6b9a]","#3  /opt/couchbase/bin/memcached() [0x400000+0x3f7052]","#4  /opt/couchbase/bin/memcached() [0x400000+0x4ad164]","#5  /opt/couchbase/bin/memcached() [0x400000+0x3df318]","#6  /opt/couchbase/bin/memcached() [0x400000+0x2a569e]","#7  /opt/couchbase/bin/memcached() [0x400000+0x190303]","#8  /opt/couchbase/bin/memcached() [0x400000+0x22f4db]","#9  /opt/couchbase/bin/memcached() [0x400000+0x20159b]","#10 /opt/couchbase/bin/memcached() [0x400000+0x202940]","#11 /opt/couchbase/bin/memcached() [0x400000+0x1e4141]","#12 /opt/couchbase/bin/memcached() [0x400000+0x1e43d3]","#13 /opt/couchbase/bin/memcached() [0x400000+0x1e4a5b]","#14 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0xed8e]","#15 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0x17d01]","#16 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fa0dba98000+0x1854f]","#17 /opt/couchbase/bin/memcached() [0x400000+0x8085f7]","#18 /opt/couchbase/bin/memcached() [0x400000+0x808ac6]","#19 /opt/couchbase/bin/memcached() [0x400000+0x80a486]","#20 /opt/couchbase/bin/memcached() [0x400000+0x1b0b59]","#21 /opt/couchbase/bin/memcached() [0x400000+0x78d7e9]","#22 /lib64/libpthread.so.0() [0x7fa0dc7dd000+0x7ea5]","#23 /lib64/libc.so.6(clone+0x6d) [0x7fa0da027000+0xfe8dd]"]
            2022-05-16T22:04:09.825411-07:00 INFO 269: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - (vb:767) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default", backfill:false, seqno:623
            2022-05-16T22:04:09.825445-07:00 INFO 269: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - Removing connection [ {"ip":"172.23.108.98","port":44228} - {"ip":"172.23.108.96","port":11209} (System, <ud>@ns_server</ud> ]
            

            navdeep.boparai Navdeep Boparai added a comment - - edited The rebalance fails when the janitor running on orchestrator(ns_1@172.23.108.97) fails on the "apply_new_config_replicas_phase" call to ns_1@72.23.108.98 as shown below: ns_1@172.23.108.97 log: [ns_server:info,2022-05-16T22:04:11.447-07:00,ns_1@172.23.108.97:<0.2997.2>:janitor_agent:call_on_servers:289]default:Some janitor state change requests (apply_new_config_replicas_phase) have failed: [{'ns_1@172.23.108.98', {'EXIT', {{{noproc, {gen_server,call, [undefined, {setup_replication......} Now going to ns_1@172.23.108.98 to see why the call failed, we can see that "apply_new_config_replicas_phase" kicks off the replication manager on ns_1@172.23.108.98 and starts a DCP stream from ns_1@172.23.108.96 to ns_1@172.23.108.98 [ns_server:debug,2022-05-16T22:04:09.558-07:00,ns_1@172.23.108.98:dcp_replication_manager-default<0.31265.0>:dcp_sup:start_replicator:48]Starting DCP replication from 'ns_1@172.23.108.96' for bucket "default"... [ns_server:debug,2022-05-16T22:04:09.562-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:open_connection:72]Open consumer connection "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default" on socket #Port<0.1656>: Body <<"{\"consumer_name\":\"ns_1@172.23.108.98\"}">> [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-ns_1@172.23.108.96<0.31463.0>:dcp_replicator:init:53]Opened connection to local memcached <0.31464.0> [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-<0.31466.0>:dcp_replicator:connect_to_producer:81]initiated new dcp replication with consumer side: <0.31464.0> and producer side: <0.31465.0> [ns_server:debug,2022-05-16T22:04:09.570-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 0, opaque = 0x0, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 1, opaque = 0x1, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 2, opaque = 0x2, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 3, opaque = 0x3, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 4, opaque = 0x4, type = add Further down the log we see a failure on the DCP stream because the socket for the connections is closed: =========================ERROR REPORT========================= ** Generic server <0.31465.0> terminating ** Last message in was {tcp_closed,#Port<0.1657>} ** When Server state == {state,#Port<0.1657>, {producer, "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default", 'ns_1@172.23.108.96',"default"}, undefined,<<>>,dcp_producer_conn,[],#Port<0.1656>, <0.31464.0>,true} ** Reason for termination == ** socket_closed The connection closing leads to the replication manger dying and hence explains why the initial "apply_new_config_replicas_phase" call failed: [ns_server:debug,2022-05-16T22:04:11.436-07:00,ns_1@172.23.108.98:replication_manager-default<0.31266.0>:replication_manager:terminate:71]Replication manager died {{noproc, {gen_server,call, [undefined, {setup_replication, [error_logger:error,2022-05-16T22:04:11.441-07:00,ns_1@172.23.108.98:janitor_agent-default<0.31269.0>:ale_error_logger_handler:do_log:101] =========================ERROR REPORT========================= ** Generic server 'janitor_agent-default' terminating ** Last message in was {apply_new_config_replicas_phase, [{map, Now looking further into the cause of the connection closing, digging into the memcached log for ns_1@172.23.108.96 during relevant time period, a failure is shown that caused the connection to be closed: 2022-05-16T22:04:09.825400-07:00 ERROR 269: Exception occurred during packet execution. Closing connection: Monotonic<l> (queuedSeqnoAck: ns_1@172.23.108.98) invariant failed: new value (565) breaks invariant on current value (565). Cookies: [{"aiostat":"success","connection":"[ {\"ip\":\"172.23.108.98\",\"port\":44228} - {\"ip\":\"172.23.108.96\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":5,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":2},"refcount":1}] Exception thrown from: ["#0 /opt/couchbase/bin/memcached() [0x400000+0x13a726]","#1 /opt/couchbase/bin/memcached() [0x400000+0x3f21ed]","#2 /opt/couchbase/bin/memcached() [0x400000+0x3f6b9a]","#3 /opt/couchbase/bin/memcached() [0x400000+0x3f7052]","#4 /opt/couchbase/bin/memcached() [0x400000+0x4ad164]","#5 /opt/couchbase/bin/memcached() [0x400000+0x3df318]","#6 /opt/couchbase/bin/memcached() [0x400000+0x2a569e]","#7 /opt/couchbase/bin/memcached() [0x400000+0x190303]","#8 /opt/couchbase/bin/memcached() [0x400000+0x22f4db]","#9 /opt/couchbase/bin/memcached() [0x400000+0x20159b]","#10 /opt/couchbase/bin/memcached() [0x400000+0x202940]","#11 /opt/couchbase/bin/memcached() [0x400000+0x1e4141]","#12 /opt/couchbase/bin/memcached() [0x400000+0x1e43d3]","#13 /opt/couchbase/bin/memcached() [0x400000+0x1e4a5b]","#14 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0xed8e]","#15 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0x17d01]","#16 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fa0dba98000+0x1854f]","#17 /opt/couchbase/bin/memcached() [0x400000+0x8085f7]","#18 /opt/couchbase/bin/memcached() [0x400000+0x808ac6]","#19 /opt/couchbase/bin/memcached() [0x400000+0x80a486]","#20 /opt/couchbase/bin/memcached() [0x400000+0x1b0b59]","#21 /opt/couchbase/bin/memcached() [0x400000+0x78d7e9]","#22 /lib64/libpthread.so.0() [0x7fa0dc7dd000+0x7ea5]","#23 /lib64/libc.so.6(clone+0x6d) [0x7fa0da027000+0xfe8dd]"] 2022-05-16T22:04:09.825411-07:00 INFO 269: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - (vb:767) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default", backfill:false, seqno:623 2022-05-16T22:04:09.825445-07:00 INFO 269: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - Removing connection [ {"ip":"172.23.108.98","port":44228} - {"ip":"172.23.108.96","port":11209} (System, <ud>@ns_server</ud> ]

            Assigning further for analysis on the memcached failure that caused the DCP stream to close out. 

            navdeep.boparai Navdeep Boparai added a comment - Assigning further for analysis on the memcached failure that caused the DCP stream to close out. 

            See around 50+ tests failing with this error on the weekly run we had on 7.1.1-3070.

            Balakumaran.Gopal Balakumaran Gopal added a comment - See around 50+ tests failing with this error on the weekly run we had on 7.1.1-3070.
            owend Daniel Owen added a comment -

            Hi Ashwin Govindarajulu, This is marked as a regression and affects 7.1.1-3056. Do you have the last build it works with so we can track down when the bug was introduced.

            many thanks

            owend Daniel Owen added a comment - Hi Ashwin Govindarajulu , This is marked as a regression and affects 7.1.1-3056. Do you have the last build it works with so we can track down when the bug was introduced. many thanks

            Hi Dan,

            Test is working fine till 7.1.1-3045, but fails on 3050 build

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Hi Dan, Test is working fine till 7.1.1-3045 , but fails on 3050 build
            owend Daniel Owen added a comment -

            Many thanks for identifying last working build - 3045.

            We believe it is due to patches that were submitted to address MB-51606.
            The set of patches went in-between 3046 and 3064.

            The patches are currently in the process of being reverted - see https://review.couchbase.org/c/kv_engine/+/175268

            Once the above patch is in - we should retest

            owend Daniel Owen added a comment - Many thanks for identifying last working build - 3045. We believe it is due to patches that were submitted to address MB-51606 . The set of patches went in-between 3046 and 3064. The patches are currently in the process of being reverted - see https://review.couchbase.org/c/kv_engine/+/175268 Once the above patch is in - we should retest
            owend Daniel Owen added a comment -

            Hey, Ashwin Govindarajulu, The revert patches are now in. Therefore please could you retest to check the issue has been addressed?

            many thanks

            owend Daniel Owen added a comment - Hey, Ashwin Govindarajulu , The revert patches are now in. Therefore please could you retest to check the issue has been addressed? many thanks

            Hello Daniel Owen, Not seeing this issue on last week's regression run on 7.1.1-3166{}

            Ran the targeted test individually to validate the same.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Hello Daniel Owen , Not seeing this issue on last week's regression run on 7.1.1-3166 { } Ran the targeted test individually to validate the same.
            drigby Dave Rigby added a comment -

            Ashwin Govindarajulu Can this issue be resolved then now?

            drigby Dave Rigby added a comment - Ashwin Govindarajulu Can this issue be resolved then now?

            Yes Dave Rigby, we can mark this as resolved.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Yes Dave Rigby , we can mark this as resolved.

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty