Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52170

queuedSeqnoAck: invariant failed

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • 1
    • Yes
    • KV June 2022

    Description

       

      Build:  7.1.1-3056

      Scenario:

      • 3 kv node cluster
      • Couchbase bucket with replicas=2
      • Delta failover (172.23.108.97) and rebalance out the node

      CB failure log:

      Rebalance exited with reason {{badmatch, {error, failed_nodes,[\'ns_1@172.23.108.98\']}}},
      [{ns_janitor,cleanup_apply_config_body,4,[{file,"src/ns_janitor.erl"},{line,295}]},
      {ns_janitor,\'-cleanup_apply_config/4-fun-0-\',4,[{file,"src/ns_janitor.erl"},{line,215}]},
      {async,\'-async_init/4-fun-1-\',3, [{file,"src/async.erl"},{line,191}]}]}.
      Rebalance Operation Id = 4d82719bd30c809e7ebc3b32e1f55882'}

      TAF test:

       

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.77890.ini GROUP=P0;durability,upgrade_version=7.1.1-3056 -t failover.failovertests.FailoverTests.test_failover_then_add_back_and_rebalance_in,nodes_init=3,replicas=2,num_failed_nodes=1,num_items=300000,recoveryType=delta:full,graceful=True,durability=MAJORITY,GROUP=P0;durability;windows;luks'

       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ashwin.govindarajulu Ashwin Govindarajulu created issue -
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Field Original Value New Value
            Summary Failover: Rebalance out of failed node failed with during "cleanup_apply_config_body" Failover: Rebalance out of failed node failed during "cleanup_apply_config_body"
            Abhijeeth.Nuthan Abhijeeth Nuthan made changes -
            Assignee Abhijeeth Nuthan [ abhijeeth.nuthan ] Navdeep Boparai [ JIRAUSER26109 ]
            navdeep.boparai Navdeep Boparai added a comment - - edited

            The rebalance fails when the janitor running on orchestrator(ns_1@172.23.108.97) fails on the "apply_new_config_replicas_phase" call to ns_1@72.23.108.98 as shown below:

            ns_1@172.23.108.97 log:

            [ns_server:info,2022-05-16T22:04:11.447-07:00,ns_1@172.23.108.97:<0.2997.2>:janitor_agent:call_on_servers:289]default:Some janitor state change requests (apply_new_config_replicas_phase) have failed:
            [{'ns_1@172.23.108.98',
                 {'EXIT',
                     {{{noproc,
                           {gen_server,call,
                               [undefined,
                                {setup_replication......}
            

            Now going to ns_1@172.23.108.98 to see why the call failed, we can see that "apply_new_config_replicas_phase" kicks off the replication manager on ns_1@172.23.108.98 and starts a DCP stream from ns_1@172.23.108.96 to ns_1@172.23.108.98

            [ns_server:debug,2022-05-16T22:04:09.558-07:00,ns_1@172.23.108.98:dcp_replication_manager-default<0.31265.0>:dcp_sup:start_replicator:48]Starting DCP replication from 'ns_1@172.23.108.96' for bucket "default"...
            [ns_server:debug,2022-05-16T22:04:09.562-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:open_connection:72]Open consumer connection "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default" on socket #Port<0.1656>: Body <<"{\"consumer_name\":\"ns_1@172.23.108.98\"}">>
            [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-ns_1@172.23.108.96<0.31463.0>:dcp_replicator:init:53]Opened connection to local memcached <0.31464.0>
            [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-<0.31466.0>:dcp_replicator:connect_to_producer:81]initiated new dcp replication with consumer side: <0.31464.0> and producer side: <0.31465.0>
            [ns_server:debug,2022-05-16T22:04:09.570-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 0, opaque = 0x0, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 1, opaque = 0x1, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 2, opaque = 0x2, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 3, opaque = 0x3, type = add
            [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 4, opaque = 0x4, type = add
            

            Further down the log we see a failure on the DCP stream because the socket for the connections is closed:

            =========================ERROR REPORT=========================
            ** Generic server <0.31465.0> terminating 
            ** Last message in was {tcp_closed,#Port<0.1657>}
            ** When Server state == {state,#Port<0.1657>,
                                        {producer,
                                            "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default",
                                            'ns_1@172.23.108.96',"default"},
                                        undefined,<<>>,dcp_producer_conn,[],#Port<0.1656>,
                                        <0.31464.0>,true}
            ** Reason for termination ==
            ** socket_closed
            

            The connection closing leads to the replication manger dying and hence explains why the initial "apply_new_config_replicas_phase" call failed:

            [ns_server:debug,2022-05-16T22:04:11.436-07:00,ns_1@172.23.108.98:replication_manager-default<0.31266.0>:replication_manager:terminate:71]Replication manager died {{noproc,
                                          {gen_server,call,
                                              [undefined,
                                               {setup_replication,
            

            [error_logger:error,2022-05-16T22:04:11.441-07:00,ns_1@172.23.108.98:janitor_agent-default<0.31269.0>:ale_error_logger_handler:do_log:101]
            =========================ERROR REPORT=========================
            ** Generic server 'janitor_agent-default' terminating 
            ** Last message in was {apply_new_config_replicas_phase,
                                       [{map,
            

            Now looking further into the cause of the connection closing, digging into the memcached log for ns_1@172.23.108.96 during relevant time period, a failure is shown that caused the connection to be closed:

            2022-05-16T22:04:09.825400-07:00 ERROR 269: Exception occurred during packet execution. Closing connection: Monotonic<l> (queuedSeqnoAck: ns_1@172.23.108.98) invariant failed: new value (565) breaks invariant on current value (565). Cookies: [{"aiostat":"success","connection":"[ {\"ip\":\"172.23.108.98\",\"port\":44228} - {\"ip\":\"172.23.108.96\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":5,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":2},"refcount":1}] Exception thrown from: ["#0  /opt/couchbase/bin/memcached() [0x400000+0x13a726]","#1  /opt/couchbase/bin/memcached() [0x400000+0x3f21ed]","#2  /opt/couchbase/bin/memcached() [0x400000+0x3f6b9a]","#3  /opt/couchbase/bin/memcached() [0x400000+0x3f7052]","#4  /opt/couchbase/bin/memcached() [0x400000+0x4ad164]","#5  /opt/couchbase/bin/memcached() [0x400000+0x3df318]","#6  /opt/couchbase/bin/memcached() [0x400000+0x2a569e]","#7  /opt/couchbase/bin/memcached() [0x400000+0x190303]","#8  /opt/couchbase/bin/memcached() [0x400000+0x22f4db]","#9  /opt/couchbase/bin/memcached() [0x400000+0x20159b]","#10 /opt/couchbase/bin/memcached() [0x400000+0x202940]","#11 /opt/couchbase/bin/memcached() [0x400000+0x1e4141]","#12 /opt/couchbase/bin/memcached() [0x400000+0x1e43d3]","#13 /opt/couchbase/bin/memcached() [0x400000+0x1e4a5b]","#14 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0xed8e]","#15 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0x17d01]","#16 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fa0dba98000+0x1854f]","#17 /opt/couchbase/bin/memcached() [0x400000+0x8085f7]","#18 /opt/couchbase/bin/memcached() [0x400000+0x808ac6]","#19 /opt/couchbase/bin/memcached() [0x400000+0x80a486]","#20 /opt/couchbase/bin/memcached() [0x400000+0x1b0b59]","#21 /opt/couchbase/bin/memcached() [0x400000+0x78d7e9]","#22 /lib64/libpthread.so.0() [0x7fa0dc7dd000+0x7ea5]","#23 /lib64/libc.so.6(clone+0x6d) [0x7fa0da027000+0xfe8dd]"]
            2022-05-16T22:04:09.825411-07:00 INFO 269: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - (vb:767) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default", backfill:false, seqno:623
            2022-05-16T22:04:09.825445-07:00 INFO 269: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - Removing connection [ {"ip":"172.23.108.98","port":44228} - {"ip":"172.23.108.96","port":11209} (System, <ud>@ns_server</ud> ]
            

            navdeep.boparai Navdeep Boparai added a comment - - edited The rebalance fails when the janitor running on orchestrator(ns_1@172.23.108.97) fails on the "apply_new_config_replicas_phase" call to ns_1@72.23.108.98 as shown below: ns_1@172.23.108.97 log: [ns_server:info,2022-05-16T22:04:11.447-07:00,ns_1@172.23.108.97:<0.2997.2>:janitor_agent:call_on_servers:289]default:Some janitor state change requests (apply_new_config_replicas_phase) have failed: [{'ns_1@172.23.108.98', {'EXIT', {{{noproc, {gen_server,call, [undefined, {setup_replication......} Now going to ns_1@172.23.108.98 to see why the call failed, we can see that "apply_new_config_replicas_phase" kicks off the replication manager on ns_1@172.23.108.98 and starts a DCP stream from ns_1@172.23.108.96 to ns_1@172.23.108.98 [ns_server:debug,2022-05-16T22:04:09.558-07:00,ns_1@172.23.108.98:dcp_replication_manager-default<0.31265.0>:dcp_sup:start_replicator:48]Starting DCP replication from 'ns_1@172.23.108.96' for bucket "default"... [ns_server:debug,2022-05-16T22:04:09.562-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:open_connection:72]Open consumer connection "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default" on socket #Port<0.1656>: Body <<"{\"consumer_name\":\"ns_1@172.23.108.98\"}">> [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-ns_1@172.23.108.96<0.31463.0>:dcp_replicator:init:53]Opened connection to local memcached <0.31464.0> [ns_server:debug,2022-05-16T22:04:09.563-07:00,ns_1@172.23.108.98:dcp_replicator-default-<0.31466.0>:dcp_replicator:connect_to_producer:81]initiated new dcp replication with consumer side: <0.31464.0> and producer side: <0.31465.0> [ns_server:debug,2022-05-16T22:04:09.570-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 0, opaque = 0x0, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 1, opaque = 0x1, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 2, opaque = 0x2, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 3, opaque = 0x3, type = add [ns_server:debug,2022-05-16T22:04:09.571-07:00,ns_1@172.23.108.98:<0.31464.0>:dcp_commands:add_stream:83]Add stream for partition 4, opaque = 0x4, type = add Further down the log we see a failure on the DCP stream because the socket for the connections is closed: =========================ERROR REPORT========================= ** Generic server <0.31465.0> terminating ** Last message in was {tcp_closed,#Port<0.1657>} ** When Server state == {state,#Port<0.1657>, {producer, "replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default", 'ns_1@172.23.108.96',"default"}, undefined,<<>>,dcp_producer_conn,[],#Port<0.1656>, <0.31464.0>,true} ** Reason for termination == ** socket_closed The connection closing leads to the replication manger dying and hence explains why the initial "apply_new_config_replicas_phase" call failed: [ns_server:debug,2022-05-16T22:04:11.436-07:00,ns_1@172.23.108.98:replication_manager-default<0.31266.0>:replication_manager:terminate:71]Replication manager died {{noproc, {gen_server,call, [undefined, {setup_replication, [error_logger:error,2022-05-16T22:04:11.441-07:00,ns_1@172.23.108.98:janitor_agent-default<0.31269.0>:ale_error_logger_handler:do_log:101] =========================ERROR REPORT========================= ** Generic server 'janitor_agent-default' terminating ** Last message in was {apply_new_config_replicas_phase, [{map, Now looking further into the cause of the connection closing, digging into the memcached log for ns_1@172.23.108.96 during relevant time period, a failure is shown that caused the connection to be closed: 2022-05-16T22:04:09.825400-07:00 ERROR 269: Exception occurred during packet execution. Closing connection: Monotonic<l> (queuedSeqnoAck: ns_1@172.23.108.98) invariant failed: new value (565) breaks invariant on current value (565). Cookies: [{"aiostat":"success","connection":"[ {\"ip\":\"172.23.108.98\",\"port\":44228} - {\"ip\":\"172.23.108.96\",\"port\":11209} (System, <ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":8,"cas":0,"datatype":"raw","extlen":8,"key":"<ud></ud>","keylen":0,"magic":"ClientRequest","opaque":5,"opcode":"DCP_SEQNO_ACKNOWLEDGED","vbucket":2},"refcount":1}] Exception thrown from: ["#0 /opt/couchbase/bin/memcached() [0x400000+0x13a726]","#1 /opt/couchbase/bin/memcached() [0x400000+0x3f21ed]","#2 /opt/couchbase/bin/memcached() [0x400000+0x3f6b9a]","#3 /opt/couchbase/bin/memcached() [0x400000+0x3f7052]","#4 /opt/couchbase/bin/memcached() [0x400000+0x4ad164]","#5 /opt/couchbase/bin/memcached() [0x400000+0x3df318]","#6 /opt/couchbase/bin/memcached() [0x400000+0x2a569e]","#7 /opt/couchbase/bin/memcached() [0x400000+0x190303]","#8 /opt/couchbase/bin/memcached() [0x400000+0x22f4db]","#9 /opt/couchbase/bin/memcached() [0x400000+0x20159b]","#10 /opt/couchbase/bin/memcached() [0x400000+0x202940]","#11 /opt/couchbase/bin/memcached() [0x400000+0x1e4141]","#12 /opt/couchbase/bin/memcached() [0x400000+0x1e43d3]","#13 /opt/couchbase/bin/memcached() [0x400000+0x1e4a5b]","#14 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0xed8e]","#15 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7() [0x7fa0dba98000+0x17d01]","#16 /opt/couchbase/bin/../lib/libevent_core-2.1.so.7(event_base_loop+0x3bf) [0x7fa0dba98000+0x1854f]","#17 /opt/couchbase/bin/memcached() [0x400000+0x8085f7]","#18 /opt/couchbase/bin/memcached() [0x400000+0x808ac6]","#19 /opt/couchbase/bin/memcached() [0x400000+0x80a486]","#20 /opt/couchbase/bin/memcached() [0x400000+0x1b0b59]","#21 /opt/couchbase/bin/memcached() [0x400000+0x78d7e9]","#22 /lib64/libpthread.so.0() [0x7fa0dc7dd000+0x7ea5]","#23 /lib64/libc.so.6(clone+0x6d) [0x7fa0da027000+0xfe8dd]"] 2022-05-16T22:04:09.825411-07:00 INFO 269: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - (vb:767) ActiveStream::registerCursor name "eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default", backfill:false, seqno:623 2022-05-16T22:04:09.825445-07:00 INFO 269: (No Engine) DCP (Producer) eq_dcpq:replication:ns_1@172.23.108.96->ns_1@172.23.108.98:default - Removing connection [ {"ip":"172.23.108.98","port":44228} - {"ip":"172.23.108.96","port":11209} (System, <ud>@ns_server</ud> ]

            Assigning further for analysis on the memcached failure that caused the DCP stream to close out. 

            navdeep.boparai Navdeep Boparai added a comment - Assigning further for analysis on the memcached failure that caused the DCP stream to close out. 
            navdeep.boparai Navdeep Boparai made changes -
            Assignee Navdeep Boparai [ JIRAUSER26109 ] Daniel Owen [ owend ]
            Balakumaran.Gopal Balakumaran Gopal made changes -
            Priority Major [ 3 ] Critical [ 2 ]

            See around 50+ tests failing with this error on the weekly run we had on 7.1.1-3070.

            Balakumaran.Gopal Balakumaran Gopal added a comment - See around 50+ tests failing with this error on the weekly run we had on 7.1.1-3070.
            owend Daniel Owen made changes -
            Component/s ns_server [ 10019 ]
            Component/s couchbase-bucket [ 10173 ]
            owend Daniel Owen made changes -
            Summary Failover: Rebalance out of failed node failed during "cleanup_apply_config_body" queuedSeqnoAck: invariant failed
            owend Daniel Owen added a comment -

            Hi Ashwin Govindarajulu, This is marked as a regression and affects 7.1.1-3056. Do you have the last build it works with so we can track down when the bug was introduced.

            many thanks

            owend Daniel Owen added a comment - Hi Ashwin Govindarajulu , This is marked as a regression and affects 7.1.1-3056. Do you have the last build it works with so we can track down when the bug was introduced. many thanks
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
            owend Daniel Owen made changes -
            Link This issue is duplicated by MB-52171 [ MB-52171 ]
            owend Daniel Owen made changes -
            Link This issue is duplicated by MB-52171 [ MB-52171 ]
            owend Daniel Owen made changes -
            Link This issue relates to MB-52171 [ MB-52171 ]
            owend Daniel Owen made changes -
            Link This issue relates to MB-52171 [ MB-52171 ]
            owend Daniel Owen made changes -
            Link This issue is duplicated by MB-52171 [ MB-52171 ]

            Hi Dan,

            Test is working fine till 7.1.1-3045, but fails on 3050 build

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Hi Dan, Test is working fine till 7.1.1-3045 , but fails on 3050 build
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Assignee Ashwin Govindarajulu [ ashwin.govindarajulu ] Daniel Owen [ owend ]
            owend Daniel Owen added a comment -

            Many thanks for identifying last working build - 3045.

            We believe it is due to patches that were submitted to address MB-51606.
            The set of patches went in-between 3046 and 3064.

            The patches are currently in the process of being reverted - see https://review.couchbase.org/c/kv_engine/+/175268

            Once the above patch is in - we should retest

            owend Daniel Owen added a comment - Many thanks for identifying last working build - 3045. We believe it is due to patches that were submitted to address MB-51606 . The set of patches went in-between 3046 and 3064. The patches are currently in the process of being reverted - see https://review.couchbase.org/c/kv_engine/+/175268 Once the above patch is in - we should retest
            owend Daniel Owen made changes -
            Link This issue relates to MB-51606 [ MB-51606 ]
            owend Daniel Owen made changes -
            Link This issue relates to MB-52169 [ MB-52169 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.1.1 [ 18320 ]
            Fix Version/s 7.1.2 [ 18414 ]
            owend Daniel Owen added a comment -

            Hey, Ashwin Govindarajulu, The revert patches are now in. Therefore please could you retest to check the issue has been addressed?

            many thanks

            owend Daniel Owen added a comment - Hey, Ashwin Govindarajulu , The revert patches are now in. Therefore please could you retest to check the issue has been addressed? many thanks
            owend Daniel Owen made changes -
            Assignee Daniel Owen [ owend ] Ashwin Govindarajulu [ ashwin.govindarajulu ]

            Hello Daniel Owen, Not seeing this issue on last week's regression run on 7.1.1-3166{}

            Ran the targeted test individually to validate the same.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Hello Daniel Owen , Not seeing this issue on last week's regression run on 7.1.1-3166 { } Ran the targeted test individually to validate the same.
            drigby Dave Rigby added a comment -

            Ashwin Govindarajulu Can this issue be resolved then now?

            drigby Dave Rigby added a comment - Ashwin Govindarajulu Can this issue be resolved then now?
            drigby Dave Rigby made changes -
            Sprint KV June 2022 [ 2203 ]
            drigby Dave Rigby made changes -
            Rank Ranked lower

            Yes Dave Rigby, we can mark this as resolved.

            ashwin.govindarajulu Ashwin Govindarajulu added a comment - Yes Dave Rigby , we can mark this as resolved.
            owend Daniel Owen made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Resolved [ 5 ]
            ashwin.govindarajulu Ashwin Govindarajulu made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty