Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-36338

Expelling erroneously allows deduping of SyncWrites in checkpoint

    XMLWordPrintable

Details

    • Untriaged
    • Unknown
    • KV-Engine Mad-Hatter GA

    Description

      Build : 6.5.0-4471
      Test : -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml
      Scale : 3
      Iteration : 1st

      Step:

      [2019-10-04T09:44:15-07:00, sequoiatools/couchbase-cli:6.5:6a4039] server-add -c 172.23.108.103:8091 --server-add https://172.23.106.100 -u Administrator -p password --server-add-username Administrator --server-add-password password --services data
      [2019-10-04T09:44:38-07:00, sequoiatools/couchbase-cli:6.5:b4a767] rebalance -c 172.23.108.103:8091 -u Administrator -p password
       
      Error occurred on container - sequoiatools/couchbase-cli:6.5:[rebalance -c 172.23.108.103:8091 -u Administrator -p password]
       
      docker logs b4a767
      docker start b4a767
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      [2019-10-04T09:55:07-07:00, sequoiatools/cmd:5a4414] 60
      [2019-10-04T09:56:16-07:00, sequoiatools/cmd:348ec6] 300
      

      Error: (diag.log in 172.23.108.103)

      2019-10-04T09:54:52.724-07:00, compaction_daemon:0:warning:message(ns_1@172.23.97.119) - Compactor for view `mapreduce_view/default/_design/scale/replica` (pid [{type,
                                                                               view},
                                                                              {important,
                                                                               true},
                                                                              {name,
                                                                               <<"mapreduce_view/default/_design/scale/replica">>},
                                                                              {fa,
                                                                               {#Fun<compaction_daemon.23.62703019>,
                                                                                [<<"default">>,
                                                                                 <<"_design/scale">>,
                                                                                 mapreduce_view,
                                                                                 replica,
                                                                                 {config,
                                                                                  {30,
                                                                                   undefined},
                                                                                  {30,
                                                                                   undefined},
                                                                                  undefined,
                                                                                  false,
                                                                                  false,
                                                                                  {daemon_config,
                                                                                   30,
                                                                                   131072,
                                                                                   20971520}},
                                                                                 true,
                                                                                 {[{type,
                                                                                    bucket}]}]}}]) terminated unexpectedly: {updater_died,
                                                                                                                             {updater_error,
                                                                                                                              timeout}}
      2019-10-04T09:54:52.930-07:00, ns_rebalancer:2:info:message(ns_1@172.23.108.103) - Bad replicators after rebalance:
      Missing = [{'ns_1@172.23.108.103','ns_1@172.23.97.119',113},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',114},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',115},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',116},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',117},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',118},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',119},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',120},
                 {'ns_1@172.23.108.103','ns_1@172.23.97.119',121}]
      Extras = []
      2019-10-04T09:54:52.933-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.108.103) - Rebalance exited with reason bad_replicas.
      Rebalance Operation Id = 0b67a052f94895e5418ff35de8a74354
      -------------------------------
       
       
      per_node_processes('ns_1@172.23.108.103') =
           {<0.32074.207>,
            [{backtrace,
                 [<<"Program counter: 0x00007f550eb437c8 (gen_server:loop/7 + 288)">>,
                  <<"CP: 0x0000000000000000 (invalid)">>,<<"arity = 0">>,<<>>,
                  <<"0x00007f538ccd6128 Return addr 0x00007f55553e6ff0 (proc_lib:init_p_do_apply/3 + 72)">>,
                  <<"y(0)     []">>,<<"y(1)     infinity">>,
                  <<"y(2)     dcp_proxy">>,
                  <<"(3)     {state,#Port<0.236464>,{producer,\"replication:ns_1@172.23.106.100->ns_1@172.23.108.103:NEW_ORDER\",'ns_1@172.23.1">>,
                  <<"y(4)     <0.32074.207>">>,<<"y(5)     <0.7577.208>">>,<<>>,
                  <<"0x00007f538ccd6160 Return addr 0x0000000000942608 (<terminate process normally>)">>,
                  <<"y(0)     []">>,
                  <<"y(1)     Catch 0x00007f55553e7010 (proc_lib:init_p_do_apply/3 + 104)">>,
                  <<>>]},
             {messages,[]},
             {dictionary,
                 [{'$initial_call',{dcp_proxy,init,1}},
                  {'$ancestors',
                      ['dcp_replicator-NEW_ORDER-ns_1@172.23.106.100',
                       'dcp_sup-NEW_ORDER','single_bucket_kv_sup-NEW_ORDER',
                       ns_bucket_sup,ns_bucket_worker_sup,ns_server_sup,
                       ns_server_nodes_sup,<0.205.0>,ns_server_cluster_sup,root_sup,
                       <0.117.0>]}]},
             {registered_name,[]},
             {status,waiting},
             {initial_call,{proc_lib,init_p,5}},
             {error_handler,error_handler},
             {garbage_collection,
                 [{max_heap_size,#{error_logger => true,kill => true,size => 0}},
                  {min_bin_vheap_size,46422},
                  {min_heap_size,233},
                  {fullsweep_after,512},
                  {minor_gcs,98}]},
             {garbage_collection_info,
                 [{old_heap_block_size,6772},
                  {heap_block_size,1598},
                  {mbuf_size,0},
                  {recent_size,24},
                  {stack_size,10},
                  {old_heap_size,2339},
                  {heap_size,445},
      
      

      logs:

      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.156.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.157.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.164.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.61.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.67.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.69.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.70.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.104.87.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.106.100.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.106.188.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.108.103.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.96.148.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.96.251.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.96.252.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.96.253.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.96.56.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.96.95.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.97.119.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.97.121.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.97.122.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.97.239.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.97.242.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.98.135.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.99.20.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.99.21.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1570208407/collectinfo-2019-10-04T170009-ns_1%40172.23.99.25.zip

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-36338
          # Subject Branch Project Status CR V

          Activity

            Memcached repeatedly drops replication connection on us on node 172.23.97.119:

            [ns_server:error,2019-10-04T09:49:44.542-07:00,ns_1@172.23.97.119:<0.31735.117>:dcp_proxy:handle_info:117]Socket #Port<0.216600> was closed. Closing myself. State = {state,
                                                                        #Port<0.216600>,
                                                                        {consumer,
                                                                         "replication:ns_1@172.23.108.103->ns_1@172.23.97.119:default",
                                                                         'ns_1@172.23.97.119',
                                                                         "default"},
                                                                        undefined,<<>>,
                                                                        dcp_consumer_conn,
                                                                        {state,idle,
                                                                         "789qrstuvwxy"},
                                                                        #Port<0.216601>,
                                                                        <0.31738.117>,
                                                                        true}
            [error_logger:error,2019-10-04T09:49:44.542-07:00,ns_1@172.23.97.119:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]** Generic server <0.31735.117> terminating 
            ** Last message in was {tcp_closed,#Port<0.216600>}
            ** When Server state == {state,#Port<0.216600>,
                                        {consumer,
                                            "replication:ns_1@172.23.108.103->ns_1@172.23.97.119:default",
                                            'ns_1@172.23.97.119',"default"},
                                        undefined,<<>>,dcp_consumer_conn,
                                        {state,idle,"789qrstuvwxy"},
                                        #Port<0.216601>,<0.31738.117>,true}
            ** Reason for termination == 
            ** socket_closed
            

            There are lots of errors of the following kind in memcached log:

            2019-10-04T09:49:44.383606-07:00 ERROR 427: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:48279 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007ff0d557e310","ewouldblock":false,"packet":{"bodylen":41,"cas":1570207457743667200,"datatype":"raw","extlen":21,"key":"<ud>.9A95B082-102_634200</ud>","keylen":20,"magic":"ClientRequest","opaque":7,"opcode":"DCP_DELETION","vbucket":114},"refcount":1}] - closing connection ([ 127.0.0.1:48279 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:114 No tracked, but received commit for key <ud>cid:0x0:9A95B082-102_634200</ud>
            

            Aliaksey Artamonau Aliaksey Artamonau (Inactive) added a comment - Memcached repeatedly drops replication connection on us on node 172.23.97.119: [ns_server:error,2019-10-04T09:49:44.542-07:00,ns_1@172.23.97.119:<0.31735.117>:dcp_proxy:handle_info:117]Socket #Port<0.216600> was closed. Closing myself. State = {state, #Port<0.216600>, {consumer, "replication:ns_1@172.23.108.103->ns_1@172.23.97.119:default", 'ns_1@172.23.97.119', "default"}, undefined,<<>>, dcp_consumer_conn, {state,idle, "789qrstuvwxy"}, #Port<0.216601>, <0.31738.117>, true} [error_logger:error,2019-10-04T09:49:44.542-07:00,ns_1@172.23.97.119:error_logger<0.32.0>:ale_error_logger_handler:do_log:203]** Generic server <0.31735.117> terminating ** Last message in was {tcp_closed,#Port<0.216600>} ** When Server state == {state,#Port<0.216600>, {consumer, "replication:ns_1@172.23.108.103->ns_1@172.23.97.119:default", 'ns_1@172.23.97.119',"default"}, undefined,<<>>,dcp_consumer_conn, {state,idle,"789qrstuvwxy"}, #Port<0.216601>,<0.31738.117>,true} ** Reason for termination == ** socket_closed There are lots of errors of the following kind in memcached log: 2019-10-04T09:49:44.383606-07:00 ERROR 427: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:48279 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007ff0d557e310","ewouldblock":false,"packet":{"bodylen":41,"cas":1570207457743667200,"datatype":"raw","extlen":21,"key":"<ud>.9A95B082-102_634200</ud>","keylen":20,"magic":"ClientRequest","opaque":7,"opcode":"DCP_DELETION","vbucket":114},"refcount":1}] - closing connection ([ 127.0.0.1:48279 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:114 No tracked, but received commit for key <ud>cid:0x0:9A95B082-102_634200</ud>

            Caused by checkpoint expelling.
            As of http://review.couchbase.org/#/c/115757/ (in order to fix another issue) entries in the checkpoint keyIndex are invalidated rather than removed, and flagged tagged with whether the item was a SyncWrite or not, to allow us to avoid deduping SyncWrites after the queued_item has been freed. However, expelling splices items out from the start of the list to just before the earliest cursor, then swaps the value of the dummy item with the value of the item the cursor points to.
            This was (as I understand) to allow that last item to be expelled, without invalidating the cursor iterators while also ensuring the checkpoint is still started with a dummy item as required.

            When invalidating the keyIndex entry, the entry's iterator is dereferenced, expecting to find the correct item (residing in the list the items were splice out into; splicing does not invalidate iterators) to check if the contained value is a syncWrite. In the failing case seen in this ticket, the last item to be expelled is a sync write. As noted above, this element of the list was not spliced out, but instead had its value swapped with that of the dummy item (which was spliced out). The dummy value is decidedly not a SyncWrite.

            The checkpoint then later allows a duplicate syncWrite to be (erroneously) queued, as the keyIndex indicates the existing item was not a SyncWrite. Eventually this is streamed to a replica which has not performed expelling, and so correctly identifies that the second SyncWrite should not appear in the same checkpoint.

            Fix in progress.

            james.harrison James Harrison added a comment - Caused by checkpoint expelling. As of http://review.couchbase.org/#/c/115757/ (in order to fix another issue) entries in the checkpoint keyIndex are invalidated rather than removed, and flagged tagged with whether the item was a SyncWrite or not, to allow us to avoid deduping SyncWrites after the queued_item has been freed. However , expelling splices items out from the start of the list to just before the earliest cursor, then swaps the value of the dummy item with the value of the item the cursor points to. This was (as I understand) to allow that last item to be expelled, without invalidating the cursor iterators while also ensuring the checkpoint is still started with a dummy item as required. When invalidating the keyIndex entry, the entry's iterator is dereferenced, expecting to find the correct item (residing in the list the items were splice out into; splicing does not invalidate iterators) to check if the contained value is a syncWrite. In the failing case seen in this ticket, the last item to be expelled is a sync write. As noted above, this element of the list was not spliced out, but instead had its value swapped with that of the dummy item (which was spliced out). The dummy value is decidedly not a SyncWrite. The checkpoint then later allows a duplicate syncWrite to be (erroneously) queued, as the keyIndex indicates the existing item was not a SyncWrite. Eventually this is streamed to a replica which has not performed expelling, and so correctly identifies that the second SyncWrite should not appear in the same checkpoint. Fix in progress.

            Build couchbase-server-6.5.0-4625 contains kv_engine commit aa36b57 with commit message:
            MB-36338: Ensure keyIndex entries are invalidated correctly

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4625 contains kv_engine commit aa36b57 with commit message: MB-36338 : Ensure keyIndex entries are invalidated correctly

            Build couchbase-server-7.0.0-1017 contains kv_engine commit aa36b57 with commit message:
            MB-36338: Ensure keyIndex entries are invalidated correctly

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1017 contains kv_engine commit aa36b57 with commit message: MB-36338 : Ensure keyIndex entries are invalidated correctly

            Build couchbase-server-6.5.0-4644 contains kv_engine commit 9004066 with commit message:
            MB-36338: deinline Item::isAnySyncWriteOp()

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4644 contains kv_engine commit 9004066 with commit message: MB-36338 : deinline Item::isAnySyncWriteOp()

            Build couchbase-server-7.0.0-1018 contains kv_engine commit 9004066 with commit message:
            MB-36338: deinline Item::isAnySyncWriteOp()

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1018 contains kv_engine commit 9004066 with commit message: MB-36338 : deinline Item::isAnySyncWriteOp()

            People

              girish.benakappa Girish Benakappa
              girish.benakappa Girish Benakappa
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty