Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40480

Non-complete, unpersisted, "deleted" prepare can be removed from HashTable by the persistence of previous abort

    XMLWordPrintable

    Details

      Description

      Build: 6.6.0-7880-enterprise

      Scenario:

      • 4 node cluster, Couchbase bucket (replica=2)
      • Rebalance out 1 node from the cluster
      • Initiate transaction in parallel to rebalance_out operation

        +----------------+-----------------+--------------+
        | Nodes          | Services        | Status       |
        +----------------+-----------------+--------------+
        | 172.23.107.52  | index, kv, n1ql | Cluster node |
        | 172.23.123.101 | kv              | --- OUT ---> |
        | 172.23.123.102 | kv              | Cluster node |
        | 172.23.123.100 | kv              | Cluster node |
        +----------------+-----------------+--------------+

      Observation:

      Seeing rebalance failure followed by memcached crash on master node - 172.23.107.52

      Service 'memcached' exited with status 134. Restarting. Messages:
      2020-07-14T23:30:54.406403-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2e4bcfd000+0x8f213]
      2020-07-14T23:30:54.406414-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xccc10]
      2020-07-14T23:30:54.406426-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xc805a]
      2020-07-14T23:30:54.406434-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xca463]
      2020-07-14T23:30:54.406441-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x18f5a0]
      2020-07-14T23:30:54.406447-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xcf98d]
      2020-07-14T23:30:54.406454-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x12b864]
      2020-07-14T23:30:54.406459-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f2e4ddac000+0x8f17]
      2020-07-14T23:30:54.406467-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f2e4b5c8000+0x7dd5]
      2020-07-14T23:30:54.406499-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f2e4b1fb000+0xfdead]
       
      Rebalance exited with reason {mover_crashed,
      {unexpected_exit,
      {'EXIT',<0.6670.0>,
      {{{{{child_interrupted,
      {'EXIT',<17502.2478.0>,socket_closed}},
      [{dcp_replicator,spawn_and_wait,1,
      [{file,"src/dcp_replicator.erl"}, {line,266}]},
      {dcp_replicator,handle_call,3,
      [{file,"src/dcp_replicator.erl"}, {line,121}]},
      {gen_server,try_handle_call,4,
      [{file,"gen_server.erl"},{line,636}]},
      {gen_server,handle_msg,6,
      [{file,"gen_server.erl"},{line,665}]},
      {proc_lib,init_p_do_apply,3,
      [{file,"proc_lib.erl"},{line,247}]}]},
      {gen_server,call,
      [<17502.2476.0>,get_partitions,infinity]}},
      {gen_server,call,
      ['dcp_replication_manager-default',
      {get_replicator_pid,543}, infinity]}},
      {gen_server,call,
      [{'janitor_agent-default',
      'ns_1@172.23.123.102'},
      {if_rebalance,<0.3620.0>,
      {update_vbucket_state,979,active,paused, undefined,
      [['ns_1@172.23.123.102', 'ns_1@172.23.123.101', 'ns_1@172.23.107.52']]}},
      infinity]}}}}}.
      Rebalance Operation Id = 322d92a2335598e144eb0bb97f14f1a3 
       
      Worker <0.6325.0> (for action {move,{979,
      ['ns_1@172.23.123.102',
      'ns_1@172.23.123.101',
      'ns_1@172.23.107.52'],
      ['ns_1@172.23.107.52',
      'ns_1@172.23.123.100',
      'ns_1@172.23.123.102'],
      []}}) exited with reason {unexpected_exit,
      {'EXIT', <0.6670.0>,
      {{{{{child_interrupted,
      {'EXIT', <17502.2478.0>, socket_closed}},
      [{dcp_replicator, spawn_and_wait, 1,
      [{file, "src/dcp_replicator.erl"}, {line, 266}]},
      {dcp_replicator, handle_call, 3,
      [{file, "src/dcp_replicator.erl"}, {line, 121}]},
      {gen_server, try_handle_call, 4,
      [{file, "gen_server.erl"}, {line, 636}]},
      {gen_server, handle_msg, 6,
      [{file, "gen_server.erl"}, {line, 665}]},
      {proc_lib, init_p_do_apply, 3,
      [{file, "proc_lib.erl"}, {line, 247}]}]},
      {gen_server, call,
      [<17502.2476.0>,
      get_partitions, infinity]}},
      {gen_server, call,
      ['dcp_replication_manager-default',
      {get_replicator_pid, 543}, infinity]}},
      {gen_server, call,
      [{'janitor_agent-default',
      'ns_1@172.23.123.102'},
      {if_rebalance, <0.3620.0>,
      {update_vbucket_state,
      979, active, paused, undefined,
      [['ns_1@172.23.123.102',
      'ns_1@172.23.123.101',
      'ns_1@172.23.107.52']]}},
      infinity]}}}}

      Test to run:

      Atomicity.doc_isolation.IsolationDocTest.test_transaction_with_rebalance,nodes_init=4,replicas=2,num_items=20000,rebalance_type=out,nodes_out=1,doc_op=create,durability=PERSIST_TO_MAJORITY,services_init=kv;n1ql;index,rerun=False
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          ashwin.govindarajulu Ashwin Govindarajulu created issue -
          ritam.sharma Ritam Sharma made changes -
          Field Original Value New Value
          Description *Build*: 6.6.0-7880-enterprise

          *Scenario*:
           * 4 node cluster, Couchbase bucket (replica=2)
           * Rebalance out 1 node from the cluster
           * Initiate transaction in parallel to rebalance_out operation
          {noformat}
          +----------------+-----------------+--------------+
          | Nodes | Services | Status |
          +----------------+-----------------+--------------+
          | 172.23.107.52 | index, kv, n1ql | Cluster node |
          | 172.23.123.101 | kv   | --- OUT ---> |
          | 172.23.123.102 | kv | Cluster node |
          | 172.23.123.100 | kv | Cluster node |
          +----------------+-----------------+--------------+{noformat}

          *Observation:*

          Seeing rebalance failure followed by memcached crash on master node
          {noformat}
          Service 'memcached' exited with status 134. Restarting. Messages:
          2020-07-14T23:30:54.406403-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2e4bcfd000+0x8f213]
          2020-07-14T23:30:54.406414-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xccc10]
          2020-07-14T23:30:54.406426-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xc805a]
          2020-07-14T23:30:54.406434-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xca463]
          2020-07-14T23:30:54.406441-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x18f5a0]
          2020-07-14T23:30:54.406447-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xcf98d]
          2020-07-14T23:30:54.406454-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x12b864]
          2020-07-14T23:30:54.406459-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f2e4ddac000+0x8f17]
          2020-07-14T23:30:54.406467-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f2e4b5c8000+0x7dd5]
          2020-07-14T23:30:54.406499-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f2e4b1fb000+0xfdead]

          Rebalance exited with reason {mover_crashed,
          {unexpected_exit,
          {'EXIT',<0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT',<17502.2478.0>,socket_closed}},
          [{dcp_replicator,spawn_and_wait,1,
          [{file,"src/dcp_replicator.erl"}, {line,266}]},
          {dcp_replicator,handle_call,3,
          [{file,"src/dcp_replicator.erl"}, {line,121}]},
          {gen_server,try_handle_call,4,
          [{file,"gen_server.erl"},{line,636}]},
          {gen_server,handle_msg,6,
          [{file,"gen_server.erl"},{line,665}]},
          {proc_lib,init_p_do_apply,3,
          [{file,"proc_lib.erl"},{line,247}]}]},
          {gen_server,call,
          [<17502.2476.0>,get_partitions,infinity]}},
          {gen_server,call,
          ['dcp_replication_manager-default',
          {get_replicator_pid,543}, infinity]}},
          {gen_server,call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance,<0.3620.0>,
          {update_vbucket_state,979,active,paused, undefined,
          [['ns_1@172.23.123.102', 'ns_1@172.23.123.101', 'ns_1@172.23.107.52']]}},
          infinity]}}}}}.
          Rebalance Operation Id = 322d92a2335598e144eb0bb97f14f1a3 

          Worker <0.6325.0> (for action {move,{979,
          ['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52'],
          ['ns_1@172.23.107.52',
          'ns_1@172.23.123.100',
          'ns_1@172.23.123.102'],
          []}}) exited with reason {unexpected_exit,
          {'EXIT', <0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT', <17502.2478.0>, socket_closed}},
          [{dcp_replicator, spawn_and_wait, 1,
          [{file, "src/dcp_replicator.erl"}, {line, 266}]},
          {dcp_replicator, handle_call, 3,
          [{file, "src/dcp_replicator.erl"}, {line, 121}]},
          {gen_server, try_handle_call, 4,
          [{file, "gen_server.erl"}, {line, 636}]},
          {gen_server, handle_msg, 6,
          [{file, "gen_server.erl"}, {line, 665}]},
          {proc_lib, init_p_do_apply, 3,
          [{file, "proc_lib.erl"}, {line, 247}]}]},
          {gen_server, call,
          [<17502.2476.0>,
          get_partitions, infinity]}},
          {gen_server, call,
          ['dcp_replication_manager-default',
          {get_replicator_pid, 543}, infinity]}},
          {gen_server, call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance, <0.3620.0>,
          {update_vbucket_state,
          979, active, paused, undefined,
          [['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52']]}},
          infinity]}}}}{noformat}
          *Build*: 6.6.0-7880-enterprise

          *Scenario*:
           * 4 node cluster, Couchbase bucket (replica=2)
           * Rebalance out 1 node from the cluster
           * Initiate transaction in parallel to rebalance_out operation
          {noformat}
          +----------------+-----------------+--------------+
          | Nodes | Services | Status |
          +----------------+-----------------+--------------+
          | 172.23.107.52 | index, kv, n1ql | Cluster node |
          | 172.23.123.101 | kv   | --- OUT ---> |
          | 172.23.123.102 | kv | Cluster node |
          | 172.23.123.100 | kv | Cluster node |
          +----------------+-----------------+--------------+{noformat}

          *Observation:*

          Seeing rebalance failure followed by memcached crash on master node - 172.23.107.52
          {noformat}
          Service 'memcached' exited with status 134. Restarting. Messages:
          2020-07-14T23:30:54.406403-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2e4bcfd000+0x8f213]
          2020-07-14T23:30:54.406414-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xccc10]
          2020-07-14T23:30:54.406426-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xc805a]
          2020-07-14T23:30:54.406434-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xca463]
          2020-07-14T23:30:54.406441-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x18f5a0]
          2020-07-14T23:30:54.406447-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xcf98d]
          2020-07-14T23:30:54.406454-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x12b864]
          2020-07-14T23:30:54.406459-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f2e4ddac000+0x8f17]
          2020-07-14T23:30:54.406467-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f2e4b5c8000+0x7dd5]
          2020-07-14T23:30:54.406499-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f2e4b1fb000+0xfdead]

          Rebalance exited with reason {mover_crashed,
          {unexpected_exit,
          {'EXIT',<0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT',<17502.2478.0>,socket_closed}},
          [{dcp_replicator,spawn_and_wait,1,
          [{file,"src/dcp_replicator.erl"}, {line,266}]},
          {dcp_replicator,handle_call,3,
          [{file,"src/dcp_replicator.erl"}, {line,121}]},
          {gen_server,try_handle_call,4,
          [{file,"gen_server.erl"},{line,636}]},
          {gen_server,handle_msg,6,
          [{file,"gen_server.erl"},{line,665}]},
          {proc_lib,init_p_do_apply,3,
          [{file,"proc_lib.erl"},{line,247}]}]},
          {gen_server,call,
          [<17502.2476.0>,get_partitions,infinity]}},
          {gen_server,call,
          ['dcp_replication_manager-default',
          {get_replicator_pid,543}, infinity]}},
          {gen_server,call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance,<0.3620.0>,
          {update_vbucket_state,979,active,paused, undefined,
          [['ns_1@172.23.123.102', 'ns_1@172.23.123.101', 'ns_1@172.23.107.52']]}},
          infinity]}}}}}.
          Rebalance Operation Id = 322d92a2335598e144eb0bb97f14f1a3 

          Worker <0.6325.0> (for action {move,{979,
          ['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52'],
          ['ns_1@172.23.107.52',
          'ns_1@172.23.123.100',
          'ns_1@172.23.123.102'],
          []}}) exited with reason {unexpected_exit,
          {'EXIT', <0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT', <17502.2478.0>, socket_closed}},
          [{dcp_replicator, spawn_and_wait, 1,
          [{file, "src/dcp_replicator.erl"}, {line, 266}]},
          {dcp_replicator, handle_call, 3,
          [{file, "src/dcp_replicator.erl"}, {line, 121}]},
          {gen_server, try_handle_call, 4,
          [{file, "gen_server.erl"}, {line, 636}]},
          {gen_server, handle_msg, 6,
          [{file, "gen_server.erl"}, {line, 665}]},
          {proc_lib, init_p_do_apply, 3,
          [{file, "proc_lib.erl"}, {line, 247}]}]},
          {gen_server, call,
          [<17502.2476.0>,
          get_partitions, infinity]}},
          {gen_server, call,
          ['dcp_replication_manager-default',
          {get_replicator_pid, 543}, infinity]}},
          {gen_server, call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance, <0.3620.0>,
          {update_vbucket_state,
          979, active, paused, undefined,
          [['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52']]}},
          infinity]}}}}{noformat}
          Hide
          owend Daniel Owen added a comment -

          On 172.23.107.52 seeing:

          2020-07-14T23:30:47.535365-07:00 ERROR 92: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.52->ns_1@172.23.123.101:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":124,"opcode":"DCP_COMMIT","status":"Not found"}
          2020-07-14T23:30:49.918099-07:00 ERROR 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.52->ns_1@172.23.123.101:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":124,"opcode":"DCP_COMMIT","status":"Not found"}
          2020-07-14T23:30:54.336833-07:00 ERROR (default) VBucket::abort (vb:243) - active failed as no HashTableitem found with key:<ud>cid:0x0:test_docs-00007567</ud>
          
          

          Show
          owend Daniel Owen added a comment - On 172.23.107.52 seeing: 2020-07-14T23:30:47.535365-07:00 ERROR 92: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.52->ns_1@172.23.123.101:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":124,"opcode":"DCP_COMMIT","status":"Not found"} 2020-07-14T23:30:49.918099-07:00 ERROR 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.107.52->ns_1@172.23.123.101:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":124,"opcode":"DCP_COMMIT","status":"Not found"} 2020-07-14T23:30:54.336833-07:00 ERROR (default) VBucket::abort (vb:243) - active failed as no HashTableitem found with key:<ud>cid:0x0:test_docs-00007567</ud>
          Hide
          owend Daniel Owen added a comment -

          On 172.23.123.100 seeing:

          2020-07-14T23:28:33.404296-07:00 ERROR 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.123.100->ns_1@172.23.123.102:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":150,"opcode":"DCP_COMMIT","status":"Not found"}
          2020-07-14T23:28:35.202402-07:00 ERROR 55: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.123.100->ns_1@172.23.123.102:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":22,"opcode":"DCP_COMMIT","status":"Not found"}
          

          Show
          owend Daniel Owen added a comment - On 172.23.123.100 seeing: 2020-07-14T23:28:33.404296-07:00 ERROR 62: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.123.100->ns_1@172.23.123.102:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":150,"opcode":"DCP_COMMIT","status":"Not found"} 2020-07-14T23:28:35.202402-07:00 ERROR 55: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.123.100->ns_1@172.23.123.102:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":22,"opcode":"DCP_COMMIT","status":"Not found"}
          Hide
          owend Daniel Owen added a comment -

          On 172.23.123.101 seeing:

          2020-07-14T23:30:47.535475-07:00 ERROR (default) VBucket::commit (vb:164) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00020907</ud> prepare_seqno:29, commit_seqno: 30
          2020-07-14T23:30:49.888694-07:00 ERROR (default) VBucket::commit (vb:164) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00020907</ud> prepare_seqno:29, commit_seqno: 30
          

          Show
          owend Daniel Owen added a comment - On 172.23.123.101 seeing: 2020-07-14T23:30:47.535475-07:00 ERROR (default) VBucket::commit (vb:164) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00020907</ud> prepare_seqno:29, commit_seqno: 30 2020-07-14T23:30:49.888694-07:00 ERROR (default) VBucket::commit (vb:164) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00020907</ud> prepare_seqno:29, commit_seqno: 30
          Hide
          owend Daniel Owen added a comment -

          On 172.23.123.102 seeing:

          2020-07-14T23:28:33.404193-07:00 ERROR (default) VBucket::commit (vb:489) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00006681</ud> prepare_seqno:3, commit_seqno: 4
          2020-07-14T23:28:35.167405-07:00 ERROR (default) VBucket::commit (vb:309) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00026690</ud> prepare_seqno:3, commit_seqno: 4
          2020-07-14T23:28:35.194229-07:00 ERROR 55: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:44227 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fcc15e6c210","ewouldblock":false,"packet":{"bodylen":34,"cas":0,"datatype":"raw","extlen":16,"key":"<ud>test_docs-00006681</ud>","keylen":18,"magic":"ClientRequest","opaque":155,"opcode":"DCP_COMMIT","vbucket":489},"refcount":1}] - closing connection ([ 127.0.0.1:44227 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:489 Pending resolution for 'SW @0x7fcc160b4e50 qi:[key:'<ud>cid:0x0:test_docs-00006681</ud>' seqno:3 reqs:{persist_to_majority, timeout=infinite}]  status:Pending', but received unexpected commit for key <ud>cid:0x0:test_docs-00006681</ud> different prepare seqno: 5
          2020-07-14T23:28:39.958943-07:00 ERROR 56: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:42668 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fcc15e70310","ewouldblock":false,"packet":{"bodylen":39,"cas":1594794515272695808,"datatype":"raw","extlen":21,"key":"<ud>test_docs-00026690</ud>","keylen":18,"magic":"ClientRequest","opaque":24,"opcode":"DCP_DELETION","vbucket":309},"refcount":1}] - closing connection ([ 127.0.0.1:42668 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:309 Pending resolution for 'SW @0x7fcc15eb9050 qi:[key:'<ud>cid:0x0:test_docs-00026690</ud>' seqno:3 reqs:{persist_to_majority, timeout=infinite}]  status:Pending', but received unexpected commit for key <ud>cid:0x0:test_docs-00026690</ud> different prepare seqno: 5
          2020-07-14T23:28:44.960641-07:00 ERROR 56: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:33903 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fcc15e6d610","ewouldblock":false,"packet":{"bodylen":39,"cas":1594794515272695808,"datatype":"raw","extlen":21,"key":"<ud>test_docs-00026690</ud>","keylen":18,"magic":"ClientRequest","opaque":24,"opcode":"DCP_DELETION","vbucket":309},"refcount":1}] - closing connection ([ 127.0.0.1:33903 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:309 Pending resolution for 'SW @0x7fcc15eb9050 qi:[key:'<ud>cid:0x0:test_docs-00026690</ud>' seqno:3 reqs:{persist_to_majority, timeout=infinite}]  status:Pending', but received unexpected commit for key <ud>cid:0x0:test_docs-00026690</ud> different prepare seqno: 5
          

          Show
          owend Daniel Owen added a comment - On 172.23.123.102 seeing: 2020-07-14T23:28:33.404193-07:00 ERROR (default) VBucket::commit (vb:489) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00006681</ud> prepare_seqno:3, commit_seqno: 4 2020-07-14T23:28:35.167405-07:00 ERROR (default) VBucket::commit (vb:309) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00026690</ud> prepare_seqno:3, commit_seqno: 4 2020-07-14T23:28:35.194229-07:00 ERROR 55: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:44227 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fcc15e6c210","ewouldblock":false,"packet":{"bodylen":34,"cas":0,"datatype":"raw","extlen":16,"key":"<ud>test_docs-00006681</ud>","keylen":18,"magic":"ClientRequest","opaque":155,"opcode":"DCP_COMMIT","vbucket":489},"refcount":1}] - closing connection ([ 127.0.0.1:44227 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:489 Pending resolution for 'SW @0x7fcc160b4e50 qi:[key:'<ud>cid:0x0:test_docs-00006681</ud>' seqno:3 reqs:{persist_to_majority, timeout=infinite}] status:Pending', but received unexpected commit for key <ud>cid:0x0:test_docs-00006681</ud> different prepare seqno: 5 2020-07-14T23:28:39.958943-07:00 ERROR 56: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:42668 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fcc15e70310","ewouldblock":false,"packet":{"bodylen":39,"cas":1594794515272695808,"datatype":"raw","extlen":21,"key":"<ud>test_docs-00026690</ud>","keylen":18,"magic":"ClientRequest","opaque":24,"opcode":"DCP_DELETION","vbucket":309},"refcount":1}] - closing connection ([ 127.0.0.1:42668 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:309 Pending resolution for 'SW @0x7fcc15eb9050 qi:[key:'<ud>cid:0x0:test_docs-00026690</ud>' seqno:3 reqs:{persist_to_majority, timeout=infinite}] status:Pending', but received unexpected commit for key <ud>cid:0x0:test_docs-00026690</ud> different prepare seqno: 5 2020-07-14T23:28:44.960641-07:00 ERROR 56: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:33903 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fcc15e6d610","ewouldblock":false,"packet":{"bodylen":39,"cas":1594794515272695808,"datatype":"raw","extlen":21,"key":"<ud>test_docs-00026690</ud>","keylen":18,"magic":"ClientRequest","opaque":24,"opcode":"DCP_DELETION","vbucket":309},"refcount":1}] - closing connection ([ 127.0.0.1:33903 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:309 Pending resolution for 'SW @0x7fcc15eb9050 qi:[key:'<ud>cid:0x0:test_docs-00026690</ud>' seqno:3 reqs:{persist_to_majority, timeout=infinite}] status:Pending', but received unexpected commit for key <ud>cid:0x0:test_docs-00026690</ud> different prepare seqno: 5
          owend Daniel Owen made changes -
          Assignee Daniel Owen [ owend ] Ben Huddleston [ ben.huddleston ]
          owend Daniel Owen made changes -
          Due Date 20/Jul/20
          Hide
          ben.huddleston Ben Huddleston added a comment - - edited

          Nothing stands out in the logs here as an obvious cause.

          I found vBucket 489 and 309 had errors on node 102. Both had some slightly odd looking vBucket state changes before the errors but they were deleted and recreated before things went wrong.

          On node 101 vBucket 164 had the same error without the state changes, the only interesting thing was that a stream was made, nothing was received, then recreated before the crash.

          I refreshed my understanding of the Stream/VBucket durability code but nothing is jumping out as the obvious cause for this (no HashTable item for commit). Interesting note is that these crashes look to be from purely memory streams (i.e. no disk backfills).

          Show
          ben.huddleston Ben Huddleston added a comment - - edited Nothing stands out in the logs here as an obvious cause. I found vBucket 489 and 309 had errors on node 102. Both had some slightly odd looking vBucket state changes before the errors but they were deleted and recreated before things went wrong. On node 101 vBucket 164 had the same error without the state changes, the only interesting thing was that a stream was made, nothing was received, then recreated before the crash. I refreshed my understanding of the Stream/VBucket durability code but nothing is jumping out as the obvious cause for this (no HashTable item for commit). Interesting note is that these crashes look to be from purely memory streams (i.e. no disk backfills).
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          *Update*

          While trying to collect logs, noticed a TAF task related issue.

          Fixed using http://review.couchbase.org/c/TAF/+/132654 and restarted the failed jobs again.

          Will collect logs if we see the same failure again.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - * Update * While trying to collect logs, noticed a TAF task related issue. Fixed using http://review.couchbase.org/c/TAF/+/132654 and restarted the failed jobs again. Will collect logs if we see the same failure again.
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          Description *Build*: 6.6.0-7880-enterprise

          *Scenario*:
           * 4 node cluster, Couchbase bucket (replica=2)
           * Rebalance out 1 node from the cluster
           * Initiate transaction in parallel to rebalance_out operation
          {noformat}
          +----------------+-----------------+--------------+
          | Nodes | Services | Status |
          +----------------+-----------------+--------------+
          | 172.23.107.52 | index, kv, n1ql | Cluster node |
          | 172.23.123.101 | kv   | --- OUT ---> |
          | 172.23.123.102 | kv | Cluster node |
          | 172.23.123.100 | kv | Cluster node |
          +----------------+-----------------+--------------+{noformat}

          *Observation:*

          Seeing rebalance failure followed by memcached crash on master node - 172.23.107.52
          {noformat}
          Service 'memcached' exited with status 134. Restarting. Messages:
          2020-07-14T23:30:54.406403-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2e4bcfd000+0x8f213]
          2020-07-14T23:30:54.406414-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xccc10]
          2020-07-14T23:30:54.406426-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xc805a]
          2020-07-14T23:30:54.406434-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xca463]
          2020-07-14T23:30:54.406441-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x18f5a0]
          2020-07-14T23:30:54.406447-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xcf98d]
          2020-07-14T23:30:54.406454-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x12b864]
          2020-07-14T23:30:54.406459-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f2e4ddac000+0x8f17]
          2020-07-14T23:30:54.406467-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f2e4b5c8000+0x7dd5]
          2020-07-14T23:30:54.406499-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f2e4b1fb000+0xfdead]

          Rebalance exited with reason {mover_crashed,
          {unexpected_exit,
          {'EXIT',<0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT',<17502.2478.0>,socket_closed}},
          [{dcp_replicator,spawn_and_wait,1,
          [{file,"src/dcp_replicator.erl"}, {line,266}]},
          {dcp_replicator,handle_call,3,
          [{file,"src/dcp_replicator.erl"}, {line,121}]},
          {gen_server,try_handle_call,4,
          [{file,"gen_server.erl"},{line,636}]},
          {gen_server,handle_msg,6,
          [{file,"gen_server.erl"},{line,665}]},
          {proc_lib,init_p_do_apply,3,
          [{file,"proc_lib.erl"},{line,247}]}]},
          {gen_server,call,
          [<17502.2476.0>,get_partitions,infinity]}},
          {gen_server,call,
          ['dcp_replication_manager-default',
          {get_replicator_pid,543}, infinity]}},
          {gen_server,call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance,<0.3620.0>,
          {update_vbucket_state,979,active,paused, undefined,
          [['ns_1@172.23.123.102', 'ns_1@172.23.123.101', 'ns_1@172.23.107.52']]}},
          infinity]}}}}}.
          Rebalance Operation Id = 322d92a2335598e144eb0bb97f14f1a3 

          Worker <0.6325.0> (for action {move,{979,
          ['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52'],
          ['ns_1@172.23.107.52',
          'ns_1@172.23.123.100',
          'ns_1@172.23.123.102'],
          []}}) exited with reason {unexpected_exit,
          {'EXIT', <0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT', <17502.2478.0>, socket_closed}},
          [{dcp_replicator, spawn_and_wait, 1,
          [{file, "src/dcp_replicator.erl"}, {line, 266}]},
          {dcp_replicator, handle_call, 3,
          [{file, "src/dcp_replicator.erl"}, {line, 121}]},
          {gen_server, try_handle_call, 4,
          [{file, "gen_server.erl"}, {line, 636}]},
          {gen_server, handle_msg, 6,
          [{file, "gen_server.erl"}, {line, 665}]},
          {proc_lib, init_p_do_apply, 3,
          [{file, "proc_lib.erl"}, {line, 247}]}]},
          {gen_server, call,
          [<17502.2476.0>,
          get_partitions, infinity]}},
          {gen_server, call,
          ['dcp_replication_manager-default',
          {get_replicator_pid, 543}, infinity]}},
          {gen_server, call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance, <0.3620.0>,
          {update_vbucket_state,
          979, active, paused, undefined,
          [['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52']]}},
          infinity]}}}}{noformat}
          *Build*: 6.6.0-7880-enterprise

          *Scenario*:
           * 4 node cluster, Couchbase bucket (replica=2)
           * Rebalance out 1 node from the cluster
           * Initiate transaction in parallel to rebalance_out operation
          {noformat}+----------------+-----------------+--------------+
          | Nodes | Services | Status |
          +----------------+-----------------+--------------+
          | 172.23.107.52 | index, kv, n1ql | Cluster node |
          | 172.23.123.101 | kv   | --- OUT ---> |
          | 172.23.123.102 | kv | Cluster node |
          | 172.23.123.100 | kv | Cluster node |
          +----------------+-----------------+--------------+{noformat}

          *Observation:*

          Seeing rebalance failure followed by memcached crash on master node - 172.23.107.52
          {noformat}Service 'memcached' exited with status 134. Restarting. Messages:
          2020-07-14T23:30:54.406403-07:00 CRITICAL /opt/couchbase/bin/../lib/libstdc++.so.6() [0x7f2e4bcfd000+0x8f213]
          2020-07-14T23:30:54.406414-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xccc10]
          2020-07-14T23:30:54.406426-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xc805a]
          2020-07-14T23:30:54.406434-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xca463]
          2020-07-14T23:30:54.406441-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x18f5a0]
          2020-07-14T23:30:54.406447-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0xcf98d]
          2020-07-14T23:30:54.406454-07:00 CRITICAL /opt/couchbase/bin/../lib/../lib/ep.so() [0x7f2e466a5000+0x12b864]
          2020-07-14T23:30:54.406459-07:00 CRITICAL /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7f2e4ddac000+0x8f17]
          2020-07-14T23:30:54.406467-07:00 CRITICAL /lib64/libpthread.so.0() [0x7f2e4b5c8000+0x7dd5]
          2020-07-14T23:30:54.406499-07:00 CRITICAL /lib64/libc.so.6(clone+0x6d) [0x7f2e4b1fb000+0xfdead]

          Rebalance exited with reason {mover_crashed,
          {unexpected_exit,
          {'EXIT',<0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT',<17502.2478.0>,socket_closed}},
          [{dcp_replicator,spawn_and_wait,1,
          [{file,"src/dcp_replicator.erl"}, {line,266}]},
          {dcp_replicator,handle_call,3,
          [{file,"src/dcp_replicator.erl"}, {line,121}]},
          {gen_server,try_handle_call,4,
          [{file,"gen_server.erl"},{line,636}]},
          {gen_server,handle_msg,6,
          [{file,"gen_server.erl"},{line,665}]},
          {proc_lib,init_p_do_apply,3,
          [{file,"proc_lib.erl"},{line,247}]}]},
          {gen_server,call,
          [<17502.2476.0>,get_partitions,infinity]}},
          {gen_server,call,
          ['dcp_replication_manager-default',
          {get_replicator_pid,543}, infinity]}},
          {gen_server,call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance,<0.3620.0>,
          {update_vbucket_state,979,active,paused, undefined,
          [['ns_1@172.23.123.102', 'ns_1@172.23.123.101', 'ns_1@172.23.107.52']]}},
          infinity]}}}}}.
          Rebalance Operation Id = 322d92a2335598e144eb0bb97f14f1a3 

          Worker <0.6325.0> (for action {move,{979,
          ['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52'],
          ['ns_1@172.23.107.52',
          'ns_1@172.23.123.100',
          'ns_1@172.23.123.102'],
          []}}) exited with reason {unexpected_exit,
          {'EXIT', <0.6670.0>,
          {{{{{child_interrupted,
          {'EXIT', <17502.2478.0>, socket_closed}},
          [{dcp_replicator, spawn_and_wait, 1,
          [{file, "src/dcp_replicator.erl"}, {line, 266}]},
          {dcp_replicator, handle_call, 3,
          [{file, "src/dcp_replicator.erl"}, {line, 121}]},
          {gen_server, try_handle_call, 4,
          [{file, "gen_server.erl"}, {line, 636}]},
          {gen_server, handle_msg, 6,
          [{file, "gen_server.erl"}, {line, 665}]},
          {proc_lib, init_p_do_apply, 3,
          [{file, "proc_lib.erl"}, {line, 247}]}]},
          {gen_server, call,
          [<17502.2476.0>,
          get_partitions, infinity]}},
          {gen_server, call,
          ['dcp_replication_manager-default',
          {get_replicator_pid, 543}, infinity]}},
          {gen_server, call,
          [{'janitor_agent-default',
          'ns_1@172.23.123.102'},
          {if_rebalance, <0.3620.0>,
          {update_vbucket_state,
          979, active, paused, undefined,
          [['ns_1@172.23.123.102',
          'ns_1@172.23.123.101',
          'ns_1@172.23.107.52']]}},
          infinity]}}}}{noformat}
          *Test to run:*
          {noformat}
          Atomicity.doc_isolation.IsolationDocTest.test_transaction_with_rebalance,nodes_init=4,replicas=2,num_items=20000,rebalance_type=out,nodes_out=1,doc_op=create,durability=PERSIST_TO_MAJORITY,services_init=kv;n1ql;index,rerun=False
          {noformat}
          owend Daniel Owen made changes -
          Summary [Doc isolation] Seeing rebalance failure with reason "mover crashed" followed by memcached crash [Doc isolation] failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00020907</ud> prepare_seqno:29, commit_seqno: 30
          owend Daniel Owen made changes -
          Summary [Doc isolation] failed as no HashTable item found with key:<ud>cid:0x0:test_docs-00020907</ud> prepare_seqno:29, commit_seqno: 30 [Doc isolation] failed as no HashTable item found with key:.... prepare_seqno:29, commit_seqno: 30
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Tried repro without TAF today with no luck. Been trying TAF repro and hitting the following error:

          2020-07-17 18:13:41,720 | test  | ERROR   | pool-2-thread-4 | [rest_client:_http_request:746] POST http://127.0.0.1:9003/pools/default body: memoryQuota=13007 headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjphc2Rhc2Q=\n', 'Content-Type': 'application/x-www-form-urlencoded'} error: 400 reason: unknown {"errors":{"_":"Total quota (13007MB) exceeds the maximum allowed quota (9157MB) on node 'n_3@127.0.0.1'"}} auth: Administrator:asdasd
          

          Ashwin Govindarajulu, how do you specify memory quota for the test?

          Show
          ben.huddleston Ben Huddleston added a comment - Tried repro without TAF today with no luck. Been trying TAF repro and hitting the following error: 2020-07-17 18:13:41,720 | test | ERROR | pool-2-thread-4 | [rest_client:_http_request:746] POST http://127.0.0.1:9003/pools/default body: memoryQuota=13007 headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjphc2Rhc2Q=\n', 'Content-Type': 'application/x-www-form-urlencoded'} error: 400 reason: unknown {"errors":{"_":"Total quota (13007MB) exceeds the maximum allowed quota (9157MB) on node 'n_3@127.0.0.1'"}} auth: Administrator:asdasd Ashwin Govindarajulu , how do you specify memory quota for the test?
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          As discussed with Ben Huddleston, he was using CE of server. So bucket creation was failing with following log message in TAF,
          2020-07-17 18:23:56,805 | test | INFO | MainProcess | pool-2-thread-5 | [table_view:display:72] Rebalance Overview
          --------------------------------------
          | Nodes | Services | Status |
          --------------------------------------
          | 127.0.0.1 | index, kv, n1ql | Cluster node |
          | 127.0.0.1 | None | <--- IN — |
          | 127.0.0.1 | None | <--- IN — |
          | 127.0.0.1 | None | <--- IN — |
          -------------------------+-------------+2020-07-17 18:23:56,855 | test | INFO | MainProcess | pool-2-thread-5 | [task:check:320] Rebalance - status: running, progress: 0
          2020-07-17 18:24:06,959 | test | INFO | MainProcess | pool-2-thread-5 | [task:check:320] Rebalance - status: none, progress: 100
          2020-07-17 18:24:06,982 | test | INFO | MainProcess | pool-2-thread-5 | [task:check:377] Rebalance completed with progress: 100% in 10.1910002232 sec
          2020-07-17 18:24:07,403 | infra | ERROR | MainProcess | pool-2-thread-6 | [Rest_Connection:_http_request:205] POST http://127.0.0.1:9000/pools/default/buckets body: storageBackend=couchstore&replicaIndex=1&maxTTL=0&flushEnabled=0&durabilityMinLevel=none&compressionMode=passive&bucketType=couchbase&conflictResolutionType=seqno&name=default&replicaNumber=2&ramQuotaMB=100&threadsNumber=3&evictionPolicy=valueOnly headers: {'Accept': '/', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjphc2Rhc2Q=', 'Content-Type': 'application/x-www-form-urlencoded'} error: 400 reason: unknown {"errors":{"maxTTL":"Max TTL is supported in enterprise edition only","compressionMode":"Compression mode is supported in enterprise edition only"},"summaries":{"ramSummary":{"total":22187868160,"otherBuckets":0,"nodesCount":4,"perNodeMegs":100,"thisAlloc":419430400,"thisUsed":0,"free":21768437760},"hddSummary":{"total":1999852683264,"otherData":1719873307604,"otherBuckets":0,"thisUsed":0,"free":279979375660}}} auth: Administrator:asdas
          So asked him to switch to EE and try again.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - As discussed with Ben Huddleston , he was using CE of server. So bucket creation was failing with following log message in TAF, 2020-07-17 18:23:56,805 | test | INFO | MainProcess | pool-2-thread-5 | [table_view:display:72] Rebalance Overview ---------- --------------- ------------- | Nodes | Services | Status | ---------- --------------- ------------- | 127.0.0.1 | index, kv, n1ql | Cluster node | | 127.0.0.1 | None | <--- IN — | | 127.0.0.1 | None | <--- IN — | | 127.0.0.1 | None | <--- IN — | ---------- --------------- + -------------+2020-07-17 18:23:56,855 | test | INFO | MainProcess | pool-2-thread-5 | [task:check:320] Rebalance - status: running, progress: 0 2020-07-17 18:24:06,959 | test | INFO | MainProcess | pool-2-thread-5 | [task:check:320] Rebalance - status: none, progress: 100 2020-07-17 18:24:06,982 | test | INFO | MainProcess | pool-2-thread-5 | [task:check:377] Rebalance completed with progress: 100% in 10.1910002232 sec 2020-07-17 18:24:07,403 | infra | ERROR | MainProcess | pool-2-thread-6 | [Rest_Connection:_http_request:205] POST http://127.0.0.1:9000/pools/default/buckets body: storageBackend=couchstore&replicaIndex=1&maxTTL=0&flushEnabled=0&durabilityMinLevel=none&compressionMode=passive&bucketType=couchbase&conflictResolutionType=seqno&name=default&replicaNumber=2&ramQuotaMB=100&threadsNumber=3&evictionPolicy=valueOnly headers: {'Accept': ' / ', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjphc2Rhc2Q=', 'Content-Type': 'application/x-www-form-urlencoded'} error: 400 reason: unknown {"errors":{"maxTTL":"Max TTL is supported in enterprise edition only","compressionMode":"Compression mode is supported in enterprise edition only"},"summaries":{"ramSummary":{"total":22187868160,"otherBuckets":0,"nodesCount":4,"perNodeMegs":100,"thisAlloc":419430400,"thisUsed":0,"free":21768437760},"hddSummary":{"total":1999852683264,"otherData":1719873307604,"otherBuckets":0,"thisUsed":0,"free":279979375660}}} auth: Administrator:asdas So asked him to switch to EE and try again.
          owend Daniel Owen made changes -
          Due Date 20/Jul/20 24/Jul/20
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Tried repro with TAF cluster_run and hit other issues. Ashwin Govindarajulu said that this test won't work properly with a cluster_run so switched to vagrants.

          Hit a gsl assert in the ADM when running this test with vagrants. Expect it could have the same root cause so it's worth mentioning here. Here's the backtrace.

          (gdb) bt full
          #0  0x00007f04936025c9 in raise () from /usr/lib64/libc-2.17.so
          No symbol table info available.
          #1  0x00007f0493603cd8 in abort () from /usr/lib64/libc-2.17.so
          No symbol table info available.
          #2  0x00007f0494154195 in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:95
                  terminating = false
                  t = <optimized out>
          #3  0x0000000000549e42 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86
          No locals.
          #4  0x00007f0494151f86 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47
          No locals.
          #5  0x00007f0494151fd1 in std::terminate () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57
          No locals.
          #6  0x00007f0494152213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f047c0060c0, tinfo=0x7f9ab0 <typeinfo for std::logic_error>, dest=0x418c50 <_ZNSt11logic_errorD1Ev@plt>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93
                  globals = <optimized out>
                  header = 0x7f047c006040
          #7  0x00007f048eb71c10 in ActiveDurabilityMonitor::throwException<std::logic_error> (this=this@entry=0x7f046100b0f0, thrower=..., error=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:1482
          No locals.
          #8  0x00007f048eb6d05a in ActiveDurabilityMonitor::abort (this=this@entry=0x7f046100b0f0, sw=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:872
                  key = <optimized out>
                  result = ENGINE_KEY_ENOENT
                  __func__ = "abort"
                  s = <optimized out>
          #9  0x00007f048eb6f463 in ActiveDurabilityMonitor::processCompletedSyncWriteQueue (this=0x7f046100b0f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:450
                  sw = {storage_ = {{emptyState = -96 '\240', value = {<DurabilityMonitor::SyncWrite> = {item = {value = 0x7f046873e2a0}, startTime = {__d = {__r = 813149465068}}, status = ToAbort}, cookie = 0x7f04606fd680, firstChain = {
                            ackCount = {<IgnorePolicy<unsigned char>> = {<No data fields>}, val = 1 '\001'}, chainPtr = 0x7f0460f1a590}, secondChain = {ackCount = {<IgnorePolicy<unsigned char>> = {<No data fields>}, val = 0 '\000'}, chainPtr = 0x0},
                          expiryTime = {<boost::optional_detail::optional_base<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1, 1000000000> > > >> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = true,
                              m_storage = {dummy_ = {data = "\236p\230ٽ\000\000", aligner_ = {<No data fields>}}}}, <No data fields>}}}, hasValue = true}}
                  lock = {_M_device = @0x7f04610111a8}
          #10 0x00007f048ec345a0 in VBucket::processResolvedSyncWrites (this=0x7f0461035800) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucket.cc:697
                  rlh = {lock_ = 0x7f0461035a50, token_ = {type_ = folly::SharedMutexToken::INLINE_SHARED, slot_ = 25215}}
          #11 0x00007f048eb7498d in DurabilityCompletionTask::run (this=0x7f048fd64510) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/durability_completion_task.cc:53
                  vb = {<std::__shared_ptr<VBucket, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<VBucket, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f0461035800, _M_refcount = {_M_pi = 0x7f0461044f20}}, <No data fields>}
                  runtime = <optimized out>
                  startTime = {__d = {__r = 815400727662}}
                  pendingVb = {vbid = 33}
          #12 0x00007f048ebd0864 in ExecutorThread::run (this=0x7f048fd6ab20) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:190
                  curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7f04615cf8a0 <Address 0x7f04615cf8a0 out of bounds>}, _M_string_length = 24, {
                      _M_local_buf = "\030\000\000\000\000\000\000\000pressor", _M_allocated_capacity = 24}}
                  woketime = <optimized out>
                  scheduleOverhead = <optimized out>
                  again = <optimized out>
                  runtime = <optimized out>
                  q = <optimized out>
                  tick = 249 '\371'
          #13 0x00007f049617af17 in run (this=0x7f0492fc3110) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58
          No locals.
          #14 platform_thread_wrap (arg=0x7f0492fc3110) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71
                  context = {_M_t = {
                      _M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7f0492fc3110}, <No data fields>}, <No data fields>}}}
          #15 0x00007f0493995df3 in start_thread () from /usr/lib64/libpthread-2.17.so
          No symbol table info available.
          #16 0x00007f04936c31ad in clone () from /usr/lib64/libc-2.17.so
          No symbol table info available.
          

          Will investigate the logs.

          Show
          ben.huddleston Ben Huddleston added a comment - Tried repro with TAF cluster_run and hit other issues. Ashwin Govindarajulu said that this test won't work properly with a cluster_run so switched to vagrants. Hit a gsl assert in the ADM when running this test with vagrants. Expect it could have the same root cause so it's worth mentioning here. Here's the backtrace. (gdb) bt full #0 0x00007f04936025c9 in raise () from /usr/lib64/libc-2.17.so No symbol table info available. #1 0x00007f0493603cd8 in abort () from /usr/lib64/libc-2.17.so No symbol table info available. #2 0x00007f0494154195 in __gnu_cxx::__verbose_terminate_handler () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/vterminate.cc:95 terminating = false t = <optimized out> #3 0x0000000000549e42 in backtrace_terminate_handler () at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/utilities/terminate_handler.cc:86 No locals. #4 0x00007f0494151f86 in __cxxabiv1::__terminate (handler=<optimized out>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:47 No locals. #5 0x00007f0494151fd1 in std::terminate () at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_terminate.cc:57 No locals. #6 0x00007f0494152213 in __cxxabiv1::__cxa_throw (obj=obj@entry=0x7f047c0060c0, tinfo=0x7f9ab0 <typeinfo for std::logic_error>, dest=0x418c50 <_ZNSt11logic_errorD1Ev@plt>) at /tmp/deploy/gcc-7.3.0/libstdc++-v3/libsupc++/eh_throw.cc:93 globals = <optimized out> header = 0x7f047c006040 #7 0x00007f048eb71c10 in ActiveDurabilityMonitor::throwException<std::logic_error> (this=this@entry=0x7f046100b0f0, thrower=..., error=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:1482 No locals. #8 0x00007f048eb6d05a in ActiveDurabilityMonitor::abort (this=this@entry=0x7f046100b0f0, sw=...) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:872 key = <optimized out> result = ENGINE_KEY_ENOENT __func__ = "abort" s = <optimized out> #9 0x00007f048eb6f463 in ActiveDurabilityMonitor::processCompletedSyncWriteQueue (this=0x7f046100b0f0) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/active_durability_monitor.cc:450 sw = {storage_ = {{emptyState = -96 '\240', value = {<DurabilityMonitor::SyncWrite> = {item = {value = 0x7f046873e2a0}, startTime = {__d = {__r = 813149465068}}, status = ToAbort}, cookie = 0x7f04606fd680, firstChain = { ackCount = {<IgnorePolicy<unsigned char>> = {<No data fields>}, val = 1 '\001'}, chainPtr = 0x7f0460f1a590}, secondChain = {ackCount = {<IgnorePolicy<unsigned char>> = {<No data fields>}, val = 0 '\000'}, chainPtr = 0x0}, expiryTime = {<boost::optional_detail::optional_base<std::chrono::time_point<std::chrono::_V2::steady_clock, std::chrono::duration<long, std::ratio<1, 1000000000> > > >> = {<boost::optional_detail::optional_tag> = {<No data fields>}, m_initialized = true, m_storage = {dummy_ = {data = "\236p\230ٽ\000\000", aligner_ = {<No data fields>}}}}, <No data fields>}}}, hasValue = true}} lock = {_M_device = @0x7f04610111a8} #10 0x00007f048ec345a0 in VBucket::processResolvedSyncWrites (this=0x7f0461035800) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/vbucket.cc:697 rlh = {lock_ = 0x7f0461035a50, token_ = {type_ = folly::SharedMutexToken::INLINE_SHARED, slot_ = 25215}} #11 0x00007f048eb7498d in DurabilityCompletionTask::run (this=0x7f048fd64510) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/durability/durability_completion_task.cc:53 vb = {<std::__shared_ptr<VBucket, (__gnu_cxx::_Lock_policy)2>> = {<std::__shared_ptr_access<VBucket, (__gnu_cxx::_Lock_policy)2, false, false>> = {<No data fields>}, _M_ptr = 0x7f0461035800, _M_refcount = {_M_pi = 0x7f0461044f20}}, <No data fields>} runtime = <optimized out> startTime = {__d = {__r = 815400727662}} pendingVb = {vbid = 33} #12 0x00007f048ebd0864 in ExecutorThread::run (this=0x7f048fd6ab20) at /home/couchbase/jenkins/workspace/couchbase-server-unix/kv_engine/engines/ep/src/executorthread.cc:190 curTaskDescr = {static npos = 18446744073709551615, _M_dataplus = {<std::allocator<char>> = {<__gnu_cxx::new_allocator<char>> = {<No data fields>}, <No data fields>}, _M_p = 0x7f04615cf8a0 <Address 0x7f04615cf8a0 out of bounds>}, _M_string_length = 24, { _M_local_buf = "\030\000\000\000\000\000\000\000pressor", _M_allocated_capacity = 24}} woketime = <optimized out> scheduleOverhead = <optimized out> again = <optimized out> runtime = <optimized out> q = <optimized out> tick = 249 '\371' #13 0x00007f049617af17 in run (this=0x7f0492fc3110) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:58 No locals. #14 platform_thread_wrap (arg=0x7f0492fc3110) at /home/couchbase/jenkins/workspace/couchbase-server-unix/platform/src/cb_pthreads.cc:71 context = {_M_t = { _M_t = {<std::_Tuple_impl<0, CouchbaseThread*, std::default_delete<CouchbaseThread> >> = {<std::_Tuple_impl<1, std::default_delete<CouchbaseThread> >> = {<std::_Head_base<1, std::default_delete<CouchbaseThread>, true>> = {<std::default_delete<CouchbaseThread>> = {<No data fields>}, <No data fields>}, <No data fields>}, <std::_Head_base<0, CouchbaseThread*, false>> = {_M_head_impl = 0x7f0492fc3110}, <No data fields>}, <No data fields>}}} #15 0x00007f0493995df3 in start_thread () from /usr/lib64/libpthread-2.17.so No symbol table info available. #16 0x00007f04936c31ad in clone () from /usr/lib64/libc-2.17.so No symbol table info available. Will investigate the logs.
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Didn't manage to get the logs for the above crash, they'd rotated out already.

          Managed to repro the issue on vagrants (I believe twice). My attempt at getting pcaps was unsuccessful though so running again. TAF can do this with the option "collect_pcaps=True" though which is handy. Believe I've got a repro again, but it's stuck in a txn cleanup phase (this is 1 snippet of logs).

          15:38:07.564 244/pool-2-thread-15/841a5 ErrorWrapper retryTransaction=false
          15:38:07.564 244/pool-2-thread-15/841a5/e8cb6 converted ErrorWrapper TRANSACTION_EXPIRED to final error com.couchbase.transactions.error.TransactionExpired: Transaction has expired configured timeout of 100000msecs.  The transaction is not committed.
          TransactionCleanupEndRunEvent: start={default/_default/clientId=1583f,index=0,numClients=2,ATRs={checking=512,total=1024},runLength=60000millis},end={taken=10128millis,ATRs={tried=0,present=0,errored=0},entries={expired=0,failed=0,total=0},largestNumEntries=0}
          TransactionCleanupEndRunEvent: start={default/_default/clientId=6b822,index=1,numClients=2,ATRs={checking=512,total=1024},runLength=60000millis},end={taken=10127millis,ATRs={tried=0,present=0,errored=0},entries={expired=0,failed=0,total=0},largestNumEntries=0}
          

          I think that this phase is running some kind of txn which is constantly failing because the previous errors have got at least one node stuck as the DCP connections are constantly being torn down with errors. It hasn't spit out the pcaps or cbcollects yet though.

          Show
          ben.huddleston Ben Huddleston added a comment - Didn't manage to get the logs for the above crash, they'd rotated out already. Managed to repro the issue on vagrants (I believe twice). My attempt at getting pcaps was unsuccessful though so running again. TAF can do this with the option "collect_pcaps=True" though which is handy. Believe I've got a repro again, but it's stuck in a txn cleanup phase (this is 1 snippet of logs). 15:38:07.564 244/pool-2-thread-15/841a5 ErrorWrapper retryTransaction=false 15:38:07.564 244/pool-2-thread-15/841a5/e8cb6 converted ErrorWrapper TRANSACTION_EXPIRED to final error com.couchbase.transactions.error.TransactionExpired: Transaction has expired configured timeout of 100000msecs. The transaction is not committed. TransactionCleanupEndRunEvent: start={default/_default/clientId=1583f,index=0,numClients=2,ATRs={checking=512,total=1024},runLength=60000millis},end={taken=10128millis,ATRs={tried=0,present=0,errored=0},entries={expired=0,failed=0,total=0},largestNumEntries=0} TransactionCleanupEndRunEvent: start={default/_default/clientId=6b822,index=1,numClients=2,ATRs={checking=512,total=1024},runLength=60000millis},end={taken=10127millis,ATRs={tried=0,present=0,errored=0},entries={expired=0,failed=0,total=0},largestNumEntries=0} I think that this phase is running some kind of txn which is constantly failing because the previous errors have got at least one node stuck as the DCP connections are constantly being torn down with errors. It hasn't spit out the pcaps or cbcollects yet though.
          lynn.straus Lynn Straus made changes -
          Labels 6.6.0 Transactions functional-test 6.6.0 Transactions approved-for-6.6.0 functional-test
          lynn.straus Lynn Straus made changes -
          Link This issue blocks MB-38724 [ MB-38724 ]
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Ashwin Govindarajulu told me that the easiest way to get pcaps in this case is to just ssh to each vagrant node and kill the tcpdump process. In attempting to do so I found that the tcpdump process was not actually running. tcpdump wasn't running because the tcpdump command in pytests/basetestcase.py assumes we are using the eth0 device which is not the case for my VMs.

          https://github.com/couchbaselabs/TAF/blob/69c32d49cba9caff1e64bb30345ffee419ed0295/pytests/basetestcase.py#L398

          tcpdump -C 500 -w pcaps/pack-dump-file.pcap  -i eth0 -s 0 tcp
          

          After changing this to use an appropriate device and a bit of messing about to rebuild my local wireshark I managed to get pcaps of a reproduction.

          Show
          ben.huddleston Ben Huddleston added a comment - Ashwin Govindarajulu told me that the easiest way to get pcaps in this case is to just ssh to each vagrant node and kill the tcpdump process. In attempting to do so I found that the tcpdump process was not actually running. tcpdump wasn't running because the tcpdump command in pytests/basetestcase.py assumes we are using the eth0 device which is not the case for my VMs. https://github.com/couchbaselabs/TAF/blob/69c32d49cba9caff1e64bb30345ffee419ed0295/pytests/basetestcase.py#L398 tcpdump -C 500 -w pcaps/pack-dump-file.pcap -i eth0 -s 0 tcp After changing this to use an appropriate device and a bit of messing about to rebuild my local wireshark I managed to get pcaps of a reproduction.
          ben.huddleston Ben Huddleston made changes -
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Log snippets for node .104 which hit this.

          2020-07-20T18:41:32.612223+00:00 INFO (default) VBucket: created vb:55 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:179188313359479 topology:null
          2020-07-20T18:41:32.612244+00:00 INFO (default) VBucket::setState: transitioning vb:55 with high seqno:0 from:replica to:replica
          2020-07-20T18:41:32.834898+00:00 INFO 51: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.101->ns_1@10.112.202.104:default - (vb:55) Attempting to add stream: opaque_:15, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:179188313359479, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2020-07-20T18:41:33.117764+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:555) Attempting to add stream: opaque_:1, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:128582366262475, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2020-07-20T18:41:33.117785+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:556) Attempting to add stream: opaque_:2, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:158066772463846, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2020-07-20T18:41:33.117802+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:557) Attempting to add stream: opaque_:3, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:194833567066844, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2020-07-20T18:41:33.117811+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:558) Attempting to add stream: opaque_:4, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:160753835799648, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2020-07-20T18:41:33.117818+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:559) Attempting to add stream: opaque_:5, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:28356807138515, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          2020-07-20T18:43:11.619383+00:00 ERROR (default) VBucket::commit (vb:55) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-20049</ud> prepare_seqno:13, commit_seqno: 14
          

          Pcap shows this is a deleted prepare which was preceded by an abort for the same key.

          With the recent txn changes to stage documents as deleted every prepare should be marked deleted.

          Show
          ben.huddleston Ben Huddleston added a comment - Log snippets for node .104 which hit this. 2020-07-20T18:41:32.612223+00:00 INFO (default) VBucket: created vb:55 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:179188313359479 topology:null 2020-07-20T18:41:32.612244+00:00 INFO (default) VBucket::setState: transitioning vb:55 with high seqno:0 from:replica to:replica 2020-07-20T18:41:32.834898+00:00 INFO 51: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.101->ns_1@10.112.202.104:default - (vb:55) Attempting to add stream: opaque_:15, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:179188313359479, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2020-07-20T18:41:33.117764+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:555) Attempting to add stream: opaque_:1, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:128582366262475, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2020-07-20T18:41:33.117785+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:556) Attempting to add stream: opaque_:2, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:158066772463846, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2020-07-20T18:41:33.117802+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:557) Attempting to add stream: opaque_:3, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:194833567066844, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2020-07-20T18:41:33.117811+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:558) Attempting to add stream: opaque_:4, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:160753835799648, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2020-07-20T18:41:33.117818+00:00 INFO 53: (default) DCP (Consumer) eq_dcpq:replication:ns_1@10.112.202.103->ns_1@10.112.202.104:default - (vb:559) Attempting to add stream: opaque_:5, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:28356807138515, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} 2020-07-20T18:43:11.619383+00:00 ERROR (default) VBucket::commit (vb:55) failed as no HashTable item found with key:<ud>cid:0x0:test_docs-20049</ud> prepare_seqno:13, commit_seqno: 14 Pcap shows this is a deleted prepare which was preceded by an abort for the same key. With the recent txn changes to stage documents as deleted every prepare should be marked deleted.
          ben.huddleston Ben Huddleston made changes -
          Hide
          ben.huddleston Ben Huddleston added a comment - - edited

          Based on seqno acks in the pcap we know that this node has persisted the prepare that "went missing" (the highlighted prepare).

          Show
          ben.huddleston Ben Huddleston added a comment - - edited Based on seqno acks in the pcap we know that this node has persisted the prepare that "went missing" (the highlighted prepare).
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Think I've found the bug here. There's a race condition when we flush documents that requires an abort and a deleted prepare that is as follows:

          1. Create and flush a prepare
          2. Create an abort for that prepare
          3. Run the flusher to the point that it has acquired the mutations from the CheckpointManager but not yet invoked the PersistenceCallback
          4. Create a deleted prepare with the same key
          5. Finish running the PersistenceCallback

          At step 5 we call "VBucket::deletedOnDiskCbk" for the abort. We call this function as an abort has the deleted flag set on the item and it is not a prepare. This function then attempts to find the item in the HashTbale and if the item exists, is deleted, and has the same rev seqno, then we delete it from the HashTable. This is the cause of the "prepare not found" errors. This can manifest on the replica as an "ERROR" message which causes the DCP stream to shut down, or, on the active, as a crash in the ActiveDurabilityMonitor.

          This is most likely a latent bug with SyncDeletes that has come up now as changing transactions to store staged mutations as deleted has increased the likelihood of hitting it.

          Show
          ben.huddleston Ben Huddleston added a comment - Think I've found the bug here. There's a race condition when we flush documents that requires an abort and a deleted prepare that is as follows: Create and flush a prepare Create an abort for that prepare Run the flusher to the point that it has acquired the mutations from the CheckpointManager but not yet invoked the PersistenceCallback Create a deleted prepare with the same key Finish running the PersistenceCallback At step 5 we call "VBucket::deletedOnDiskCbk" for the abort. We call this function as an abort has the deleted flag set on the item and it is not a prepare. This function then attempts to find the item in the HashTbale and if the item exists, is deleted, and has the same rev seqno, then we delete it from the HashTable. This is the cause of the "prepare not found" errors. This can manifest on the replica as an "ERROR" message which causes the DCP stream to shut down, or, on the active, as a crash in the ActiveDurabilityMonitor. This is most likely a latent bug with SyncDeletes that has come up now as changing transactions to store staged mutations as deleted has increased the likelihood of hitting it.
          Hide
          drigby Dave Rigby added a comment -

          This is most likely a latent bug with SyncDeletes that has come up now as changing transactions to store staged mutations as deleted has increased the likelihood of hitting it.

          Can you confirm, and if so update the affectsVersion to include 6.5.0 and 6.5.1 please?

          Show
          drigby Dave Rigby added a comment - This is most likely a latent bug with SyncDeletes that has come up now as changing transactions to store staged mutations as deleted has increased the likelihood of hitting it. Can you confirm, and if so update the affectsVersion to include 6.5.0 and 6.5.1 please?
          ben.huddleston Ben Huddleston made changes -
          Affects Version/s 6.5.0 [ 15037 ]
          Affects Version/s 6.5.1 [ 16622 ]
          ben.huddleston Ben Huddleston made changes -
          Summary [Doc isolation] failed as no HashTable item found with key:.... prepare_seqno:29, commit_seqno: 30 Non-complete, unpersisted, "deleted" prepare can be removed from HashTable by the persistence of previous abort
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Affects 6.5.0 and 6.5.1.

          Show
          ben.huddleston Ben Huddleston added a comment - Affects 6.5.0 and 6.5.1.
          drigby Dave Rigby made changes -
          Assignee Ben Huddleston [ ben.huddleston ] Ashwin Govindarajulu [ ashwin.govindarajulu ]
          Resolution Fixed [ 1 ]
          Status Open [ 1 ] Resolved [ 5 ]
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-6.6.0-7888 contains kv_engine commit 6842000 with commit message:
          MB-40480: Compare seqno at VBucket::deletedOnDiskCbk

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7888 contains kv_engine commit 6842000 with commit message: MB-40480 : Compare seqno at VBucket::deletedOnDiskCbk
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Not seeing this issue in latest build. Verified using 6.6.0-7891.

          Hence closing this ticket.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Not seeing this issue in latest build. Verified using 6.6.0-7891. Hence closing this ticket.
          ashwin.govindarajulu Ashwin Govindarajulu made changes -
          VERIFICATION STEPS Ran the same test few time and not hitting this issue.
          Job: http://qa.sc.couchbase.com/job/oel6-4node-rebalance_in_jython/1128/console
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-7.0.0-2712 contains kv_engine commit 6842000 with commit message:
          MB-40480: Compare seqno at VBucket::deletedOnDiskCbk

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-2712 contains kv_engine commit 6842000 with commit message: MB-40480 : Compare seqno at VBucket::deletedOnDiskCbk

            People

            Assignee:
            ashwin.govindarajulu Ashwin Govindarajulu
            Reporter:
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Dates

              Due:
              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty