Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43120

Ephemeral PDM: "No Prepare waiting for completion, but received commit for key" after node restart

    XMLWordPrintable

Details

    Description

      Script to Repo

      /testrunner -i /tmp/testexec.13082.ini GROUP=auto_reprovision_durability,rerun=False,get-cbcollect-info=True,infra_log_level=critical,log_level=error,upgrade_version=7.0.0-3938 -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover_during_rebalance,upgrade_version=7.0.0-3938,nodes_in=0,timeout=5,rerun=False,auto_reprovision=True,GROUP=auto_reprovision_durability,nodes_out=1,bucket_spec=single_bucket.buckets_all_ephemeral_for_rebalance_tests_more_collections,data_load_spec=volume_test_load_with_CRUD_on_collections,get-cbcollect-info=True,replicas=2,durability=MAJORITY,log_level=error,failover_action=restart_server,nodes_init=4,can_abort_rebalance=False,num_node_failures=1,override_spec_params=durability;replicas,infra_log_level=critical
      

      Summary:
      The test introduces server_restart on a node with autoreprovison enabled, and expects the ongoing rebalance to fail.  After node comes up back again, bucket gets autoreprovisioned and a subsequent rebalance is triggered, which in this case failed. 
      (Not a new test)

       

      Observations and timeline

      Inject failure condition of "server_restart" on .119 and expect an ongoing rebalance to fail 

      2020-12-06 04:12:14,760 | test  | ERROR   | pool-1-thread-20 | [rest_client:print_UI_logs:2596] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.123.117', u'tstamp': 1607256732759L, u'shortText': u'message', u'serverTime': u'2020-12-06T04:12:12.759Z', u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n    
      

      .119 node comes up

      2020-12-06 04:15:27,631 | test  | ERROR   | pool-1-thread-6 | [rest_client:print_UI_logs:2596] {u'code': 0, u'module': u'auto_reprovision', u'type': u'info', u'node': u'ns_1@172.23.123.117', u'tstamp': 1607256747350L, u'shortText': u'message', u'serverTime': u'2020-12-06T04:12:27.350Z', u'text': u'auto-reprovision is disabled as maximum number of nodes (1) that can be auto-reprovisioned has been reached.'}
      2020-12-06 04:15:27,631 | test  | ERROR   | pool-1-thread-6 | [rest_client:print_UI_logs:2596] {u'code': 0, u'module': u'auto_reprovision', u'type': u'info', u'node': u'ns_1@172.23.123.117', u'tstamp': 1607256747348L, u'shortText': u'message', u'serverTime': u'2020-12-06T04:12:27.348Z', u'text': u'Bucket "default" has been reprovisioned on following nodes: [\'ns_1@172.23.123.119\']. Nodes on which the data service restarted: [\'ns_1@172.23.123.119\'].'}
      

      Subsequent rebalance fails (expected it to pass)

      2020-12-06 04:21:37,841 | test  | ERROR   | pool-1-thread-21 | [rest_client:print_UI_logs:2596] {u'code': 0, u'module': u'ns_orchestrator', u'type': u'critical', u'node': u'ns_1@172.23.123.117', u'tstamp': 1607257293145L, u'shortText': u'message', u'serverTime': u'2020-12-06T04:21:33.145Z', u'text': u'Rebalance exited with reason {mover_crashed,\n                              {unexpected_exit,\n          
      

      Few error messages

      172.23.123.119 : Found exceptionRunLoop message in /opt/couchbase/var/lib/couchbase/logs/memcached.log.000005.txt
      2020-12-06T04:21:32.126048-08:00 WARNING 792: exception occurred in runloop during packet execution. Closing connection: PassiveDurabilityMonitor::completeSyncWrite vb:767 No Prepare waiting for completion, but received commit for key <ud>cid:0xaa:test_collections-161</ud>. Cookies: [{"aiostat":"success","connection":"[ {\"ip\":\"127.0.0.1\",\"port\":56137} - {\"ip\":\"127.0.0.1\",\"port\":11209} (<ud>@ns_server</ud>) ]","engine_storage":"0x0000000000000000","ewouldblock":false,"packet":{"bodylen":38,"cas":0,"datatype":"raw","extlen":16,"key":"<ud>..test_collections-161</ud>","keylen":22,"magic":"ClientRequest","opaque":202,"opcode":"DCP_COMMIT","vbucket":767},"refcount":1}]
      

       

      172.23.121.216 : Found ERROR message in /opt/couchbase/var/lib/couchbase/logs/memcached.log.000004.txt
      2020-12-06T04:21:31.944275-08:00 ERROR (default) VBucket::commit (vb:767) failed as no HashTable item found with key:<ud>cid:0xaa:test_collections-161</ud> prepare_seqno:1803, commit_seqno:2032

       

      172.23.121.217 : Found ERROR message in /opt/couchbase/var/lib/couchbase/logs/memcached.log.000004.txt
      2020-12-06T04:21:31.945663-08:00 ERROR 57: (default) DCP (Producer) eq_dcpq:replication:ns_1@172.23.121.217->ns_1@172.23.121.216:default - DcpProducer::handleResponse disconnecting, received unexpected response:{"bodylen":0,"cas":0,"datatype":"raw","extlen":0,"keylen":0,"magic":"ClientResponse","opaque":198,"opcode":"DCP_COMMIT","status":"Not found"} for stream:stream name:eq_dcpq:replication:ns_1@172.23.121.217->ns_1@172.23.121.216:default, vb:767, state:in-memory
      

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Looked fine on last weekly build 7.0.0-3784.

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - Looked fine on last weekly build 7.0.0-3784.
            drigby Dave Rigby added a comment -

            The error messages seen sound similar to MB-41089 (which was fixed in build 7.0.0-3701).

            drigby Dave Rigby added a comment - The error messages seen sound similar to MB-41089 (which was fixed in build 7.0.0-3701).

            Mass closing all defects that has Resolution != Fixed

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Mass closing all defects that has Resolution != Fixed

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty