Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35135

"received unexpected commit for key" from PassiveDurabilityMonitor::completeSyncWrite [2019/7/19]

    XMLWordPrintable

    Details

    • Triage:
      Triaged
    • Is this a Regression?:
      Unknown
    • Sprint:
      KV-Engine Mad-Hatter Beta

      Description

      Error message seen on node 172.28.128.83 in memached.log.
      Jepsen test was trying to perform Majority SyncWrites and failing over nodes with the following test parameters:
      workload=failover,failover-type=hard,recovery-type=delta,replicas=2,no-autofailover,disrupt-count=1,rate=0,durability=0:100:0:0,eviction-policy=value.

      2019-07-17T08:29:46.999576-07:00 ERROR 47: exception occurred in runloop during packet execution. Cookie info: [{"aiostat":"success","connection":"[ 127.0.0.1:37724 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]","engine_storage":"0x00007fa4f6f57010","ewouldblock":false,"packet":{"bodylen":27,"cas":0,"datatype":"raw","extlen":16,"key":"<ud>.jepsen0013</ud>","keylen":11,"magic":"ClientRequest","opaque":74,"opcode":"DCP_COMMIT","vbucket":413},"refcount":1}] - closing connection ([ 127.0.0.1:37724 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]): PassiveDurabilityMonitor::completeSyncWrite vb:413 Pending resolution for 'SW @0x7fa4f70e71d0 cookie:0 qi:[key:'<ud>cid:0x0:jepsen0013</ud>' seqno:1269 reqs:{Majority, timeout=infinite}] maj:0 #1stChainAcks:0 1stChainAcks:[] #2ndChainAcks:0 2ndChainAcks:[]', but received unexpected commit for key cid:0x0:jepsen0013
      2019-07-17T08:29:46.999628-07:00 INFO 47: (No Engine) DCP (Consumer) eq_dcpq:replication:ns_1@172.28.128.187->ns_1@172.28.128.83:default - Removing connection [ 127.0.0.1:37724 - 127.0.0.1:11209 (<ud>@ns_server</ud>) ]
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          richard.demellow Richard deMellow added a comment - - edited

          This looks like its causing a rebalance crash for Jepsen:

          2019-07-18 07:25:24,154{GMT}	WARN	[jepsen nemesis] jepsen.core: Process :nemesis crashed
          java.lang.RuntimeException: Rebalance failed
          	at couchbase.util$wait_for_rebalance_complete.invokeStatic(util.clj:123) ~[na:na]
          	at couchbase.util$wait_for_rebalance_complete.invoke(util.clj:112) ~[na:na]
          	at couchbase.util$wait_for_rebalance_complete.invokeStatic(util.clj:114) ~[na:na]
          	at couchbase.util$wait_for_rebalance_complete.invoke(util.clj:112) ~[na:na]
          	at couchbase.util$wait_for_rebalance_complete.invokeStatic(util.clj:113) ~[na:na]
          	at couchbase.util$wait_for_rebalance_complete.invoke(util.clj:112) ~[na:na]
          	at couchbase.util$rebalance.invokeStatic(util.clj:153) ~[na:na]
          	at couchbase.util$rebalance.invoke(util.clj:136) ~[na:na]
          	at couchbase.util$rebalance.invokeStatic(util.clj:138) ~[na:na]
          	at couchbase.util$rebalance.invoke(util.clj:136) ~[na:na]
          	at couchbase.nemesis$couchbase$reify__3387.invoke_BANG_(nemesis.clj:162) ~[na:na]
          	at jepsen.nemesis$invoke_compat_BANG_.invokeStatic(nemesis.clj:40) ~[jepsen-0.1.14.jar:na]
          	at jepsen.nemesis$invoke_compat_BANG_.invoke(nemesis.clj:36) ~[jepsen-0.1.14.jar:na]
          	at couchbase.core$_main$fn__4725$fn__4726.invoke(core.clj:248) ~[na:na]
          	at jepsen.core$nemesis_invoke_op_BANG_$fn__5523.invoke(core.clj:266) ~[jepsen-0.1.14.jar:na]
          	at jepsen.core$nemesis_invoke_op_BANG_.invokeStatic(core.clj:266) [jepsen-0.1.14.jar:na]
          	at jepsen.core$nemesis_invoke_op_BANG_.invoke(core.clj:261) [jepsen-0.1.14.jar:na]
          	at jepsen.core$nemesis_apply_op_BANG_.invokeStatic(core.clj:301) [jepsen-0.1.14.jar:na]
          	at jepsen.core$nemesis_apply_op_BANG_.invoke(core.clj:293) [jepsen-0.1.14.jar:na]
          	at jepsen.core.NemesisWorker.run_worker_BANG_(core.clj:417) [jepsen-0.1.14.jar:na]
          	at jepsen.core$run_workers_BANG_$run__5507.invoke(core.clj:214) [jepsen-0.1.14.jar:na]
          	at dom_top.core$real_pmap_helper$build_thread__214$fn__215.invoke(core.clj:146) [jepsen-0.1.14.jar:na]
          	at clojure.lang.AFn.applyToHelper(AFn.java:152) [clojure-1.10.1.jar:na]
          	at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.10.1.jar:na]
          	at clojure.core$apply.invokeStatic(core.clj:665) [clojure-1.10.1.jar:na]
          	at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) [clojure-1.10.1.jar:na]
          	at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) [clojure-1.10.1.jar:na]
          	at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.10.1.jar:na]
          	at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.10.1.jar:na]
          	at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.10.1.jar:na]
          	at clojure.core$apply.invokeStatic(core.clj:669) [clojure-1.10.1.jar:na]
          	at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) [clojure-1.10.1.jar:na]
          	at clojure.lang.RestFn.invoke(RestFn.java:397) [clojure-1.10.1.jar:na]
          	at clojure.lang.AFn.run(AFn.java:22) [clojure-1.10.1.jar:na]
          	at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212]
          

          When running test:
          workload=failover,failover-type=hard,recovery-type=delta,replicas=2,no-autofailover,disrupt-count=1,rate=0,durability=0:100:0:0

          I've attached a second set of logs here 20190718T072033.000-0700.zip .

          Show
          richard.demellow Richard deMellow added a comment - - edited This looks like its causing a rebalance crash for Jepsen: 2019-07-18 07:25:24,154{GMT} WARN [jepsen nemesis] jepsen.core: Process :nemesis crashed java.lang.RuntimeException: Rebalance failed at couchbase.util$wait_for_rebalance_complete.invokeStatic(util.clj:123) ~[na:na] at couchbase.util$wait_for_rebalance_complete.invoke(util.clj:112) ~[na:na] at couchbase.util$wait_for_rebalance_complete.invokeStatic(util.clj:114) ~[na:na] at couchbase.util$wait_for_rebalance_complete.invoke(util.clj:112) ~[na:na] at couchbase.util$wait_for_rebalance_complete.invokeStatic(util.clj:113) ~[na:na] at couchbase.util$wait_for_rebalance_complete.invoke(util.clj:112) ~[na:na] at couchbase.util$rebalance.invokeStatic(util.clj:153) ~[na:na] at couchbase.util$rebalance.invoke(util.clj:136) ~[na:na] at couchbase.util$rebalance.invokeStatic(util.clj:138) ~[na:na] at couchbase.util$rebalance.invoke(util.clj:136) ~[na:na] at couchbase.nemesis$couchbase$reify__3387.invoke_BANG_(nemesis.clj:162) ~[na:na] at jepsen.nemesis$invoke_compat_BANG_.invokeStatic(nemesis.clj:40) ~[jepsen-0.1.14.jar:na] at jepsen.nemesis$invoke_compat_BANG_.invoke(nemesis.clj:36) ~[jepsen-0.1.14.jar:na] at couchbase.core$_main$fn__4725$fn__4726.invoke(core.clj:248) ~[na:na] at jepsen.core$nemesis_invoke_op_BANG_$fn__5523.invoke(core.clj:266) ~[jepsen-0.1.14.jar:na] at jepsen.core$nemesis_invoke_op_BANG_.invokeStatic(core.clj:266) [jepsen-0.1.14.jar:na] at jepsen.core$nemesis_invoke_op_BANG_.invoke(core.clj:261) [jepsen-0.1.14.jar:na] at jepsen.core$nemesis_apply_op_BANG_.invokeStatic(core.clj:301) [jepsen-0.1.14.jar:na] at jepsen.core$nemesis_apply_op_BANG_.invoke(core.clj:293) [jepsen-0.1.14.jar:na] at jepsen.core.NemesisWorker.run_worker_BANG_(core.clj:417) [jepsen-0.1.14.jar:na] at jepsen.core$run_workers_BANG_$run__5507.invoke(core.clj:214) [jepsen-0.1.14.jar:na] at dom_top.core$real_pmap_helper$build_thread__214$fn__215.invoke(core.clj:146) [jepsen-0.1.14.jar:na] at clojure.lang.AFn.applyToHelper(AFn.java:152) [clojure-1.10.1.jar:na] at clojure.lang.AFn.applyTo(AFn.java:144) [clojure-1.10.1.jar:na] at clojure.core$apply.invokeStatic(core.clj:665) [clojure-1.10.1.jar:na] at clojure.core$with_bindings_STAR_.invokeStatic(core.clj:1973) [clojure-1.10.1.jar:na] at clojure.core$with_bindings_STAR_.doInvoke(core.clj:1973) [clojure-1.10.1.jar:na] at clojure.lang.RestFn.invoke(RestFn.java:425) [clojure-1.10.1.jar:na] at clojure.lang.AFn.applyToHelper(AFn.java:156) [clojure-1.10.1.jar:na] at clojure.lang.RestFn.applyTo(RestFn.java:132) [clojure-1.10.1.jar:na] at clojure.core$apply.invokeStatic(core.clj:669) [clojure-1.10.1.jar:na] at clojure.core$bound_fn_STAR_$fn__5749.doInvoke(core.clj:2003) [clojure-1.10.1.jar:na] at clojure.lang.RestFn.invoke(RestFn.java:397) [clojure-1.10.1.jar:na] at clojure.lang.AFn.run(AFn.java:22) [clojure-1.10.1.jar:na] at java.lang.Thread.run(Thread.java:748) [na:1.8.0_212] When running test: workload=failover,failover-type=hard,recovery-type=delta,replicas=2,no-autofailover,disrupt-count=1,rate=0,durability=0:100:0:0 I've attached a second set of logs here 20190718T072033.000-0700.zip .
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Exception occurred in runloop memcached errors typically result in a rebalance failure due to bulk_set_vbucket_state_failed in the ns_server logs. Have you got the rebalance failure message from the ns_server logs? Or the node which failed?

          Show
          ben.huddleston Ben Huddleston added a comment - Exception occurred in runloop memcached errors typically result in a rebalance failure due to bulk_set_vbucket_state_failed in the ns_server logs. Have you got the rebalance failure message from the ns_server logs? Or the node which failed?
          Hide
          ben.huddleston Ben Huddleston added a comment - - edited

          Above theory is not quite right. Think the disconnect needs to be in the middle of the snapshot containing the prepare. This then avoids updating the hps or we would otherwise remove it at duplicate PRE.

          Show
          ben.huddleston Ben Huddleston added a comment - - edited Above theory is not quite right. Think the disconnect needs to be in the middle of the snapshot containing the prepare. This then avoids updating the hps or we would otherwise remove it at duplicate PRE.
          Hide
          ben.huddleston Ben Huddleston added a comment -

          Got a test case that hits this now. The key was not receiving a full snapshot before disconnect which causes the HPS to not be moved.

          Show
          ben.huddleston Ben Huddleston added a comment - Got a test case that hits this now. The key was not receiving a full snapshot before disconnect which causes the HPS to not be moved.
          Hide
          build-team Couchbase Build Team added a comment -

          Build couchbase-server-6.5.0-3819 contains kv_engine commit 34086a0 with commit message:
          MB-35135: Deal with old PRE in trackedWrites due to partial snapshot

          Show
          build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-3819 contains kv_engine commit 34086a0 with commit message: MB-35135 : Deal with old PRE in trackedWrites due to partial snapshot

            People

            • Assignee:
              richard.demellow Richard deMellow
              Reporter:
              richard.demellow Richard deMellow
            • Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.