Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-35060

[Durability]: Rollback of SyncWrite-related operations can result in invalid state on replicas [2019/7/19]

    XMLWordPrintable

Details

    • Untriaged
    • Centos 64-bit
    • No
    • KV-Engine Mad-Hatter Beta

    Description

       

      Build: 6.5.0-3748

      Scenario:

      1. 4 node cluster, Couchbase bucket (replicas=3)
      2. Initially load 10 docs with durability=MAJORITY (Succeeds)
      3. Started loading 10K docs in background
      4. Kill Memcached on node-4 (10.112.191.104), when SyncWrite creates are happening in parallel

      Observation:

      Memcached on node-4 recovers, but still all the docs ops fails with `com.couchbase.client.core.error.RequestTimeoutException` exception.

      Once the step-3 is completed with errors, trying to retry the fails docs one-by-one using the separate SDK client with sdk_timeout=5 seconds (Same as step-2).

      Seeing `com.couchbase.client.core.error.DurabilityAmbiguousException` for all individual operation, but the doc_creation succeeds. After this memcached on replica nodes (node-2, node-3) crashes at the same time with exit code #139 with back-trace given below. node-1 crashes 16s later (05:19:43.384)
      Expected behavior:

      As soon as the node-4 restarts, all sync_writes should go through fine without any timeouts / ambiguous exceptions. No Memcached crash is expected when writing to replica nodes.

      2019-07-13T05:19:27.211-07:00, ns_log:0:info:message(ns_1@10.112.191.103) - Service 'memcached' exited with status 139. Restarting. Messages:
      2019-07-13T05:19:27.183050-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x8f300]
      2019-07-13T05:19:27.183054-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x8f3ab]
      2019-07-13T05:19:27.183059-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x9bf4c]
      2019-07-13T05:19:27.183065-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x30557]
      2019-07-13T05:19:27.183068-07:00 CRITICAL     /opt/couchbase/bin/../lib/libevent_core.so.2.1.8() [0x7fabf4202000+0x17107]
      2019-07-13T05:19:27.183071-07:00 CRITICAL     /opt/couchbase/bin/../lib/libevent_core.so.2.1.8(event_base_loop+0x39f) [0x7fabf4202000+0x1767f]
      2019-07-13T05:19:27.183077-07:00 CRITICAL     /opt/couchbase/bin/memcached() [0x400000+0x701a9]
      2019-07-13T05:19:27.183079-07:00 CRITICAL     /opt/couchbase/bin/../lib/libplatform_so.so.0.1.0() [0x7fabf4f23000+0x8f27]
      2019-07-13T05:19:27.183083-07:00 CRITICAL     /lib64/libpthread.so.0() [0x7fabf2953000+0x7df3]
      2019-07-13T05:19:27.183107-07:00 CRITICAL     /lib64/libc.so.6(clone+0x6d) [0x7fabf2592000+0xf61ad]
      

      Bucket state after the test execution:

       

      TAF Testcase:

       

      crash_test.crash_process.CrashTest:
        test_crash_process,nodes_init=4,replicas=3,num_items=10,process=memcached,service=memcached,sig_type=sigkill,target_node=active,durability=MAJORITY

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty