Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-31410

DCP Consumer tries to process out-of-order messages

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.1.3, 5.5.3, 6.0.1, 6.5.0
    • 5.0.0, 5.0.1, 5.1.0, 5.1.1, 5.1.2, 5.5.0, 5.5.1, 5.5.2
    • couchbase-bucket
    • None
    • Triaged
    • Unknown

    Description

      Rebalance fails while performing the following test (steps are in order for reproducing on both debug/rel vulcan builds on local OSX):

      1. export COUCHBASE_NUM_VBUCKETS=4 && ./cluster_run -n 2
      2. ./couchbase-cli cluster-init --cluster=localhost:9000 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
      3. ./couchbase-cli bucket-create -c localhost:9000 -u admin -p admin1 --bucket=example --bucket-type=couchbase --bucket-ramsize=100 --bucket-replica=1 --bucket-eviction-policy=fullEviction --enable-flush=1 --wait
      4. cbc-pillowfight --spec="couchbase://127.0.0.1:12000/example" --username=admin --password=admin1 --batch-size=1 --num-items=200000 --num-threads=2 --set-pct=100 --min-size=1024 --max-size=1024 --random-body --populate-only
      5. ./couchbase-cli cluster-init --cluster=localhost:9001 --cluster-username=admin --cluster-password=admin1 --services=data --cluster-ramsize=256
      6. ./couchbase-cli server-add --cluster=127.0.0.1:9000 --username=admin --password=admin1 --server-add=127.0.0.1:9001 --server-add-username=admin --server-add-password=admin1
      7. ./couchbase-cli rebalance -c localhost:9000 -u admin -p admin1

      Rebalance fails with:

      Rebalancing                                                                                                                                                  
      Bucket: 01/01 (example)                                                                                                                 100000 docs remaining
      [==========================================================================================================================                         ] 83.33%
      ERROR: Rebalance failed. See logs for detailed reason. You can try again.
      

      We log disconnections in memcached.log at n_0:

      2018-09-26T11:59:03.860260Z INFO (No Engine) DCP (Consumer) eq_dcpq:replication:n_1@127.0.0.1->n_0@192.168.4.65:example - Removing connection [ 192.168.4.65:58533 - 192.168.4.65:11999 (<ud>@ns_server</ud>) ]
      2018-09-26T11:59:03.860292Z WARNING 44: (example) DCP (Consumer) eq_dcpq:replication:n_1@127.0.0.1->n_0@192.168.4.65:example - (vb 2) Setting stream to dead state, last_seqno is 50000, unAckedBytes is 0, status is The stream closed early because the conn was disconnected
      2018-09-26T11:59:03.860303Z WARNING 44: (example) DCP (Consumer) eq_dcpq:replication:n_1@127.0.0.1->n_0@192.168.4.65:example - (vb 3) Setting stream to dead state, last_seqno is 50000, unAckedBytes is 0, status is The stream closed early because the conn was disconnected
      2018-09-26T11:59:03.860909Z INFO (No Engine) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.65->n_1@127.0.0.1:example - Removing connection [ 192.168.4.65:58476 - 192.168.4.65:11999 (<ud>@ns_server</ud>) ]
      2018-09-26T11:59:03.877834Z INFO 42: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.65->n_1@127.0.0.1:example - (vb 0) Stream closing, sent until seqno 34369 remaining items 0, reason: The stream closed early because the conn was disconnected
      2018-09-26T11:59:03.877847Z INFO 42: (example) DCP (Producer) eq_dcpq:replication:n_0@192.168.4.65->n_1@127.0.0.1:example - (vb 1) Stream closing, sent until seqno 50000 remaining items 0, reason: The stream closed early because the conn was disconnected
      

      Here the reason of the disconnection in memcached.log at n_1:

      2018-09-26T11:59:03.676598Z CRITICAL *** Fatal error encountered during exception handling ***
      2018-09-26T11:59:03.676638Z CRITICAL Caught unhandled std::exception-derived exception. what(): Monotonic<x> invariant failed: new value (24871) breaks invariant on current value (24872)
      2018-09-26T11:59:03.678398Z CRITICAL Call stack:     /Users/paoloc/dev/source/install/lib/libplatform_so.0.1.0.dylib(print_backtrace_to_buffer+0x4e) [0x10c63c000+0xc2ee]
          /Users/paoloc/dev/source/install/lib/libmcd_util.1.0.0.dylib(_ZL13log_backtracev+0x33) [0x10c3e0000+0x55853]
          /Users/paoloc/dev/source/install/lib/libmcd_util.1.0.0.dylib(_ZL27backtrace_terminate_handlerv+0x58) [0x10c3e0000+0x55638]
          /usr/lib/libc++abi.dylib(_ZSt11__terminatePFvvE+0x8) [0x7fffbaa84000+0x22d49]
          /usr/lib/libc++abi.dylib(_ZN10__cxxabiv1L22exception_cleanup_funcE19_Unwind_Reason_CodeP17_Unwind_Exception+0) [0x7fffbaa84000+0x227be]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN20ThrowExceptionPolicyIxE12nonMonotonicERKxS2_+0xc8c) [0x11b200000+0x66c1c]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN9MonotonicIx26DefaultOrderReversedPolicyNSt3__17greaterEEaSERKx+0x67) [0x11b200000+0x4f7b7]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN17CheckpointManager10setBySeqnoEx+0x65) [0x11b200000+0x1a08c5]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN9EPVBucket17queueBackfillItemER19SingleThreadedRCPtrI4ItemPS1_NSt3__114default_deleteIS1_EEE15GenerateBySeqno+0x190) [0x11b200000+0x1a05d0]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN7VBucket10queueDirtyER11StoredValue15GenerateBySeqno11GenerateCasbP22PreLinkDocumentContext+0x440) [0x11b200000+0x39bb80]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN7VBucket10queueDirtyER11StoredValueRK14VBQueueItemCtx+0x90) [0x11b200000+0x3b3130]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN9EPVBucket17addNewStoredValueERKN9HashTable14HashBucketLockERK4ItemRK14VBQueueItemCtx16GenerateRevSeqno+0x91) [0x11b200000+0x1a3001]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN7VBucket10processSetERKN9HashTable14HashBucketLockERP11StoredValueR4ItemybbRK14VBQueueItemCtxN2cb13StoreIfStatusEbb+0x15b9) [0x11b200000+0x3a0459]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN7VBucket15addBackfillItemER4Item15GenerateBySeqno+0x220) [0x11b200000+0x3a1610]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN8KVBucket15addBackfillItemER4Item15GenerateBySeqnoP16ExtendedMetaData+0x219) [0x11b200000+0x2e9829]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN13PassiveStream15processMutationEP16MutationResponse+0x8a3) [0x11b200000+0x15c9b3]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN13PassiveStream23processBufferedMessagesERjm+0x34c) [0x11b200000+0x15f62c]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN11DcpConsumer25drainStreamsBufferedItemsENSt3__110shared_ptrI13PassiveStreamEEm+0x1bb) [0x11b200000+0xc5b1b]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN11DcpConsumer20processBufferedItemsEv+0x101) [0x11b200000+0xc7351]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN15DcpConsumerTask3runEv+0x383) [0x11b200000+0xd7663]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZN14ExecutorThread3runEv+0x1212) [0x11b200000+0x263052]
          /Users/paoloc/dev/source/install/lib/ep.dylib(_ZL22launch_executor_threadPv+0x1d) [0x11b200000+0x26197d]
          /Users/paoloc/dev/source/install/lib/libplatform_so.0.1.0.dylib(_ZN15CouchbaseThread3runEv+0x21e) [0x10c63c000+0x474e]
          /Users/paoloc/dev/source/install/lib/libplatform_so.0.1.0.dylib(_ZL20platform_thread_wrapPv+0x102) [0x10c63c000+0x2c02]
          /usr/lib/system/libsystem_pthread.dylib(_pthread_body+0xb4) [0x7fffbc0b3000+0x393b]
          /usr/lib/system/libsystem_pthread.dylib(_pthread_body+0) [0x7fffbc0b3000+0x3887]
          /usr/lib/system/libsystem_pthread.dylib(thread_start+0xd) [0x7fffbc0b3000+0x308d]
      

      Notes:

      • The error is easily reproducible with the steps and the setup above
      • But, it is not deterministic (e.g., I don't see the issue if I repeat the test with COUCHBASE_NUM_VBUCKETS=1024)
      • Symptoms are very similar to the ones in MB-29501

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              owend Daniel Owen
              paolo.cocchi Paolo Cocchi
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty