Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40052

DCP producer not backfilling data after previously reaching maxBackfills limit

    XMLWordPrintable

Details

    • Triaged
    • 1
    • Yes
    • KV Sprint 2020-June

    Description

      A couple of our regression tests started intermittently failing recently because of some missing documents that are supposed to be streamed via DCP within a certain timeout (60 seconds).

      The tests use cluster_run with 3 nodes as follows:
      Node n0: data + analytics
      Node n1: data
      Node n2: analytics

      At around time 2020-06-14T19:52:40.348-07:00, the beer-sample bucket is created and loaded. After that, Analytics starts DCP streams to the bucket.
      When I checked the logs, I noticed that we are receiving all the documents from one of the data nodes but none from the other.

      Following the memcached logs for a vbucket from each node, we can see the following:

      Node n0 (vb:1):

      2020-06-14T19:52:40.471388-07:00 INFO 57: HELO [cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1] XERROR, Select bucket [ 172.22.0.2:35852 - 172.22.0.2:12000 (<ud>@cbas</ud>) ]
      2020-06-14T19:52:40.528744-07:00 INFO 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
      2020-06-14T19:52:40.528785-07:00 INFO 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1" backfill:true, seqno:9
      2020-06-14T19:52:40.528801-07:00 INFO 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) Scheduling backfill from 1 to 8, reschedule flag : False
      2020-06-14T19:52:40.528977-07:00 INFO 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) ActiveStream::markDiskSnapshot: Sending disk snapshot with start 0, end 8, and high completed --, max visible 8
      2020-06-14T19:52:40.528986-07:00 INFO 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) ActiveStream::registerCursor name "eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1", backfill:true, seqno:9
      2020-06-14T19:52:40.639592-07:00 INFO 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) Backfill complete, 0 items read from disk, 8 from memory, last seqno read: 8, pendingBackfill : False
      2020-06-14T19:52:40.639657-07:00 INFO 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) ActiveStream::transitionState: Transitioning from backfilling to in-memory
      2020-06-14T19:53:40.931213-07:00 WARNING 57: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:1 - (vb:1) Stream closing, sent until seqno 8 remaining items 0, reason: The stream closed early because the conn was disconnected
      

      Node: n1 (vb:516) the backfill snapshot is never sent:

      2020-06-14T19:52:40.508361-07:00 INFO 50: HELO [cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:0] XERROR, Select bucket [ 172.22.0.2:37014 - 172.22.0.2:12002 (<ud>@cbas</ud>) ]
      2020-06-14T19:52:40.541900-07:00 INFO 50: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:0 - (vb:516) Creating stream with start seqno 0 and end seqno 18446744073709551615; requested end seqno was 18446744073709551615, collections-manifest uid:none, sid:none
      2020-06-14T19:52:40.541936-07:00 INFO 50: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:0 - (vb:516) ActiveStream::scheduleBackfill_UNLOCKED register cursor with name "eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:0" backfill:true, seqno:6
      2020-06-14T19:52:40.541951-07:00 INFO (beer-sample) Backfill for eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:0 vb:516 is pending
      2020-06-14T19:52:40.541956-07:00 INFO 50: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:0 - (vb:516) Scheduling backfill from 1 to 5, reschedule flag : False
      2020-06-14T19:53:40.916919-07:00 WARNING 50: (beer-sample) DCP (Producer) eq_dcpq:cbas:Default.Local:beer-sample:78d4ab44c7eb544b6cd41208352a4783:0 - (vb:516) Stream closing, sent until seqno 0 remaining items 0, reason: The stream closed early because the conn was disconnected
      

      We haven't made any major changes to the DCP streaming client and that test used to pass all the time within that configured timeout of 60 seconds.
      cbcollect_info from the nodes is attached.

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-40052
          # Subject Branch Project Status CR V

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              murtadha.hubail Murtadha Hubail
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty