Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48671

XDCR - DCP Nozzle may not be cleanly shut down

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 7.1.0
    • 6.6.0, 6.6.1, 6.6.2, 6.6.3, 7.0.0, 7.0.1, 7.1.0
    • XDCR
    • None
    • Untriaged
    • 1
    • No

    Description

      2021-09-28T08:04:21.398-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_a30a022d663b52ee88384f1a57e757f7/default/remote_172.23.96.148:11210_1 received 3515 items (0 compressed), sent 3515 items. streams inactive: map[308:1 310:1 314:1 326:1 333:1 338:1 539:0 540:0 541:0 542:0 543:0 544:0 545:0 546:0]
       
      2021-09-28T08:04:22.182-07:00 ERRO GOXDCR.DcpNozzle: Error: dcpStreamHelper for vbno: 546 internal version overflow
      

      VB 546 has been non-init for a long time, and DCP nozzle has a restarter that picks up non-init VB’s and tries to start them. (Part of dcpNozzle.startUprStreams())

      It does it every 100 millisecond.
      Uint16 max is 65535.
      Increments 1 every 100 millisecond means takes 6553500 milliseconds, or 1 hour 49 minutes to overrun.

      Before the panic, this is the DCP Nozzle startUprStream instance:

      cbcollect_info_ns_1@172.23.96.148_20210928-172611/ns_server.goxdcr.log:2021-09-28T07:20:57.511-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_a30a022d663b52ee88384f1a57e757f7/default/remote_172.23.96.148:11210_1: startUprStreams for [308 309 310 314 315 317 319 321 322 323 324 325 326 328 330 331 333 334 335 336 337 338 340 341 539 540 541 542 543 544 545 546]...
      

      Before the panic occurred, here’s a bunch of the DCP nozzle showcasing inactive VB of 546:

      cbcollect_info_ns_1@172.23.96.148_20210928-172611/ns_server.goxdcr.log:2021-09-28T07:21:51.398-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_a30a022d663b52ee88384f1a57e757f7/default/remote_172.23.96.148:11210_1 received 3515 items (0 compressed), sent 3515 items. streams inactive: map[308:1 310:1 314:1 326:1 333:1 338:1 539:0 540:0 541:0 542:0 543:0 544:0 545:0 546:0]
      cbcollect_info_ns_1@172.23.96.148_20210928-172611/ns_server.goxdcr.log:2021-09-28T07:21:54.453-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_a30a022d663b52ee88384f1a57e757f7/default/remote_172.23.96.148:11210_1 received 7217 items (109 compressed), sent 7217 items. streams inactive: map[310:1 326:1 338:1 539:0 540:0 541:0 542:0 543:0 544:0 545:0 546:0]
      cbcollect_info_ns_1@172.23.96.148_20210928-172611/ns_server.goxdcr.log:2021-09-28T07:21:58.707-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_a30a022d663b52ee88384f1a57e757f7/default/remote_172.23.96.148:11210_1 received 3515 items (0 compressed), sent 3515 items. streams inactive: map[308:1 310:1 314:1 326:1 333:1 338:1 539:0 540:0 541:0 542:0 543:0 544:0 545:0 546:0]
      cbcollect_info_ns_1@172.23.96.148_20210928-172611/ns_server.goxdcr.log:2021-09-28T07:22:04.014-07:00 INFO GOXDCR.DcpNozzle: dcp_backfill_a30a022d663b52ee88384f1a57e757f7/default/remote_172.23.96.148:11210_1 received 12099 items (961 compressed), sent 12099 items. streams inactive: map[310:1 326:1 338:1 539:0 540:0 541:0 542:0 543:0 544:0 545:0 546:0]
      

      (repeat)

      Then the panic occurred:

      2021-09-28T08:04:22.182-07:00 ERRO GOXDCR.DcpNozzle: Error: dcpStreamHelper for vbno: 546 internal version overflow
      

      The sheer various instances of the DCP nozzle showcasing different numbers of sent/received items to me means that there are stray DCP Nozzles that were not shut down correctly and have upstream continuously running in the bg and eventually exhausting the uint16 counter.

      It seems that the closure of dcp.finCh as it has been set up for a long time, could be missed if Start() and Stop() function execute in parallel. I suspect that the problem has been there all along but with the introduction of Backfill Pipeline, the issue shows up more often.

      The panic was a fortunate error to allow us to catch the situation in a long running system.

      This may be related to MB-48380.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty