Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51616

XDCR - check value range validity for XmemSelfMonitorIntervalConfig

    XMLWordPrintable

Details

    • Bug
    • Resolution: Unresolved
    • Major
    • Morpheus
    • 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 6.5.0, 6.6.3, Morpheus, 6.6.4, 6.6.5, 7.0.0, 7.0.1, 7.0.2, 7.0.3, 7.1.0
    • XDCR
    • None
    • Untriaged
    • 1
    • No

    Description

      Background
      This setting is used to detect whether or not XMEM (outgoing) nozzle is stuck. There is a counter. And every time an "interval" has taken place (i.e. 6 sec), if the nozzle hasn't moved any queued data, the counter increments.
      The original design is that every 6 seconds, the interval would fire.
      After 60 times (if no connection issues) of no data movement where all the same data that was queued to be sent were not sent, a monitor would fire a "Xmem is stuck" error and force the pipeline to restart.

       

      Back in 6.5.0, a MB was fixed:
      https://issues.couchbase.com/browse/MB-31762

      This MB was simple and changes the unit for one of the XDCR internal's settings, "XmemSelfMonitorInterval" from millisecond to second.

      By default, customer environments should have the number "6" for this setting. With MB-31762, this means the interval has been correctly set to "6 seconds"

      However, it is likely that some customers, prior to MB-31762, have seen the issue and set the value to "6000" to correspond to the millisecond unit. Now, with MB-31762 fixed, these customers' environments now have the setting as "6000 seconds".

      This means that for these customers, the monitor will fire only after (6000 seconds * 60 times) = 4 days and 4 hours.
      Given purge interval by default is 3 days, if a pipeline is stuck for 4 days and 4 hours and then finally restarts, there's a good chance that it may hit the "purge seqno > resumeSeqno" issue and experience a rollback to 0. We don't want that.

      Granted, we have seen some customer where they do have it set to 6000 seconds and nothing bad has happened thus far.

      https://couchbase.slack.com/archives/C038LN6MZ35/p1648489445603079?thread_ts=1648231528.847189&cid=C038LN6MZ35

       

      This MB is filed to see if we can do a validity check for this value. A good goal is to maybe have a maximum (currently it is MaxInt32) value so that the monitor doesn't exceed the purge interval. The hope is that as customers upgrade to a later version of CB Server, this value can be checked and automatically brought back in line as part of upgrade.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty