Details
Description
Background
This setting is used to detect whether or not XMEM (outgoing) nozzle is stuck. There is a counter. And every time an "interval" has taken place (i.e. 6 sec), if the nozzle hasn't moved any queued data, the counter increments.
The original design is that every 6 seconds, the interval would fire.
After 60 times (if no connection issues) of no data movement where all the same data that was queued to be sent were not sent, a monitor would fire a "Xmem is stuck" error and force the pipeline to restart.
Back in 6.5.0, a MB was fixed:
https://issues.couchbase.com/browse/MB-31762
This MB was simple and changes the unit for one of the XDCR internal's settings, "XmemSelfMonitorInterval" from millisecond to second.
By default, customer environments should have the number "6" for this setting. With MB-31762, this means the interval has been correctly set to "6 seconds"
However, it is likely that some customers, prior to MB-31762, have seen the issue and set the value to "6000" to correspond to the millisecond unit. Now, with MB-31762 fixed, these customers' environments now have the setting as "6000 seconds".
This means that for these customers, the monitor will fire only after (6000 seconds * 60 times) = 4 days and 4 hours.
Given purge interval by default is 3 days, if a pipeline is stuck for 4 days and 4 hours and then finally restarts, there's a good chance that it may hit the "purge seqno > resumeSeqno" issue and experience a rollback to 0. We don't want that.
Granted, we have seen some customer where they do have it set to 6000 seconds and nothing bad has happened thus far.
This MB is filed to see if we can do a validity check for this value. A good goal is to maybe have a maximum (currently it is MaxInt32) value so that the monitor doesn't exceed the purge interval. The hope is that as customers upgrade to a later version of CB Server, this value can be checked and automatically brought back in line as part of upgrade.
Attachments
Issue Links
- is caused by
-
MB-31762 Xmem self monitoring interval is too short
- Closed