Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
Cheshire-Cat
-
None
-
Untriaged
-
1
-
Unknown
Description
In Cheshire Cat the following function has a syntax error in the timer callback (I have a source bucket with 10M docs and only make about 19.5K timers) oddly the code editor is showing the error count promptly at the problematic line - but the main statistic doesn't show the count for minutes - yet my timer fires as soon as possible!
function TimerCallback(context) {
|
// intentional error pegging as a failure is too slow
|
e = 10;
|
// correct syntax
|
// var e = 10;
|
}
|
function OnUpdate(doc, meta) {
|
// only process 1/512 or 2 vBuckets 0.19% of 10M or 19.5 k timers
|
if (meta.vb && ((meta.vb % 512) !== 0)) return;
|
|
// intentional error pegging as a failure is too slow
|
// e = 10;
|
// correct syntax
|
// var e = 10;
|
|
createTimer(TimerCallback, new Date(), meta.id, {id: meta.id, doc: doc});
|
}
|
If the syntax error is as shown in the code block above my failure: statistic across 10M incorrectly stays at 0 - but I can View the code and hover over the "caution" symbol where the failure occurs and se he count incrementing each time I open the code editor.
Deployment | Statistics | ||
---|---|---|---|
Success | 9.98M | backlog | 3066 |
Timeout | 0 | failure | 0 |
However if I wait a few more minutes the failure: statistic will correctly display 19572 - this is delayed by too much as the timer was firing from now, i.e. new Date(). I would expect a 7-14 second delay not several minutes for the failure: statistic to peg. We really want more prompt feedback that something is failing.
Deployment | Statistics | ||
---|---|---|---|
Success | 9.98M | backlog | 3066 |
Timeout | 0 | failure | 19572 |
If the syntax error is in the OnUpdate block it works as expected uncomment "e = 10;" and my failure: statistic across 10M rows is about 0.19% of the bucket or ~ 19572. However I do end up with a "backlog" of 3066 when the backlog should be 0.
Hi Jon Strabala, Jeelan Poola:
1.
Eventing was already incrementing a counter (timer_callback_failure) for the case where a timer callback invocation threw an exception. ns_server periodically polls eventing, creates some synthetic / derived counters for display using the stats returned by eventing, and displays them in the UI.
I've added timer_callback_failure value in the computation of the derived value (failure count) that gets displayed in the UI. With this change, failures in timer callback causes the failure count to show an appropriate value.
I tried out the case where 100% of OnUpdate succeeded and 100% of TimerCallback failed. The UI then shows success as 10K (total # of mutations reflecting OnUpdate success) and failure as 10K as well (reflecting TimerCallback failure). This looks weird, but is the correct representation of what happened.
2.
To complete this picture accurately, though, we should have an addition counter to track successful timer callback invocations. At present, we maintain counters for successful invocations of OnUpdate and OnDelete, but not one for TimerCallback invocations.
In my opinion, the success count should reflect successful JS function invocations, regardless of the specific function invoked, and hence, should include TimerCallback invocations as well. Adding just the TimerCallback failures to failure_count is not sufficient.
If we do this change, then:
The sum total of success + failure would be 20K to reflect 20K JS Function invocations.
Thoughts?
3.
There is a lag in the UI as to when the UI displays the counters. This is in the ns_server code / browser side - things do get updated, but it is not on a very predictable time. Not sure if this is a bug, though.
4.
As you would know, the backlog is totally unrelated to the timer callback failures. Backlog comes into picture only for mutations processed from DCP stream resulting in invocations of OnUpdate or OnDelete depending on what the mutation was. Timer invocations are not from the DCP and do not cause the DCP backlog to either increase or decrease. I am unable to see any issue with the backlog computation in my setup. Regardless, we should remove the backlog issue from this bug, file a separate one for that if needed - given with Collections we may not be able to support backlog anyways, it needs to be prioritised accordingly.
Thanks!