Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43914

Errors timer callback does not promptly peg the failure statistic and also leaves non-zero backlog

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • Cheshire-Cat
    • 7.0.0
    • documentation, eventing
    • None
    • Untriaged
    • 1
    • Unknown

    Description

      In Cheshire Cat the following function has a syntax error in the timer callback (I have a source bucket with 10M docs and only make about 19.5K timers) oddly the code editor is showing the error count promptly at the problematic line - but the main statistic doesn't show the count for minutes - yet my timer fires as soon as possible!

      function TimerCallback(context) {
       // intentional error pegging as a failure is too slow
       e = 10;
       // correct syntax
       // var e = 10;
      }
      function OnUpdate(doc, meta) {
       // only process 1/512 or 2 vBuckets 0.19% of 10M or 19.5 k timers  
       if (meta.vb && ((meta.vb % 512) !== 0)) return; 
       
       // intentional error pegging as a failure is too slow
       // e = 10;
       // correct syntax
       // var e = 10;
       
       createTimer(TimerCallback, new Date(), meta.id, {id: meta.id, doc: doc}); 
      }
      

      If the syntax error is as shown in the code block above my failure: statistic across 10M incorrectly stays at 0 - but I can View the code and hover over the "caution" symbol where the failure occurs and se he count incrementing each time I open the code editor. 

      Deployment Statistics    
      Success 9.98M backlog 3066
      Timeout 0 failure 0

      However if I wait a few more minutes the failure: statistic will correctly display 19572 - this is delayed by too much as the timer was firing from now, i.e. new Date().  I would expect a 7-14 second delay not several minutes for the failure: statistic to peg.  We really want more prompt feedback that something is failing.

      Deployment Statistics    
      Success 9.98M backlog 3066
      Timeout 0 failure 19572

      If the syntax error is in the OnUpdate block it works as expected  uncomment "e = 10;" and my failure: statistic across 10M rows is about 0.19% of the bucket or ~ 19572.  However I do end up with a "backlog" of 3066 when the backlog should be 0.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Hi Jon Strabala, Jeelan Poola:

          1.
          Eventing was already incrementing a counter (timer_callback_failure) for the case where a timer callback invocation threw an exception. ns_server periodically polls eventing, creates some synthetic / derived counters for display using the stats returned by eventing, and displays them in the UI.

          I've added timer_callback_failure value in the computation of the derived value (failure count) that gets displayed in the UI. With this change, failures in timer callback causes the failure count to show an appropriate value.

          I tried out the case where 100% of OnUpdate succeeded and 100% of TimerCallback failed. The UI then shows success as 10K (total # of mutations reflecting OnUpdate success) and failure as 10K as well (reflecting TimerCallback failure). This looks weird, but is the correct representation of what happened.

          2.
          To complete this picture accurately, though, we should have an addition counter to track successful timer callback invocations. At present, we maintain counters for successful invocations of OnUpdate and OnDelete, but not one for TimerCallback invocations.

          In my opinion, the success count should reflect successful JS function invocations, regardless of the specific function invoked, and hence, should include TimerCallback invocations as well. Adding just the TimerCallback failures to failure_count is not sufficient.

          If we do this change, then:

          • Upon processing the DCP stream in my example, the success count would show 10K. Upon failed invocations of the timer callback, the failure counter would show 10K.
          • Alternatively, if the timer callback did not fail, the success count would show 20K, and failure count would show 0.

          The sum total of success + failure would be 20K to reflect 20K JS Function invocations.

          Thoughts?

          3.
          There is a lag in the UI as to when the UI displays the counters. This is in the ns_server code / browser side - things do get updated, but it is not on a very predictable time. Not sure if this is a bug, though.

          4.
          As you would know, the backlog is totally unrelated to the timer callback failures. Backlog comes into picture only for mutations processed from DCP stream resulting in invocations of OnUpdate or OnDelete depending on what the mutation was. Timer invocations are not from the DCP and do not cause the DCP backlog to either increase or decrease. I am unable to see any issue with the backlog computation in my setup. Regardless, we should remove the backlog issue from this bug, file a separate one for that if needed - given with Collections we may not be able to support backlog anyways, it needs to be prioritised accordingly.

          Thanks!

          srinivasan.raman Srinivasan Raman added a comment - Hi Jon Strabala , Jeelan Poola : 1. Eventing was already incrementing a counter (timer_callback_failure) for the case where a timer callback invocation threw an exception. ns_server periodically polls eventing, creates some synthetic / derived counters for display using the stats returned by eventing, and displays them in the UI. I've added timer_callback_failure value in the computation of the derived value (failure count) that gets displayed in the UI. With this change, failures in timer callback causes the failure count to show an appropriate value. I tried out the case where 100% of OnUpdate succeeded and 100% of TimerCallback failed. The UI then shows success as 10K (total # of mutations reflecting OnUpdate success) and failure as 10K as well (reflecting TimerCallback failure). This looks weird, but is the correct representation of what happened. 2. To complete this picture accurately, though, we should have an addition counter to track successful timer callback invocations. At present, we maintain counters for successful invocations of OnUpdate and OnDelete, but not one for TimerCallback invocations. In my opinion, the success count should reflect successful JS function invocations, regardless of the specific function invoked, and hence, should include TimerCallback invocations as well. Adding just the TimerCallback failures to failure_count is not sufficient. If we do this change, then: Upon processing the DCP stream in my example, the success count would show 10K. Upon failed invocations of the timer callback, the failure counter would show 10K. Alternatively, if the timer callback did not fail, the success count would show 20K, and failure count would show 0. The sum total of success + failure would be 20K to reflect 20K JS Function invocations. Thoughts? 3. There is a lag in the UI as to when the UI displays the counters. This is in the ns_server code / browser side - things do get updated, but it is not on a very predictable time. Not sure if this is a bug, though. 4. As you would know, the backlog is totally unrelated to the timer callback failures. Backlog comes into picture only for mutations processed from DCP stream resulting in invocations of OnUpdate or OnDelete depending on what the mutation was. Timer invocations are not from the DCP and do not cause the DCP backlog to either increase or decrease. I am unable to see any issue with the backlog computation in my setup. Regardless, we should remove the backlog issue from this bug, file a separate one for that if needed - given with Collections we may not be able to support backlog anyways, it needs to be prioritised accordingly. Thanks!
          srinivasan.raman Srinivasan Raman added a comment - - edited

          Based on the discussions, the following are taken care of in this MB:

          • Add new counter timer_callback_success and expose via stats endpoints.
          • Updated ns-server to incorporate this new metric in relevant derived metrics.
          • Cleaned up text descriptions displayed in various Admin UI pages.
          srinivasan.raman Srinivasan Raman added a comment - - edited Based on the discussions, the following are taken care of in this MB: Add new counter timer_callback_success and expose via stats endpoints. Updated ns-server to incorporate this new metric in relevant derived metrics. Cleaned up text descriptions displayed in various Admin UI pages.

          Build couchbase-server-7.0.0-4358 contains ns_server commit fd87358 with commit message:
          MB-43914: Add metric to track successful timer callbacks

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4358 contains ns_server commit fd87358 with commit message: MB-43914 : Add metric to track successful timer callbacks

          Build couchbase-server-7.0.0-4360 contains eventing commit 26a0023 with commit message:
          MB-43914: Add metric to track successful timer callbacks

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4360 contains eventing commit 26a0023 with commit message: MB-43914 : Add metric to track successful timer callbacks
          sujay.gad Sujay Gad added a comment -

          Verified using 7.0.0-4960
          Timer callback success field is added.

             "timer_callback_failure": 0,
             "timer_callback_success": 214842
          

          sujay.gad Sujay Gad added a comment - Verified using 7.0.0-4960 Timer callback success field is added. "timer_callback_failure" : 0 , "timer_callback_success" : 214842

          People

            chanabasappa.ghali Chanabasappa Ghali
            jon.strabala Jon Strabala
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty