Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45125

SDK KV Write operations may timeout with default timeout of 2.5secs

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not a Bug
    • Cheshire-Cat
    • 7.0.0
    • clients
    • 7.0.0-4342
    • Untriaged
    • 1
    • Unknown

    Description

      We are seeing the following error in most of the timer based tests - 

      eventing-consumer [worker_perf-test1_2:/tmp/127.0.0.1:8091_2_4121979234.sock:194805] Timer Creation failed with message: Set Timer Failed with KV Error 2021-02-01T07:18:15.326-08:00 [Info] eventing-consumer [worker_perf-test1_2:/tmp/127.0.0.1:8091_2_4121979234.sock:194805] file : /home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/eventing/v8_consumer/src/v8worker.cc line : 870 caller : SendUpdate : Returning empty valueError executing on_doc_update 2021-02-01T07:18:15.327-08:00 [Info] eventing-consumer [worker_perf-test1_21:/tmp/127.0.0.1:8091_21_4121979234.sock:195113] Exception: { "message": { "code": 1, "desc": "LCB_ERR_TIMEOUT (201): The request was not completed by the user-defined timeout", "name": "LCB_ERR_TIMEOUT" }, "stack": "Error\n at OnUpdate (perf-test1.js:6:5)"} Location: perf-test1.js:6 Code: createTimer(timerCallback, fireAt, meta.id, context); Stack: Error at OnUpdate (perf-test1.js:6:5) 2021-02-01T07:18:15.327-08:00 [Info] eventing-consumer [worker_perf-test1_2:/tmp/127.0.0.1:8091_2_4121979234.sock:194805] Exception: { "message": { "code": 1, "desc": "LCB_ERR_TIMEOUT (201): The request was not completed by the user-defined timeout", "name": "LCB_ERR_TIMEOUT" }, "stack": "Error\n at OnUpdate (perf-test1.js:6:5)"} Location: perf-test1.js:6 Code: createTimer(timerCallback, fireAt, meta.id, context); Stack: Error at OnUpdate (perf-test1.js:6:5) 2021-02-01T07:18:18.129-08:00 [Info] DCPT[eventing:nTq05Ae8-4:{eventing:nTq05Ae8-3:worker_perf-test1_0_172.23.96.16:11210_172.23.97.177:8096}/0] DCP-socket -> eventing blocked 100.733852ms (0.899432%)   
      

      Logs for the job are attached in the test here : http://perf.jenkins.couchbase.com/job/themis/8575/console  

      Eventing Logs : https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-themis-8575/172.23.97.177.zip 

      "timer_create_failure": 1"timer_create_counter": 49999999
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Bulk closing non-fixed defects. Feel free to reopen if necessary.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Bulk closing non-fixed defects. Feel free to reopen if necessary.

            Jeelan Poola: Personally, I consider this pretty serious and I agree with you that some users are likely to see this when rebalance upgrading.

            I don't think there is anything we should do on the SDKs to compensate for the defect in MB-38978 though. There would be unintended consequences. Also, we have a large amount of documentation about handling timeouts and any app developer needs to already handle timeouts. They are sort of a fact of life when working with a distributed system that you're using for operational (a.k.a. expected to be interactive and fast) purposes. You wouldn't expect this to happen during routine maintenance, such as gathering telemetry for support. But that's why it's a bug to be fixed.

            I'll take two actions here:
            1) Open a bug to release note this on Couchbase Server, since that's where the issue is and the tool that triggers this and
            2) Comment over on MB-38978 and chat with Daniel Owen

            ingenthr Matt Ingenthron added a comment - Jeelan Poola : Personally, I consider this pretty serious and I agree with you that some users are likely to see this when rebalance upgrading. I don't think there is anything we should do on the SDKs to compensate for the defect in MB-38978 though. There would be unintended consequences. Also, we have a large amount of documentation about handling timeouts and any app developer needs to already handle timeouts. They are sort of a fact of life when working with a distributed system that you're using for operational (a.k.a. expected to be interactive and fast) purposes. You wouldn't expect this to happen during routine maintenance, such as gathering telemetry for support. But that's why it's a bug to be fixed. I'll take two actions here: 1) Open a bug to release note this on Couchbase Server, since that's where the issue is and the tool that triggers this and 2) Comment over on MB-38978 and chat with Daniel Owen
            jeelan.poola Jeelan Poola added a comment - - edited

            Matt Ingenthron At a minimum, IMO, this should be release noted from SDK perspective so that customers who use default timeout of 2.5secs are not caught unawares when they upgrade to 7.0 and their KV writes start to timeout sporadically. Not sure if SDK can compensate somehow for MB-38978 (in eventing we had to increase default timeout to 5secs). Depending on how serious SDK team considers this problem, you may want to request expediting fix for MB-38978. I logged this so we track impact of MB-38978 on SDK/Customer applications.

            jeelan.poola Jeelan Poola added a comment - - edited Matt Ingenthron At a minimum, IMO, this should be release noted from SDK perspective so that customers who use default timeout of 2.5secs are not caught unawares when they upgrade to 7.0 and their KV writes start to timeout sporadically. Not sure if SDK can compensate somehow for MB-38978 (in eventing we had to increase default timeout to 5secs). Depending on how serious SDK team considers this problem, you may want to request expediting fix for MB-38978 . I logged this so we track impact of MB-38978 on SDK/Customer applications.

            Hey Jeelan Poola, in this case, the timeout is because the underlying service is not responding, correct? I might be missing it, but what do you see as the defect in the client in this case? What is it the client should be doing differently?

            ingenthr Matt Ingenthron added a comment - Hey Jeelan Poola , in this case, the timeout is because the underlying service is not responding, correct? I might be missing it, but what do you see as the defect in the client in this case? What is it the client should be doing differently?
            jeelan.poola Jeelan Poola added a comment - - edited

            This is a spin-off from MB-44024 where it was observed that some lcb kv write ops were timing out with LCB_ERR_TIMEOUT with default timeout configuration of 2.5secs due to MB-38978 with same test & testbed as that off 6.6.x. This potentially will affect existing customer deployments upgrading to 7.0.

            Cc Michael Nitschinger

            jeelan.poola Jeelan Poola added a comment - - edited This is a spin-off from MB-44024 where it was observed that some lcb kv write ops were timing out with LCB_ERR_TIMEOUT with default timeout configuration of 2.5secs due to MB-38978 with same test & testbed as that off 6.6.x. This potentially will affect existing customer deployments upgrading to 7.0. Cc Michael Nitschinger

            People

              ingenthr Matt Ingenthron
              jeelan.poola Jeelan Poola
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty