Uploaded image for project: 'Couchbase Java Client'
  1. Couchbase Java Client
  2. JCBC-896

Default backoff policy should backoff much more aggressively

    XMLWordPrintable

Details

    • Improvement
    • Resolution: Duplicate
    • Blocker
    • 2.3.0
    • None
    • None
    • Security Level: Public

    Description

      Recently we've seen a number of issues that appear to have both a server side and client side component, in which the server starts returning NOT_MY_VBUCKET responses and the client retry policy seems to exacerbate the situation by overwhelming the network.

      I believe we need to backoff much more aggressively than we do now. Currently we backoff up to a max of 100 ms which after a couple of seconds results in 10 retries per second for every failing request. If the error condition persists on the server for some time (which we've seen) this results in the network interface of the server quickly being overwhelmed. Essentially what starts off as an exponential backoff quickly becomes a fixed delay backoff of 100 ms.

      In the attached screen shots the bytes written by ep-engine quickly races to 240 MB / second, I'm assuming here, but I imagine this likely overwhelms the NIC card. The second graph shows the same nodes racing to 14 k NMVBs per second.

      I suggest the following values:

      • 10 micro second initial value
      • 10 x multipler each time (so pattern looks like: 100 us, 1 ms, 10 ms, 100 ms, 1 s, etc)
      • no max value

      Of course, we can solve this by implementing some kind of "retry aggregation" in the client: instead of retrying every request we only retry one command type against one vbucket until things improve and then we retry all the pending requests of the same command against the same vbucket. This is more complex, but under this type of scheme we wouldn't have to always be exponential and with always aggressive backoff.

      And of course I think we actually need to have this default across all the client implementations but let's start the conversation here.

      I think this improvement is needed pretty urgently which is why I'm proposing next release of the Java client.

      Note that I think there is an issue on the server side, which is perhaps MB-12268 or an unfiled issue coming out of the current investigations. However, when things go haywire on the server side, the client should play its part in not making the situation worse.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              daschl Michael Nitschinger
              dfinlay Dave Finlay
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty