Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-1086

GetAndLock not returning Locked status but timed out when doc is locked

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 2.2.5
    • 2.2.6
    • docs
    • .Net with Couchbase 4.1

    Description

      Hi,
      The Bug is in the GetAndLock mechanism that return op time out instread the proper error and in a propper time (it's about 4sec).

      This is the code below:
      var hosts = new List<Uri>

      { new Uri("http://Rois-MacBook-Pro.local:8091/pools") }

      ;
      var config = new ClientConfiguration()

      { Servers = hosts }

      ;
      var cluster = new Cluster(config);

      var bucket = cluster.OpenBucket("workshop","test");

      var sw = new Stopwatch();
      sw.Start();
      var res1 = bucket.GetWithLock<string>("roikatz",30);
      sw.Stop();
      Console.WriteLine("first op:

      {0}, result: {1}", sw.ElapsedMilliseconds, res1.Status);
      var sw1 = new Stopwatch();
      sw1.Start();
      var res2 = bucket.GetWithLock<string>("roikatz", 30);
      sw1.Stop();
      Console.WriteLine("second op: {0}

      , result

      {1}

      ", sw1.ElapsedMilliseconds, res2.Status);

      And that is the output:
      First op 8ms, result Success
      Second op 4110ms, result OperationTimeout

      In java this code return exception in about 2ms.
      I suspect that is a bug in the SDK.

      Thanks,
      Roi

      Attachments

        For Gerrit Dashboard: NCBC-1086
        # Subject Branch Project Status CR V

        Activity

          jmorris Jeff Morris added a comment -

          Thanks Roi, I'll look into this today.

          jmorris Jeff Morris added a comment - Thanks Roi, I'll look into this today.
          jmorris Jeff Morris added a comment -

          Interesting...the server returns a status of temporary failure when the document is locked with a "LOCK_ERROR" as the message. I wonder why it was designed this way.

          jmorris Jeff Morris added a comment - Interesting...the server returns a status of temporary failure when the document is locked with a "LOCK_ERROR" as the message. I wonder why it was designed this way.
          roi.katz Roi Katz added a comment -

          So the server actually returned immediately but the SDK retried several times until backing off?
          does this fix actually fixing other possible bugs as well?

          roi.katz Roi Katz added a comment - So the server actually returned immediately but the SDK retried several times until backing off? does this fix actually fixing other possible bugs as well?
          jmorris Jeff Morris added a comment -

          The bug is actually on the server: temp fail is returned when the memcached process is initializing, not when you try to access a key that has been locked. I need to create an MB for that.

          does this fix actually fixing other possible bugs as well?

          I doubt it, this is a real corner case.

          -Jeff

          jmorris Jeff Morris added a comment - The bug is actually on the server: temp fail is returned when the memcached process is initializing, not when you try to access a key that has been locked. I need to create an MB for that. does this fix actually fixing other possible bugs as well? I doubt it, this is a real corner case. -Jeff
          roi.katz Roi Katz added a comment -

          So it was working "well" in other SDK's because they didn't retry the operation?

          roi.katz Roi Katz added a comment - So it was working "well" in other SDK's because they didn't retry the operation?
          jmorris Jeff Morris added a comment - - edited

          I cannot speak for the other SDK's (nor generalize), but based off the discussion I had with one of the server folks was that this specific response shouldn't be temp fail. Setting the retry flag to false, fixed this specific issue, however the binary memcached protocol (https://github.com/couchbase/memcached/blob/master/docs/BinaryProtocol.md) states that temp fails should be retried:

          A temporary failure is a problem that might go away by resending the operation. There is currently two status codes that may be returned for that:
           
          Busy - The server is too busy to handle your request, please back off
          Temporary failure - The server hit a problem that a retry might fix
          

          I think in this specific case, if the lock is held, the response status should be something like KEY_LOCKED (if it existed). In the code review, a reviewer indicated that temp failures shouldn't be retried, so I removed the logic , however based upon what's in the protocol, I'll likely revert that change . It still has to run through situational tests, so it might fail there anyways and have to be reverted.

          -Jeff

          jmorris Jeff Morris added a comment - - edited I cannot speak for the other SDK's (nor generalize), but based off the discussion I had with one of the server folks was that this specific response shouldn't be temp fail. Setting the retry flag to false, fixed this specific issue, however the binary memcached protocol ( https://github.com/couchbase/memcached/blob/master/docs/BinaryProtocol.md ) states that temp fails should be retried: A temporary failure is a problem that might go away by resending the operation. There is currently two status codes that may be returned for that:   Busy - The server is too busy to handle your request, please back off Temporary failure - The server hit a problem that a retry might fix I think in this specific case, if the lock is held, the response status should be something like KEY_LOCKED (if it existed). In the code review, a reviewer indicated that temp failures shouldn't be retried, so I removed the logic , however based upon what's in the protocol, I'll likely revert that change . It still has to run through situational tests, so it might fail there anyways and have to be reverted. -Jeff
          jmorris Jeff Morris added a comment -

          Per the 2.0 spec, the clients (SDKs) should not retry on temp_fail:

          Operations that yield a TMPFAIL from the server must not be repeated by the client. Instead, the temporary failure should be propagated to the user of the library. In most cases, TMPFAIL results indicate that the server is overloaded and this state should be propagated to the user rather than retried, which would simply contribute to the overload.

          Since the change has already been committed, there is nothing else really to do here.

          -Jeff

          jmorris Jeff Morris added a comment - Per the 2.0 spec, the clients (SDKs) should not retry on temp_fail: Operations that yield a TMPFAIL from the server must not be repeated by the client. Instead, the temporary failure should be propagated to the user of the library. In most cases, TMPFAIL results indicate that the server is overloaded and this state should be propagated to the user rather than retried, which would simply contribute to the overload. Since the change has already been committed, there is nothing else really to do here. -Jeff
          roi.katz Roi Katz added a comment -

          Working well now.
          Returning TemporaryFaliure - immediately (up to 1-2ms)

          roi.katz Roi Katz added a comment - Working well now. Returning TemporaryFaliure - immediately (up to 1-2ms)

          People

            jmorris Jeff Morris
            roi.katz Roi Katz
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty