Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-1393

async upsert hangs after removing entry point node

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 2.4.5
    • Component/s: None
    • Labels:
      None

      Description

      client version : couchbase-net-client 2.4.2

      server version : 4.5.1 GA

      Reproduction steps:

      1. run 2 or more cluster
      2. open bucket using one of the node host (entry point)
      3. keep upsert.
      4. while upsert, failover the entry point and leave it for more than 15 seconds.

      Expected : upsert should work with active nodes

      Current : upsert hangs

       

      Details : net client waits for lock which is not released by previous call to 

      https://github.com/couchbase/couchbase-net-client/blob/master/Src/Couchbase/Core/Server.cs#L414

       

      Here are two call stacks happened exactly 15 second after the other lock

      First lock right after node is failed over:
      >    Couchbase.NetClient.dll!Couchbase.Core.Server.CheckOnline(bool isDead) Line 416    C#
          Couchbase.NetClient.dll!Couchbase.IO.ConnectionPool<Couchbase.IO.MultiplexingConnection>.Release(Couchbase.IO.MultiplexingConnection connection) Line 214    C#
          Couchbase.NetClient.dll!Couchbase.IO.ConnectionPool<Couchbase.IO.MultiplexingConnection>.Couchbase.IO.IConnectionPool.Release(Couchbase.IO.IConnection connection) Line 315    C#
          Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.CheckConnection() Line 378    C#
          Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.ExecuteAsync<object>(Couchbase.IO.Operations.IOperation<object> operation, Couchbase.IO.IConnection connection) Line 197    C#
          Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.ExecuteAsync<object>(Couchbase.IO.Operations.IOperation<object> operation) Line 209    C#
          Couchbase.NetClient.dll!Couchbase.Core.Server.SendAsync<object>(Couchbase.IO.Operations.IOperation<object> operation) Line 632    C#
          Couchbase.NetClient.dll!Couchbase.Core.Buckets.CouchbaseRequestExecuter.SendWithRetryAsync<object>(Couchbase.IO.Operations.IOperation<object> operation, System.Threading.Tasks.TaskCompletionSource<Couchbase.IOperationResult<object>> tcs, System.Threading.CancellationTokenSource cts) Line 740    C#
          Couchbase.NetClient.dll!Couchbase.Core.Buckets.CouchbaseRequestExecuter.SendWithDurabilityAsync<object>(Couchbase.IO.Operations.IOperation<object> operation, bool deletion, Couchbase.ReplicateTo replicateTo, Couchbase.PersistTo persistTo) Line 345    C#
          Couchbase.NetClient.dll!Couchbase.CouchbaseBucket.UpsertAsync<object>(string key, object value, ulong cas, uint expiration, Couchbase.ReplicateTo replicateTo, Couchbase.PersistTo persistTo) Line 3412    C#
          Couchbase.NetClient.dll!Couchbase.CouchbaseBucket.UpsertAsync<object>(string key, object value, Couchbase.ReplicateTo replicateTo, Couchbase.PersistTo persistTo) Line 3362    C#
          Sdkd.dll!Sdkd.Commands.SetCommand.ExecuteSingleAsync(System.Collections.Generic.Dictionary<string, string> batch) Line 67    C#
          Sdkd.dll!Sdkd.Commands.KVCommand.Run(int handleId) Line 42    C#
          SdkdConsole.exe!SdkdConsole.Daemon.Handle.DispatchCommand(Sdkd.Protocol.Request req, Couchbase.Core.IBucket cli) Line 271    C#
          SdkdConsole.exe!SdkdConsole.Daemon.Handle.Run() Line 141    C#
       
       
      Second lock after 15 seconds :
      Couchbase.NetClient.dll!Couchbase.Core.Server.CheckOnline(bool isDead) Line 416    C#
          Couchbase.NetClient.dll!Couchbase.IO.ConnectionPool<Couchbase.IO.MultiplexingConnection>.Release(Couchbase.IO.MultiplexingConnection connection) Line 214    C#
          Couchbase.NetClient.dll!Couchbase.IO.ConnectionPool<Couchbase.IO.MultiplexingConnection>.Couchbase.IO.IConnectionPool.Release(Couchbase.IO.IConnection connection) Line 315    C#
          Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.CheckConnection() Line 393    C#
          Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.ExecuteAsync<object>(Couchbase.IO.Operations.IOperation<object> operation, Couchbase.IO.IConnection connection) Line 197    C#
          Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.ExecuteAsync<object>(Couchbase.IO.Operations.IOperation<object> operation) Line 209    C#
          Couchbase.NetClient.dll!Couchbase.Core.Server.SendAsync<object>(Couchbase.IO.Operations.IOperation<object> operation) Line 632    C#
          Couchbase.NetClient.dll!Couchbase.Core.Buckets.CouchbaseRequestExecuter.SendWithRetryAsync<object>(Couchbase.IO.Operations.IOperation<object> operation, System.Threading.Tasks.TaskCompletionSource<Couchbase.IOperationResult<object>> tcs, System.Threading.CancellationTokenSource cts) Line 740    C#
          Couchbase.NetClient.dll!Couchbase.Core.Buckets.CouchbaseRequestExecuter.SendWithDurabilityAsync<object>(Couchbase.IO.Operations.IOperation<object> operation, bool deletion, Couchbase.ReplicateTo replicateTo, Couchbase.PersistTo persistTo) Line 345    C#
          Couchbase.NetClient.dll!Couchbase.CouchbaseBucket.UpsertAsync<object>(string key, object value, ulong cas, uint expiration, Couchbase.ReplicateTo replicateTo, Couchbase.PersistTo persistTo) Line 3412    C#
          Couchbase.NetClient.dll!Couchbase.CouchbaseBucket.UpsertAsync<object>(string key, object value, Couchbase.ReplicateTo replicateTo, Couchbase.PersistTo persistTo) Line 3362    C#
          Sdkd.dll!Sdkd.Commands.SetCommand.ExecuteSingleAsync(System.Collections.Generic.Dictionary<string, string> batch) Line 67    C#
          Sdkd.dll!Sdkd.Commands.KVCommand.Run(int handleId) Line 42    C#
          SdkdConsole.exe!SdkdConsole.Daemon.Handle.DispatchCommand(Sdkd.Protocol.Request req, Couchbase.Core.IBucket cli) Line 271    C#
          SdkdConsole.exe!SdkdConsole.Daemon.Handle.Run() Line 141    C#
       

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            jmorris Jeff Morris added a comment - - edited

            Jae -

            Can you point to the SDKD code that is executing the test? Thus far, it seems this is replicable only when executing asynchronous code synchronously:

            var result = bucket.UpsertAsync(key, val).Result;

            If i change my code to use await, it doesn't block:

            var result = await bucket.UpsertAsync(key, val);

            I am not sure that the blocking you are seeing is from the lock not being released, since threads would simply queue up waiting for their turn. Instead, I suspect that the Tasks themselves are never completing. I can see this by data being added to the bucket, but the Task itself never returns. 

            My test code was missing a few lines, so tasks are returning...however, since we are running synchronously, I suspect that the blocking happens when that code is executed (within the lock) and the operation must timeout before the next thread can enter the lock because its a failed over node. I'll keep looking...

            -Jeff

             

            Show
            jmorris Jeff Morris added a comment - - edited Jae - Can you point to the SDKD code that is executing the test? Thus far, it seems this is replicable only when executing asynchronous code synchronously: var result = bucket.UpsertAsync(key, val).Result; If i change my code to use await, it doesn't block: var result = await bucket.UpsertAsync(key, val); I am not sure that the blocking you are seeing is from the lock not being released, since threads would simply queue up waiting for their turn. Instead, I suspect that the Tasks themselves are never completing. I can see this by data being added to the bucket, but the Task itself never returns.   My test code was missing a few lines, so tasks are returning...however, since we are running synchronously, I suspect that the blocking happens when that code is executed (within the lock) and the operation must timeout before the next thread can enter the lock because its a failed over node. I'll keep looking... -Jeff  
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment - - edited

            Here is the SDKD code

             

                    public override async void ExecuteSingleAsync(Dictionary<string, string> batch)
                    {
                        SetCommandOptions sco = (SetCommandOptions) Options;
                        var tasks = new List<Task<IOperationResult<Object>>>();
                        foreach (var item in batch)
                        {
                            tasks.Add(Client.UpsertAsync<Object>(item.Key, item.Value, sco.DurReplicateTo, sco.DurPersistTo));
                        }
                        var results = await Task.WhenAll(tasks);
                        int i = 0;
                        foreach (var item in batch)
                        {
                            opcoll.SubmitTaskResult(item.Key, item.Value, tasks[i].Result);
                            i++;
                        }
                    }
            

            Full source is

            https://github.com/couchbase/sdkd-net/blob/master/src/Sdkd/Commands/SetCommand.cs#L61

             

            Please note that the batch is 1, so the test adds one task and wait the batch (with 1 batch) to be done by 'await Task.WhenAll(tasks)'

            Also, note that the second hit to lock(
             Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.CheckConnection() Line 393)
            happens exactly 15 seconds after first hit to the lock(
             Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.CheckConnection() Line 378) right after entry point node failover. I'm wondering cause of this 15 seconds of interval would give some hint?

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - - edited Here is the SDKD code   public override async void ExecuteSingleAsync(Dictionary<string, string> batch) { SetCommandOptions sco = (SetCommandOptions) Options; var tasks = new List<Task<IOperationResult<Object>>>(); foreach (var item in batch) { tasks.Add(Client.UpsertAsync<Object>(item.Key, item.Value, sco.DurReplicateTo, sco.DurPersistTo)); } var results = await Task.WhenAll(tasks); int i = 0 ; foreach (var item in batch) { opcoll.SubmitTaskResult(item.Key, item.Value, tasks[i].Result); i++; } } Full source is https://github.com/couchbase/sdkd-net/blob/master/src/Sdkd/Commands/SetCommand.cs#L61   Please note that the batch is 1, so the test adds one task and wait the batch (with 1 batch) to be done by 'await Task.WhenAll(tasks)' Also, note that the second hit to lock(  Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.CheckConnection() Line 393) happens exactly 15 seconds after first hit to the lock(  Couchbase.NetClient.dll!Couchbase.IO.Services.MultiplexingIOService.CheckConnection() Line 378) right after entry point node failover. I'm wondering cause of this 15 seconds of interval would give some hint?
            Hide
            jmorris Jeff Morris added a comment -

            The 15 second interval matches the ClientConfiguration.SendTimeout default, so my guess is that operations are timing out and then entering the lock to check the status of the node. I doesn't appear to be deadlocking or anything here. This is expected when operations are mapped to a node that is down or otherwise non-functional.

            A couple of thoughts:

            • Perhaps there are queued up tasks and we are waiting for them to timeout?
            • Perhaps the client should be recognizing that the node is offline and providing a fail-fast mechanism for tasks that are mapped to it?
            Show
            jmorris Jeff Morris added a comment - The 15 second interval matches the ClientConfiguration.SendTimeout default, so my guess is that operations are timing out and then entering the lock to check the status of the node. I doesn't appear to be deadlocking or anything here. This is expected when operations are mapped to a node that is down or otherwise non-functional. A couple of thoughts: Perhaps there are queued up tasks and we are waiting for them to timeout? Perhaps the client should be recognizing that the node is offline and providing a fail-fast mechanism for tasks that are mapped to it?
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment -

            Let me create a simple app so you can test against the cluster I have.

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - Let me create a simple app so you can test against the cluster I have.
            Hide
            jmorris Jeff Morris added a comment - - edited

            Also, it might appear to block because the `ExecuteSingleAsync` method is be called synchronously in the `KVCommand.Run()`: https://github.com/couchbase/sdkd-net/blob/master/src/Sdkd/Commands/KVCommand.cs#L41

            The problem with async/await is that you need to do up/down the stack to work as expected. That being said, the node should go into a "node unavailable" status and fail fast any ops that are mapped to it if the config change hasn't happened yet.

            Show
            jmorris Jeff Morris added a comment - - edited Also, it might appear to block because the `ExecuteSingleAsync` method is be called synchronously in the `KVCommand.Run()`: https://github.com/couchbase/sdkd-net/blob/master/src/Sdkd/Commands/KVCommand.cs#L41 The problem with async/await is that you need to do up/down the stack to work as expected. That being said, the node should go into a "node unavailable" status and fail fast any ops that are mapped to it if the config change hasn't happened yet.
            Hide
            jaekwon.park Jae Park [X] (Inactive) added a comment -

            I have attached example app in the ticket which uses VM that I use for testing. you can download and run then try failover(graceful or hard doesn't matter) 172.23.123.176 , which is the enptry point node then you can see upsert stops. If you add break point at the lock(_syncObj) then you may see first lock passes when you hit step over, but second hit to lock step over hangs

            Show
            jaekwon.park Jae Park [X] (Inactive) added a comment - I have attached example app in the ticket which uses VM that I use for testing. you can download and run then try failover(graceful or hard doesn't matter) 172.23.123.176 , which is the enptry point node then you can see upsert stops. If you add break point at the lock(_syncObj) then you may see first lock passes when you hit step over, but second hit to lock step over hangs

              People

              • Assignee:
                jmorris Jeff Morris
                Reporter:
                jaekwon.park Jae Park [X] (Inactive)
              • Votes:
                0 Vote for this issue
                Watchers:
                2 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty

                    Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.