Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-3091

NRE GetDocumentFromReplicaAsync when EndPoint is null v3.2.X

    XMLWordPrintable

Details

    • Bug
    • Status: Resolved
    • Major
    • Resolution: Fixed
    • 3.2.6
    • 3.2.8
    • None
    • None
    • 1

    Description

      From forums: https://forums.couchbase.com/t/net-v3-2-x-sdk-usage-of-getanyreplicaasync/32673

      We’re upgrading to the 3.2.x SDK and encountered an issue with the GetAnyReplicaAsync method (I didn’t check GetAllReplicaAsync).

      So this throws ArgumentNullExceptionevery time it’s called after auto-failover event. This is the stack trace:

      \nStackTrace:\nInner Exception:\n\nValue cannot be null. (Parameter ‘endPoint’)\nType: System.ArgumentNullException\nSource: Couchbase.NetClient\nTargetSite: Void ThrowArgumentNullException(System.String)\nStackTrace:\n at Couchbase.Utils.ThrowHelper.ThrowArgumentNullException(String paramName)\n at Couchbase.CouchbaseBucket.SendAsync(IOperation op, CancellationTokenPair tokenPair)\n at Couchbase.Core.Retry.RetryOrchestrator.RetryAsync(BucketBase bucket, IOperation operation, CancellationTokenPair tokenPair)\n at Couchbase.KeyValue.CouchbaseCollection.GetReplica(String id, Int16 index, IRequestSpan span, CancellationToken cancellationToken, ITranscoderOverrideOptions options)\n at Couchbase.KeyValue.CouchbaseCollection.GetAnyReplicaAsync(String id, GetAnyReplicaOptions options)\n

      Note this doesn’t occur before auto-failover, only after this. I think this is very similar to the related issue in the v2.7.x SDK .net 2.7.X NullReferenceException encountered on GetDocumentFromReplicaAsync<T>

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          jmorris Jeff Morris added a comment -

          I believe this is a separate issue than https://issues.couchbase.com/browse/NCBC-3074. The NRE happens at two different locations in the code.

          jmorris Jeff Morris added a comment - I believe this is a separate issue than https://issues.couchbase.com/browse/NCBC-3074 . The NRE happens at two different locations in the code.
          eugeneshcherbo Eugene Shcherbo added a comment - - edited

          Some info from me guys, hope this will help to reproduce.

          This is the minimal reproducible example I‘m using to test this and cb3-replica-reads-logs.txt for the ticket you can find the SDK logs.

          My environment

          • I setup a 4-node cluster on one host with Docker containers (the tutorial).
          • I have one empty bucket configured with 2 replicas
          • Auto-failover is configured to occur after 15 seconds

          Steps to reproduce

          1. Having the web server from MRE running I generate a load (it’s sufficient to generate 1 request per second)
          • timestamp in logs is 17:38:22
          1. After n (any number can be chosen) seconds I kill one of the couchbase nodes (I killed the docker container with the docker kill command)
          • timestamp in logs is about 17:38:37 (this timestamp might not be inaccurate, this is the timestamp of the first timeout exception)
          • endpoint from logs which is killed is 172.16.101.14
          1. After ~15 seconds after I killed the node auto-failover occurred
          • start timestamp is 17:38:50
          • finish timestamp is 17:38:53
          1. System.ArgumentNullException is thrown from Couchbase.KeyValue.CouchbaseCollection.GetAnyReplicaAsync(String id, GetAnyReplicaOptions options)
          • Line 5450 at 17:38:53
          • can be found by either timestamp or by “Unhandled exception” query.

          So what I'm doing to reproduce this is just query a primary node with GetAsync`1 for non-existing document and then if any errors occurs I query GetAnyReplicaAsync`1. Actually I think that it's enough to just query a replica without querying a primary node. 

          eugeneshcherbo Eugene Shcherbo added a comment - - edited Some info from me guys, hope this will help to reproduce. This is the minimal reproducible example  I‘m using to test this and cb3-replica-reads-logs.txt for the ticket you can find the SDK logs. My environment I setup a 4-node cluster on one host with Docker containers ( the tutorial ). I have one empty bucket configured with 2 replicas Auto-failover is configured to occur after 15 seconds Steps to reproduce Having the web server from MRE running I generate a load ( it’s sufficient to generate 1 request per second ) timestamp in logs is 17:38:22 After n (any number can be chosen) seconds I kill one of the couchbase nodes ( I killed the docker container with the docker kill command) timestamp in logs is about 17:38:37 ( this timestamp might not be inaccurate, this is the timestamp of the first timeout exception ) endpoint from logs which is killed is 172.16.101.14 After ~15 seconds after I killed the node auto-failover occurred start timestamp is 17:38:50 finish timestamp is 17:38:53 System.ArgumentNullException is thrown from Couchbase.KeyValue.CouchbaseCollection.GetAnyReplicaAsync(String id, GetAnyReplicaOptions options) Line 5450 at 17:38:53 can be found by either timestamp or by “Unhandled exception” query. So what I'm doing to reproduce this is just query a primary node with GetAsync`1 for non-existing document and then if any errors occurs I query GetAnyReplicaAsync`1 . Actually I think that it's enough to just query a replica without querying a primary node. 
          jmorris Jeff Morris added a comment -

          Thanks Eugene Shcherbo, I'll take a look!

           

          jmorris Jeff Morris added a comment - Thanks Eugene Shcherbo , I'll take a look!  
          jmorris Jeff Morris added a comment - - edited

          Eugene Shcherbo 

          I still haven't replicated but based on the stacktrace, I am pretty sure I have a fix. I attached a VF (couchbase-net-client-3.2.8-pre-r5457.zip), please provide any feedback on how it works out.

          Note that this has not been thoroughly tested and or vetted by our QE, so use it for verification purposes only.

          Jeff

          jmorris Jeff Morris added a comment - - edited Eugene Shcherbo   I still haven't replicated but based on the stacktrace, I am pretty sure I have a fix. I attached a VF ( couchbase-net-client-3.2.8-pre-r5457.zip ), please provide any feedback on how it works out. Note that this has not been thoroughly tested and or vetted by our QE, so use it for verification purposes only. Jeff
          eugeneshcherbo Eugene Shcherbo added a comment - - edited

          Hello Jeff Morris 

          I tested with your updates and can confirm that the ArgumentNullException exception disappeared, so your fix seems to be the right one. Thank you very much for taking a look at this.

          Out of curiosity, could you please tell/show me what you changed (probably you have a git branch published?).

          eugeneshcherbo Eugene Shcherbo added a comment - - edited Hello Jeff Morris   I tested with your updates and can confirm that the ArgumentNullException exception disappeared , so your fix seems to be the right one. Thank you very much for taking a look at this. Out of curiosity, could you please tell/show me what you changed (probably you have a git branch published?).
          jmorris Jeff Morris added a comment -

          Eugene Shcherbo 

          If for any reason we cannot map to a node, then we throw a NodeUnavailableException which puts the operation into the retry loop. The operation should then either succeed or continue to retry until it times out. The problem was that the NullArgumentException was being thrown because the endpoint was null and it was percolating up to the app layer.

          You can see the proposed change on the Gerrit link on the right side of this ticket.

          Jeff

          jmorris Jeff Morris added a comment - Eugene Shcherbo   If for any reason we cannot map to a node, then we throw a NodeUnavailableException which puts the operation into the retry loop. The operation should then either succeed or continue to retry until it times out. The problem was that the NullArgumentException was being thrown because the endpoint was null and it was percolating up to the app layer. You can see the proposed change on the Gerrit link on the right side of this ticket. Jeff

          Jeff Morris thank you

          eugeneshcherbo Eugene Shcherbo added a comment - Jeff Morris thank you

          People

            jmorris Jeff Morris
            jmorris Jeff Morris
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty