Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-2422

Error replacing dead connections on N1QL Failover rebalance

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.0.0-beta.4
    • 3.0.0
    • None

    Description

      Whilst running a situational test where we perform n1ql queries and perform a failover on 1 of the 3 n1ql nodes and rebalance, there is an issue where the dead connections are not being replaced with this error message:

      fail: Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool[0]      Error replacing dead connections for 10.112.194.102:11210.Couchbase.Core.Exceptions.KeyValue.DocumentNotFoundException: Exception of type 'Couchbase.Core.Exceptions.KeyValue.DocumentNotFoundException' was thrown.   at Couchbase.Core.ClusterNode.ExecuteOp(Func`4 sender, IOperation op, Object state, CancellationToken token, Nullable`1 timeout) in /Users/willbroadbelt/Documents/repos/couchbase/couchbase-net-client/src/Couchbase/Core/ClusterNode.cs:line 454   at Couchbase.Core.ClusterNode.Couchbase.Core.IO.Connections.IConnectionInitializer.SelectBucketAsync(IConnection connection, String bucketName, CancellationToken cancellationToken) in /Users/willbroadbelt/Documents/repos/couchbase/couchbase-net-client/src/Couchbase/Core/ClusterNode.cs:line 526   at Couchbase.Core.IO.Connections.ConnectionPoolBase.CreateConnectionAsync(CancellationToken cancellationToken) in /Users/willbroadbelt/Documents/repos/couchbase/couchbase-net-client/src/Couchbase/Core/IO/Connections/ConnectionPoolBase.cs:line 72   at Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.<>c_DisplayClass28_0.<<AddConnectionsAsync>g_StartConnection|0>d.MoveNext() in /Users/willbroadbelt/Documents/repos/couchbase/couchbase-net-client/src/Couchbase/Core/IO/Connections/DataFlow/DataFlowConnectionPool.cs:line 251--- End of stack trace from previous location where exception was thrown --   at Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.AddConnectionsAsync(Int32 count, CancellationToken cancellationToken) in /Users/willbroadbelt/Documents/repos/couchbase/couchbase-net-client/src/Couchbase/Core/IO/Connections/DataFlow/DataFlowConnectionPool.cs:line 287   at Couchbase.Core.IO.Connections.DataFlow.DataFlowConnectionPool.CleanupDeadConnectionsAsync() in /Users/willbroadbelt/Documents/repos/couchbase/couchbase-net-client/src/Couchbase/Core/IO/Connections/DataFlow/DataFlowConnectionPool.cs:line 358---------------------Context Info--------------------------{{

      {"DispatchedFrom":null,"DispatchedTo":null,"DocumentKey":"default","ClientContextId":"2714","Cas":0,"Status":0,"BucketName":"default","CollectionName":null,"ScopeName":null,"Message":"KV Error: \{Name=\"KEY_ENOENT\", Description=\"Not Found\", Attributes=\"item-only\"}

      "}}}

       

      I've attached the SDK (N1QL_Failover_Rebalance.log) and sdkd (N1QL_Failover_Rebalance_SDKD.log) logs. I believe this issue occurs for other services too.

      Attachments

        Issue Links

          For Gerrit Dashboard: NCBC-2422
          # Subject Branch Project Status CR V

          Activity

            jmorris Jeff Morris added a comment -

            Will Broadbelt -

            This isn't a query service error, its a K/V service error - so I am assuming the test is hybrid?

            -Jeff

            jmorris Jeff Morris added a comment - Will Broadbelt - This isn't a query service error, its a K/V service error - so I am assuming the test is hybrid? -Jeff

            Jeff Morris 

            Hmm no actually, after the docs are preloaded in at the start, it only runs queries. It does still initialise a connection for the bucket and default collection, although it only uses the cluster for the queries.

            The above results were done with a cluster of 

            1. kv
            2. kv,index,fts,n1ql <-failed over and rebalanced out
            3. kv,index,fts,n1ql
            4. kv,index,fts,n1ql

            But I just tried it with :

            1. kv
            2. n1ql <- failed over and rebalanced out
            3. index
            4. n1ql

            and didnt see any errors anymore.

            The reason I thought this was a query issue is because we have a bunch of these errors after the dead connection error msg :

            Sdkd.N1QL.N1QLQueryCommand|ERROR|Query failed The query was canceled.

            and

            Sdkd.N1QL.N1QLQueryCommand|ERROR|Query failed Service n1ql not available.

            Perhaps I should only ever be using the second cluster config? Although I normally see fine results for either of these cluster types when running other SDKs through these situations.

             

            will.broadbelt Will Broadbelt added a comment - Jeff Morris   Hmm no actually, after the docs are preloaded in at the start, it only runs queries. It does still initialise a connection for the bucket and default collection, although it only uses the cluster for the queries. The above results were done with a cluster of  kv kv,index,fts,n1ql <-failed over and rebalanced out kv,index,fts,n1ql kv,index,fts,n1ql But I just tried it with : kv n1ql <- failed over and rebalanced out index n1ql and didnt see any errors anymore. The reason I thought this was a query issue is because we have a bunch of these errors after the dead connection error msg : Sdkd.N1QL.N1QLQueryCommand|ERROR|Query failed The query was canceled. and Sdkd.N1QL.N1QLQueryCommand|ERROR|Query failed Service n1ql not available . Perhaps I should only ever be using the second cluster config? Although I normally see fine results for either of these cluster types when running other SDKs through these situations.  
            jmorris Jeff Morris added a comment -

            Thanks Will Broadbelt for the details

            jmorris Jeff Morris added a comment - Thanks Will Broadbelt for the details
            jmorris Jeff Morris added a comment -

            10.143.194.102 =>
            VW: 45msCouchbase.Core.Exceptions.UnambiguousTimeoutException: The request has timed out.
            ---> System.InvalidOperationException: Too many retries: 10.
            — End of inner exception stack trace —
            at Couchbase.Core.Exceptions.UnambiguousTimeoutException.ThrowWithRetryReasons(IRequest request, Exception innerException) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\Exceptions\UnambiguousTimeoutException.cs:line 30
            at Couchbase.Core.Retry.RetryOrchestrator.RetryAsync[T](Func`1 send, IRequest request) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\Retry\RetryOrchestrator.cs:line 100
            at Couchbase.CouchbaseBucket.ViewQueryAsync[TKey,TValue](String designDocument, String viewName, ViewOptions options) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\CouchbaseBucket.cs:line 191
            at LoadTester.Program.RunViews(IBucket bucket, Int32 timesToRun, Boolean displayTimes) in C:\Users\Jeff Morris\source\temp\LoadTester\Program.cs:line 92
            ----------------------Context Info--------------------------
            null

            VW: 2210msCouchbase.Core.Exceptions.RequestCanceledException: The query was canceled.
            ---> System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it.
            ---> System.Net.Sockets.SocketException (10061): No connection could be made because the target machine actively refused it.
            at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
            — End of inner exception stack trace —
            at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken)
            at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken)
            at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
            at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken)
            at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken)
            at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken)
            at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts)
            at Couchbase.Query.QueryClient.ExecuteQuery[T](QueryOptions options, ITypeSerializer serializer) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Query\QueryClient.cs:line 135
            — End of inner exception stack trace —
            at Couchbase.Query.QueryClient.ExecuteQuery[T](QueryOptions options, ITypeSerializer serializer) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Query\QueryClient.cs:line 228
            at Couchbase.Query.QueryClient.QueryAsync[T](String statement, QueryOptions options) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Query\QueryClient.cs:line 67
            at Couchbase.Cluster.<>c_DisplayClass27_0`1.<<QueryAsync>g_Func|0>d.MoveNext() in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Cluster.cs:line 257
            — End of stack trace from previous location where exception was thrown —
            at Couchbase.Core.Retry.RetryOrchestrator.RetryAsync[T](Func`1 send, IRequest request) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\Retry\RetryOrchestrator.cs:line 51
            at Couchbase.Cluster.QueryAsync[T](String statement, QueryOptions options) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Cluster.cs:line 260
            at LoadTester.Program.RunN1QL(ICluster cluster, Int32 timesToRun, Boolean displayTimes) in C:\Users\Jeff Morris\source\temp\LoadTester\Program.cs:line 149
            ----------------------Context Info--------------------------
            {"Statement":"[

            {\"statement\":\"SELECT * from `default`\",\"timeout\":\"75000ms\",\"client_context_id\":\"60f92b02-badf-4050-b1c1-1493a80819a9\"}

            ]","ClientContextId":"60f92b02-ba

            jmorris Jeff Morris added a comment - 10.143.194.102 => VW: 45msCouchbase.Core.Exceptions.UnambiguousTimeoutException: The request has timed out. ---> System.InvalidOperationException: Too many retries: 10. — End of inner exception stack trace — at Couchbase.Core.Exceptions.UnambiguousTimeoutException.ThrowWithRetryReasons(IRequest request, Exception innerException) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\Exceptions\UnambiguousTimeoutException.cs:line 30 at Couchbase.Core.Retry.RetryOrchestrator.RetryAsync [T] (Func`1 send, IRequest request) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\Retry\RetryOrchestrator.cs:line 100 at Couchbase.CouchbaseBucket.ViewQueryAsync [TKey,TValue] (String designDocument, String viewName, ViewOptions options) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\CouchbaseBucket.cs:line 191 at LoadTester.Program.RunViews(IBucket bucket, Int32 timesToRun, Boolean displayTimes) in C:\Users\Jeff Morris\source\temp\LoadTester\Program.cs:line 92 ---------------------- Context Info -------------------------- null VW: 2210msCouchbase.Core.Exceptions.RequestCanceledException: The query was canceled. ---> System.Net.Http.HttpRequestException: No connection could be made because the target machine actively refused it. ---> System.Net.Sockets.SocketException (10061): No connection could be made because the target machine actively refused it. at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) — End of inner exception stack trace — at System.Net.Http.ConnectHelper.ConnectAsync(String host, Int32 port, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean allowHttp2, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.GetHttpConnectionAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpConnectionPool.SendWithRetryAsync(HttpRequestMessage request, Boolean doRequestAuth, CancellationToken cancellationToken) at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, CancellationToken cancellationToken) at System.Net.Http.HttpClient.FinishSendAsyncBuffered(Task`1 sendTask, HttpRequestMessage request, CancellationTokenSource cts, Boolean disposeCts) at Couchbase.Query.QueryClient.ExecuteQuery [T] (QueryOptions options, ITypeSerializer serializer) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Query\QueryClient.cs:line 135 — End of inner exception stack trace — at Couchbase.Query.QueryClient.ExecuteQuery [T] (QueryOptions options, ITypeSerializer serializer) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Query\QueryClient.cs:line 228 at Couchbase.Query.QueryClient.QueryAsync [T] (String statement, QueryOptions options) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Query\QueryClient.cs:line 67 at Couchbase.Cluster.<>c_ DisplayClass27_0`1.<<QueryAsync>g _Func|0>d.MoveNext() in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Cluster.cs:line 257 — End of stack trace from previous location where exception was thrown — at Couchbase.Core.Retry.RetryOrchestrator.RetryAsync [T] (Func`1 send, IRequest request) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Core\Retry\RetryOrchestrator.cs:line 51 at Couchbase.Cluster.QueryAsync [T] (String statement, QueryOptions options) in C:\Users\Jeff Morris\source\couchbase-net-client\src\Couchbase\Cluster.cs:line 260 at LoadTester.Program.RunN1QL(ICluster cluster, Int32 timesToRun, Boolean displayTimes) in C:\Users\Jeff Morris\source\temp\LoadTester\Program.cs:line 149 ---------------------- Context Info -------------------------- {"Statement":"[ {\"statement\":\"SELECT * from `default`\",\"timeout\":\"75000ms\",\"client_context_id\":\"60f92b02-badf-4050-b1c1-1493a80819a9\"} ]","ClientContextId":"60f92b02-ba
            jmorris Jeff Morris added a comment - - edited

            Brett say's:

            "Removing a node from the cluster goes through a number of stages. Specifically, it has to ‘unload’ the node, which means removing all the services from that node and that happens asynchronously, then once all the services have been removed (which takes at least 30s for views), it finally removes the node from the cluster. If during that period the KV service is unloaded, but the whole node has not yet been removed from the cluster, then the node will act as if it is a non-KV node in the cluster map. If you recall from a while back, the way that ns_server handles this is that it places those non-KV nodes at the end of the nodesExt list, and the cluster map will no longer reference that far through the list. There was an issue back then where when we were bootstrapping we would attempt to connect to these non-KV nodes, and the only way to tell how far down the list you should go for KV is to inspect the length of the nodes entry, and only go that far through nodesExt for KV. I can try and find the .NET issue that would have been filed for it for more information."

            This is actually a server issue with an MB from way back when; more history can be found in NCBC-1052.

            jmorris Jeff Morris added a comment - - edited Brett say's: "Removing a node from the cluster goes through a number of stages. Specifically, it has to ‘unload’ the node, which means removing all the services from that node and that happens asynchronously, then once all the services have been removed (which takes at least 30s for views), it finally removes the node from the cluster. If during that period the KV service is unloaded, but the whole node has not yet been removed from the cluster, then the node will act as if it is a non-KV node in the cluster map. If you recall from a while back, the way that ns_server handles this is that it places those non-KV nodes at the end of the nodesExt list, and the cluster map will no longer reference that far through the list. There was an issue back then where when we were bootstrapping we would attempt to connect to these non-KV nodes, and the only way to tell how far down the list you should go for KV is to inspect the length of the nodes entry, and only go that far through nodesExt for KV. I can try and find the .NET issue that would have been filed for it for more information." This is actually a server issue with an MB from way back when; more history can be found in NCBC-1052 .

            Jeff Morris - This seems to have fixed the initial issue where with the cluster of 

            1. kv
            2. kv,index,fts,n1ql 
            3. kv,index,fts,n1ql
            4. kv,index,fts,n1ql

            But there is now an issue with the other cluster type:

            1. kv
            2. n1ql
            3. index
            4. n1ql

            where all queries fail, as the sdk thinks the n1ql service is unavailable the whole time, i.e. even with a stable, unchanging cluster. I've attached the logs - SdkdConsole.log.zip . (run was stopped before any cluster change was enacted).

            will.broadbelt Will Broadbelt added a comment - Jeff Morris  - This seems to have fixed the initial issue where with the cluster of  kv kv,index,fts,n1ql  kv,index,fts,n1ql kv,index,fts,n1ql But there is now an issue with the other cluster type: kv n1ql index n1ql where all queries fail, as the sdk thinks the n1ql service is unavailable the whole time, i.e. even with a stable, unchanging cluster. I've attached the logs -  SdkdConsole.log.zip  . (run was stopped before any cluster change was enacted).
            jmorris Jeff Morris added a comment -

            Will Broadbelt - this should be fixed now

            jmorris Jeff Morris added a comment - Will Broadbelt - this should be fixed now

            Yep not seeing the issue anymore.

            will.broadbelt Will Broadbelt added a comment - Yep not seeing the issue anymore.

            People

              jmorris Jeff Morris
              will.broadbelt Will Broadbelt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty