Uploaded image for project: 'Couchbase .NET client library'
  1. Couchbase .NET client library
  2. NCBC-2308

N1QL query situational failure

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Fixed
    • 3.0.0-beta.4
    • 3.0.0
    • None

    Description

      All n1ql situational tests are failing/getting stuck due to 
      (An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full)
      I've attached the sdk logs (SdkdConsole.log.zip), and below is the message that is returned to the sdkd logs:

       
      [306.17 INFO] (SDKD log:137) Unexpected Exception on Run:One or more errors occurred. (An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full)
      [306.17 INFO] (SDKD log:137) System.AggregateException: One or more errors occurred. (An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full) ---> System.Net.Sockets.SocketException: An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
      [306.18 INFO] (LineGobbler doFilter:115) +++ Following exception has internal ID: 1
      [306.18 INFO] (SDKD log:137) at System.Net.Sockets.Socket.UpdateStatusAfterSocketErrorAndThrowException(SocketError error, String callerName)
      at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress)
      at System.Net.Sockets.Socket.InternalBind(EndPoint localEP)
      at System.Net.Sockets.Socket.ConnectAsync(SocketAsyncEventArgs e)
      at Couchbase.Utils.IpEndPointExtensions.GetConnection(IPEndPoint endPoint, ClusterOptions options) in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\couchbase-net-client\src\Couchbase\Utils\IPEndPointExtensions.cs:line 86
      at Couchbase.Core.ClusterNode.<CreateAsync>d__100.MoveNext() in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\couchbase-net-client\src\Couchbase\Core\ClusterNode.cs:line 297
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (LineGobbler doFilter:115) +++ Following exception has internal ID: 3
      [306.18 INFO] (SDKD log:137) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
      at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
      at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
      at Couchbase.Core.ClusterContext.<InitializeAsync>d__43.MoveNext() in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\couchbase-net-client\src\Couchbase\Core\ClusterContext.cs:line 240
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (LineGobbler doFilter:115) +++ Following exception has internal ID: 5
      [306.18 INFO] (SDKD log:137) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
      at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
      at Couchbase.Cluster.<InitializeAsync>d__17.MoveNext() in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\couchbase-net-client\src\Couchbase\Cluster.cs:line 110
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (LineGobbler doFilter:115) +++ Following exception has internal ID: 7
      [306.18 INFO] (SDKD log:137) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
      at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
      at Couchbase.Cluster.<ConnectAsync>d__14.MoveNext() in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\couchbase-net-client\src\Couchbase\Cluster.cs:line 80
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (LineGobbler doFilter:115) +++ Following exception has internal ID: 9
      [306.18 INFO] (SDKD log:137) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
      at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
      at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
      at SdkdConsole.Daemon.Handle.<ClusterAsync>d__18.MoveNext() in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\sdkd-net\src\SdkdConsole\Daemon\Handle.cs:line 323
      [306.18 INFO] (SDKD log:137) — End of inner exception stack trace —
      [306.18 INFO] (LineGobbler doFilter:115) +++ Following exception has internal ID: 11
      [306.18 INFO] (SDKD log:137) at System.Threading.Tasks.Task`1.GetResultCore(Boolean waitCompletionNotification)
      at SdkdConsole.Daemon.Handle.Run() in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\sdkd-net\src\SdkdConsole\Daemon\Handle.cs:line 100
      [306.18 INFO] (SDKD log:137) ---> (Inner Exception #0) System.Net.Sockets.SocketException (0x80004005): An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full
      [306.18 INFO] (SDKD log:137) +++ Received exception of ID 0
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (SDKD log:137) +++ Received exception of ID 2
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (SDKD log:137) +++ Received exception of ID 4
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (SDKD log:137) +++ Received exception of ID 6
      [306.18 INFO] (SDKD log:137) — End of stack trace from previous location where exception was thrown —
      [306.18 INFO] (LineGobbler doFilter:115) +++ Following exception has internal ID: 13
      [306.18 INFO] (SDKD log:137) at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
      at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task)
      at System.Runtime.CompilerServices.TaskAwaiter`1.GetResult()
      at SdkdConsole.Daemon.Handle.<ClusterAsync>d__18.MoveNext() in C:\jenkins\workspace\server-build-test-net\centos-net-sdk3.0-server-situational-tests\sdkd-net\src\SdkdConsole\Daemon\Handle.cs:line 323<---
      [306.18 INFO] (SDKD log:137)
       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            will.broadbelt Will Broadbelt created issue -
            jmorris Jeff Morris made changes -
            Field Original Value New Value
            Status New [ 10003 ] Open [ 1 ]
            jmorris Jeff Morris made changes -
            Status Open [ 1 ] In Progress [ 3 ]
            jmorris Jeff Morris made changes -
            Fix Version/s 3.0.0 [ 16606 ]
            jmorris Jeff Morris made changes -
            Affects Version/s 3.0.0-beta.4 [ 16712 ]
            jmorris Jeff Morris added a comment -

            Hi Will -

            Looking into this a bit:

            " An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full"

            This indicates a resource issue; likely too many sockets on a TIME_WAIT state (note this is can be caused by a large number of connections if they keep closing and TcpTimedWaitDelay is at it's default of ~4min on windows). In the log, all of these are being caused within the ConfigHandler: "Issue getting Cluster Map cluster! ". I noticed that around this time the circuit-breaker starts getting flipped open for K/V operations; I think the handler must also go through the circuit breaker so that it doesn't keep trying create new sockets when encounters a dead connection.

            I'll try that as a fix but we may need to tune environment variables as well to ensure that the sockets get cleaned up faster.

            -Jeff

            jmorris Jeff Morris added a comment - Hi Will - Looking into this a bit: " An operation on a socket could not be performed because the system lacked sufficient buffer space or because a queue was full" This indicates a resource issue; likely too many sockets on a TIME_WAIT state (note this is can be caused by a large number of connections if they keep closing and TcpTimedWaitDelay is at it's default of ~4min on windows). In the log, all of these are being caused within the ConfigHandler: "Issue getting Cluster Map cluster! " . I noticed that around this time the circuit-breaker starts getting flipped open for K/V operations; I think the handler must also go through the circuit breaker so that it doesn't keep trying create new sockets when encounters a dead connection. I'll try that as a fix but we may need to tune environment variables as well to ensure that the sockets get cleaned up faster. -Jeff
            jmorris Jeff Morris added a comment -

            There is a patch up in gerrit for testing/review - initially the config handler would directly use the connection and try to recreate it if was down. This would happen every 2.5s, without backoff even if the node itself was down - this is likely causing the issue. The patch makes the handler polling go through the circuit breaker which help.

            jmorris Jeff Morris added a comment - There is a patch up in gerrit for testing/review - initially the config handler would directly use the connection and try to recreate it if was down. This would happen every 2.5s, without backoff even if the node itself was down - this is likely causing the issue. The patch makes the handler polling go through the circuit breaker which help.
            jmorris Jeff Morris made changes -
            Link This issue relates to NCBC-2316 [ NCBC-2316 ]
            jmorris Jeff Morris added a comment -

            This ticket fixes the actual issue; view requests were timing out immediately because no timeout was supplied and 0 was being passed in as a timeout.

            jmorris Jeff Morris added a comment - This ticket fixes the actual issue; view requests were timing out immediately because no timeout was supplied and 0 was being passed in as a timeout.
            will.broadbelt Will Broadbelt made changes -
            Resolution Fixed [ 1 ]
            Status In Progress [ 3 ] Resolved [ 5 ]
            will.broadbelt Will Broadbelt made changes -
            Status Resolved [ 5 ] Closed [ 6 ]

            People

              jmorris Jeff Morris
              will.broadbelt Will Broadbelt
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty