Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-47943

XDCR - File descriptor leak in XDCR

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.5.0, 6.5.1, 6.6.0, 6.6.1, 6.6.2, 6.5.2, 7.0.0, 7.1.0, 7.0.1
    • 7.1.0
    • XDCR
    • None
    • Untriaged
    • 1
    • Unknown

    Description

      In a recent case from the field we've seen XDCR holding 70,000 sockets that do not have a process on the other side of the connection. When lsof is run, these sockets show up as follows:

       
      COMMAND     PID    USER   FD      TYPE    DEVICE SIZE/OFF      NODE NAME
      ...
      goxdcr.bi 24440 cbadmin    5u     sock       0,7      0t0 303953540 protocol: TCP
      goxdcr.bi 24440 cbadmin    6u     sock       0,7      0t0 211821560 protocol: TCP
      goxdcr.bi 24440 cbadmin   10u     sock       0,7      0t0 216966092 protocol: TCP
      ...
      

      In this case the user had set the file descriptor limit to 70k and so at this point XDCR is unable to create new connections. This issue was previously tracked in MB-44182 and believed to be fixed, but it seems that the core issue hasn't been completely fixed.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            neil.huang Neil Huang added a comment -

            The analysis for leaking file descriptor has been posted in another ticket. Closing this ticket.

            neil.huang Neil Huang added a comment - The analysis for leaking file descriptor has been posted in another ticket. Closing this ticket.
            neil.huang Neil Huang added a comment -

            Reopening for reinvestigation

            neil.huang Neil Huang added a comment - Reopening for reinvestigation
            neil.huang Neil Huang added a comment -

            I found the issue in goXDCR’s http server that has been in place since goXDCR’s first implementation since 2014.
            Thanks to Dave Finlay hint in his previous comment that ns_server eventually gives up after 30 seconds, it makes sense on why ns_server doesn’t have FD issues while XDCR does, and has all these go-routines lying around as shown from my previous analysis.

            I don’t have the right command in macOS to show the exact output per Linux boxes, but what most likely happens is that:

            1. ns_server sends periodic REST request to XDCR that includes replicationInfos, remoteCluster, and stats/buckets/<bucketName>
            2. XDCR accepts the connection and spawns a go-routine (per previous stack trace)
            3. All the previously mentioned conditions stay true
            4. XDCR’s “single thread” still takes its sweet time getting back to a single “replicationInfos” call, per stack trace.
            5. After 30 seconds, ns_server gives up and closes the connections
            6. XDCR, however, doesn’t have an escape mechanism. The go-routine spawned from the initial accept call still exists and holds on to its end of the socket.
            7. Ns_server tries again, while XDCR is still servicing the original “replicationInfos” call, goxdcr accepts() it and launches another go-routine.

            The correct thing for XDCR is to be aware of the timeout from ns_server’s side, and to ensure that the spawned go-routine lets go of it after the timeout.

            Testing with a small proposed fix for MB-47923 later, the same exact reproduction test case no longer shows up anymore.

            The same lsof output show a consistently stable number of file descriptors. This means that all the “CLOSED” connections were actually closed and has gone away.

            $ while (( 1 == 1 ));do lsof -p `ps -ef | grep goxdcr | grep 9000 | awk '{print $2}'` | awk '{print $1}' | grep goxdcr | sort | uniq -c | sort -rn;done
              36 goxdcr
              37 goxdcr
              38 goxdcr
              38 goxdcr
              38 goxdcr
              38 goxdcr
              38 goxdcr
              38 goxdcr
              37 goxdcr
              32 goxdcr
              38 goxdcr
              38 goxdcr
            

            Goroutine dump no longer shows the admin port server as the top culprit either, and the number of go-routines stay healthy:

            goroutine profile: total 103
            15 @ 0x4032660 0x402d61a 0x402cbe5 0x40c33e5 0x41a634d 0x41a6328 0x41a6358 0x41bdb4f 0x41bd690 0x41b3831 0x41c2252 0x41c2147 0x419c2ae 0x419b8e1 0x419ab06 0x430ca65 0x431061e 0x430e02c 0x405f9e1
            #       0x402cbe4       internal/poll.runtime_pollWait+0x54             /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/runtime/netpoll.go:184
            #       0x40c33e4       internal/poll.(*pollDesc).wait+0x44             /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_poll_runtime.go:87
            #       0x41a634c       internal/poll.(*pollDesc).waitWrite+0x29c       /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_poll_runtime.go:96
            #       0x41a6327       internal/poll.(*FD).WaitWrite+0x277             /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_unix.go:498
            #       0x41a6357       net.(*netFD).connect+0x2a7                      /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/fd_unix.go:152
            #       0x41bdb4e       net.(*netFD).dial+0xfe                          /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/sock_posix.go:149
            #       0x41bd68f       net.socket+0x1bf                                /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/sock_posix.go:70
            #       0x41b3830       net.internetSocket+0x140                        /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/ipsock_posix.go:141
            #       0x41c2251       net.(*sysDialer).doDialTCP+0xc1                 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/tcpsock_posix.go:65
            #       0x41c2146       net.(*sysDialer).dialTCP+0xd6                   /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/tcpsock_posix.go:61
            #       0x419c2ad       net.(*sysDialer).dialSingle+0x36d               /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:578
            #       0x419b8e0       net.(*sysDialer).dialSerial+0x220               /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:546
            #       0x419ab05       net.(*Dialer).DialContext+0x665                 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:424
            #       0x430ca64       net/http.(*Transport).dial+0x1f4                /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1035
            #       0x431061d       net/http.(*Transport).dialConn+0x193d           /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1473
            #       0x430e02b       net/http.(*Transport).dialConnFor+0xdb          /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1303
             
            15 @ 0x4032660 0x4041deb 0x41cd7ba 0x405f9e1
            #       0x41cd7b9       net.(*netFD).connect.func2+0xb9 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/fd_unix.go:129
             
            15 @ 0x4032660 0x4041deb 0x431686f 0x405f9e1
            

             

            neil.huang Neil Huang added a comment - I found the issue in goXDCR’s http server that has been in place since goXDCR’s first implementation since 2014. Thanks to Dave Finlay hint in his previous comment that ns_server eventually gives up after 30 seconds, it makes sense on why ns_server doesn’t have FD issues while XDCR does, and has all these go-routines lying around as shown from my previous analysis. I don’t have the right command in macOS to show the exact output per Linux boxes, but what most likely happens is that: ns_server sends periodic REST request to XDCR that includes replicationInfos, remoteCluster, and stats/buckets/<bucketName> XDCR accepts the connection and spawns a go-routine (per previous stack trace) All the previously mentioned conditions stay true XDCR’s “single thread” still takes its sweet time getting back to a single “replicationInfos” call, per stack trace. After 30 seconds, ns_server gives up and closes the connections XDCR, however, doesn’t have an escape mechanism. The go-routine spawned from the initial accept call still exists and holds on to its end of the socket. Ns_server tries again, while XDCR is still servicing the original “replicationInfos” call, goxdcr accepts() it and launches another go-routine. The correct thing for XDCR is to be aware of the timeout from ns_server’s side, and to ensure that the spawned go-routine lets go of it after the timeout. Testing with a small proposed fix for MB-47923 later, the same exact reproduction test case no longer shows up anymore. The same lsof output show a consistently stable number of file descriptors. This means that all the “CLOSED” connections were actually closed and has gone away. $ while (( 1 == 1 ));do lsof -p `ps -ef | grep goxdcr | grep 9000 | awk '{print $2}'` | awk '{print $1}' | grep goxdcr | sort | uniq -c | sort -rn;done 36 goxdcr 37 goxdcr 38 goxdcr 38 goxdcr 38 goxdcr 38 goxdcr 38 goxdcr 38 goxdcr 37 goxdcr 32 goxdcr 38 goxdcr 38 goxdcr Goroutine dump no longer shows the admin port server as the top culprit either, and the number of go-routines stay healthy: goroutine profile: total 103 15 @ 0x4032660 0x402d61a 0x402cbe5 0x40c33e5 0x41a634d 0x41a6328 0x41a6358 0x41bdb4f 0x41bd690 0x41b3831 0x41c2252 0x41c2147 0x419c2ae 0x419b8e1 0x419ab06 0x430ca65 0x431061e 0x430e02c 0x405f9e1 # 0x402cbe4 internal/poll.runtime_pollWait+0x54 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/runtime/netpoll.go:184 # 0x40c33e4 internal/poll.(*pollDesc).wait+0x44 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_poll_runtime.go:87 # 0x41a634c internal/poll.(*pollDesc).waitWrite+0x29c /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_poll_runtime.go:96 # 0x41a6327 internal/poll.(*FD).WaitWrite+0x277 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_unix.go:498 # 0x41a6357 net.(*netFD).connect+0x2a7 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/fd_unix.go:152 # 0x41bdb4e net.(*netFD).dial+0xfe /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/sock_posix.go:149 # 0x41bd68f net.socket+0x1bf /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/sock_posix.go:70 # 0x41b3830 net.internetSocket+0x140 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/ipsock_posix.go:141 # 0x41c2251 net.(*sysDialer).doDialTCP+0xc1 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/tcpsock_posix.go:65 # 0x41c2146 net.(*sysDialer).dialTCP+0xd6 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/tcpsock_posix.go:61 # 0x419c2ad net.(*sysDialer).dialSingle+0x36d /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:578 # 0x419b8e0 net.(*sysDialer).dialSerial+0x220 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:546 # 0x419ab05 net.(*Dialer).DialContext+0x665 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:424 # 0x430ca64 net/http.(*Transport).dial+0x1f4 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1035 # 0x431061d net/http.(*Transport).dialConn+0x193d /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1473 # 0x430e02b net/http.(*Transport).dialConnFor+0xdb /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1303   15 @ 0x4032660 0x4041deb 0x41cd7ba 0x405f9e1 # 0x41cd7b9 net.(*netFD).connect.func2+0xb9 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/fd_unix.go:129   15 @ 0x4032660 0x4041deb 0x431686f 0x405f9e1  

            Build couchbase-server-7.1.0-1212 contains goxdcr commit 14062a9 with commit message:
            MB-47943 - XDCR file descriptor leak when system is busy

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1212 contains goxdcr commit 14062a9 with commit message: MB-47943 - XDCR file descriptor leak when system is busy

            People

              pavithra.mahamani Pavithra Mahamani
              dfinlay Dave Finlay
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty