I found the issue in goXDCR’s http server that has been in place since goXDCR’s first implementation since 2014.
Thanks to Dave Finlay hint in his previous comment that ns_server eventually gives up after 30 seconds, it makes sense on why ns_server doesn’t have FD issues while XDCR does, and has all these go-routines lying around as shown from my previous analysis.
I don’t have the right command in macOS to show the exact output per Linux boxes, but what most likely happens is that:
- ns_server sends periodic REST request to XDCR that includes replicationInfos, remoteCluster, and stats/buckets/<bucketName>
- XDCR accepts the connection and spawns a go-routine (per previous stack trace)
- All the previously mentioned conditions stay true
- XDCR’s “single thread” still takes its sweet time getting back to a single “replicationInfos” call, per stack trace.
- After 30 seconds, ns_server gives up and closes the connections
- XDCR, however, doesn’t have an escape mechanism. The go-routine spawned from the initial accept call still exists and holds on to its end of the socket.
- Ns_server tries again, while XDCR is still servicing the original “replicationInfos” call, goxdcr accepts() it and launches another go-routine.
The correct thing for XDCR is to be aware of the timeout from ns_server’s side, and to ensure that the spawned go-routine lets go of it after the timeout.
Testing with a small proposed fix for MB-47923 later, the same exact reproduction test case no longer shows up anymore.
The same lsof output show a consistently stable number of file descriptors. This means that all the “CLOSED” connections were actually closed and has gone away.
$ while (( 1 == 1 ));do lsof -p `ps -ef | grep goxdcr | grep 9000 | awk '{print $2}'` | awk '{print $1}' | grep goxdcr | sort | uniq -c | sort -rn;done
|
36 goxdcr
|
37 goxdcr
|
38 goxdcr
|
38 goxdcr
|
38 goxdcr
|
38 goxdcr
|
38 goxdcr
|
38 goxdcr
|
37 goxdcr
|
32 goxdcr
|
38 goxdcr
|
38 goxdcr
|
Goroutine dump no longer shows the admin port server as the top culprit either, and the number of go-routines stay healthy:
goroutine profile: total 103
|
15 @ 0x4032660 0x402d61a 0x402cbe5 0x40c33e5 0x41a634d 0x41a6328 0x41a6358 0x41bdb4f 0x41bd690 0x41b3831 0x41c2252 0x41c2147 0x419c2ae 0x419b8e1 0x419ab06 0x430ca65 0x431061e 0x430e02c 0x405f9e1
|
# 0x402cbe4 internal/poll.runtime_pollWait+0x54 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/runtime/netpoll.go:184
|
# 0x40c33e4 internal/poll.(*pollDesc).wait+0x44 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_poll_runtime.go:87
|
# 0x41a634c internal/poll.(*pollDesc).waitWrite+0x29c /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_poll_runtime.go:96
|
# 0x41a6327 internal/poll.(*FD).WaitWrite+0x277 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/internal/poll/fd_unix.go:498
|
# 0x41a6357 net.(*netFD).connect+0x2a7 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/fd_unix.go:152
|
# 0x41bdb4e net.(*netFD).dial+0xfe /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/sock_posix.go:149
|
# 0x41bd68f net.socket+0x1bf /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/sock_posix.go:70
|
# 0x41b3830 net.internetSocket+0x140 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/ipsock_posix.go:141
|
# 0x41c2251 net.(*sysDialer).doDialTCP+0xc1 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/tcpsock_posix.go:65
|
# 0x41c2146 net.(*sysDialer).dialTCP+0xd6 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/tcpsock_posix.go:61
|
# 0x419c2ad net.(*sysDialer).dialSingle+0x36d /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:578
|
# 0x419b8e0 net.(*sysDialer).dialSerial+0x220 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:546
|
# 0x419ab05 net.(*Dialer).DialContext+0x665 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/dial.go:424
|
# 0x430ca64 net/http.(*Transport).dial+0x1f4 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1035
|
# 0x431061d net/http.(*Transport).dialConn+0x193d /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1473
|
# 0x430e02b net/http.(*Transport).dialConnFor+0xdb /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/http/transport.go:1303
|
|
15 @ 0x4032660 0x4041deb 0x41cd7ba 0x405f9e1
|
# 0x41cd7b9 net.(*netFD).connect.func2+0xb9 /Users/neil.huang/.cbdepscache/exploded/x86_64/go-1.13.3/go/src/net/fd_unix.go:129
|
|
15 @ 0x4032660 0x4041deb 0x431686f 0x405f9e1
|
The analysis for leaking file descriptor has been posted in another ticket. Closing this ticket.