Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44128

XDCR TCP connection leak when host does not respond and XDCR retries

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • Cheshire-Cat, 6.5.1, 6.6.2
    • 7.0.0
    • XDCR
    • None
    • 1

    Description

      Based on a customer set up, it is possible in a very rare case for goxdcr to leak connections and end up taking up all the file descriptors of a system. 

      (Below is finding from the customer's case without the customer reference)
      This is the code to clean up any failed REST calls, such as to ns_server:
      http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2168-2173

      2167 			transport, ok := client.Transport.(*http.Transport)
      2168 			if ok {
      2169 				if u.IsSeriousNetError(err) {
      2170 					logger.Debugf("Encountered %v, close all idle connections for this http client.\n", err)
      2171 				}
      2172 				transport.CloseIdleConnections()
      2173 			}
      

      The suspect thing is that it is possible for transport not to be set. As is the case, for http calls (to local ns_server, we don’t encrypt), goxdcr doesn’t set the transport:

      http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2319

       		client = &http.Client{Timeout: base.DefaultHttpTimeout}
      

      If transport is not set, then we’re not closing idle connection, and depending on golang to close it for us.

      It just so happens that golang has had an issue https://github.com/golang/go/issues/28012 that showcases how TCP connection is not closed if the server doesn’t respond.
      In particular, the user posted a code snip that is exactly how XDCR creates the http client. See https://github.com/golang/go/issues/28012#issuecomment-562290662 and he claims that the TCP connection isn’t closed.

      This issue was fixed Dec 11, 2019 in golang 1.14, with the tile: “net/http: don't wait indefinitely in Transport for proxy CONNECT response”.

      XDCR for 6.5.1 is shipped with golang 1.11 according to CMakefile. The golang issue I mentioned was filed with the OP using 1.11 as well.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            neil.huang Neil Huang created issue -
            neil.huang Neil Huang made changes -
            Field Original Value New Value
            Link This issue relates to CBSE-9597 [ CBSE-9597 ]
            neil.huang Neil Huang made changes -
            Affects Version/s 6.5.1 [ 16622 ]
            Affects Version/s 6.6.2 [ 17103 ]
            Description Based on a customer set up, it is possible in a very rare case for goxdcr to leak connections and end up taking up all the file descriptors of a system.  Based on a customer set up, it is possible in a very rare case for goxdcr to leak connections and end up taking up all the file descriptors of a system. 

            (Below is finding from the customer's case without the customer reference)
            This is the code to clean up any failed REST calls, such as to ns_server:
             [http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2168-2173]
            {code}2167 transport, ok := client.Transport.(*http.Transport)
            2168 if ok {
            2169 if u.IsSeriousNetError(err) {
            2170 logger.Debugf("Encountered %v, close all idle connections for this http client.\n", err)
            2171 }
            2172 transport.CloseIdleConnections()
            2173 }
            {code}
            The suspect thing is that it is possible for transport not to be set. As is the case, for http calls (to local ns_server, we don’t encrypt), goxdcr doesn’t set the transport:

            [http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2319]
            {code} client = &http.Client{Timeout: base.DefaultHttpTimeout}
            {code}
            If transport is not set, then we’re not closing idle connection, *and depending on golang to close it for us*.

            It just so happens that golang has had an issue [https://github.com/golang/go/issues/28012] that showcases how TCP connection is not closed if the server doesn’t respond.
             In particular, the user posted a code snip that is exactly how XDCR creates the http client. See [https://github.com/golang/go/issues/28012#issuecomment-562290662] and he claims that the TCP connection isn’t closed.

            This issue was fixed Dec 11, 2019 in golang 1.14, with the tile: “net/http: don't wait indefinitely in Transport for proxy CONNECT response”.

            XDCR for 6.5.1 is shipped with golang 1.11 according to CMakefile. The golang issue I mentioned was filed with the OP using 1.11 as well.

            This most likely explains why XDCR eats up all the FD’s, but this doesn’t yet explain how the system’s networking got XDCR into this cycle in the first place. 
            Issue Type Task [ 3 ] Bug [ 1 ]
            Summary XDCR socket leak investigation XDCR TCP connection leak when host does not respond and XDCR retries
            neil.huang Neil Huang made changes -
            Description Based on a customer set up, it is possible in a very rare case for goxdcr to leak connections and end up taking up all the file descriptors of a system. 

            (Below is finding from the customer's case without the customer reference)
            This is the code to clean up any failed REST calls, such as to ns_server:
             [http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2168-2173]
            {code}2167 transport, ok := client.Transport.(*http.Transport)
            2168 if ok {
            2169 if u.IsSeriousNetError(err) {
            2170 logger.Debugf("Encountered %v, close all idle connections for this http client.\n", err)
            2171 }
            2172 transport.CloseIdleConnections()
            2173 }
            {code}
            The suspect thing is that it is possible for transport not to be set. As is the case, for http calls (to local ns_server, we don’t encrypt), goxdcr doesn’t set the transport:

            [http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2319]
            {code} client = &http.Client{Timeout: base.DefaultHttpTimeout}
            {code}
            If transport is not set, then we’re not closing idle connection, *and depending on golang to close it for us*.

            It just so happens that golang has had an issue [https://github.com/golang/go/issues/28012] that showcases how TCP connection is not closed if the server doesn’t respond.
             In particular, the user posted a code snip that is exactly how XDCR creates the http client. See [https://github.com/golang/go/issues/28012#issuecomment-562290662] and he claims that the TCP connection isn’t closed.

            This issue was fixed Dec 11, 2019 in golang 1.14, with the tile: “net/http: don't wait indefinitely in Transport for proxy CONNECT response”.

            XDCR for 6.5.1 is shipped with golang 1.11 according to CMakefile. The golang issue I mentioned was filed with the OP using 1.11 as well.

            This most likely explains why XDCR eats up all the FD’s, but this doesn’t yet explain how the system’s networking got XDCR into this cycle in the first place. 
            Based on a customer set up, it is possible in a very rare case for goxdcr to leak connections and end up taking up all the file descriptors of a system. 

            (Below is finding from the customer's case without the customer reference)
             This is the code to clean up any failed REST calls, such as to ns_server:
             [http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2168-2173]
            {code}2167 transport, ok := client.Transport.(*http.Transport)
            2168 if ok {
            2169 if u.IsSeriousNetError(err) {
            2170 logger.Debugf("Encountered %v, close all idle connections for this http client.\n", err)
            2171 }
            2172 transport.CloseIdleConnections()
            2173 }
            {code}
            The suspect thing is that it is possible for transport not to be set. As is the case, for http calls (to local ns_server, we don’t encrypt), goxdcr doesn’t set the transport:

            [http://src.couchbase.org/source/xref/6.5.1/goproj/src/github.com/couchbase/goxdcr/utils/utils.go#2319]
            {code} client = &http.Client{Timeout: base.DefaultHttpTimeout}
            {code}
            If transport is not set, then we’re not closing idle connection, *and depending on golang to close it for us*.

            It just so happens that golang has had an issue [https://github.com/golang/go/issues/28012] that showcases how TCP connection is not closed if the server doesn’t respond.
             In particular, the user posted a code snip that is exactly how XDCR creates the http client. See [https://github.com/golang/go/issues/28012#issuecomment-562290662] and he claims that the TCP connection isn’t closed.

            This issue was fixed Dec 11, 2019 in golang 1.14, with the tile: “net/http: don't wait indefinitely in Transport for proxy CONNECT response”.

            XDCR for 6.5.1 is shipped with golang 1.11 according to CMakefile. The golang issue I mentioned was filed with the OP using 1.11 as well.
            neil.huang Neil Huang made changes -
            Resolution Fixed [ 1 ]
            Status Open [ 1 ] Closed [ 6 ]
            neil.huang Neil Huang made changes -
            Link This issue is cloned by MB-44182 [ MB-44182 ]
            wayne Wayne Siu made changes -
            Link This issue backports to MB-44182 [ MB-44182 ]
            wayne Wayne Siu made changes -
            Link This issue is cloned by MB-44182 [ MB-44182 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s 7.0.0 [ 17233 ]
            lynn.straus Lynn Straus made changes -
            Fix Version/s Cheshire-Cat [ 15915 ]

            People

              neil.huang Neil Huang
              neil.huang Neil Huang
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty