Details
-
Bug
-
Resolution: Fixed
-
Major
-
6.6.2, 6.6.3, 6.6.4, 6.6.5
-
Untriaged
-
1
-
Unknown
Description
Note: This issue has been spun out of MB-51077 / MB-26887 for tracking purposes - it has already been fixed in version 7.0.0 via the adoption of the libevent bufferevent API.
Prior to 7.0.0, KV-Engine had a simplistic handler for the initial TLS handshake - if SSL_accept returns a temporary error (needs to read / write more data) then it simply drains both input / output pipes and retries without any yield:
int Connection::sslAcceptWithRetry() { |
while (true) { |
int r = ssl.accept(); |
if (r == 1) { |
// handshake completed. |
return r; |
}
|
|
auto sslError = ssl.getError(r);
|
if (sslError == SSL_ERROR_WANT_READ || |
sslError == SSL_ERROR_WANT_WRITE) {
|
// Drain send and receive pipes. |
ssl.drainBioSendPipe(socketDescriptor);
|
if (ssl.hasError()) { |
cb::net::set_econnreset();
|
return -1; |
}
|
ssl.drainBioRecvPipe(socketDescriptor);
|
if (ssl.hasError()) { |
cb::net::set_econnreset();
|
return -1; |
}
|
// Continue SSL accept handshake. |
continue; |
} else { |
logSslErrorInfo("SSL_accept", r); |
cb::net::set_econnreset();
|
return -1; |
}
|
}
|
folly::assume_unreachable();
|
}
|
Note how this loops calling ssl.accept() - which is just a thin wrapper around SSL_accept - exiting the loop when the handshake is successful. If SSL_accept instead returned SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE then we attempt to fulfull that request by draining the send and recv pipes (underlying TCP/IP send / recv buffers); on any other error code we give up.
The issue is that after draining the send / recv pipes, the code immediately retries the loop. The problem here is that SSL_accept might still be waiting for more data to transfer over the network, and while we have pushed data down to the underlying TCP/IP socket, the expected response may not have arrived yet. In effect we have a non-blocking socket but we are using it in a blocking manner - by busy-waiting for data to be sent over.
in theory it is possible the front-end thread could be blocked for an arbitrarily long period of time; as long as the underlying connection did not have data to read/write on it - and the TCP/IP connection was still established. In practice we have only observed the thread being blocked for the order of ~hundreds of milliseconds.
This issue can cause other tasks in the engine to block for the duration of the SSL accept. Those (known) tasks are:
1) DCP connection manager task
2) DCP connection notifier task
This code was added in 6.6.2 (see MB-42607) to handle cases where the complete TLS handshake was not completed in a single TCP/IP send/receive.
This code was removed in 7.0.0 and upwards - we restructured the entire connection management to use libevent's bufferevent API to support out-of-order responses - see MB-26887.