Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Major
Fix Version/s: 6.6.6, 7.0.0
Affects Version/s: 6.6.2, 6.6.3, 6.6.4, 6.6.5
Component/s: memcached
Labels:
- approved-for-6.6.6
- verified-6.6.6

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

Note: This issue has been spun out of ~~MB-51077~~ / ~~MB-26887~~ for tracking purposes - it has already been fixed in version 7.0.0 via the adoption of the libevent bufferevent API.

Prior to 7.0.0, KV-Engine had a simplistic handler for the initial TLS handshake - if SSL_accept returns a temporary error (needs to read / write more data) then it simply drains both input / output pipes and retries without any yield:

int Connection::sslAcceptWithRetry() {

    while (true) {

        int r = ssl.accept();

        if (r == 1) {

            // handshake completed.

            return r;

        auto sslError = ssl.getError(r);

        if (sslError == SSL_ERROR_WANT_READ ||

            sslError == SSL_ERROR_WANT_WRITE) {

            // Drain send and receive pipes.

            ssl.drainBioSendPipe(socketDescriptor);

            if (ssl.hasError()) {

                cb::net::set_econnreset();

                return -1;

            ssl.drainBioRecvPipe(socketDescriptor);

            if (ssl.hasError()) {

                cb::net::set_econnreset();

                return -1;

            // Continue SSL accept handshake.

            continue;

        } else {

            logSslErrorInfo("SSL_accept", r);

            cb::net::set_econnreset();

            return -1;

    folly::assume_unreachable();

Note how this loops calling ssl.accept() - which is just a thin wrapper around SSL_accept - exiting the loop when the handshake is successful. If SSL_accept instead returned SSL_ERROR_WANT_READ or SSL_ERROR_WANT_WRITE then we attempt to fulfull that request by draining the send and recv pipes (underlying TCP/IP send / recv buffers); on any other error code we give up.

The issue is that after draining the send / recv pipes, the code immediately retries the loop. The problem here is that SSL_accept might still be waiting for more data to transfer over the network, and while we have pushed data down to the underlying TCP/IP socket, the expected response may not have arrived yet. In effect we have a non-blocking socket but we are using it in a blocking manner - by busy-waiting for data to be sent over.

in theory it is possible the front-end thread could be blocked for an arbitrarily long period of time; as long as the underlying connection did not have data to read/write on it - and the TCP/IP connection was still established. In practice we have only observed the thread being blocked for the order of ~hundreds of milliseconds.

This issue can cause other tasks in the engine to block for the duration of the SSL accept. Those (known) tasks are:
1) DCP connection manager task
2) DCP connection notifier task

This code was added in 6.6.2 (see ~~MB-42607~~) to handle cases where the complete TLS handshake was not completed in a single TCP/IP send/receive.

This code was removed in 7.0.0 and upwards - we restructured the entire connection management to use libevent's bufferevent API to support out-of-order responses - see ~~MB-26887~~.