Details
Description
Replicate the behavior by doing the following:
1. Start the attached node app in debug mode (LCB_LOGLEVEL=5 node app.js). The app will attempt an upsert every 60 seconds during the swap rebalance.
2. Initiate a swap rebalance of the both nodes in a two node cluster, eg.
/opt/couchbase/bin/couchbase-cli server-add -c 10.112.162.101:8091 -u Administrator -p password --server-add=10.112.162.103:8091 --server-add=10.112.162.104:8091 --server-add-username=Administrator --server-add-password=password
|
/opt/couchbase/bin/couchbase-cli rebalance -c 10.112.162.101:8091 -u Administrator -p password --server-remove=10.112.162.101:8091 --server-remove=10.112.162.102:8091
|
3. Once the rebalance completes, the next attempted upsert will fail, and the following messages appear in the output:
178443ms [I0] {25296/307} [TRACE] (server - L:538) <10.112.162.101:11210> (SRV=0x10480be90,IX=0) Scheduling next timeout for 2500 ms. This is not an error
|
180013ms [I0] {25296/307} [DEBUG] (lcbio_mgr - L:408) <10.112.162.102:11210> (HE=0x10480de30) Creating new connection because none are available in the pool
|
180013ms [I0] {25296/307} [TRACE] (lcbio_mgr - L:323) <10.112.162.102:11210> (HE=0x10480de30) New pool entry: I=0x10480e070
|
180013ms [I0] {25296/307} [INFO] (connection - L:465) <10.112.162.102:11210> (SOCK=0x10480e2f0) Starting. Timeout=2500000us
|
180014ms [I0] {25296/307} [TRACE] (connection - L:360) <10.112.162.102:11210> (SOCK=0x10480e2f0) Received completion handler. Status=0. errno=0
|
180014ms [I0] {25296/307} [INFO] (connection - L:139) <10.112.162.102:11210> (SOCK=0x10480e2f0) Connected established
|
180014ms [I0] {25296/307} [INFO] (lcbio - L:112) Translating errno=0, lcb=0x2d to NETWORK_ERROR
|
180014ms [I0] {25296/307} [INFO] (connection - L:100) <10.112.162.102:11210> (SOCK=0x10480e2f0) Couldn't set TCP_NODELAY
|
180014ms [I0] {25296/307} [DEBUG] (connection - L:98) <10.112.162.102:11210> (SOCK=0x10480e2f0) Successfuly set TCP_KEEPALIVE
|
180014ms [I0] {25296/307} [DEBUG] (lcbio_mgr - L:282) <10.112.162.102:11210> (HE=0x10480de30) Received result for I=0x10480e070,C=0x0; E=0x0
|
180014ms [I0] {25296/307} [DEBUG] (lcbio_mgr - L:242) <10.112.162.102:11210> (HE=0x10480de30) Assigning R=0x10480bd60 SOCKET=0x10480e2f0
|
180014ms [I0] {25296/307} [TRACE] (server - L:595) <10.112.162.102:11210> (SRV=0x10480cce0) Session not yet negotiated. Negotiating
|
180014ms [I0] {25296/307} [DEBUG] (ioctx - L:101) <10.112.162.102:11210> (CTX=0x10270b760,unknown) Pairing with SOCK=0x10480e2f0
|
180014ms [I0] {25296/307} [DEBUG] (negotiation - L:397) <10.112.162.102:11210> (SASLREQ=0x10270a7b0) Server supports feature: 0x3 (TCP NODELAY)
|
180014ms [I0] {25296/307} [TRACE] (negotiation - L:542) <10.112.162.102:11210> (SASLREQ=0x10270a7b0) GET_ERRORMAP unsupported/disabled
|
180015ms [I0] {25296/307} [WARN] (negotiation - L:519) <10.112.162.102:11210> (SASLREQ=0x10270a7b0) SASL auth failed with STATUS=0x20
|
180015ms [I0] {25296/307} [ERROR] (negotiation - L:132) <10.112.162.102:11210> (SASLREQ=0x10270a7b0) Error: 0x2, SASL Step Failed
|
180015ms [I0] {25296/307} [ERROR] (server - L:583) <NOHOST:NOPORT> (SRV=0x10480cce0,IX=1) Connection attempt failed. Received LCB_AUTH_ERROR (0x02) from libcouchbase, received 0 from operating system
|
180015ms [I0] {25296/307} [DEBUG] (retryq - L:312) Adding PKT=0x10270ba60 to retry queue. Try count=1
|
180015ms [I0] {25296/307} [TRACE] (retryq - L:179) Next tick in 14 ms
|
180015ms [I0] {25296/307} [TRACE] (confmon - L:252) Refreshing current cluster map
|
180015ms [I0] {25296/307} [DEBUG] (ioctx - L:151) <10.112.162.102:11210> (CTX=0x10270b760,sasl) Destroying context. Pending Writes=0, Entered=true, Socket Refcount=1
|
The client seems to bounce back and forth between the 2 nodes that have been removed, and doesn’t attempt a connection to the new nodes. Interestingly, when continuously upserting every 1 second, I am not able to reproduce the issue. Appears to be a race condition.