A longer explanation is needed here from eventing's perspective as this wasn't very direct to spot. For LCB folks, TLDR - Only the last section of this comment is relevant:
Taking some examples from the timer_2 metadata.zip
, we can see 10 timer partitions - from partition 688 to 760 for which the span.start hasn't shrinked (moved forward) at all. Example:
{
|
"coll2": {
|
"sta": 1615536930,
|
"stp": 1615533563
|
},
|
"id": "eventing::2093318803:tm:688:sp"
|
},
|
Epoch 1615536930 = 12 March 2021 00:15:30 AM UTC - 08:00 .
On the contrary, note the timer span for a "good" partition - example 682:
{
|
"coll2": {
|
"sta": 1615542460,
|
"stp": 1615533563
|
},
|
"id": "eventing::2093318803:tm:680:sp"
|
},
|
Epoch 1615542460 = 12 March 2021 01:47:40 AM UTC - 08:00
So, the bottomline is that for 10 partitions (688 - 760) the last time a timer scan was successful was 12 March 2021 00:15:30 AM UTC - 08:00 . Or we can also say that the first failure happened while scanning root keys starting at 12 March 2021 00:15:37 AM UTC - 08:00 (after 7 second boundary).
If one checks the zip file above, we see that all context keys, alarms and root keys belong to those 10 partitions.
—
On to the next section - Why is timer scan not processing these root keys?
Partitions 688 - 760 are assigned to eventing node 172.23.121.165. Given that the prolonged effect of the issue is seen for timers2_0 function, lets check the eventing-consumer logs only for this function: 693280967_timer2.log
The above is an extract of logs for worker timer2's worker thread using LCB instance with ID: 693280967
—
Here we can see that from 2021-03-12T00:15:55 onwards (reminder - around the last time the timer scan for partition 688 and others was successful), the libcouchbase instance used by the worker thread belonging to function timer2 on eventing node 121.165 started reporting timeouts for get_cluster_config calls destined only to KV node - 172.23.104.16 (No problem with calls to other KV nodes):
2021-03-12T00:15:55.132-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,cccp L:187 I:693280967] <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
|
|
2021-03-12T00:16:51.131-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b023c","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
|
...
|
...
|
2021-03-12T01:05:12.265-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b093b","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
|
...
|
...
|
2021-03-12T01:05:13.279-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:1108 I:693280967] <172.23.104.18:11210> (CTX=0x7f261c011d50,memcached, SRV=0x7f262c1afe60,IX=2) Got socket error LCB_ERR_SOCKET_SHUTDOWN (1025)
|
...
|
...
|
2021-03-12T01:05:13.280-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,connection L:153 I:693280967] <172.23.104.18:11210> (SOCK=9607dcb76e53922a) Connected established
|
...
|
...
|
...
|
2021-03-12T01:05:14.766-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b093d","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
|
2021-03-12T01:05:15.280-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,cccp L:187 I:693280967] <NOHOST:NOPORT> (CTX=(nil),) Could not get configuration: LCB_ERR_TIMEOUT (201)
|
...
|
...
|
...
|
2021-03-12T01:53:40.799-08:00 [Info] eventing-consumer [worker_timers2_0_0:/tmp/127.0.0.1:8091_0_2093318803.sock:16472] [lcb,server L:814 I:693280967] Failing command with error LCB_ERR_TIMEOUT (201): {"b":"metadata","i": "544d25282952a0c7/2ec3390c27d099ff/b11a1","l":"172.23.121.165:44534","r":"172.23.104.16:11210","s":"kv:get_cluster_config","t":58000000}
|
Note the log snippet above - The lcb errors started at 2021-03-12T00:15:55, continued till 2021-03-12T01:05:12. At this point, there was a socket shutdown at {{ 2021-03-12T01:05:13}}.
A new connection was then established in the very next second at 2021-03-12T01:05:13.
However, even with the new connection, similar timeouts for get_cluster_config continued until the logs were collected.
From the perspective of memcached on node 172.23.104.16, there are no slow GET_CLUSTER_CONFIG calls. Slowest is 102 ms:
[ 10.75 - 102.40]ms (99.9995%) 1|
|
Additionally, eventing is now passing a 58 second timeout to libcouchbase for any kind of KV operation. Worth noting that multiple libcouchbase seems to be retrying every 10 seconds.
—
Effect of these timeouts on eventing:
With such continuous timeouts, no timer documents in metadata bucket owned by KV node 104.16 can be accessed. This is why timer scan ultimately gives up on those partitions.
—
Worth noting that this exact issue has also been noticed in MB-44913 where such get_cluster_config timeouts were preventing handler to be paused.
Michael Nitschinger given that root cause in this case matches MB-44913 assigned to you.
Please let me know if further clarification is needed.
Rerun on 7.0.0 build 4669
Scope_0 is source ,scope_1 is destination
volume_4669 → http://supportal.couchbase.com/snapshot/a5c987add8e2ba7de815826265fca1e4::0