Details
-
Bug
-
Resolution: Fixed
-
Major
-
7.6.2
-
Untriaged
-
0
-
Unknown
Description
Before this patch, when an eventing function is under deployment or being resumed,
- If the function has any destination bucket bindings whose LCB instances are to be bootstrapped.
AND - There are ongoing KV service issues, e.g., a network partition, firewall on the KV port, etc.
We leave this LCB instance in an unhealthy state and skip past to continue with deployment.
Unfortunately, this LCB handle will never get repaired even if the KV issues eventually get resolved. This results in any subsequent operations scheduled on this LCB instance to fail (and not return control), which results in a held termination_lock_ and from the customer's perspective, the eventing function gets stuck.
With this patch, we keep track of the statuses of the LCB instances and "lazily" retry bootstrapping the unhealthy LCB instance(s) till the operation timeout. The operation timeout is derived from the script timeout. This process is "lazy" because we retry the LCB bootstrap process when the customer's JavaScript code uses the corresponding bucket binding(s).
With this change in approach, we can ensure that the customer's JavaScript code does not get stuck and times out the mutation's processing instead.
Attachments
Issue Links
- backports to
-
MB-63014 [Trinity]: Retry LCB instance bootstrap till timeout
- Resolved