When doing internal Tools testing of backups/restores of data from other services the eventing service restore test sometimes fails.
The setup of the relevant test suite:
1. Load data to all services
2. Do a backup
3. Remove data from all services
4. Do a restore for all other services individually one at a time
5. Do a restore for the eventing service last
The issue is that sometimes after the restore for the eventing service is finished only one of the backed up functions is restored and appears on the cluster (we backup and restore one collection-aware function and one non-collection-aware for backwards compatibility).
I have investigated the issue and I have managed to find several things:
1. The endpoint for a function /api/v1/functions/function_name starts returning 404 status code before the functions is fully removed from the eventing service.
For example, the approximate time we got the 404 status code from that endpoint in one of our test runs was
However, in ns_server.eventing.log I see these relevant entries (all should be related to deletion of that function):
I am not sure if this is the intended behaviour. This is, however, not the main issue.
2. If we base our expectations on that endpoint and do a restore, there seems to be a race condition in the eventing service which leads to the not-fully-deleted function not being restored (cbbackupmgr still treats the restore as successful). This is the issue that I am mainly concerned about as the expectation is that the eventing service correctly handles this case where a function is in the process of being deleted and is being restored at the same time (this being caused by a race condition is my current theory, I am mainly basing it on the fact that adding a delay in-between the point at which the endpoint returns 404 status code and the restore seems to fix the issue).
I wrote a bash script that should help with reproducing the issue, I can consistently get the case where only one function is restored and sometimes even both functions are missing in the end.
1. Set up a cluster_run cluster with 1 node with admin name 'Administrator' and password 'asdasd' (otherwise you will have to change the host and the ports in the script), I used the same memory settings as we do in our testing tool:
- Data: 256
- Query: On
- Index: 512
- Search: 512
- Analytics: 2152
- Eventing: 256
- Backup: On
2. Create 2 buckets of 128 MB each: "default" and "meta", enable flush on both
3. Run the reproduce.sh script
- Positional parameter 1: backup archive directory
- Positional parameter 2: backup repository name
- Positional parameter 3: delay in second in-between deployment of functions and the backup (set this to 30 seconds, it is used to reproduce issue in MB- )
- Positional parameter 4: delay in second in-between the deletion of functions after the backup and the restore
4. (Optionally) Run the second script cleanup.sh to get to the initial state to be able to run the script again, you might need to undeploy functions first though
With settings like this:
you should get 0-1 functions after the restore.
should give you both functions (this is the expected behaviour).
The cluster logs archives exceed the size limit so I will only attach the cbbackupmgr logs and the eventing service logs (can provide full archive if necessary as well) for example failed and successful runs of the relevant test suite.