Loading...

Details

Type: Bug
Resolution: Unresolved
Priority: Major
Fix Version/s: Morpheus
Affects Version/s: 7.1.0
Component/s: eventing
Labels:
- request-dev-verify

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

What is the issue?
When doing internal Tools testing of backups/restores of data from other services the eventing service restore test sometimes fails.

The setup of the relevant test suite:
1. Load data to all services
2. Do a backup
3. Remove data from all services
4. Do a restore for all other services individually one at a time
5. Do a restore for the eventing service last

The issue is that sometimes after the restore for the eventing service is finished only one of the backed up functions is restored and appears on the cluster (we backup and restore one collection-aware function and one non-collection-aware for backwards compatibility).

I have investigated the issue and I have managed to find several things:
1. The endpoint for a function /api/v1/functions/function_name starts returning 404 status code before the functions is fully removed from the eventing service.

For example, the approximate time we got the 404 status code from that endpoint in one of our test runs was

2022-03-03 15:40:21.950

However, in ns_server.eventing.log I see these relevant entries (all should be related to deletion of that function):

2022-03-03T15:40:21.847+00:00 [Info] ServiceMgr::deletePrimaryStore Function: eventing_func_legacy deleting from primary store

2022-03-03T15:40:21.852+00:00 [Info] ServiceMgr::deleteTempStore Function: eventing_func_legacy deleting drafts from temporary store

2022-03-03T15:40:22.368+00:00 [Info] SuperSupervisor::deleteFromCleanupApps [2] Function: eventing_func_legacy deleted

2022-03-03T15:40:37.455+00:00 [Info] SuperSupervisor::deleteFromLocallyDeployedApps [1] Function: eventing_func_legacy deleting from locally deployed apps map

2022-03-03T15:40:52.494+00:00 [Info] SuperSupervisor::deleteFromRunningProducers [0] Function: eventing_func_legacy deleted from running functions

2022-03-03T15:40:52.495+00:00 [Info] SuperSupervisor::deleteFromDeployedApps [0] Function: eventing_func_legacy deleting from deployed apps map

I am not sure if this is the intended behaviour. This is, however, not the main issue.

2. If we base our expectations on that endpoint and do a restore, there seems to be a race condition in the eventing service which leads to the not-fully-deleted function not being restored (cbbackupmgr still treats the restore as successful). This is the issue that I am mainly concerned about as the expectation is that the eventing service correctly handles this case where a function is in the process of being deleted and is being restored at the same time (this being caused by a race condition is my current theory, I am mainly basing it on the fact that adding a delay in-between the point at which the endpoint returns 404 status code and the restore seems to fix the issue).

Steps to reproduce:
I wrote a bash script that should help with reproducing the issue, I can consistently get the case where only one function is restored and sometimes even both functions are missing in the end.

1. Set up a cluster_run cluster with 1 node with admin name 'Administrator' and password 'asdasd' (otherwise you will have to change the host and the ports in the script), I used the same memory settings as we do in our testing tool:

Data: 256
Query: On
Index: 512
Search: 512
Analytics: 2152
Eventing: 256
Backup: On

2. Create 2 buckets of 128 MB each: "default" and "meta", enable flush on both
3. Run the reproduce.sh script

Positional parameter 1: backup archive directory
Positional parameter 2: backup repository name
Positional parameter 3: delay in second in-between deployment of functions and the backup (set this to 30 seconds, it is used to reproduce issue in MB- )
Positional parameter 4: delay in second in-between the deletion of functions after the backup and the restore

4. (Optionally) Run the second script cleanup.sh to get to the initial state to be able to run the script again, you might need to undeploy functions first though

With settings like this:

./reproduce.sh ~/temp/backups repo_0 30 0

you should get 0-1 functions after the restore.

./reproduce.sh ~/temp/backups repo_0 30 30

should give you both functions (this is the expected behaviour).

The cluster logs archives exceed the size limit so I will only attach the cbbackupmgr logs and the eventing service logs (can provide full archive if necessary as well) for example failed and successful runs of the relevant test suite.

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

backup-logs-fail.log
1.11 MB
03/Mar/22 8:53 AM
backup-logs-success.log
1.11 MB
03/Mar/22 8:53 AM
cleanup.sh
0.4 kB
04/Mar/22 9:58 AM
eventing-log-fail.log
40.99 MB
03/Mar/22 8:53 AM
eventing-log-success.log
46.06 MB
03/Mar/22 8:53 AM
reproduce.sh
3 kB
04/Mar/22 9:58 AM
reproduce2.sh
3 kB
05/Mar/22 9:19 PM

Issue Links

relates to

MB-51312 Functions that were backed up in the 'deploying' state change their state several times after a restore

Open

MB-51311 Functions in the 'deploying' state are not being backed up

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Eventing service doesn't correctly handle the delete-restore race condition

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty