Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51301

Eventing service doesn't correctly handle the delete-restore race condition

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      What is the issue?
      When doing internal Tools testing of backups/restores of data from other services the eventing service restore test sometimes fails.

      The setup of the relevant test suite:
      1. Load data to all services
      2. Do a backup
      3. Remove data from all services
      4. Do a restore for all other services individually one at a time
      5. Do a restore for the eventing service last

      The issue is that sometimes after the restore for the eventing service is finished only one of the backed up functions is restored and appears on the cluster (we backup and restore one collection-aware function and one non-collection-aware for backwards compatibility).

      I have investigated the issue and I have managed to find several things:
      1. The endpoint for a function /api/v1/functions/function_name starts returning 404 status code before the functions is fully removed from the eventing service.

      For example, the approximate time we got the 404 status code from that endpoint in one of our test runs was

      2022-03-03 15:40:21.950
      

      However, in ns_server.eventing.log I see these relevant entries (all should be related to deletion of that function):

      2022-03-03T15:40:21.847+00:00 [Info] ServiceMgr::deletePrimaryStore Function: eventing_func_legacy deleting from primary store
      2022-03-03T15:40:21.852+00:00 [Info] ServiceMgr::deleteTempStore Function: eventing_func_legacy deleting drafts from temporary store
      2022-03-03T15:40:22.368+00:00 [Info] SuperSupervisor::deleteFromCleanupApps [2] Function: eventing_func_legacy deleted
      2022-03-03T15:40:37.455+00:00 [Info] SuperSupervisor::deleteFromLocallyDeployedApps [1] Function: eventing_func_legacy deleting from locally deployed apps map
      2022-03-03T15:40:52.494+00:00 [Info] SuperSupervisor::deleteFromRunningProducers [0] Function: eventing_func_legacy deleted from running functions
      2022-03-03T15:40:52.495+00:00 [Info] SuperSupervisor::deleteFromDeployedApps [0] Function: eventing_func_legacy deleting from deployed apps map
      

      I am not sure if this is the intended behaviour. This is, however, not the main issue.

      2. If we base our expectations on that endpoint and do a restore, there seems to be a race condition in the eventing service which leads to the not-fully-deleted function not being restored (cbbackupmgr still treats the restore as successful). This is the issue that I am mainly concerned about as the expectation is that the eventing service correctly handles this case where a function is in the process of being deleted and is being restored at the same time (this being caused by a race condition is my current theory, I am mainly basing it on the fact that adding a delay in-between the point at which the endpoint returns 404 status code and the restore seems to fix the issue).

      Steps to reproduce:
      I wrote a bash script that should help with reproducing the issue, I can consistently get the case where only one function is restored and sometimes even both functions are missing in the end.

      1. Set up a cluster_run cluster with 1 node with admin name 'Administrator' and password 'asdasd' (otherwise you will have to change the host and the ports in the script), I used the same memory settings as we do in our testing tool:

      • Data: 256
      • Query: On
      • Index: 512
      • Search: 512
      • Analytics: 2152
      • Eventing: 256
      • Backup: On

      2. Create 2 buckets of 128 MB each: "default" and "meta", enable flush on both
      3. Run the reproduce.sh script

      • Positional parameter 1: backup archive directory
      • Positional parameter 2: backup repository name
      • Positional parameter 3: delay in second in-between deployment of functions and the backup (set this to 30 seconds, it is used to reproduce issue in MB- )
      • Positional parameter 4: delay in second in-between the deletion of functions after the backup and the restore

      4. (Optionally) Run the second script cleanup.sh to get to the initial state to be able to run the script again, you might need to undeploy functions first though

      With settings like this:

      ./reproduce.sh ~/temp/backups repo_0 30 0
      

      you should get 0-1 functions after the restore.

      ./reproduce.sh ~/temp/backups repo_0 30 30
      

      should give you both functions (this is the expected behaviour).

      The cluster logs archives exceed the size limit so I will only attach the cbbackupmgr logs and the eventing service logs (can provide full archive if necessary as well) for example failed and successful runs of the relevant test suite.

      Attachments

        1. backup-logs-fail.log
          1.11 MB
        2. backup-logs-success.log
          1.11 MB
        3. cleanup.sh
          0.4 kB
        4. eventing-log-fail.log
          40.99 MB
        5. eventing-log-success.log
          46.06 MB
        6. reproduce.sh
          3 kB
        7. reproduce2.sh
          3 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            ankit.prabhu Ankit Prabhu added a comment -

            Hi Maksimiljans Januska,
            Looking at your test script(reproduce.sh) the endpoint which is checking the status of the app is not correct.
            Rest endpoint to check for status of the app : "api/v1/status/<function_name>?bucket=<BucketName>&scope=<scopeName>"

            curl -u Administrator:asdasd -X GET http://localhost:9300/api/v1/status/eventing_func\?bucket=default\&scope=_default                                       
            {
             "app": {
              "composite_status": "undeployed",
              "name": "eventing_func",
              "function_scope": {
               "bucket": "default",
               "scope": "_default"
              },
              "num_bootstrapping_nodes": 0,
              "num_deployed_nodes": 0,
              "deployment_status": false,
              "processing_status": false,
              "redeploy_required": false
             },
             "num_eventing_nodes": 1
            }% 

            "composite_status" will give the current status of the app.

            "api/v1/functions/<function_name>?bucket=<BucketName>&scope=<scopeName>" just returns what user saved(for deployment request ".settings.deployment_status = true && .settings.processing_status=true") and not the actual state of the function on all eventing nodes.

            The script is undeploying the handler and immediately deleting the app without waiting for undeployment to complete. We allow this for historical reason but not recommended. In this case the function will be in undeploying state and continues to remove checkpoint in the background. So restore won't allow importing this handler.

            Script with changes to checking correct status of the app: reproduce2.sh

            ankit.prabhu Ankit Prabhu added a comment - Hi Maksimiljans Januska , Looking at your test script(reproduce.sh) the endpoint which is checking the status of the app is not correct. Rest endpoint to check for status of the app : "api/v1/status/<function_name>?bucket=<BucketName>&scope=<scopeName>" curl -u Administrator:asdasd -X GET http://localhost:9300/api/v1/status/eventing_func\?bucket=default\&scope=_default                                        { "app": {   "composite_status": "undeployed",   "name": "eventing_func",   "function_scope": {    "bucket": "default",    "scope": "_default"   },   "num_bootstrapping_nodes": 0,   "num_deployed_nodes": 0,   "deployment_status": false,   "processing_status": false,   "redeploy_required": false }, "num_eventing_nodes": 1 }% "composite_status" will give the current status of the app. "api/v1/functions/<function_name>?bucket=<BucketName>&scope=<scopeName>" just returns what user saved(for deployment request ".settings.deployment_status = true && .settings.processing_status=true") and not the actual state of the function on all eventing nodes. The script is undeploying the handler and immediately deleting the app without waiting for undeployment to complete. We allow this for historical reason but not recommended. In this case the function will be in undeploying state and continues to remove checkpoint in the background. So restore won't allow importing this handler. Script with changes to checking correct status of the app: reproduce2.sh
            ankit.prabhu Ankit Prabhu added a comment -

            Please reopen if this is not the case.

            ankit.prabhu Ankit Prabhu added a comment - Please reopen if this is not the case.
            james.lee James Lee added a comment -

            Hi Ankit Prabhu,

            In this case the function will be in undeploying state and continues to remove checkpoint in the background. So restore won't allow importing this handler.

            In our testing we're seeing the restore complete successfully with the eventing restore endpoint return a 200 status code; this isn't an ideal UX as we'd prefer to error out if the restore request isn't allowed. Please could you confirm if this is expected, I'd expect this to return a 400 status with a useful error message.

            Thanks,
            James

            james.lee James Lee added a comment - Hi Ankit Prabhu , In this case the function will be in undeploying state and continues to remove checkpoint in the background. So restore won't allow importing this handler. In our testing we're seeing the restore complete successfully with the eventing restore endpoint return a 200 status code; this isn't an ideal UX as we'd prefer to error out if the restore request isn't allowed. Please could you confirm if this is expected, I'd expect this to return a 400 status with a useful error message. Thanks, James

            People

              ankit.prabhu Ankit Prabhu
              maks.januska Maksimiljans Januska
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty