Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49163

[System Test][Backup Service] Weird issue observed with backup archive

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.1.0
    • 7.1.0
    • qe
    • Untriaged
    • 1
    • No

    Description

      Build: 7.1.0-1558

      In our longevity cluster, we use an ansible script for setup and install. This script creates the following directories on all the nodes: /data/archive, /data/s3, /data/gcp and /data/couchbase. /data/archive is created to be used as the local filesystem archive by the backup service. From /var/log/messages, we can see that these directories were created successfully before the test started running:

      Oct 25 14:04:49 localhost ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/data/couchbase owner=couchbase follow=True group=couchbase unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=777 attributes=None backup=None
      Oct 25 14:04:52 localhost ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/data/archive owner=couchbase follow=True group=couchbase unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=777 attributes=None backup=None
      Oct 25 14:04:56 localhost ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/data/s3 owner=couchbase follow=True group=couchbase unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=777 attributes=None backup=None
      Oct 25 14:04:59 localhost ansible-file: Invoked with directory_mode=None force=False remote_src=None path=/data/gcp owner=couchbase follow=True group=couchbase unsafe_writes=None state=directory content=NOT_LOGGING_PARAMETER serole=None diff_peek=None setype=None selevel=None original_basename=None regexp=None validate=None src=None seuser=None recurse=False delimiter=None mode=777 attributes=None backup=None
      Oct 25 14:05:04 localhost ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=chown -R couchbase:couchbase /data removes=None creates=None chdir=None stdin=None
      Oct 25 14:05:04 localhost ansible-command: [WARNING] Consider using the file module with owner rather than running chown.  If you need to use command because file is insufficient you can add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.
      Oct 25 14:05:09 localhost ansible-command: Invoked with warn=True executable=None _uses_shell=True _raw_params=chmod -R 777 /data removes=None creates=None chdir=None stdin=None
      Oct 25 14:05:09 localhost ansible-command: [WARNING] Consider using the file module with mode rather than running chmod.  If you need to use command because file is insufficient you can add warn=False to this command task or set command_warnings=False in ansible.cfg to get rid of this message.
      

      The test starts running at 2021-10-25T14:20:29. From the console, we can see that the plan and the repo were created at 2021-10-25T14:31:04 and 2021-10-25T14:31:10 respectively:

      [2021-10-25T14:31:04-07:00, appropriate/curl:072e28] -X POST -u Administrator:password -H Content-Type:application/json http://172.23.97.74:8091/_p/backup/api/v1/plan/my_plan -d {"name":"my_plan","description":"This plan does backups every 2 day","services":["data","gsi","views","ft","eventing","cbas","query"],"default":false,"tasks":[{"name":"backup-1","task_type":"BACKUP","schedule":{"job_type":"BACKUP","frequency":6,"period":"HOURS","start_now":false},"full_backup":true},{"name":"merge","task_type":"MERGE","schedule":{"job_type":"MERGE","frequency":1,"period":"DAYS","time":"12:00","start_now":false},"merge_options":{"offset_start":0,"offset_end":2},"full_backup":true}]}
      [2021-10-25T14:31:10-07:00, appropriate/curl:a286b5] -u Administrator:password -X POST http://172.23.97.74:8091/_p/backup/api/v1/cluster/self/repository/active/my_repo -H Content-Type:application/json -d {"plan": "my_plan", "archive": "/data/archive"}
      [2021-10-25T14:31:15-07:00, appropriate/curl:88500a] -u Administrator:password -X POST 
      

      We can corroborate the same from backup service log (.123.32):

      2021-10-25T14:30:04.005-07:00 DEBUG (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.123.32:8091/pools/default/buckets'
      2021-10-25T14:30:04.055-07:00 DEBUG (REST) (Attempt 1) (GET) (200) Received response from 'http://172.23.123.32:8091/pools/default'
      2021-10-25T14:30:04.056-07:00 DEBUG (REST) (Attempt 1) (GET) Dispatching request to 'http://172.23.123.32:8091/pools/default/buckets'
      2021-10-25T14:30:04.283-07:00 DEBUG (REST) (Attempt 1) (GET) (200) Received response from 'http://172.23.123.32:8091/pools/default/buckets'
      2021-10-25T14:30:04.292-07:00 INFO (REST) Successfully connected to cluster | {"enterprise":true,"uuid":"29d3ef8afafbc18ea12bc1445a22e0f0","version":{"min_version":"7.1.0","is_mixed_cluster":false},"max_vbuckets":1024,"uniform_vbuckets":true}
      2021-10-25T14:30:04.306-07:00 DEBUG (REST) (Attempt 1) (GET) (200) Received response from 'http://172.23.123.32:8091/pools/default/buckets'
      2021-10-25T14:30:04.314-07:00 INFO (REST) Successfully connected to cluster | {"enterprise":true,"uuid":"29d3ef8afafbc18ea12bc1445a22e0f0","version":{"min_version":"7.1.0","is_mixed_cluster":false},"max_vbuckets":1024,"uniform_vbuckets":true}
      2021-10-25T14:31:04.531-07:00 INFO (REST) POST /api/v1/plan/my_plan
      2021-10-25T14:31:04.547-07:00 INFO (HTTP Manager) Added plan {"name": "my_plan"}
      2021-10-25T14:31:04.548-07:00 DEBUG (REST) (Attempt 1) (POST) Dispatching request to 'http://172.23.123.32:8091/_event'
      2021-10-25T14:31:04.549-07:00 DEBUG (REST) (Attempt 1) (POST) (200) Received response from 'http://172.23.123.32:8091/_event'
      2021-10-25T14:31:10.122-07:00 INFO (REST) POST /api/v1/cluster/self/repository/active/my_repo
      2021-10-25T14:31:10.177-07:00 INFO (GRPC Manager) Global storage location check {"req": "location:\"/data/archive/.cbbs-29d3ef8afafbc18ea12bc1445a22e0f0/node-3880b945cf7acbefb6ea16ea44cb45c4-1635197470129817882\"  sourceNode:\"3880b945cf7acbefb6ea16ea44cb45c4\""}
      2021-10-25T14:31:10.182-07:00 INFO (Runner) Running command /opt/couchbase/bin/cbbackupmgr config --archive /data/archive --repo 140cd7e8-4f69-4a54-93ad-2f1a812ecf6f --cbbs --readme-author Backup Service, cluster: 29d3ef8afafbc18ea12bc1445a22e0f0
      2021-10-25T14:31:10.228-07:00 DEBUG (Leader Manager) Received store event {"eventType": 4}
      2021-10-25T14:31:10.228-07:00 INFO (Event Handler) (Repo Change) Repository change {"event": {"Type":4,"Path":"/cbbs/cluster/self/active/my_repo","Value":"eyJpZCI6Im15X3JlcG8iLCJwbGFuX25hbWUiOiJteV9wbGFuIiwic3RhdGUiOiJhY3RpdmUiLCJhcmNoaXZlIjoiL2RhdGEvYXJjaGl2ZSIsInJlcG8iOiIxNDBjZDdlOC00ZjY5LTRhNTQtOTNhZC0yZjFhODEyZWNmNmYiLCJ2ZXJzaW9uIjoxLCJjcmVhdGlvbl90aW1lIjoiMjAyMS0xMC0yNVQxNDozMToxMC4yMTYwNTA5NDEtMDc6MDAiLCJ1cGRhdGVfdGltZSI6IjIwMjEtMTAtMjVUMTQ6MzE6MTAuMjE2MDUwOTQxLTA3OjAwIn0=","Rev":"g2gCbgUAlJnr2g5sAAAAAWgCbQAAACAzODgwYjk0NWNmN2FjYmVmYjZlYTE2ZWE0NGNiNDVjNGgCYQFuBQAenuvaDmo="}}
      2021-10-25T14:31:10.228-07:00 INFO (Event Handler) (Repo Change) Resuming repository {"repositoryID": "my_repo", "planName": "my_plan"}
      2021-10-25T14:31:10.233-07:00 INFO (Manger) Added repository {"id": "my_repo", "cluster": "self", "archive": "/data/archive", "repo": "140cd7e8-4f69-4a54-93ad-2f1a812ecf6f", "plan": "my_plan"}
      

      But then, at 2021-10-25T14:52:37 we start seeing errors saying /data/archive is empty in the backup node .123.32 and completely missing in all other nodes in the cluster.

      2021-10-25T14:52:37.853-07:00 ERROR (Janitor) Storage access check failed {"err": "node 3880b945cf7acbefb6ea16ea44cb45c4 cannot access location /data/archive: stat /data/archive: no such file or directory", "repository": "my_repo", "archive": "/data/archive"}
      2021-10-25T14:52:37.859-07:00 INFO (Janitor) Running storage access check on state archived
      

      At this point when I login to the backup service node, I can see that /data/archive is missing. The test did the following between 2021-10-25T14:31:10 and 2021-10-25T14:53:51:

      [2021-10-25T14:31:22-07:00, sequoiatools/collections:1.0:1fefb0] -i 172.23.97.74:8091 -b bucket4 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=6 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:31:30-07:00, sequoiatools/collections:1.0:17ee4b] -i 172.23.97.74:8091 -b bucket5 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=6 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:31:38-07:00, sequoiatools/collections:1.0:3fa39c] -i 172.23.97.74:8091 -b bucket6 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=6 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:31:45-07:00, sequoiatools/collections:1.0:e4be55] -i 172.23.97.74:8091 -b bucket7 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=6 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:31:53-07:00, sequoiatools/collections:1.0:05c169] -i 172.23.97.74:8091 -b bucket8 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=3 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:32:01-07:00, sequoiatools/collections:1.0:4ce395] -i 172.23.97.74:8091 -b bucket9 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=3 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:32:08-07:00, sequoiatools/collections:1.0:f02887] -i 172.23.106.138:8091 -b bucket4 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=6 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:32:16-07:00, sequoiatools/collections:1.0:69c642] -i 172.23.106.138:8091 -b bucket8 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=6 --collection_count=30 --collection_distribution=uniform
      [2021-10-25T14:32:24-07:00, sequoiatools/collections:1.0:81debf] -i 172.23.106.138:8091 -b bucket9 -o create_multi_scope_collection -s scope_ -c coll_ --scope_count=6 --collection_count=30 --collection_distribution=uniform
      [pull] sequoiatools/cmd
      [2021-10-25T14:32:32-07:00, sequoiatools/cmd:a33792] 300
      [2021-10-25T14:37:38-07:00, sequoiatools/couchbase-cli:bce350] xdcr-setup -c 172.23.97.74:8091 --create --xdcr-cluster-name remote --xdcr-hostname 172.23.106.138 --xdcr-username Administrator --xdcr-password password
      [2021-10-25T14:37:44-07:00, sequoiatools/couchbase-cli:a1b4a0] xdcr-replicate -c 172.23.97.74:8091 --create --xdcr-cluster-name remote --xdcr-from-bucket default --xdcr-to-bucket remote --enable-compression 1
      [2021-10-25T14:37:50-07:00, sequoiatools/couchbase-cli:be9c07] xdcr-replicate -c 172.23.97.74:8091 --create --xdcr-cluster-name remote --xdcr-from-bucket bucket4 --xdcr-to-bucket bucket4 --enable-compression 1
      [2021-10-25T14:37:56-07:00, sequoiatools/couchbase-cli:b2fc92] xdcr-replicate -c 172.23.97.74:8091 --create --xdcr-cluster-name remote --xdcr-from-bucket bucket8 --xdcr-to-bucket bucket8 --enable-compression 1
      [2021-10-25T14:38:02-07:00, sequoiatools/couchbase-cli:111a93] xdcr-replicate -c 172.23.97.74:8091 --create --xdcr-cluster-name remote --xdcr-from-bucket bucket9 --xdcr-to-bucket bucket9 --enable-compression 1
      [pull] sequoiatools/catapult
      [2021-10-25T14:38:09-07:00, sequoiatools/catapult:778cda] -i 172.23.97.74 -u Administrator -p password -b bucket4 -n 60000 -pc 100 -dt Hotel -ds 1000 -ac True
      [2021-10-25T14:38:14-07:00, sequoiatools/catapult:5e5893] -i 172.23.97.74 -u Administrator -p password -b bucket5 -n 60000 -pc 100 -dt Hotel -ds 1000 -ac True
      [2021-10-25T14:38:19-07:00, sequoiatools/catapult:c0fcf6] -i 172.23.97.74 -u Administrator -p password -b bucket6 -n 60000 -pc 100 -dt Hotel -ds 1000 -ac True
      [2021-10-25T14:38:24-07:00, sequoiatools/catapult:96e530] -i 172.23.97.74 -u Administrator -p password -b bucket7 -n 60000 -pc 100 -dt Hotel -ds 1000 -ac True
      [pull] sequoiatools/transactions
      [2021-10-25T14:38:31-07:00, sequoiatools/transactions:5dbb4f] 172.23.97.74 default 1000
      [2021-10-25T14:38:36-07:00, sequoiatools/collections:1.0:8391b5] -i 172.23.97.74:8091 -b bucket8 -o crud_mode --crud_interval=120 --max_scopes=10 --max_collections=100
      [2021-10-25T14:38:41-07:00, sequoiatools/collections:1.0:1a5831] -i 172.23.97.74:8091 -b bucket9 -o crud_mode --crud_interval=120 --max_scopes=10 --max_collections=100
      [pull] sequoiatools/pillowfight:7.0
      [2021-10-25T14:38:47-07:00, sequoiatools/pillowfight:7.0:5a0da9] -U couchbase://172.23.97.74/default?select_bucket=true -M 512 -I 6000 -B 600 -t 1 --rate-limit 6000 -P password --durability majority
      [2021-10-25T14:38:52-07:00, sequoiatools/cmd:b2483d] 600
      [2021-10-25T14:50:06-07:00, sequoiatools/couchbase-cli:7.0:035006] rebalance -c 172.23.97.74:8091 --server-remove 172.23.96.14:8091 -u Administrator -p password
      [2021-10-25T14:52:45-07:00, sequoiatools/cmd:81abb9] 60
      [2021-10-25T14:53:51-07:00, sequoiatools/cmd:3cc9df] 600
      

      Steps wise, it did:
      Created some scopes and collections
      Setup XDCR
      Loaded data on to some buckets
      CRUD ops on scopes and collections
      More data loading
      Rebalance out a kv node

      Cluster config:

      ########## Cluster config ##################
      ######  cbas : 3 ===== > [172.23.120.74:8091 172.23.120.75:8091 172.23.120.81:8091]  ###########
      ######  backup : 1 ===== > [172.23.123.32:8091]  ###########
      ######  index : 6 ===== > [172.23.96.243:8091 172.23.96.254:8091 172.23.97.105:8091 172.23.97.110:8091 172.23.97.112:8091 172.23.97.148:8091]  ###########
      ######  n1ql : 2 ===== > [172.23.97.149:8091 172.23.97.150:8091]  ###########
      ######  fts : 2 ===== > [172.23.106.134:8091 172.23.97.151:8091]  ###########
      ######  eventing : 3 ===== > [172.23.120.58:8091 172.23.123.31:8091 172.23.123.33:8091]  ###########
      ######  kv : 12 ===== > [172.23.120.73:8091 172.23.120.77:8091 172.23.120.86:8091 172.23.121.77:8091 172.23.123.24:8091 172.23.123.25:8091 172.23.123.26:8091 172.23.96.122:8091 172.23.96.14:8091 172.23.96.48:8091 172.23.97.241:8091 172.23.97.74:8091]  ###########
      

      Logs:
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.106.134.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.120.58.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.120.73.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.120.74.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.120.75.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.120.77.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.120.81.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.120.86.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.121.77.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.123.24.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.123.25.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.123.26.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.123.31.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.123.32.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.123.33.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.96.122.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.96.243.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.96.254.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.96.48.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.105.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.110.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.112.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.148.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.149.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.150.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.151.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.241.zip
      url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1635202730/collectinfo-2021-10-25T225851-ns_1%40172.23.97.74.zip

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              arunkumar Arunkumar Senthilnathan (Inactive)
              arunkumar Arunkumar Senthilnathan (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty