Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45658

[Backup Service] [Investigate] Investigate volume testing failures.

    XMLWordPrintable

Details

    • Task
    • Resolution: Done
    • Major
    • 7.0.0
    • Cheshire-Cat
    • tools
    • None
    • Enterprise Edition 7.0.0 build 4907

    Description

      Description:

      Volume testing is quite chaotic, we observed 3 types failing tasks during run. The goal of this issue is to determine if the following task failures are expected under these chaotic conditions:

      1. MB-45659 (Disk I/O Error)
      2. MB-45660 (Authentication failed)
      3. MB-45661 (Failed to sync lockfile)

      Cluster Setup:

      There are roughly 5 nodes present in the cluster at a given moment in time with an extra 3 nodes being used as spare nodes for the swap rebalance.

      Services: 

      Each node runs the kv and backup services. 

      Testing:

      The exact steps performed in the test can be found here: https://hub.internal.couchbase.com/confluence/pages/viewpage.action?pageId=50135893

      The test made it to step 15 before I terminated it.

      Backup Service Configuration:

      There are 10 repositories: 'repo-plan1' .. 'repo-plan10' each with a plan: 'plan1' .. 'plan10' sharing identical tasks.

      Tasks: Backup every 15 minutes. Merge every 40 minutes between every 0 and 1 days.

      Each repository has a unique archive location '/tmp/my-archive/archive-plan1' .. '/tmp/my-archive/archive-plan2' to avoid lock contention issues.

      Shared folder:

      The shared folder (NFS) is '/data/share' is mounted at '/tmp/my-archive' on each machine using NFS.

      Attached

      The cbbackupmgr logs for each repository can be found in: backup-logs.zip
       
      The server logs:

      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/collectinfo-2021-04-14T150204-ns_1%40172.23.105.175.zip
      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/collectinfo-2021-04-14T150204-ns_1%40172.23.106.233.zip
      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/collectinfo-2021-04-14T150204-ns_1%40172.23.106.238.zip
      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/collectinfo-2021-04-14T150204-ns_1%40172.23.106.251.zip
      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/collectinfo-2021-04-14T150204-ns_1%40172.23.121.74.zip
      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/172.23.121.78.zip
      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/172.23.106.250.zip
      https://cb-engineering.s3.amazonaws.com/CBQE-6782/tools-qe/172.23.106.236.zip

      The test logs:

      volumetest.log

      The task history:

      task-history.zip

      Side commentary:

      The testing is very chaotic, nodes performing backups are rebalanced out.

      There were other interesting tasks, but I have omitted them as they seem to be of the expected category mainly relating to orphans or merge tasks which lacked the sufficient number of backups.

      Attempted Supportal upload: https://supportal.couchbase.com/snapshot/4f28ea5724c3bcddbefc2c1de8390e05::0

      Attachments

        1. backup-logs.zip
          6.47 MB
        2. task-history.zip
          39 kB
        3. volumetest.log
          272 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              asad.zaidi Asad Zaidi (Inactive)
              asad.zaidi Asad Zaidi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty