Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-25119

cbbackupmgr : No. of items in the merged incremental backups not matching the src cluster

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: User Error
    • 4.6.3
    • 4.6.3
    • tools
    • Untriaged
    • Unknown

    Description

      Build : 4.6.3-4047

      This is from the incremental backup system tests where we run multiple iterations to generate incremental backups and merge them. The no. of items in the final merged backup do not match the no. of items in the src cluster.

      i. 2 clusters. Src cluster has 3 nodes. Dest cluster has 1 node.
      ii. On Src cluster, KV ops are always in progress. The KV load has a mix of creates, updates and deletes, and some documents also have expiration set.
      iii. Runs 5 iterations.
      iv. The following operations are performed in each iteration – rebalance out, rebalance in, failover and addback, failover and rebalance out, bucket flush.
      v. An incremental backup is taken after each of these operations.
      vi. At the end of the iteration, we merge the incremental backups.
      vii. At the end of the test, we stop the KV ops, compact the buckets, and take one final incremental backup.
      viii. Run a merge again.

      At this point, there should be just 1 backup, and the number of items in that backup should match those in the src cluster. In the test, # Items on src = 16896926, # Items on dest = 15421002

      The env is live for debugging. Pls let me know once done, so that we can repurpose the env.

      Dest : http://54.68.177.137:8091
      Src : http://52.26.27.121:8091 (Other nodes in the cluster : 34.212.221.52, 35.167.207.174)

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - - edited

          Reproduced the same issue on another env. using the same test. Src cluster has 11,882,732 items, the backup has 10,499,434 items.

          I am using this new env for running another test. Pls use the env shared earlier for debugging if required.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - - edited Reproduced the same issue on another env. using the same test. Src cluster has 11,882,732 items, the backup has 10,499,434 items. I am using this new env for running another test. Pls use the env shared earlier for debugging if required.

          The last backup doesn't contain all of the items in the cluster. For example, I looked at vbucket 1023 and the backup high seqno in the range.json file was much less than the high seqno on the server. Can you verify that the last backup was run after all operations were stopped on the cluster?

          I ran the backup one more time and the remaining items were backed up and after I restored the the destination cluster all of the items were restored.

          mikew Mike Wiederhold [X] (Inactive) added a comment - The last backup doesn't contain all of the items in the cluster. For example, I looked at vbucket 1023 and the backup high seqno in the range.json file was much less than the high seqno on the server. Can you verify that the last backup was run after all operations were stopped on the cluster? I ran the backup one more time and the remaining items were backed up and after I restored the the destination cluster all of the items were restored.

          Thanks Mike for looking into it. Here's what we do before we take the last incremental backup.

          1. Stop kv ops
          2. Sleep for 10 mins (expiry of 10 mins is set on 15% docs, so this sleep will make them eligible to be removed)
          3. Run bucket compaction on both buckets
          4. Sleep for 500s (idle system)
          5. Take an incremental backup
          6. Merge all backups
          7. Restore to dest cluster
          8. Sleep for 500 s
          9. Validate # items on src and dest cluster.

          Since you were able to match the counts, I will run the test again to see if the kv ops were really stopped by the test, or was something running.

          For now, I am re-purposing this env. Will update this bug with my findings.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Thanks Mike for looking into it. Here's what we do before we take the last incremental backup. 1. Stop kv ops 2. Sleep for 10 mins (expiry of 10 mins is set on 15% docs, so this sleep will make them eligible to be removed) 3. Run bucket compaction on both buckets 4. Sleep for 500s (idle system) 5. Take an incremental backup 6. Merge all backups 7. Restore to dest cluster 8. Sleep for 500 s 9. Validate # items on src and dest cluster. Since you were able to match the counts, I will run the test again to see if the kv ops were really stopped by the test, or was something running. For now, I am re-purposing this env. Will update this bug with my findings.

          KV ops werent completely stopped by test which caused this error. Encountered the same situation again today. Manually stopped KV ops, and the backup/restore worked well with all items correctly restored.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - KV ops werent completely stopped by test which caused this error. Encountered the same situation again today. Manually stopped KV ops, and the backup/restore worked well with all items correctly restored.

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty