Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48615

cbbackupmgr restore failed in analytics performance runs on build 7.1.0-1345

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Yes

    Description

      2021-09-27T10:27:53 [INFO] Running: ./opt/couchbase/bin/cbbackupmgr restore --force-updates --archive /backups --repo bigfun20M --threads 8 --host http://172.23.96.5 --username Administrator --password password

      Fatal error: local() encountered an error (return code 1) while executing './opt/couchbase/bin/cbbackupmgr restore --force-updates --archive /backups --repo bigfun20M --threads 8 --host http://172.23.96.5 --username Administrator --password password'

      Aborting.

      In Analytics runs, cbbackupmgr restore failed on build 7.1.0-1345. The regression is reproducible. The latest good run was running on build 7.1.0-1250.

      Job: http://perf.jenkins.couchbase.com/job/oceanus/7058/

      Logs:

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-7058/172.23.96.205.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-7058/172.23.96.57.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-7058/172.23.96.5.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-7058/172.23.96.7.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-7058/172.23.96.8.zip

      https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-oceanus-7058/172.23.96.9.zip

       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Patrick Varley

            We use the same cluster spec in all analytics tests. In the cluster spec, there are 6 nodes, and jenkins jobs will run cbcollect_info on all nodes listed in the spec. When we set up the cluster in this test, we use 4 nodes only (2 kv + 2 cbas). 57 and 205 are not configured in this test, but jenkins jobs will still run cbcollect_info on the 2 nodes.

            For disk on the data nodes, I will check it. Our record shows each data node uses one 1TB disk for /data. 

            bo-chun.wang Bo-Chun Wang added a comment - Patrick Varley We use the same cluster spec in all analytics tests. In the cluster spec, there are 6 nodes, and jenkins jobs will run cbcollect_info on all nodes listed in the spec. When we set up the cluster in this test, we use 4 nodes only (2 kv + 2 cbas). 57 and 205 are not configured in this test, but jenkins jobs will still run cbcollect_info on the 2 nodes. For disk on the data nodes, I will check it. Our record shows each data node uses one 1TB disk for /data. 

            We use the same cluster spec in all analytics tests. In the cluster spec, there are 6 nodes, and jenkins jobs will run cbcollect_info on all nodes listed in the spec. When we set up the cluster in this test, we use 4 nodes only (2 kv + 2 cbas). 57 and 205 are not configured in this test, but jenkins jobs will still run cbcollect_info on the 2 nodes.

            The cluster spec is always the same but not all tests will use all nodes. Is that information in the console output for the test? Just so we can download the logs we need rather than all 3.

            For disk on the data nodes, I will check it. Our record shows each data node uses one 1TB disk for /data.

            The disk setup on node 172.23.96.5 is as follows:

            sdc                8:32   0 447.1G  0 disk 
            └─sdc1             8:33   0 447.1G  0 part 
              └─vg_data-data 253:0    0   894G  0 lvm  /data
            sdd                8:48   0 447.1G  0 disk 
            └─sdd1             8:49   0 447.1G  0 part 
              └─vg_data-data 253:0    0   894G  0 lvm  /data
            

            The key point is the logical volume is in linear write mode, this means it will write to sdc once it has fill it it will write to sdd. In other words it will use one disk at a time, you are not getting the performance increase of the two disks working together.

            This can be seen further by the total number of bytes written by each device:

            sdc

            241 Total_LBAs_Written      0x0032   098   098   000    Old_age   Always       -       4 273 055 915 232
            242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       1 309 903 977 954
            

            sdd

            241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       86 947 431 992
            242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       66 875 808 992
            

            sdc has done 50 times more writes than sdd. This mean that sdc will fail before sdd. I think changing the LV to be stripped and not linear will improve the lifetime of the system and more important improve the speed of loading data into the cluster.

            pvarley Patrick Varley added a comment - We use the same cluster spec in all analytics tests. In the cluster spec, there are 6 nodes, and jenkins jobs will run cbcollect_info on all nodes listed in the spec. When we set up the cluster in this test, we use 4 nodes only (2 kv + 2 cbas). 57 and 205 are not configured in this test, but jenkins jobs will still run cbcollect_info on the 2 nodes. The cluster spec is always the same but not all tests will use all nodes. Is that information in the console output for the test? Just so we can download the logs we need rather than all 3. For disk on the data nodes, I will check it. Our record shows each data node uses one 1TB disk for /data. The disk setup on node 172.23.96.5 is as follows: sdc 8:32 0 447.1G 0 disk └─sdc1 8:33 0 447.1G 0 part └─vg_data-data 253:0 0 894G 0 lvm /data sdd 8:48 0 447.1G 0 disk └─sdd1 8:49 0 447.1G 0 part └─vg_data-data 253:0 0 894G 0 lvm /data The key point is the logical volume is in linear write mode, this means it will write to sdc once it has fill it it will write to sdd. In other words it will use one disk at a time, you are not getting the performance increase of the two disks working together. This can be seen further by the total number of bytes written by each device: sdc 241 Total_LBAs_Written 0x0032 098 098 000 Old_age Always - 4 273 055 915 232 242 Total_LBAs_Read 0x0032 099 099 000 Old_age Always - 1 309 903 977 954 sdd 241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 86 947 431 992 242 Total_LBAs_Read 0x0032 099 099 000 Old_age Always - 66 875 808 992 sdc has done 50 times more writes than sdd. This mean that sdc will fail before sdd. I think changing the LV to be stripped and not linear will improve the lifetime of the system and more important improve the speed of loading data into the cluster.

            Build couchbase-server-7.1.0-1390 contains backup commit fb8373a with commit message:
            MB-48615 Use a priority queue for the archive/recovery source

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1390 contains backup commit fb8373a with commit message: MB-48615 Use a priority queue for the archive/recovery source
            bo-chun.wang Bo-Chun Wang added a comment -

            Patrick Varley

            The test configuration will show how many nodes are used in the run.

            [cluster]

            mem_quota = 20480

            analytics_mem_quota = 20480

            initial_nodes = 4

            num_buckets = 1

            I will discuss the disk settings with the team and see if we should reconfigure it.

             

            bo-chun.wang Bo-Chun Wang added a comment - Patrick Varley The test configuration will show how many nodes are used in the run. [cluster] mem_quota = 20480 analytics_mem_quota = 20480 initial_nodes = 4 num_buckets = 1 I will discuss the disk settings with the team and see if we should reconfigure it.  
            bo-chun.wang Bo-Chun Wang added a comment -

            I have a good run on build 7.1.0-1390. I close this ticket.

            http://perf.jenkins.couchbase.com/job/oceanus/7066/ 

            bo-chun.wang Bo-Chun Wang added a comment - I have a good run on build 7.1.0-1390. I close this ticket. http://perf.jenkins.couchbase.com/job/oceanus/7066/  

            People

              bo-chun.wang Bo-Chun Wang
              bo-chun.wang Bo-Chun Wang
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty