Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-39476

Rebalance-in operation with large volume of doc loading fails

    XMLWordPrintable

Details

    Description

      Summary:

      After updating bucket replicas (of all to 2), rebalance-in operation (with doc loading) failed.
      (The steps to reproduce is a bit too long, as this is a volume test. Please refer to this for all the steps we do in volume testing, for your reference https://hub.internal.couchbase.com/confluence/pages/viewpage.action?pageId=50135893  )

      Script to repo:

       

      /testrunner -i /tmp/durability_volume.ini -t volumetests.Collections.volume.test_volume_taf,nodes_init=4,replicas=1,num_failed_nodes=1,new_replica=1,graceful=True,bucket_spec=multi_bucket.buckets_for_volume_test,iterations=1,sdk_client_pool=True,quota_percent=100,rerun=False,log_level=debug,skip_cleanup=True

       

       Steps to Reproduce:

      1. Create a 4 node cluster
        ------------------------------------
        Nodes Services Status

        ------------------------------------

        172.23.121.81 kv Cluster node
        172.23.121.83 None <--- IN —
        172.23.121.85 None <--- IN —
        172.23.121.105 None <--- IN —

        ------------------------------------

      2. Create buckets + initial data load into buckets
        ----------------------------------------------------------------------+
        Bucket Type Replicas TTL Items RAM Quota RAM Used Disk Used

        ----------------------------------------------------------------------+

        bucket1 membase 1 0 1500 419430400 66762912 92893938
        bucket2 membase 2 0 1500 1258291200 95204544 138104830
        bucket3 ephemeral 1 0 1500 419430400 46664208 136
        default membase 1 0 50000000 71303168000 19754308784 20303753169
      3. Rebalance-in with doc loading in parallel
        2020-05-19 11:11:11,193 | test | INFO | MainProcess | pool-2-thread-29 | [table_view:display:72] Rebalance Overview
        ------------------------------------
        Nodes Services Status

        ------------------------------------

        172.23.121.81 kv Cluster node
        172.23.121.83 kv Cluster node
        172.23.121.105 kv Cluster node
        172.23.121.85 kv Cluster node
        172.23.121.138 None <--- IN —

        ------------------------------------

      4.  Rebalance-out with doc loading in parallel
        2020-05-19 11:29:46,194 | test | INFO | MainProcess | pool-2-thread-13 | [table_view:display:72] Rebalance Overview
        ------------------------------------
        Nodes Services Status

        ------------------------------------

        172.23.121.81 kv Cluster node
        172.23.121.83 kv Cluster node
        172.23.121.105 kv Cluster node
        172.23.121.85 kv Cluster node
        172.23.121.138 [u'kv'] — OUT --->
      5.  Rebalance in-out with doc-loading in parallel
        2020-05-19 11:46:15,180 | test | INFO | MainProcess | pool-2-thread-18 | [table_view:display:72] Rebalance Overview
        ------------------------------------
        Nodes Services Status

        ------------------------------------

        172.23.121.81 kv Cluster node
        172.23.121.83 [u'kv'] — OUT --->
        172.23.121.105 kv Cluster node
        172.23.121.85 kv Cluster node
        172.23.121.138 None <--- IN —
        172.23.121.114 None <--- IN —

        ------------------------------------

      6.  Swap rebalance with doc-loading in parallel
        2020-05-19 12:08:52,194 | test | INFO | MainProcess | pool-2-thread-24 | [table_view:display:72] Rebalance Overview
        ------------------------------------
        Nodes Services Status

        ------------------------------------

        172.23.121.81 kv Cluster node
        172.23.121.105 [u'kv'] — OUT --->
        172.23.121.85 kv Cluster node
        172.23.121.138 kv Cluster node
        172.23.121.114 kv Cluster node
        172.23.121.83 None <--- IN —

        ------------------------------------

      7.  Update the bucket replicas to 2, and start rebalance-in operation with doc loading
        2020-05-19 12:31:31,700 | test | INFO | MainProcess | pool-2-thread-18 | [table_view:display:72] Rebalance Overview
        ------------------------------------
        Nodes Services Status

        ------------------------------------

        172.23.121.81 kv Cluster node
        172.23.121.83 kv Cluster node
        172.23.121.85 kv Cluster node
        172.23.121.138 kv Cluster node
        172.23.121.114 kv Cluster node
        172.23.121.105 None <--- IN —

        ------------------------------------
        This rebalance -in operation fails at around 7.15% of completion.
        Bucket statistics at this point is:
        2020-05-19 12:48:34,982 | test | INFO | MainProcess | MainThread | [table_view:display:72] Bucket statistics
        -----------------------------------------------------------------------+

        Bucket Type Replicas TTL Items RAM Quota RAM Used Disk Used

        -----------------------------------------------------------------------+

        bucket1 membase 2 0 1860 524288000 75554496 96242150
        bucket2 membase 2 0 1860 1572864000 108580632 146815318
        bucket3 ephemeral 2 0 1860 524288000 57324992 170
        default membase 2 0 60832000 106954752000 33822394976 45713868656

        -----------------------------------------------------------------------+

      Attachments

        1. rebalance_fail_1.png
          rebalance_fail_1.png
          471 kB
        2. rebalance_fail_2.png
          rebalance_fail_2.png
          464 kB
        3. test_logs.zip
          8.25 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty