Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-11403

KV+XDCR System test : Race between compaction and rebalance, rebalance-out stuck for 22 hrs, compaction never ran

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Test Blocker
    • 3.0
    • 3.0
    • couchbase-bucket
    • Security Level: Public
    • None
    • CentOS 6.x
      8 * 8 clusters, 1 bi-xdcr, 1 uni-xdcr. Each node : 15GB RAM, 419GB HDD for /data
    • Untriaged
    • Unknown

    Description

      Build
      --------
      3.0.0-786 (xdcr on upr, internal replication on upr)

      Clusters
      -----------
      Source : http://172.23.105.44:8091/
      Destination : http://172.23.105.54:8091/
      The clusters are available to investigate. No urgency to reclaim. Pls let me know if you need me to collect logs.

      Steps
      --------
      1. Load on both clusters till vb_active_resident_items_ratio < 30.
      2. Access phase with 98% gets, 2%sets runs for 3 hours
      3. Rebalance-out 1 node at cluster1 with workload [high dgm ~4%] ===> Rebalance out completes after 22 hrs, ns_server indefinitely keeps retrying compaction but ep-engine returns couch file not present while it is.

      Additional information
      ----------------------------------

      1. When ns_server triggers compaction on all nodes, ep_engine wrongly returns "Warning: failed to compact database with name=/data/standardbucket1/x.couch.1 error=no such file errno=none" although the couch file for vbucket x is present at the same location ep-engine is searching at -

      On .44

      memcached.log.9.txt-Tue Jun 10 17:40:53.160154 PDT 3: (standardbucket1) Warning: failed to compact database with name=/data/standardbucket1/0.couch.1 error=no such file errno=none
      memcached.log.9.txt:Tue Jun 10 17:40:53.160626 PDT 3: (standardbucket1) VBucket compaction failed failed!!!
      memcached.log.9.txt-Tue Jun 10 17:40:53.174682 PDT 3: (standardbucket) Warning: failed to compact database with name=/data/standardbucket/0.couch.1 error=no such file errno=none
      memcached.log.9.txt:Tue Jun 10 17:40:53.175114 PDT 3: (standardbucket) VBucket compaction failed failed!!!
      memcached.log.9.txt-Tue Jun 10 17:40:53.188307 PDT 3: (saslbucket) Warning: failed to compact database with name=/data/saslbucket/0.couch.1 error=no such file errno=none
      memcached.log.9.txt:Tue Jun 10 17:40:53.188657 PDT 3: (saslbucket) VBucket compaction failed failed!!!

      [root@soursop-s11201 logs]# ls -al /data/saslbucket/0.couch.1
      rw-rw---. 1 couchbase couchbase 7278683 Jun 10 19:12 /data/saslbucket/0.couch.1

      Note: there is more than enough space on nodes. /data can hold upto 419GB data and at the time of rebalance, usage was only 8-9%.

      2. This EP_ENGINE_FAILED error is sent back to ns_server which still relentlessly retries compacting the vbuckets for close to 22 hrs. Rebalance is stuck during this time.

      3. Then somehow this contention is resolved and vbuckets to be rebalanced out get deleted from .47. Rebalance out is completed after 22 hrs.

      4. If I force compact any buckets on this cluster, version of couch store files do not change but UI says compaction is completed and does not throw any error indicating that.
      Logs show same error:
      Wed Jun 11 17:03:48.531685 PDT 3: (standardbucket) Warning: failed to compact database with name=/data/standardbucket/0.couch.1 error=no such file errno=none
      Wed Jun 11 17:03:48.532435 PDT 3: (standardbucket) VBucket compaction failed failed!!!
      Wed Jun 11 17:03:48.755110 PDT 3: (saslbucket) Warning: failed to compact database with name=/data/saslbucket/0.couch.1 error=no such file errno=none
      Wed Jun 11 17:03:48.755682 PDT 3: (saslbucket) VBucket compaction failed failed!!!

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-11403
          # Subject Branch Project Status CR V

          Activity

            People

              apiravi Aruna Piravi (Inactive)
              apiravi Aruna Piravi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty