Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48449

[System Test] - Eventing rebalance fails after 1200secs consistently post enforce-tls unless functions are paused/resumed

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 7.0.2
    • 7.0.2
    • eventing
    • 6.6.3-9808 -> 7.0.2-6668
    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Repro
      1. Run the following longevity script on 6.6.3 for 5 days.

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
      2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.

       #!/bin/sh
      for i in {0..10000}
          do
              `curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
              `curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`
          done       
      

      3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
      4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
      5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.

      After upgrade I validated that all the metakv tombstones are purged and enabled IPv4 only + enforce tls using the below commands.

      [root@localhost logs]# grep 'ns_config tombstone' debug.log
      [ns_server:debug,2021-09-14T04:26:34.722-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 11869 ns_config tombstone(s) up to timestamp 63798837690. Tombstones:
      [ns_server:debug,2021-09-14T04:30:38.296-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798837934. Tombstones:
      [ns_server:debug,2021-09-14T04:31:41.731-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 127 ns_config tombstone(s) up to timestamp 63798837998. Tombstones:
      [ns_server:debug,2021-09-14T04:39:42.419-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798838481. Tombstones:
      [ns_server:debug,2021-09-14T04:40:43.040-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798838542. Tombstones:
      [root@localhost logs]# 
      [root@localhost logs]#  /opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only
      Switched IP family for node: http://172.23.106.134:8091
      Switched IP family for node: http://172.23.106.136:8091
      Switched IP family for node: http://172.23.106.137:8091
      Switched IP family for node: http://172.23.106.138:8091
      Switched IP family for node: http://172.23.120.58:8091
      Switched IP family for node: http://172.23.120.73:8091
      Switched IP family for node: http://172.23.120.74:8091
      Switched IP family for node: http://172.23.120.75:8091
      Switched IP family for node: http://172.23.120.77:8091
      Switched IP family for node: http://172.23.120.81:8091
      Switched IP family for node: http://172.23.120.86:8091
      Switched IP family for node: http://172.23.121.118:8091
      Switched IP family for node: http://172.23.121.77:8091
      Switched IP family for node: http://172.23.123.24:8091
      Switched IP family for node: http://172.23.123.25:8091
      Switched IP family for node: http://172.23.123.26:8091
      Switched IP family for node: http://172.23.123.31:8091
      Switched IP family for node: http://172.23.123.32:8091
      Switched IP family for node: http://172.23.123.33:8091
      Switched IP family for node: http://172.23.96.122:8091
      Switched IP family for node: http://172.23.96.14:8091
      Switched IP family for node: http://172.23.96.243:8091
      Switched IP family for node: http://172.23.97.105:8091
      Switched IP family for node: http://172.23.97.148:8091
      Switched IP family for node: http://172.23.97.149:8091
      Switched IP family for node: http://172.23.97.150:8091
      Switched IP family for node: http://172.23.97.151:8091
      SUCCESS: Switched IP family of the cluster
      [root@localhost logs]# /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
      Turned on encryption for node: http://172.23.106.134:8091
      Turned on encryption for node: http://172.23.106.136:8091
      Turned on encryption for node: http://172.23.106.137:8091
      Turned on encryption for node: http://172.23.106.138:8091
      Turned on encryption for node: http://172.23.120.58:8091
      Turned on encryption for node: http://172.23.120.73:8091
      Turned on encryption for node: http://172.23.120.74:8091
      Turned on encryption for node: http://172.23.120.75:8091
      Turned on encryption for node: http://172.23.120.77:8091
      Turned on encryption for node: http://172.23.120.81:8091
      Turned on encryption for node: http://172.23.120.86:8091
      Turned on encryption for node: http://172.23.121.118:8091
      Turned on encryption for node: http://172.23.121.77:8091
      Turned on encryption for node: http://172.23.123.24:8091
      Turned on encryption for node: http://172.23.123.25:8091
      Turned on encryption for node: http://172.23.123.26:8091
      Turned on encryption for node: http://172.23.123.31:8091
      Turned on encryption for node: http://172.23.123.32:8091
      Turned on encryption for node: http://172.23.123.33:8091
      Turned on encryption for node: http://172.23.96.122:8091
      Turned on encryption for node: http://172.23.96.14:8091
      Turned on encryption for node: http://172.23.96.243:8091
      Turned on encryption for node: http://172.23.97.105:8091
      Turned on encryption for node: http://172.23.97.148:8091
      Turned on encryption for node: http://172.23.97.149:8091
      Turned on encryption for node: http://172.23.97.150:8091
      Turned on encryption for node: http://172.23.97.151:8091
      SUCCESS: Switched node-to-node encryption on
      [root@localhost logs]#  /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict
      SUCCESS: Security settings updated
      [root@localhost logs]# 
      

      At this point I noticed that Rebalance button was enabled. Not sure why this was enabled as we had nothing to rebalance afaik(this is tracked through MB-48448). However, when I did rebalance it failed as shown below.

      ns_1@172.23.106.136 5:47:32 AM 14 Sep, 2021

      Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {worker_died,
                                     {'EXIT',<0.29362.951>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}.
      Rebalance Operation Id = f38dda25a6ccfe31114529f5d88b0753
      

      cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          No work has yet been logged on this issue.

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty