Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48449

[System Test] - Eventing rebalance fails after 1200secs consistently post enforce-tls unless functions are paused/resumed

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Duplicate
    • 7.0.2
    • 7.0.2
    • eventing
    • 6.6.3-9808 -> 7.0.2-6668
    • Untriaged
    • Centos 64-bit
    • 1
    • No

    Description

      Steps to Repro
      1. Run the following longevity script on 6.6.3 for 5 days.

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/test_allFeatures_madhatter_durability.yml -scope tests/integration/scope_Xattrs_Madhatter.yml -scale 3 -repeat 0 -log_level 0 -version 6.6.3-9808 -skip_setup=true -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      At this point it should have a 27 node cluster ( 9 Kv, 6 Index, 3 analytics, 3 fts, 3 eventing and 3 n1ql)
      2. Create 10k metakv tombstones. This has been part of our testing since MB-44838 was fixed. We used to have a total of around 25k for CC, have reduced it here to around 12k.

       #!/bin/sh
      for i in {0..10000}
          do
              `curl -X PUT -u Administrator:password http://localhost:8091/_metakv/key{$i} -d 'value=foo1'`
              `curl -X DELETE -v -u Administrator:password http://localhost:8091/_metakv/key{$i}`
          done       
      

      3. Swap rebalance 6 nodes , 1 of each service with that of 7.0.2 nodes. Rebalance goes through successfully.
      4. Failover 6 nodes(6.6.3 nodes)1 of each service(kv is graceful failover), Upgrade these nodes to 7.0.2, do a recovery of all the 6 node(kv is delta recovery) and rebalance.
      5. Repeat step no 4 until all the nodes in cluster are upgraded to 7.0.2.

      After upgrade I validated that all the metakv tombstones are purged and enabled IPv4 only + enforce tls using the below commands.

      [root@localhost logs]# grep 'ns_config tombstone' debug.log
      [ns_server:debug,2021-09-14T04:26:34.722-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 11869 ns_config tombstone(s) up to timestamp 63798837690. Tombstones:
      [ns_server:debug,2021-09-14T04:30:38.296-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798837934. Tombstones:
      [ns_server:debug,2021-09-14T04:31:41.731-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 127 ns_config tombstone(s) up to timestamp 63798837998. Tombstones:
      [ns_server:debug,2021-09-14T04:39:42.419-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798838481. Tombstones:
      [ns_server:debug,2021-09-14T04:40:43.040-07:00,ns_1@172.23.106.134:tombstone_agent<0.984.0>:tombstone_agent:purge:195]Purged 1 ns_config tombstone(s) up to timestamp 63798838542. Tombstones:
      [root@localhost logs]# 
      [root@localhost logs]#  /opt/couchbase/bin/couchbase-cli ip-family -c http://localhost:8091 -u Administrator -p password --set --ipv4only
      Switched IP family for node: http://172.23.106.134:8091
      Switched IP family for node: http://172.23.106.136:8091
      Switched IP family for node: http://172.23.106.137:8091
      Switched IP family for node: http://172.23.106.138:8091
      Switched IP family for node: http://172.23.120.58:8091
      Switched IP family for node: http://172.23.120.73:8091
      Switched IP family for node: http://172.23.120.74:8091
      Switched IP family for node: http://172.23.120.75:8091
      Switched IP family for node: http://172.23.120.77:8091
      Switched IP family for node: http://172.23.120.81:8091
      Switched IP family for node: http://172.23.120.86:8091
      Switched IP family for node: http://172.23.121.118:8091
      Switched IP family for node: http://172.23.121.77:8091
      Switched IP family for node: http://172.23.123.24:8091
      Switched IP family for node: http://172.23.123.25:8091
      Switched IP family for node: http://172.23.123.26:8091
      Switched IP family for node: http://172.23.123.31:8091
      Switched IP family for node: http://172.23.123.32:8091
      Switched IP family for node: http://172.23.123.33:8091
      Switched IP family for node: http://172.23.96.122:8091
      Switched IP family for node: http://172.23.96.14:8091
      Switched IP family for node: http://172.23.96.243:8091
      Switched IP family for node: http://172.23.97.105:8091
      Switched IP family for node: http://172.23.97.148:8091
      Switched IP family for node: http://172.23.97.149:8091
      Switched IP family for node: http://172.23.97.150:8091
      Switched IP family for node: http://172.23.97.151:8091
      SUCCESS: Switched IP family of the cluster
      [root@localhost logs]# /opt/couchbase/bin/couchbase-cli node-to-node-encryption -c http://localhost:8091 -u Administrator -p password --enable
      Turned on encryption for node: http://172.23.106.134:8091
      Turned on encryption for node: http://172.23.106.136:8091
      Turned on encryption for node: http://172.23.106.137:8091
      Turned on encryption for node: http://172.23.106.138:8091
      Turned on encryption for node: http://172.23.120.58:8091
      Turned on encryption for node: http://172.23.120.73:8091
      Turned on encryption for node: http://172.23.120.74:8091
      Turned on encryption for node: http://172.23.120.75:8091
      Turned on encryption for node: http://172.23.120.77:8091
      Turned on encryption for node: http://172.23.120.81:8091
      Turned on encryption for node: http://172.23.120.86:8091
      Turned on encryption for node: http://172.23.121.118:8091
      Turned on encryption for node: http://172.23.121.77:8091
      Turned on encryption for node: http://172.23.123.24:8091
      Turned on encryption for node: http://172.23.123.25:8091
      Turned on encryption for node: http://172.23.123.26:8091
      Turned on encryption for node: http://172.23.123.31:8091
      Turned on encryption for node: http://172.23.123.32:8091
      Turned on encryption for node: http://172.23.123.33:8091
      Turned on encryption for node: http://172.23.96.122:8091
      Turned on encryption for node: http://172.23.96.14:8091
      Turned on encryption for node: http://172.23.96.243:8091
      Turned on encryption for node: http://172.23.97.105:8091
      Turned on encryption for node: http://172.23.97.148:8091
      Turned on encryption for node: http://172.23.97.149:8091
      Turned on encryption for node: http://172.23.97.150:8091
      Turned on encryption for node: http://172.23.97.151:8091
      SUCCESS: Switched node-to-node encryption on
      [root@localhost logs]#  /opt/couchbase/bin/couchbase-cli setting-security -c http://localhost:8091 -u Administrator -p password --set --cluster-encryption-level strict
      SUCCESS: Security settings updated
      [root@localhost logs]# 
      

      At this point I noticed that Rebalance button was enabled. Not sure why this was enabled as we had nothing to rebalance afaik(this is tracked through MB-48448). However, when I did rebalance it failed as shown below.

      ns_1@172.23.106.136 5:47:32 AM 14 Sep, 2021

      Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {worker_died,
                                     {'EXIT',<0.29362.951>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}.
      Rebalance Operation Id = f38dda25a6ccfe31114529f5d88b0753
      

      cbcollect_info attached. This the first time we are running this system test upgrade on 7.0.2, hence there is no baseline as such and no last working build.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Marking this closed.

          Balakumaran.Gopal Balakumaran Gopal added a comment - Marking this closed.

          Dup of MB-48334

          jeelan.poola Jeelan Poola added a comment - Dup of MB-48334
          jeelan.poola Jeelan Poola added a comment - - edited

          Balakumaran Gopal

          This issue is not specific to upgrade, but a known issue/limitation with turning on 'enforce-tls' when >=1 eventing functions are in 'deployed' state. It was agreed upon during design phase that users will have to pause their eventing functions prior to changing their encryption configuration and resume them after. The same is captured at https://issues.couchbase.com/browse/MB-47707?focusedCommentId=534961&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-534961. And at https://issues.couchbase.com/browse/MB-47707?focusedCommentId=535066&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-535066.

          We have also raised a documentation ticket to track the same https://issues.couchbase.com/browse/DOC-9146.

          And we have an existing Neo ticket https://issues.couchbase.com/browse/MB-48334 which was logged earlier to track fixing this limitation. Hence resolving this MB duplicate.

          However, we are now thinking if this limitation is going to annoy our customers & bite us back in the form of a CBSE in future. So currently exploring ways to fix this with minimal changes to stay true to our "Always On" claim. Thank you!

          Could you please confirm if below works post upgrade?
          1. Pause the functions
          2. Change encryption level
          3. Resume functions
          4. Rebalance
          Expectation is that above should work.

          jeelan.poola Jeelan Poola added a comment - - edited Balakumaran Gopal This issue is not specific to upgrade, but a known issue/limitation with turning on 'enforce-tls' when >=1 eventing functions are in 'deployed' state. It was agreed upon during design phase that users will have to pause their eventing functions prior to changing their encryption configuration and resume them after. The same is captured at https://issues.couchbase.com/browse/MB-47707?focusedCommentId=534961&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-534961 . And at https://issues.couchbase.com/browse/MB-47707?focusedCommentId=535066&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-535066 . We have also raised a documentation ticket to track the same https://issues.couchbase.com/browse/DOC-9146 . And we have an existing Neo ticket https://issues.couchbase.com/browse/MB-48334 which was logged earlier to track fixing this limitation. Hence resolving this MB duplicate. However, we are now thinking if this limitation is going to annoy our customers & bite us back in the form of a CBSE in future. So currently exploring ways to fix this with minimal changes to stay true to our "Always On" claim. Thank you! Could you please confirm if below works post upgrade? 1. Pause the functions 2. Change encryption level 3. Resume functions 4. Rebalance Expectation is that above should work.
          jeelan.poola Jeelan Poola added a comment -

          From the above logs, it appears like gocb calls to read (and subsequently update) checkPoint data from KV are failing continuously. That sort of explains why rebalance is not proceeding further. Eventing needs to update checkpoint data during rebalance. Requesting Ankit Prabhu to take a deeper look.

          jeelan.poola Jeelan Poola added a comment - From the above logs, it appears like gocb calls to read (and subsequently update) checkPoint data from KV are failing continuously. That sort of explains why rebalance is not proceeding further. Eventing needs to update checkpoint data during rebalance. Requesting Ankit Prabhu to take a deeper look.

          Noticed errors like these on the eventing log

          172.23.96.122

          2021-09-14T07:15:56.942-07:00 [Error] Consumer::getOpCallback [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1550114910.sock:9871] Bucket fetch failed for key: <ud>eventing::1550114910::timer_op::vb::797</ud>, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"Get","Opaque":"0x0","TimeObserved":2500111201,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
          2021-09-14T07:15:56.942-07:00 [Error] Consumer::getOpCallback [worker_timer_op_0:/tmp/127.0.0.1:8091_0_1550114910.sock:9887] Bucket fetch failed for key: <ud>eventing::1550114910::timer_op::vb::683</ud>, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"Get","Opaque":"0x0","TimeObserved":2500136088,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
          2021-09-14T07:15:56.942-07:00 [Error] Consumer::addOwnershipHistorySECallback [worker_bucket_op_curl_2:/tmp/127.0.0.1:8091_2_1030245617.sock:9793] Key: eventing::1030245617::bucket_op_curl::vb::944, subdoc operation failed while performing ownership entry app post STREAMEND, err: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"MutateIn","Opaque":"0x0","TimeObserved":2500086758,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}
          

          Balakumaran.Gopal Balakumaran Gopal added a comment - Noticed errors like these on the eventing log 172.23.96.122 2021-09-14T07:15:56.942-07:00 [Error] Consumer::getOpCallback [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1550114910.sock:9871] Bucket fetch failed for key: <ud>eventing::1550114910::timer_op::vb::797</ud>, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"Get","Opaque":"0x0","TimeObserved":2500111201,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""} 2021-09-14T07:15:56.942-07:00 [Error] Consumer::getOpCallback [worker_timer_op_0:/tmp/127.0.0.1:8091_0_1550114910.sock:9887] Bucket fetch failed for key: <ud>eventing::1550114910::timer_op::vb::683</ud>, err: unambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"unambiguous timeout"}},"OperationID":"Get","Opaque":"0x0","TimeObserved":2500136088,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""} 2021-09-14T07:15:56.942-07:00 [Error] Consumer::addOwnershipHistorySECallback [worker_bucket_op_curl_2:/tmp/127.0.0.1:8091_2_1030245617.sock:9793] Key: eventing::1030245617::bucket_op_curl::vb::944, subdoc operation failed while performing ownership entry app post STREAMEND, err: ambiguous timeout | {"InnerError":{"InnerError":{"InnerError":{},"Message":"ambiguous timeout"}},"OperationID":"MutateIn","Opaque":"0x0","TimeObserved":2500086758,"RetryReasons":null,"RetryAttempts":0,"LastDispatchedTo":"","LastDispatchedFrom":"","LastConnectionID":""}

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty