Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51078

Backport 6.6.x - [System Test] Rebalance exited with reason "eventing rebalance hasn't made progress for past 1200 secs"

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 6.6.5
    • 6.6.6
    • eventing
    • Enterprise Edition 7.1.0 build 2211

    Description

      QE TEST

      -test tests/integration/neo/test_neo_couchstore_milestone4.yml -scope tests/integration/neo/scope_couchstore.yml
      

      Day - 2
      Cycle - 2
      Scale - 3

      STEP
      Swap rebalance of eventing nodes.

      [2022-02-07T10:09:59-08:00, sequoiatools/couchbase-cli:7.1:fe7ead] server-add -c 172.23.108.103:8091 --server-add https://172.23.104.5 -u Administrator -p password --server-add-username Administrator --server-add-password password --services eventing
      [2022-02-07T10:10:15-08:00, sequoiatools/couchbase-cli:7.1:ebd976] rebalance -c 172.23.108.103:8091 --server-remove 172.23.97.239 -u Administrator -p password
       
      Error occurred on container - sequoiatools/couchbase-cli:7.1:[rebalance -c 172.23.108.103:8091 --server-remove 172.23.97.239 -u Administrator -p password]
       
      docker logs ebd976
      docker start ebd976
       
      *Unable to display progress bar on this os
      JERROR: Rebalance failed. See logs for detailed reason. You can try again.
      

      FAILURE

      2022-02-07T10:57:20.008-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.104.155) - Rebalance exited with reason {service_rebalance_failed,eventing,
                                    {worker_died,
                                     {'EXIT',<0.17710.4361>,
                                      {rebalance_failed,
                                       {service_error,
                                        <<"eventing rebalance hasn't made progress for past 1200 secs">>}}}}}.
      Rebalance Operation Id = dafc248ded2ab4c3a76f1403ebd0f143
      

      List of eventing nodes in the cluster -

      eventing : 4 ===== > [172.23.104.5:8091 172.23.104.67:8091 172.23.97.122:8091 172.23.97.239:8091]
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            sujay.gad Sujay Gad added a comment -

            Reopening as 1 rebalance failure was observed on day 7 of longevity run on 6.6.5-MP3.

            Rebalance failure

            2022-05-04T00:21:02.508-07:00, ns_orchestrator:0:critical:message(ns_1@172.23.97.74) - Rebalance exited with reason {service_rebalance_failed,eventing,
                                          {worker_died,
                                           {'EXIT',<0.13640.4665>,
                                            {rebalance_failed,
                                             {service_error,
                                              <<"eventing rebalance hasn't made progress for past 600 secs">>}}}}}.
            Rebalance Operation Id = 85a16a6a719f426cd85d25c7e763b770
            

            List of eventing nodes -

            eventing : 2 ===== > [172.23.121.77:8091 172.23.97.241:8091]
            

            Logs -
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.106.134.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.106.136.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.73.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.74.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.75.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.77.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.86.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.121.77.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.123.26.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.96.122.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.96.14.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.96.48.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.112.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.149.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.150.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.241.zip
            url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.74.zip

            sujay.gad Sujay Gad added a comment - Reopening as 1 rebalance failure was observed on day 7 of longevity run on 6.6.5-MP3. Rebalance failure 2022 - 05 -04T00: 21 : 02.508 - 07 : 00 , ns_orchestrator: 0 :critical:message(ns_1 @172 .23. 97.74 ) - Rebalance exited with reason {service_rebalance_failed,eventing, {worker_died, { 'EXIT' ,< 0.13640 . 4665 >, {rebalance_failed, {service_error, << "eventing rebalance hasn't made progress for past 600 secs" >>}}}}}. Rebalance Operation Id = 85a16a6a719f426cd85d25c7e763b770 List of eventing nodes - eventing : 2 ===== > [ 172.23 . 121.77 : 8091 172.23 . 97.241 : 8091 ] Logs - url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.106.134.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.106.136.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.73.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.74.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.75.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.77.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.120.86.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.121.77.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.123.26.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.96.122.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.96.14.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.96.48.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.112.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.149.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.150.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.241.zip url : https://cb-jira.s3.us-east-2.amazonaws.com/logs/systestmon-1651654226/collectinfo-2022-05-04T085027-ns_1%40172.23.97.74.zip
            ankit.prabhu Ankit Prabhu added a comment - - edited

            Eventing trying to open the stream with vbuckets but its getting rollback message since purge seq number > start seq number. purge interval is 1hr which causes frequent rollback of streams

            1336636:2022-05-04T00:16:16.141-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 274 got STREAMREQ status: ROLLBACK
            1336658:2022-05-04T00:16:16.398-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 275 got STREAMREQ status: ROLLBACK
            1336680:2022-05-04T00:16:16.663-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 276 got STREAMREQ status: ROLLBACK
            1337602:2022-05-04T00:17:39.049-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 251 got STREAMREQ status: ROLLBACK
            1337631:2022-05-04T00:17:39.306-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 252 got STREAMREQ status: ROLLBACK
            1337654:2022-05-04T00:17:39.562-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 253 got STREAMREQ status: ROLLBACK
            1337678:2022-05-04T00:17:39.820-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 273 got STREAMREQ status: ROLLBACK
            1337699:2022-05-04T00:17:40.078-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 274 got STREAMREQ status: ROLLBACK
            1337743:2022-05-04T00:17:40.335-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 275 got STREAMREQ status: ROLLBACK
            1337765:2022-05-04T00:17:40.592-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 276 got STREAMREQ status: ROLLBACK
            1805473:2022-05-04T01:17:29.702-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK
            1806436:2022-05-04T01:18:31.140-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK
            1806814:2022-05-04T01:19:04.688-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK
            1807467:2022-05-04T01:20:07.215-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK

            This lead to rebalance failure since rebalance progress collector didn't see any changes in the vb owning.

            ankit.prabhu Ankit Prabhu added a comment - - edited Eventing trying to open the stream with vbuckets but its getting rollback message since purge seq number > start seq number. purge interval is 1hr which causes frequent rollback of streams 1336636:2022-05-04T00:16:16.141-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 274 got STREAMREQ status: ROLLBACK 1336658:2022-05-04T00:16:16.398-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 275 got STREAMREQ status: ROLLBACK 1336680:2022-05-04T00:16:16.663-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 276 got STREAMREQ status: ROLLBACK 1337602:2022-05-04T00:17:39.049-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 251 got STREAMREQ status: ROLLBACK 1337631:2022-05-04T00:17:39.306-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 252 got STREAMREQ status: ROLLBACK 1337654:2022-05-04T00:17:39.562-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 253 got STREAMREQ status: ROLLBACK 1337678:2022-05-04T00:17:39.820-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 273 got STREAMREQ status: ROLLBACK 1337699:2022-05-04T00:17:40.078-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 274 got STREAMREQ status: ROLLBACK 1337743:2022-05-04T00:17:40.335-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 275 got STREAMREQ status: ROLLBACK 1337765:2022-05-04T00:17:40.592-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_1:/tmp/127.0.0.1:8091_1_1548328557.sock:87381] vb: 276 got STREAMREQ status: ROLLBACK 1805473:2022-05-04T01:17:29.702-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK 1806436:2022-05-04T01:18:31.140-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK 1806814:2022-05-04T01:19:04.688-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK 1807467:2022-05-04T01:20:07.215-07:00 [Info] Consumer::processDCPEvents [worker_timer_op_2:/tmp/127.0.0.1:8091_2_1548328557.sock:87394] vb: 365 got STREAMREQ status: ROLLBACK This lead to rebalance failure since rebalance progress collector didn't see any changes in the vb owning.
            ritam.sharma Ritam Sharma added a comment -

            Discussed with Jeelan - Wiith MP specific to Trendyol, and the customer is not with this purge interval, plus subsequent rebalance goes fine, we okay to ignore the current failure for MP3.

            ritam.sharma Ritam Sharma added a comment - Discussed with Jeelan - Wiith MP specific to Trendyol, and the customer is not with this purge interval, plus subsequent rebalance goes fine, we okay to ignore the current failure for MP3.
            ankit.prabhu Ankit Prabhu added a comment -

            Issue is with low purge seq number which causes frequent rollback and re opening of the stream. Keeping such low purge interval can cause issues and unlikely in production.

            Resolving it since main issue is fixed.

            ankit.prabhu Ankit Prabhu added a comment - Issue is with low purge seq number which causes frequent rollback and re opening of the stream. Keeping such low purge interval can cause issues and unlikely in production. Resolving it since main issue is fixed.
            sujay.gad Sujay Gad added a comment -

            Closing based on component system test run on 6.6.6-10573.

            sujay.gad Sujay Gad added a comment - Closing based on component system test run on 6.6.6-10573.

            People

              sujay.gad Sujay Gad
              abhishek.jindal Abhishek Jindal
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty