Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-37039

Swap Rebalance failed at 60% when stop/start rebalance is done multiple times at 20%, 40%, 60%. choose_action_not_compaction error is observed in logs.

    XMLWordPrintable

Details

    Description

      Steps:
      1. Initialize a cluster with 2 nodes
      2. Create a bucket with replica=1 and load 250 docs
      3. Start swap rebalance with 1 node coming in and 1 going out
      4. Start doc loading in parallel to step 3
      5. Stop rebalance at 20% and start again
      6. Stop rebalance at 40% and start again
      7. Stop rebalance at 60% and start again, Rebalance failed.

      NOTE: Test passed on 6.5.0-4821 and it is reproducible easily for the test given below.

      QE Note:

      get-cbcollect-info=False,num_items=250000,GROUP=P0;durability,EXCLUDE_GROUP=not_for_majority,durability=MAJORITY,infra_log_level=debug -t rebalance_new.swaprebalancetests.SwapRebalanceStartStopTests.do_test,replicas=1,nodes_init=2,standard_buckets=1,num-swap=1,new_replica=2,GROUP=P0;durability,skip_cleanup=True
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            owend Daniel Owen added a comment -

            The changelog (of possible relevant changes) between 6.5.0-4821 and 6.5.0-4874 for KV is as follows:

            CHANGELOG for kv_engine
             
             * Commit: 117d7a94723b62d71f3be94e4d72cbb73f0a5797 in build: 6.5.0-4860
               MB-36973: Don't use ThreadLocalPtr for CouchKVStore::pendingFileDeletions
               
               
             * Commit: e1da4853859b19125ac7babba6643e0b44fe0484 in build: 6.5.0-4857
               MB-36765: Fix vbucket_state::operator==
               
             * Commit: db516869ffc3d4fa7405f0f11a00542d273b62be in build: 6.5.0-4850
               MB-36133: Persist highPreparedSeqno
               
               
             * Commit: ead473d8324e96eb909938b51b61cfa7e243a919 in build: 6.5.0-4848
               MB-36940: Handle tombstones properly in couchfile_upgrade
               
               
             * Commit: b83516a2c6d9513cac30093d070cb9d530cde287 in build: 6.5.0-4827
               MB-36923: Add support for num_reader/writer_threads
             
               
             * Commit: ad3a4a3f5a93af6bba5829a669bdcd3430bd8147 in build: 6.5.0-4826
               MB-36915: Avoid lock-inversion at set-vbstate and new-producer
               

            owend Daniel Owen added a comment - The changelog (of possible relevant changes) between 6.5.0-4821 and 6.5.0-4874 for KV is as follows: CHANGELOG for kv_engine   * Commit: 117d7a94723b62d71f3be94e4d72cbb73f0a5797 in build: 6.5.0-4860 MB-36973: Don't use ThreadLocalPtr for CouchKVStore::pendingFileDeletions * Commit: e1da4853859b19125ac7babba6643e0b44fe0484 in build: 6.5.0-4857 MB-36765: Fix vbucket_state::operator== * Commit: db516869ffc3d4fa7405f0f11a00542d273b62be in build: 6.5.0-4850 MB-36133: Persist highPreparedSeqno * Commit: ead473d8324e96eb909938b51b61cfa7e243a919 in build: 6.5.0-4848 MB-36940: Handle tombstones properly in couchfile_upgrade * Commit: b83516a2c6d9513cac30093d070cb9d530cde287 in build: 6.5.0-4827 MB-36923: Add support for num_reader/writer_threads   * Commit: ad3a4a3f5a93af6bba5829a669bdcd3430bd8147 in build: 6.5.0-4826 MB-36915: Avoid lock-inversion at set-vbstate and new-producer
            owend Daniel Owen added a comment -

            Error reported in ns_server error log on node 172.23.105.220

            [user:error,2019-11-23T02:42:39.903-08:00,ns_1@172.23.105.220:<0.2239.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {mover_crashed,
                                          {badarg,
                                           [{dict,fetch,
                                             ['ns_1@172.23.105.223',
                                              {dict,2,16,16,8,80,48,
                                               {[],[],[],[],[],[],[],[],[],[],[],[],[],[],
                                                [],[]},
                                               {{[['ns_1@172.23.105.220'|0]],
                                                 [['ns_1@172.23.105.221'|0]],
                                                 [],[],[],[],[],[],[],[],[],[],[],[],[],
                                                 []}}}],
                                             [{file,"dict.erl"},{line,131}]},
                                            {vbucket_move_scheduler,
                                             '-move_is_possible/7-fun-0-',3,
                                             [{file,"src/vbucket_move_scheduler.erl"},
                                              {line,305}]},
                                            {lists,all,2,[{file,"lists.erl"},{line,1213}]},
                                            {vbucket_move_scheduler,
                                             '-choose_action_not_compaction/1-fun-0-',7,
                                             [{file,"src/vbucket_move_scheduler.erl"},
                                              {line,349}]},
                                            {lists,flatmap,2,
                                             [{file,"lists.erl"},{line,1250}]},
                                            {vbucket_move_scheduler,
                                             choose_action_not_compaction,1,
                                             [{file,"src/vbucket_move_scheduler.erl"},
                                              {line,348}]},
                                            {vbucket_move_scheduler,choose_action,1,
                                             [{file,"src/vbucket_move_scheduler.erl"},
                                              {line,284}]},
                                            {ns_vbucket_mover,spawn_workers,1,
                                             [{file,"src/ns_vbucket_mover.erl"},
                                              {line,372}]}]}}.
            
            

            owend Daniel Owen added a comment - Error reported in ns_server error log on node 172.23.105.220 [user:error,2019-11-23T02:42:39.903-08:00,ns_1@172.23.105.220:<0.2239.0>:ns_orchestrator:log_rebalance_completion:1445]Rebalance exited with reason {mover_crashed, {badarg, [{dict,fetch, ['ns_1@172.23.105.223', {dict,2,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[],[], [],[]}, {{[['ns_1@172.23.105.220'|0]], [['ns_1@172.23.105.221'|0]], [],[],[],[],[],[],[],[],[],[],[],[],[], []}}}], [{file,"dict.erl"},{line,131}]}, {vbucket_move_scheduler, '-move_is_possible/7-fun-0-',3, [{file,"src/vbucket_move_scheduler.erl"}, {line,305}]}, {lists,all,2,[{file,"lists.erl"},{line,1213}]}, {vbucket_move_scheduler, '-choose_action_not_compaction/1-fun-0-',7, [{file,"src/vbucket_move_scheduler.erl"}, {line,349}]}, {lists,flatmap,2, [{file,"lists.erl"},{line,1250}]}, {vbucket_move_scheduler, choose_action_not_compaction,1, [{file,"src/vbucket_move_scheduler.erl"}, {line,348}]}, {vbucket_move_scheduler,choose_action,1, [{file,"src/vbucket_move_scheduler.erl"}, {line,284}]}, {ns_vbucket_mover,spawn_workers,1, [{file,"src/ns_vbucket_mover.erl"}, {line,372}]}]}}.
            owend Daniel Owen added a comment -

            The changelog between 6.5.0-4821 and 6.5.0-4874 for ns_server is as follows:

             * Commit: 883ee011fd4ff7ac08f0f62bf7b4a1c3a6e99532 in build: 6.5.0-4849
               MB-36975: Re-add developer preview warning if server in preview mode
               
             * Commit: 9e8ae59110136c1e9be875f1f42fbcd9ea9e2161 in build: 6.5.0-4844
               MB-35437 Don't terminate all activities on losing local lease.
               
             * Commit: cec170b131d96c6ccdf7a9e3307a144bf637f639 in build: 6.5.0-4844
               Don't allow mixing follower and leader quorums.
             
             * Commit: e428df9e6b07ad7a864957bcb2797d768ed584a2 in build: 6.5.0-4844
               Fix formatting.
             
             * Commit: 11dd45095608b24acad817559b9746d0d9dbd93d in build: 6.5.0-4844
               Add a comment describing when acquired_at can be undefined.
               
             * Commit: 479702c3ba33593dfee4e2e38f9a0d1b5b57c47f in build: 6.5.0-4844
               MB-36754 Use indexer provided stats in calculations
               
             * Commit: 00ec279ac4a55d162dad847a185eee39109a41ed in build: 6.5.0-4843
               Merge remote-tracking branch 'couchbase/alice' into master
               
               * couchbase/alice:
                 MB-30526 cbcollect - collect distro information
                 MB-36107: Update copyright to 2019 for 6.0.3
               
             * Commit: 7aac7560f6432ca4b9f3f5558af88069866a88ec in build: 6.5.0-4843
               MB-30526 cbcollect - collect distro information
               
             * Commit: f4c55e6fec80dbd6213d009bcafebb391740d40b in build: 6.5.0-4837
               MB-35515 Apply regular backfill limit to replica backfills.
             
             * Commit: b01b89007c8dc75ae85f091fce87c7f81340d015 in build: 6.5.0-4832
               MB-36923: Add num_reader_threads and num_writer_threads to OOTB memcached configuration
               
             * Commit: dd41af722d3f5f018b0682e876916cc211e1ae29 in build: 6.5.0-4831
               MB-36749: Reload cert api should return error in case if ...
               
             * Commit: f204681b7c9e72b9f989b2f39364bea3b2ae5082 in build: 6.5.0-4831
               MB-36749: Restart tls dist server & disconnect all the nodes ...
               
             * Commit: 0a0736f1fdae8b2b9455909631059c5471a76b06 in build: 6.5.0-4831
               Make cb_dist keep track of all dist connections
               
             * Commit: 89b7d3baaed4b138f2ff824f645b93948d7677f6 in build: 6.5.0-4831
               MB-36749: Don't crash cb_dist if asked to stop unknown listener
               
             * Commit: 844a5050d66bdd0a586a2e7f1c4da9ebaaa39832 in build: 6.5.0-4831
               MB-36749: cb_dist: retry 'listen' if it failed before
               
             * Commit: 3eec846981db98c8907f74328fdf8394151e112a in build: 6.5.0-4831
               MB-36749: cb_dist:close fun should be the opposite of listen
               
             * Commit: c264dc356cc183cf640a52f38e312a06ba9b8894 in build: 6.5.0-4831
               MB-36749: cb_dist should wait for acceptor when stopping a listener in order to avoid addrinuse error on start
               
             * Commit: 70c3079e48eddb6db2093c87b42b304b2704c2ec in build: 6.5.0-4830
               Do not display afamily in UI if it's not defined in ns_config
               
             * Commit: 8117c5307974f397ad831a2b870fb3ee0fd8364d in build: 6.5.0-4843
               MB-36107: Update copyright to 2019 for 6.0.3
               

            owend Daniel Owen added a comment - The changelog between 6.5.0-4821 and 6.5.0-4874 for ns_server is as follows: * Commit: 883ee011fd4ff7ac08f0f62bf7b4a1c3a6e99532 in build: 6.5.0-4849 MB-36975: Re-add developer preview warning if server in preview mode * Commit: 9e8ae59110136c1e9be875f1f42fbcd9ea9e2161 in build: 6.5.0-4844 MB-35437 Don't terminate all activities on losing local lease. * Commit: cec170b131d96c6ccdf7a9e3307a144bf637f639 in build: 6.5.0-4844 Don't allow mixing follower and leader quorums.   * Commit: e428df9e6b07ad7a864957bcb2797d768ed584a2 in build: 6.5.0-4844 Fix formatting. * Commit: 11dd45095608b24acad817559b9746d0d9dbd93d in build: 6.5.0-4844 Add a comment describing when acquired_at can be undefined. * Commit: 479702c3ba33593dfee4e2e38f9a0d1b5b57c47f in build: 6.5.0-4844 MB-36754 Use indexer provided stats in calculations * Commit: 00ec279ac4a55d162dad847a185eee39109a41ed in build: 6.5.0-4843 Merge remote-tracking branch 'couchbase/alice' into master * couchbase/alice: MB-30526 cbcollect - collect distro information MB-36107: Update copyright to 2019 for 6.0.3 * Commit: 7aac7560f6432ca4b9f3f5558af88069866a88ec in build: 6.5.0-4843 MB-30526 cbcollect - collect distro information * Commit: f4c55e6fec80dbd6213d009bcafebb391740d40b in build: 6.5.0-4837 MB-35515 Apply regular backfill limit to replica backfills.   * Commit: b01b89007c8dc75ae85f091fce87c7f81340d015 in build: 6.5.0-4832 MB-36923: Add num_reader_threads and num_writer_threads to OOTB memcached configuration * Commit: dd41af722d3f5f018b0682e876916cc211e1ae29 in build: 6.5.0-4831 MB-36749: Reload cert api should return error in case if ... * Commit: f204681b7c9e72b9f989b2f39364bea3b2ae5082 in build: 6.5.0-4831 MB-36749: Restart tls dist server & disconnect all the nodes ... * Commit: 0a0736f1fdae8b2b9455909631059c5471a76b06 in build: 6.5.0-4831 Make cb_dist keep track of all dist connections * Commit: 89b7d3baaed4b138f2ff824f645b93948d7677f6 in build: 6.5.0-4831 MB-36749: Don't crash cb_dist if asked to stop unknown listener * Commit: 844a5050d66bdd0a586a2e7f1c4da9ebaaa39832 in build: 6.5.0-4831 MB-36749: cb_dist: retry 'listen' if it failed before * Commit: 3eec846981db98c8907f74328fdf8394151e112a in build: 6.5.0-4831 MB-36749: cb_dist:close fun should be the opposite of listen * Commit: c264dc356cc183cf640a52f38e312a06ba9b8894 in build: 6.5.0-4831 MB-36749: cb_dist should wait for acceptor when stopping a listener in order to avoid addrinuse error on start * Commit: 70c3079e48eddb6db2093c87b42b304b2704c2ec in build: 6.5.0-4830 Do not display afamily in UI if it's not defined in ns_config * Commit: 8117c5307974f397ad831a2b870fb3ee0fd8364d in build: 6.5.0-4843 MB-36107: Update copyright to 2019 for 6.0.3
            owend Daniel Owen added a comment -

            The error is being reported in
            src/vbucket_move_scheduler.erl" - line 305
            src/vbucket_move_scheduler.erl - line 349
            src/vbucket_move_scheduler.erl - line 348
            src/vbucket_move_scheduler.erl - line 284
            src/ns_vbucket_mover.erl - line 372

            For build 6.5.0-4874 this corresponds to the follow patches

            src/vbucket_move_scheduler.erl" - line 305 - f4c55e6fec
            src/vbucket_move_scheduler.erl - line 349 - f4c55e6fec
            src/vbucket_move_scheduler.erl - line 348 - f4c55e6fec
            src/vbucket_move_scheduler.erl - line 284 - 73eaac9f76
            src/ns_vbucket_mover.erl - line 372 - 73eaac9f76

            f4c55e6fec -http://review.couchbase.org/110701 - introduced in 6.5.0-4837
            73eaac9f76 - http://review.couchbase.org/23317 - introduced in 2012.

            Therefore I suspect the issue is with

            • Commit: f4c55e6fec80dbd6213d009bcafebb391740d40b in build: 6.5.0-4837
              MB-35515 Apply regular backfill limit to replica backfills.

            Therefore assigning to Aliaksey Artamonau

            owend Daniel Owen added a comment - The error is being reported in src/vbucket_move_scheduler.erl" - line 305 src/vbucket_move_scheduler.erl - line 349 src/vbucket_move_scheduler.erl - line 348 src/vbucket_move_scheduler.erl - line 284 src/ns_vbucket_mover.erl - line 372 For build 6.5.0-4874 this corresponds to the follow patches src/vbucket_move_scheduler.erl" - line 305 - f4c55e6fec src/vbucket_move_scheduler.erl - line 349 - f4c55e6fec src/vbucket_move_scheduler.erl - line 348 - f4c55e6fec src/vbucket_move_scheduler.erl - line 284 - 73eaac9f76 src/ns_vbucket_mover.erl - line 372 - 73eaac9f76 f4c55e6fec - http://review.couchbase.org/110701 - introduced in 6.5.0-4837 73eaac9f76 - http://review.couchbase.org/23317 - introduced in 2012. Therefore I suspect the issue is with Commit: f4c55e6fec80dbd6213d009bcafebb391740d40b in build: 6.5.0-4837 MB-35515 Apply regular backfill limit to replica backfills. Therefore assigning to Aliaksey Artamonau
            dfinlay Dave Finlay added a comment -

            Daniel Owen: thanks for investigating and pinpointing the problem change.

            Approved for Mad Hatter as this regression can break rebalance.

            dfinlay Dave Finlay added a comment - Daniel Owen : thanks for investigating and pinpointing the problem change. Approved for Mad Hatter as this regression can break rebalance.

            Build couchbase-server-6.5.0-4890 contains ns_server commit 1907a7e with commit message:
            MB-37039 Initialize in_flight_backfills_per_node correctly.

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.5.0-4890 contains ns_server commit 1907a7e with commit message: MB-37039 Initialize in_flight_backfills_per_node correctly.

            Build couchbase-server-7.0.0-1096 contains ns_server commit 1907a7e with commit message:
            MB-37039 Initialize in_flight_backfills_per_node correctly.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-1096 contains ns_server commit 1907a7e with commit message: MB-37039 Initialize in_flight_backfills_per_node correctly.

            People

              Aliaksey Artamonau Aliaksey Artamonau (Inactive)
              ritesh.agarwal Ritesh Agarwal
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty