Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30443

Rebalance in failure after online upgrade using swap rebalance with high ops

    XMLWordPrintable

Details

    • Bug
    • Resolution: User Error
    • Major
    • None
    • 5.5.0
    • test-execution
    • None

    Description

      Script to Repro

      ./testrunner -i /tmp/upgrade.ini -p get-cbcollect-info=True,upgrade_version=5.5.0-2958,loader=high_doc_ops -t newupgradetests.MultiNodesUpgradeTests.online_upgrade_swap_rebalance_with_pillowfight,initial_version=4.6.0-3573,items=3000000,nodes_init=3,run_with_views=False
      

      Steps
      Created a 3 node cluster in 4.6.0-3573.
      Upgraded all nodes to 5.5.0-2958 using swap rebalance with high ops.
      Added and rebalanced in 5.5.0-2958 node while high ops was running. Rebalance failed.

      Failure stack trace

      Rebalance exited with reason {mover_crashed,
      {unexpected_exit,
      {'EXIT',<0.31286.7>,
      {{badmatch,{error,closed}},
      {gen_server,call,
      [{'janitor_agent-default',
      'ns_1@172.23.107.205'},
      {if_rebalance,<0.8754.7>,
      {wait_dcp_data_move,
      ['ns_1@172.23.107.201',
      'ns_1@172.23.107.207'],
      337}},
      infinity]}}}}}
      

      I see a memcached exit as well.

      Service 'memcached' exited with status 137. Restarting. Messages:
      2018-07-11T23:29:55.995003Z WARNING (default) Slow runtime for 'Running a flusher loop: shard 2' on thread writer_worker_3: 1212 ms
      2018-07-11T23:29:56.320928Z WARNING (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.107.206->ns_1@172.23.107.205:default - (vb 504) End stream received but no such stream for this vBucket
      2018-07-11T23:29:57.121430Z WARNING (default) Slow runtime for 'Removing (dead) vb:340 from memory and disk' on thread auxIO_worker_0: 206 ms
      2018-07-11T23:29:57.396319Z WARNING (default) Slow runtime for 'Running a flusher loop: shard 2' on thread writer_worker_1: 1388 ms
      2018-07-11T23:29:57.593589Z WARNING (default) Slow runtime for 'Running a flusher loop: shard 3' on thread writer_worker_3: 1017 ms
      2018-07-11T23:29:57.598123Z WARNING (default) Slow runtime for 'Running a flusher loop: shard 1' on thread writer_worker_2: 1494 ms
      2018-07-11T23:29:58.056985Z WARNING (default) Slow runtime for 'Paging out items.' on thread nonIO_worker_2: 47 ms
      2018-07-11T23:29:58.693953Z WARNING (default) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 551 ms
      

      Live cluster : http://172.23.107.205:8091/

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            Balakumaran.Gopal Balakumaran Gopal
            Balakumaran.Gopal Balakumaran Gopal
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty