Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6953

[system test] rebalance hang when doing swap rebalance

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 2.0
    • 2.0
    • ns_server
    • Security Level: Public
    • centos 6.2 64bit build 2.0.0-1862

    Description

      Cluster information:

      • 8 centos 6.2 64bit server with 4 cores CPU
      • Each server has 32 GB RAM and 400 GB SSD disk.
      • SSD disk format ext4 on /data
      • Each server has its own SSD drive, no disk sharing with other server.
      • Create cluster with 6 nodes installed couchbase server 2.0.0-1862
      • Cluster has 2 buckets, default (12GB) and saslbucket (12GB).
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Enable consistent view on cluster (default)
      • Change value of erlang in couchbase-server from +A 16 to +S 128:128

      10.6.2.37
      10.6.2.38
      10.6.2.39
      10.6.2.40
      10.6.2.42
      10.6.2.43

      • Load 15 million items to each bucket. Each key has size from 512 bytes to 1024 bytes
      • Queries all 4 views from 2 docs
      • Mutate 15 million items with key size from 1500 to 1024 bytes
      • Do swap rebalance, add node 44, 45 and remove node 39, 40
      • Rebalance moves some items and hang in hours.
      • Check tap stats from all nodes, see replication building of vbucket 171 done
        • tap from node 38:

      eq_tapq:replication_building_171_'ns_1@10.6.2.42':ack_log_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':ack_seqno: 14645
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':ack_window_full: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':backfill_completed: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':backfill_start_timestamp: 1350517822
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':bg_jobs_completed: 72
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':bg_jobs_issued: 72
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':bg_result_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':connected: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':created: 4347
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':flags: 85 (ack,backfill,vblist,checkpoints)
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':has_queued_item: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':idle: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':paused: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':pending_backfill: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':pending_disconnect: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':pending_disk_backfill: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':qlen: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':qlen_high_pri: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':qlen_low_pri: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':queue_backfillremaining: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':queue_backoff: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':queue_drain: 14631
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':queue_fill: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':queue_itemondisk: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':queue_memory: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':rec_fetched: 14568
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':recv_ack_seqno: 14644
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':reserved: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':seqno_ack_requested: 14644
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':supports_ack: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':suspended: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':total_backlog_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':total_noops: 30
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':type: producer
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':vb_filter:

      { 171 }
      eq_tapq:replication_building_171_'ns_1@10.6.2.42':vb_filters: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':ack_log_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':ack_seqno: 14645
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':ack_window_full: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':backfill_completed: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':backfill_start_timestamp: 1350517827
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':bg_jobs_completed: 72
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':bg_jobs_issued: 72
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':bg_result_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':connected: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':created: 4352
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':flags: 85 (ack,backfill,vblist,checkpoints)
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':has_queued_item: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':idle: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':paused: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':pending_backfill: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':pending_disconnect: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':pending_disk_backfill: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':qlen: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':qlen_high_pri: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':qlen_low_pri: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':queue_backfillremaining: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':queue_backoff: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':queue_drain: 14631
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':queue_fill: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':queue_itemondisk: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':queue_memory: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':rec_fetched: 14568
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':recv_ack_seqno: 14644
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':reserved: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':seqno_ack_requested: 14644
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':supports_ack: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':suspended: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':total_backlog_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':total_noops: 30
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':type: producer
      eq_tapq:replication_building_171_'ns_1@10.6.2.44':vb_filter: { 171 }

      eq_tapq:replication_building_171_'ns_1@10.6.2.44':vb_filters: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':ack_log_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':ack_seqno: 14645
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':ack_window_full: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':backfill_completed: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':backfill_start_timestamp: 1350517824
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':bg_jobs_completed: 72
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':bg_jobs_issued: 72
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':bg_result_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':connected: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':created: 4349
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':flags: 85 (ack,backfill,vblist,checkpoints)
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':has_queued_item: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':idle: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':paused: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':pending_backfill: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':pending_disconnect: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':pending_disk_backfill: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':qlen: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':qlen_high_pri: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':qlen_low_pri: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':queue_backfillremaining: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':queue_backoff: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':queue_drain: 14631
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':queue_fill: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':queue_itemondisk: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':queue_memory: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':rec_fetched: 14568
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':recv_ack_seqno: 14644
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':reserved: 1
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':seqno_ack_requested: 14644
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':supports_ack: true
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':suspended: false
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':total_backlog_size: 0
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':total_noops: 30
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':type: producer
      eq_tapq:replication_building_171_'ns_1@10.6.2.45':vb_filter:

      { 171 }

      eq_tapq:replication_building_171_'ns_1@10.6.2.45':vb_filters: 1

      • And check point at node 42, 44 and 45 all synchronized.

      [root@localhost ~]# /opt/couchbase/bin/cbstats 10.6.2.42:11210 checkpoint 171 -b saslbucket -p password
      vb_171:checkpoint_extension: false
      vb_171:last_closed_checkpoint_id: 8
      vb_171:num_checkpoint_items: 15
      vb_171:num_checkpoints: 1
      vb_171:num_items_for_persistence: 0
      vb_171:num_open_checkpoint_items: 14
      vb_171:num_tap_cursors: 0
      vb_171:open_checkpoint_id: 9
      vb_171:persisted_checkpoint_id: 8
      vb_171:state: replica
      [root@localhost ~]#
      [root@localhost ~]#
      [root@localhost ~]# /opt/couchbase/bin/cbstats 10.6.2.44:11210 checkpoint 171 -b saslbucket -p password
      vb_171:checkpoint_extension: false
      vb_171:last_closed_checkpoint_id: 8
      vb_171:num_checkpoint_items: 15
      vb_171:num_checkpoints: 1
      vb_171:num_items_for_persistence: 0
      vb_171:num_open_checkpoint_items: 14
      vb_171:num_tap_cursors: 0
      vb_171:open_checkpoint_id: 9
      vb_171:persisted_checkpoint_id: 8
      vb_171:state: replica
      [root@localhost ~]#
      [root@localhost ~]#
      [root@localhost ~]# /opt/couchbase/bin/cbstats 10.6.2.45:11210 checkpoint 171 -b saslbucket -p password
      vb_171:checkpoint_extension: false
      vb_171:last_closed_checkpoint_id: 8
      vb_171:num_checkpoint_items: 17
      vb_171:num_checkpoints: 1
      vb_171:num_items_for_persistence: 0
      vb_171:num_open_checkpoint_items: 16
      vb_171:num_tap_cursors: 0
      vb_171:open_checkpoint_id: 9
      vb_171:persisted_checkpoint_id: 8
      vb_171:state: replica

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            alkondratenko Aleksey Kondratenko (Inactive)
            thuan Thuan Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty