Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4864

rebalancing can get stuck due to a bug detecting the backfill completion during vbucket takeover

    Details

      Description

      Install couchbase server 1.8.0 release with hotfix mb-4738 on 4 nodes (12 GB RAM each) cluster in ec2.
      Load 38 million items to cluster.
      Resident ratio: 41%
      Data size on disk: 78GB
      Remove a node.
      Rebalance OK.
      Add another node in (not the removed node).
      Rebalance hang at around 80+%

      1. 50.17.157.98_tap.txt
        15 kB
        Thuan Nguyen
      2. 50.17.157.98_stat.txt
        7 kB
        Thuan Nguyen
      3. 23.20.50.242_tap.txt
        20 kB
        Thuan Nguyen
      4. 23.20.50.242_tap.txt
        24 kB
        Thuan Nguyen
      5. 23.20.50.242_stat.txt
        8 kB
        Thuan Nguyen
      6. 23.20.45.23_tap.txt
        16 kB
        Thuan Nguyen
      7. 23.20.45.23_tap.txt
        17 kB
        Thuan Nguyen
      8. 23.20.45.23_stat.txt
        8 kB
        Thuan Nguyen
      9. 107.22.84.123_tap.txt
        18 kB
        Thuan Nguyen
      10. 107.22.84.123_tap.txt
        22 kB
        Thuan Nguyen
      11. 107.22.84.123_stat.txt
        8 kB
        Thuan Nguyen
      12. 107.22.70.136_tap.txt
        20 kB
        Thuan Nguyen
      13. 107.22.11.161_tap.txt
        4 kB
        Thuan Nguyen
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        thuan Thuan Nguyen added a comment -

        Do rebalance out 2 nodes and rebalance hang

        eq_tapq:rebalance_169:ack_log_size: 0
        eq_tapq:rebalance_169:ack_playback_size: 0
        eq_tapq:rebalance_169:ack_seqno: 41291
        eq_tapq:rebalance_169:ack_window_full: false
        eq_tapq:rebalance_169:backfill_completed: false
        eq_tapq:rebalance_169:bg_backlog_size: 0
        eq_tapq:rebalance_169:bg_jobs_completed: 37724
        eq_tapq:rebalance_169:bg_jobs_issued: 37724
        eq_tapq:rebalance_169:bg_queued: 37724
        eq_tapq:rebalance_169:bg_result_size: 0
        eq_tapq:rebalance_169:bg_results: 0
        eq_tapq:rebalance_169:bg_wait_for_results: false
        eq_tapq:rebalance_169:complete: false
        eq_tapq:rebalance_169:connected: true
        eq_tapq:rebalance_169:created: 1019807
        eq_tapq:rebalance_169:empty: false
        eq_tapq:rebalance_169:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
        eq_tapq:rebalance_169:has_item: false
        eq_tapq:rebalance_169:has_queued_item: true
        eq_tapq:rebalance_169:idle: false
        eq_tapq:rebalance_169:num_tap_nack: 0
        eq_tapq:rebalance_169:num_tap_tmpfail_survivors: 0
        eq_tapq:rebalance_169:paused: 1
        eq_tapq:rebalance_169:pending_backfill: false
        eq_tapq:rebalance_169:pending_disconnect: false
        eq_tapq:rebalance_169:pending_disk_backfill: false
        eq_tapq:rebalance_169:qlen: 0
        eq_tapq:rebalance_169:qlen_high_pri: 0
        eq_tapq:rebalance_169:qlen_low_pri: 1
        eq_tapq:rebalance_169:queue_backfillremaining: 0
        eq_tapq:rebalance_169:queue_backoff: 0
        eq_tapq:rebalance_169:queue_drain: 41329
        eq_tapq:rebalance_169:queue_fill: 0
        eq_tapq:rebalance_169:queue_itemondisk: 0
        eq_tapq:rebalance_169:queue_memory: 0
        eq_tapq:rebalance_169:rec_fetched: 3609
        eq_tapq:rebalance_169:recv_ack_seqno: 41290
        eq_tapq:rebalance_169:reserved: 1
        eq_tapq:rebalance_169:seqno_ack_requested: 41290
        eq_tapq:rebalance_169:supports_ack: true
        eq_tapq:rebalance_169:suspended: false
        eq_tapq:rebalance_169:total_backlog_size: 1
        eq_tapq:rebalance_169:total_noops: 7935
        eq_tapq:rebalance_169:type: producer
        eq_tapq:rebalance_169:vb_filter:

        { 169 }

        eq_tapq:rebalance_169:vb_filters: 1

        Show
        thuan Thuan Nguyen added a comment - Do rebalance out 2 nodes and rebalance hang eq_tapq:rebalance_169:ack_log_size: 0 eq_tapq:rebalance_169:ack_playback_size: 0 eq_tapq:rebalance_169:ack_seqno: 41291 eq_tapq:rebalance_169:ack_window_full: false eq_tapq:rebalance_169:backfill_completed: false eq_tapq:rebalance_169:bg_backlog_size: 0 eq_tapq:rebalance_169:bg_jobs_completed: 37724 eq_tapq:rebalance_169:bg_jobs_issued: 37724 eq_tapq:rebalance_169:bg_queued: 37724 eq_tapq:rebalance_169:bg_result_size: 0 eq_tapq:rebalance_169:bg_results: 0 eq_tapq:rebalance_169:bg_wait_for_results: false eq_tapq:rebalance_169:complete: false eq_tapq:rebalance_169:connected: true eq_tapq:rebalance_169:created: 1019807 eq_tapq:rebalance_169:empty: false eq_tapq:rebalance_169:flags: 93 (ack,backfill,vblist,takeover,checkpoints) eq_tapq:rebalance_169:has_item: false eq_tapq:rebalance_169:has_queued_item: true eq_tapq:rebalance_169:idle: false eq_tapq:rebalance_169:num_tap_nack: 0 eq_tapq:rebalance_169:num_tap_tmpfail_survivors: 0 eq_tapq:rebalance_169:paused: 1 eq_tapq:rebalance_169:pending_backfill: false eq_tapq:rebalance_169:pending_disconnect: false eq_tapq:rebalance_169:pending_disk_backfill: false eq_tapq:rebalance_169:qlen: 0 eq_tapq:rebalance_169:qlen_high_pri: 0 eq_tapq:rebalance_169:qlen_low_pri: 1 eq_tapq:rebalance_169:queue_backfillremaining: 0 eq_tapq:rebalance_169:queue_backoff: 0 eq_tapq:rebalance_169:queue_drain: 41329 eq_tapq:rebalance_169:queue_fill: 0 eq_tapq:rebalance_169:queue_itemondisk: 0 eq_tapq:rebalance_169:queue_memory: 0 eq_tapq:rebalance_169:rec_fetched: 3609 eq_tapq:rebalance_169:recv_ack_seqno: 41290 eq_tapq:rebalance_169:reserved: 1 eq_tapq:rebalance_169:seqno_ack_requested: 41290 eq_tapq:rebalance_169:supports_ack: true eq_tapq:rebalance_169:suspended: false eq_tapq:rebalance_169:total_backlog_size: 1 eq_tapq:rebalance_169:total_noops: 7935 eq_tapq:rebalance_169:type: producer eq_tapq:rebalance_169:vb_filter: { 169 } eq_tapq:rebalance_169:vb_filters: 1
        Hide
        chiyoung Chiyoung Seo added a comment -

        Fixed in 1.8.1 branch

        Show
        chiyoung Chiyoung Seo added a comment - Fixed in 1.8.1 branch
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        This reason was that there is a bug in detecting the backfill completion for a vbucket takeover during rebalance.

        For example, the following are the TAP stats for vbucket 685 takeover:

        eq_tapq:rebalance_685:ack_window_full: false
        eq_tapq:rebalance_685:backfill_completed: false
        ...
        eq_tapq:rebalance_685:pending_backfill: false
        eq_tapq:rebalance_685:pending_disconnect: false
        eq_tapq:rebalance_685:pending_disk_backfill: false
        eq_tapq:rebalance_685:queue_backfillremaining: 0

        You can see that there are no items remaining for backfill, but "backfill_completed" flag is still false, which caused the takeover operation to be stuck.

        Show
        farshid Farshid Ghods (Inactive) added a comment - This reason was that there is a bug in detecting the backfill completion for a vbucket takeover during rebalance. For example, the following are the TAP stats for vbucket 685 takeover: eq_tapq:rebalance_685:ack_window_full: false eq_tapq:rebalance_685:backfill_completed: false ... eq_tapq:rebalance_685:pending_backfill: false eq_tapq:rebalance_685:pending_disconnect: false eq_tapq:rebalance_685:pending_disk_backfill: false eq_tapq:rebalance_685:queue_backfillremaining: 0 You can see that there are no items remaining for backfill, but "backfill_completed" flag is still false, which caused the takeover operation to be stuck.

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes