Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4828

rebalancing multiple nodes can hang if a bucket has less than 100k items due to a race condition in tap take-over

    Details

      Description

      this was observed by one of our users which had 10 buckets . some buckets had less than 10k items and tap takeover got stuck.

      tap stats :

      6179: vb_1014:cursor_checkpoint_id:eq_tapq:rebalance_1014: 1
      97312: eq_tapq:rebalance_1014:ack_log_size: 0
      97313: eq_tapq:rebalance_1014:ack_playback_size: 0
      97314: eq_tapq:rebalance_1014:ack_seqno: 10
      97315: eq_tapq:rebalance_1014:ack_window_full: false
      97316: eq_tapq:rebalance_1014:backfill_completed: false
      97317: eq_tapq:rebalance_1014:bg_backlog_size: 0
      97318: eq_tapq:rebalance_1014:bg_jobs_completed: 0
      97319: eq_tapq:rebalance_1014:bg_jobs_issued: 0
      97320: eq_tapq:rebalance_1014:bg_queued: 0
      97321: eq_tapq:rebalance_1014:bg_result_size: 0
      97322: eq_tapq:rebalance_1014:bg_results: 0
      97323: eq_tapq:rebalance_1014:bg_wait_for_results: false
      97324: eq_tapq:rebalance_1014:complete: false
      97325: eq_tapq:rebalance_1014:connected: true
      97326: eq_tapq:rebalance_1014:created: 1272317
      97327: eq_tapq:rebalance_1014:empty: false
      97328: eq_tapq:rebalance_1014:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
      97329: eq_tapq:rebalance_1014:has_item: false
      97330: eq_tapq:rebalance_1014:has_queued_item: true
      97331: eq_tapq:rebalance_1014:idle: false
      97332: eq_tapq:rebalance_1014:num_tap_nack: 0
      97333: eq_tapq:rebalance_1014:num_tap_tmpfail_survivors: 0
      97334: eq_tapq:rebalance_1014:paused: 1
      97335: eq_tapq:rebalance_1014:pending_backfill: false
      97336: eq_tapq:rebalance_1014:pending_disconnect: false
      97337: eq_tapq:rebalance_1014:pending_disk_backfill: false
      97338: eq_tapq:rebalance_1014:qlen: 0
      97339: eq_tapq:rebalance_1014:qlen_high_pri: 0
      97340: eq_tapq:rebalance_1014:qlen_low_pri: 1
      97341: eq_tapq:rebalance_1014:queue_backfillremaining: 0
      97342: eq_tapq:rebalance_1014:queue_backoff: 0
      97343: eq_tapq:rebalance_1014:queue_drain: 0
      97344: eq_tapq:rebalance_1014:queue_fill: 0
      97345: eq_tapq:rebalance_1014:queue_itemondisk: 0
      97346: eq_tapq:rebalance_1014:queue_memory: 0
      97347: eq_tapq:rebalance_1014:rec_fetched: 5
      97348: eq_tapq:rebalance_1014:recv_ack_seqno: 8
      97349: eq_tapq:rebalance_1014:reserved: 1
      97350: eq_tapq:rebalance_1014:seqno_ack_requested: 9
      97351: eq_tapq:rebalance_1014:supports_ack: true
      97352: eq_tapq:rebalance_1014:suspended: false
      97353: eq_tapq:rebalance_1014:total_backlog_size: 10
      97354: eq_tapq:rebalance_1014:total_noops: 20036
      97355: eq_tapq:rebalance_1014:type: producer
      97356: eq_tapq:rebalance_1014:vb_filter:

      { 1014 }


      97357: eq_tapq:rebalance_1014:vb_filters: 1

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#change,13562
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #205 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/205/)
        MB-4828 Check backfill completion in TapProducer::nextFgFetched() (Revision 4140edecc57912851998f30e9bbe076ddff96fc5)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • tapconnection.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #205 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/205/ ) MB-4828 Check backfill completion in TapProducer::nextFgFetched() (Revision 4140edecc57912851998f30e9bbe076ddff96fc5) Result = SUCCESS Chiyoung Seo : Files : tapconnection.cc
        Hide
        chiyoung Chiyoung Seo added a comment -

        There is a very small time window that causes race condition in detecting a backfill completion for a vbucket takeover with a small number of items (e.g., 10 items per vbucket). The fix to this issue is now in gerrit for review:

        http://review.couchbase.org/#change,13562

        Farshid plans to reproduce this issue on windows cluster.

        Show
        chiyoung Chiyoung Seo added a comment - There is a very small time window that causes race condition in detecting a backfill completion for a vbucket takeover with a small number of items (e.g., 10 items per vbucket). The fix to this issue is now in gerrit for review: http://review.couchbase.org/#change,13562 Farshid plans to reproduce this issue on windows cluster.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        workaround is to pad the bucket with more items ( 100k )

        Show
        farshid Farshid Ghods (Inactive) added a comment - workaround is to pad the bucket with more items ( 100k )

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            farshid Farshid Ghods (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes