Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4517

Rebalance Stuck on 1.8 after adding nodes (rebalance gets stuck even if ack_seqno is correct and has_queued_item is true and total_backlog_size > 1000)

    Details

      Description

      Added 5 nodes to a 10 nodes cluster.

      TAP stats from the node that has one tap connection stuck.
      /opt/couchbase/bin/cbstats localhost:11210 tap |grep rebalance

      eq_tapq:rebalance_725:ack_log_size: 248
      eq_tapq:rebalance_725:ack_playback_size: 248
      eq_tapq:rebalance_725:ack_seqno: 58249
      eq_tapq:rebalance_725:ack_window_full: false
      eq_tapq:rebalance_725:backfill_completed: false
      eq_tapq:rebalance_725:bg_backlog_size: 0
      eq_tapq:rebalance_725:bg_jobs_completed: 43790
      eq_tapq:rebalance_725:bg_jobs_issued: 43790
      eq_tapq:rebalance_725:bg_queue_size: 0
      eq_tapq:rebalance_725:bg_queued: 43790
      eq_tapq:rebalance_725:bg_result_size: 0
      eq_tapq:rebalance_725:bg_results: 0
      eq_tapq:rebalance_725:bg_wait_for_results: false
      eq_tapq:rebalance_725:complete: false
      eq_tapq:rebalance_725:connected: true
      eq_tapq:rebalance_725:created: 5423
      eq_tapq:rebalance_725:empty: false
      eq_tapq:rebalance_725:flags: 93 (ack,backfill,vblist,takeover,checkpoints)
      eq_tapq:rebalance_725:has_item: false
      eq_tapq:rebalance_725:has_queued_item: true
      eq_tapq:rebalance_725:idle: false
      eq_tapq:rebalance_725:num_tap_nack: 0
      eq_tapq:rebalance_725:num_tap_tmpfail_survivors: 0
      eq_tapq:rebalance_725:paused: 1
      eq_tapq:rebalance_725:pending_backfill: false
      eq_tapq:rebalance_725:pending_disconnect: false
      eq_tapq:rebalance_725:pending_disk_backfill: false
      eq_tapq:rebalance_725:qlen: 0
      eq_tapq:rebalance_725:qlen_high_pri: 0
      eq_tapq:rebalance_725:qlen_low_pri: 1
      eq_tapq:rebalance_725:queue_backfillremaining: 0
      eq_tapq:rebalance_725:queue_backoff: 0
      eq_tapq:rebalance_725:queue_drain: 58240
      eq_tapq:rebalance_725:queue_fill: 0
      eq_tapq:rebalance_725:queue_itemondisk: 0
      eq_tapq:rebalance_725:queue_memory: 0
      eq_tapq:rebalance_725:rec_fetched: 14710
      eq_tapq:rebalance_725:recv_ack_seqno: 58000
      eq_tapq:rebalance_725:reserved: 1
      eq_tapq:rebalance_725:seqno_ack_requested: 58000
      eq_tapq:rebalance_725:supports_ack: true
      eq_tapq:rebalance_725:suspended: false
      eq_tapq:rebalance_725:total_backlog_size: 10327
      eq_tapq:rebalance_725:total_noops: 836
      eq_tapq:rebalance_725:type: producer
      eq_tapq:rebalance_725:vb_filter:

      { 725 }

      eq_tapq:rebalance_725:vb_filters: 1

      /opt/couchbase/bin/cbstats localhost:11210 all |grep mem
      ep_diskqueue_memory: 0
      ep_mem_high_wat: 7864320000
      ep_mem_low_wat: 6291456000
      mem_used: 6270355533
      vb_active_ht_memory: 25611040
      vb_active_itm_memory: 4777690775
      vb_active_perc_mem_resident: 32
      vb_active_queue_memory: 0
      vb_pending_ht_memory: 0
      vb_pending_itm_memory: 0
      vb_pending_perc_mem_resident: 0
      vb_pending_queue_memory: 0
      vb_replica_ht_memory: 17336480
      vb_replica_itm_memory: 1221834713
      vb_replica_perc_mem_resident: 11
      vb_replica_queue_memory: 0

      1. Screen Shot 2011-12-07 at 11.01.17 AM.png
        370 kB
        Karan Kumar
      2. Screen Shot 2011-12-07 at 10.59.10 AM.png
        365 kB
        Karan Kumar
      3. Screen Shot 2011-12-07 at 10.58.41 AM.png
        410 kB
        Karan Kumar

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          karan Karan Kumar (Inactive) added a comment -

          Not been able to reproduce this with 1.8.0r-51 build. Will keep running more rebalancetests on this build.

          Show
          karan Karan Kumar (Inactive) added a comment - Not been able to reproduce this with 1.8.0r-51 build. Will keep running more rebalancetests on this build.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          We believe this was caused by ebucketmigrator unknowingly did not buffer enough nacks in it's upstream direction. Causing deadlock. ebucketmigrator was waiting upstream memcached/ep-engine to eat sent nacks and memcached/ep-engine was waiting on ebucketmigrator consuming tap messages sent to it.

          The fix is here: http://review.couchbase.org/11859 and is already merged.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - We believe this was caused by ebucketmigrator unknowingly did not buffer enough nacks in it's upstream direction. Causing deadlock. ebucketmigrator was waiting upstream memcached/ep-engine to eat sent nacks and memcached/ep-engine was waiting on ebucketmigrator consuming tap messages sent to it. The fix is here: http://review.couchbase.org/11859 and is already merged.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Have not been able to hit this issue again.

          Show
          karan Karan Kumar (Inactive) added a comment - Have not been able to hit this issue again.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Closing old tickets.

          Show
          karan Karan Kumar (Inactive) added a comment - Closing old tickets.
          Hide
          perry Perry Krug added a comment -

          Reopening temporarily

          Show
          perry Perry Krug added a comment - Reopening temporarily

            People

            • Assignee:
              karan Karan Kumar (Inactive)
              Reporter:
              karan Karan Kumar (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Due:
                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes