Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6084

ep-engine is not marking checkpoint as persisted in stats

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: None
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None

      Description

      When running my wip commit that ensures things are persisted on replica and new master before initiating takeover (http://review.couchbase.org/18798) I sometimes get rebalance 'stuck'.

      Seemingly because items are replicated & persisted (verified by looking at .couch files), but affected checkpoint is not marked as persisted.

      In this particular case it happened to vbucket 13.

      What my change is doing is I'm creating checkpoint (using create checkpoint command) after establishing replication stream. And then I'm waiting until checkpoint prior to that is persisted. But in this case stats tell me that checkpoint 0 is persisted instead of 2.

      Relevant stats:

      root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12000 raw 'checkpoint 13'
      vb_13:checkpoint_extension: false
      vb_13:eq_tapq:replication_building_13_'n_1@10.17.2.163':cursor_checkpoint_id: 3
      vb_13:last_closed_checkpoint_id: 2
      vb_13:num_checkpoint_items: 1
      vb_13:num_checkpoints: 1
      vb_13:num_items_for_persistence: 0
      vb_13:num_open_checkpoint_items: 0
      vb_13:num_tap_cursors: 1
      vb_13:open_checkpoint_id: 3
      vb_13:persisted_checkpoint_id: 2
      vb_13:state: active
      root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12002 raw 'checkpoint 13'
      vb_13:checkpoint_extension: false
      vb_13:last_closed_checkpoint_id: 2
      vb_13:num_checkpoint_items: 1
      vb_13:num_checkpoints: 1
      vb_13:num_items_for_persistence: 0
      vb_13:num_open_checkpoint_items: 0
      vb_13:num_tap_cursors: 0
      vb_13:open_checkpoint_id: 3
      vb_13:persisted_checkpoint_id: 0
      vb_13:state: replica
      root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12002 raw 'tap'
      ep_tap_ack_grace_period: 300
      ep_tap_ack_interval: 1000
      ep_tap_ack_window_size: 10
      ep_tap_backoff_period: 5
      ep_tap_bg_fetch_requeued: 0
      ep_tap_bg_fetched: 0
      ep_tap_bg_max_pending: 500
      ep_tap_count: 2
      ep_tap_deletes: 0
      ep_tap_fg_fetched: 0
      ep_tap_noop_interval: 20
      ep_tap_queue_backfillremaining: 0
      ep_tap_queue_backoff: 0
      ep_tap_queue_drain: 0
      ep_tap_queue_fill: 0
      ep_tap_queue_itemondisk: 0
      ep_tap_throttle_queue_cap: 1000000
      ep_tap_throttle_threshold: 90
      ep_tap_throttled: 0
      ep_tap_total_backlog_size: 0
      ep_tap_total_fetched: 0
      ep_tap_total_queue: 0
      eq_tapq:anon_1:connected: true
      eq_tapq:anon_1:created: 14
      eq_tapq:anon_1:num_checkpoint_end: 0
      eq_tapq:anon_1:num_checkpoint_end_failed: 0
      eq_tapq:anon_1:num_checkpoint_start: 13
      eq_tapq:anon_1:num_checkpoint_start_failed: 0
      eq_tapq:anon_1:num_delete: 0
      eq_tapq:anon_1:num_delete_failed: 0
      eq_tapq:anon_1:num_flush: 0
      eq_tapq:anon_1:num_flush_failed: 0
      eq_tapq:anon_1:num_mutation: 0
      eq_tapq:anon_1:num_mutation_failed: 0
      eq_tapq:anon_1:num_opaque: 2
      eq_tapq:anon_1:num_opaque_failed: 0
      eq_tapq:anon_1:num_unknown: 0
      eq_tapq:anon_1:num_vbucket_set: 0
      eq_tapq:anon_1:num_vbucket_set_failed: 0
      eq_tapq:anon_1:pending_disconnect: false
      eq_tapq:anon_1:reserved: 0
      eq_tapq:anon_1:supports_ack: true
      eq_tapq:anon_1:type: consumer
      eq_tapq:anon_2:connected: true
      eq_tapq:anon_2:created: 422
      eq_tapq:anon_2:num_checkpoint_end: 0
      eq_tapq:anon_2:num_checkpoint_end_failed: 0
      eq_tapq:anon_2:num_checkpoint_start: 1
      eq_tapq:anon_2:num_checkpoint_start_failed: 0
      eq_tapq:anon_2:num_delete: 0
      eq_tapq:anon_2:num_delete_failed: 0
      eq_tapq:anon_2:num_flush: 0
      eq_tapq:anon_2:num_flush_failed: 0
      eq_tapq:anon_2:num_mutation: 119
      eq_tapq:anon_2:num_mutation_failed: 0
      eq_tapq:anon_2:num_opaque: 4
      eq_tapq:anon_2:num_opaque_failed: 0
      eq_tapq:anon_2:num_unknown: 0
      eq_tapq:anon_2:num_vbucket_set: 0
      eq_tapq:anon_2:num_vbucket_set_failed: 0
      eq_tapq:anon_2:pending_disconnect: false
      eq_tapq:anon_2:reserved: 0
      eq_tapq:anon_2:supports_ack: true
      eq_tapq:anon_2:type: consumer
      root@beta:~/src/altoros/moxi/ns_server/couch/0/default#
      root@beta:~/src/altoros/moxi/ns_server/couch/0/default# ~/src/altoros/moxi/repo20/install/bin/cbstats lh:12000 raw 'tap'
      ep_tap_ack_grace_period: 300
      ep_tap_ack_interval: 1000
      ep_tap_ack_window_size: 10
      ep_tap_backoff_period: 5
      ep_tap_bg_fetch_requeued: 0
      ep_tap_bg_fetched: 0
      ep_tap_bg_max_pending: 500
      ep_tap_count: 2
      ep_tap_deletes: 0
      ep_tap_fg_fetched: 119
      ep_tap_noop_interval: 20
      ep_tap_queue_backfillremaining: 0
      ep_tap_queue_backoff: 0
      ep_tap_queue_drain: 119
      ep_tap_queue_fill: 0
      ep_tap_queue_itemondisk: 0
      ep_tap_throttle_queue_cap: 1000000
      ep_tap_throttle_threshold: 90
      ep_tap_throttled: 0
      ep_tap_total_backlog_size: 0
      ep_tap_total_fetched: 139
      ep_tap_total_queue: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':ack_log_size: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':ack_seqno: 129
      eq_tapq:replication_building_13_'n_1@10.17.2.163':ack_window_full: false
      eq_tapq:replication_building_13_'n_1@10.17.2.163':backfill_completed: true
      eq_tapq:replication_building_13_'n_1@10.17.2.163':bg_jobs_completed: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':bg_jobs_issued: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':bg_result_size: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':connected: true
      eq_tapq:replication_building_13_'n_1@10.17.2.163':created: 422
      eq_tapq:replication_building_13_'n_1@10.17.2.163':flags: 85 (ack,backfill,vblist,checkpoints)
      eq_tapq:replication_building_13_'n_1@10.17.2.163':has_queued_item: false
      eq_tapq:replication_building_13_'n_1@10.17.2.163':idle: true
      eq_tapq:replication_building_13_'n_1@10.17.2.163':paused: 1
      eq_tapq:replication_building_13_'n_1@10.17.2.163':pending_backfill: false
      eq_tapq:replication_building_13_'n_1@10.17.2.163':pending_disconnect: false
      eq_tapq:replication_building_13_'n_1@10.17.2.163':pending_disk_backfill: false
      eq_tapq:replication_building_13_'n_1@10.17.2.163':qlen: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':qlen_high_pri: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':qlen_low_pri: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_backfillremaining: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_backoff: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_drain: 119
      eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_fill: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_itemondisk: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':queue_memory: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':rec_fetched: 124
      eq_tapq:replication_building_13_'n_1@10.17.2.163':recv_ack_seqno: 128
      eq_tapq:replication_building_13_'n_1@10.17.2.163':reserved: 1
      eq_tapq:replication_building_13_'n_1@10.17.2.163':seqno_ack_requested: 128
      eq_tapq:replication_building_13_'n_1@10.17.2.163':supports_ack: true
      eq_tapq:replication_building_13_'n_1@10.17.2.163':suspended: false
      eq_tapq:replication_building_13_'n_1@10.17.2.163':total_backlog_size: 0
      eq_tapq:replication_building_13_'n_1@10.17.2.163':total_noops: 17
      eq_tapq:replication_building_13_'n_1@10.17.2.163':type: producer
      eq_tapq:replication_building_13_'n_1@10.17.2.163':vb_filter:

      { 13 }

      eq_tapq:replication_building_13_'n_1@10.17.2.163':vb_filters: 1
      eq_tapq:replication_n_1@10.17.2.163:ack_log_size: 0
      eq_tapq:replication_n_1@10.17.2.163:ack_seqno: 18
      eq_tapq:replication_n_1@10.17.2.163:ack_window_full: false
      eq_tapq:replication_n_1@10.17.2.163:backfill_completed: true
      eq_tapq:replication_n_1@10.17.2.163:bg_jobs_completed: 0
      eq_tapq:replication_n_1@10.17.2.163:bg_jobs_issued: 0
      eq_tapq:replication_n_1@10.17.2.163:bg_result_size: 0
      eq_tapq:replication_n_1@10.17.2.163:connected: true
      eq_tapq:replication_n_1@10.17.2.163:created: 14
      eq_tapq:replication_n_1@10.17.2.163:flags: 85 (ack,backfill,vblist,checkpoints)
      eq_tapq:replication_n_1@10.17.2.163:has_queued_item: false
      eq_tapq:replication_n_1@10.17.2.163:idle: true
      eq_tapq:replication_n_1@10.17.2.163:paused: 1
      eq_tapq:replication_n_1@10.17.2.163:pending_backfill: false
      eq_tapq:replication_n_1@10.17.2.163:pending_disconnect: false
      eq_tapq:replication_n_1@10.17.2.163:pending_disk_backfill: false
      eq_tapq:replication_n_1@10.17.2.163:qlen: 0
      eq_tapq:replication_n_1@10.17.2.163:qlen_high_pri: 0
      eq_tapq:replication_n_1@10.17.2.163:qlen_low_pri: 0
      eq_tapq:replication_n_1@10.17.2.163:queue_backfillremaining: 0
      eq_tapq:replication_n_1@10.17.2.163:queue_backoff: 0
      eq_tapq:replication_n_1@10.17.2.163:queue_drain: 0
      eq_tapq:replication_n_1@10.17.2.163:queue_fill: 0
      eq_tapq:replication_n_1@10.17.2.163:queue_itemondisk: 0
      eq_tapq:replication_n_1@10.17.2.163:queue_memory: 0
      eq_tapq:replication_n_1@10.17.2.163:rec_fetched: 15
      eq_tapq:replication_n_1@10.17.2.163:recv_ack_seqno: 17
      eq_tapq:replication_n_1@10.17.2.163:reserved: 1
      eq_tapq:replication_n_1@10.17.2.163:seqno_ack_requested: 17
      eq_tapq:replication_n_1@10.17.2.163:supports_ack: true
      eq_tapq:replication_n_1@10.17.2.163:suspended: false
      eq_tapq:replication_n_1@10.17.2.163:total_backlog_size: 0
      eq_tapq:replication_n_1@10.17.2.163:total_noops: 36
      eq_tapq:replication_n_1@10.17.2.163:type: producer
      eq_tapq:replication_n_1@10.17.2.163:vb_filter:

      { [0,12] }

      eq_tapq:replication_n_1@10.17.2.163:vb_filters: 13

      1. diag.gz
        691 kB
        Aleksey Kondratenko
      2. diag1.gz
        547 kB
        Aleksey Kondratenko
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        peter peter added a comment -

        Hi Chiyoung, can you look into this symptom which is required for the consistent views feature complete? Thank you.

        Show
        peter peter added a comment - Hi Chiyoung, can you look into this symptom which is required for the consistent views feature complete? Thank you.
        Hide
        chiyoung Chiyoung Seo added a comment -

        I don't think I should fix this issue by this sprint. Persistence and checkpoint are working fine at this time. I don't care any new issues introduced by external new features at this time. Please forward this comment to the entire team if you guys want to broadcast.

        Show
        chiyoung Chiyoung Seo added a comment - I don't think I should fix this issue by this sprint. Persistence and checkpoint are working fine at this time. I don't care any new issues introduced by external new features at this time. Please forward this comment to the entire team if you guys want to broadcast.
        Show
        chiyoung Chiyoung Seo added a comment - http://review.couchbase.org/#change,19355
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #389 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/389/)
        MB-6084 Reset checkpoint cursors after receiving backfill streams (Revision bb22966b48eeb2e16256cf3b40ea379ab8e72a00)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • checkpoint.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #389 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/389/ ) MB-6084 Reset checkpoint cursors after receiving backfill streams (Revision bb22966b48eeb2e16256cf3b40ea379ab8e72a00) Result = SUCCESS Chiyoung Seo : Files : checkpoint.cc

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            alkondratenko Aleksey Kondratenko (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes