Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-2676

Data loss after startup

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 1.7.0
    • Sprint ending 2010-11-18
    • couchbase-bucket
    • Security Level: Public
    • None

    Description

      Steps to reproduce:

      • load a 2 node cluster with 10M items on a default bucket of 10G
      • Observe the write queue grow to few million items, item count reach 8.2M
      • Shut down one server from command line
      • item count and write queue dropped by half, server was down immediately (no memcached process)
      • start the server, the item count when up only to ~7.3M write queue still had a few million items
      • six hours later, write queue is empty, system is stable (no load), item count is still 7.3 Million
        Diag logs from the server are attached. I reproduced it twice, so seems easy to reproduce.

      From: Sharon Barr
      Sent: Thursday, November 04, 2010 11:53 PM
      To: Dustin Sallings
      Cc: Chiyoung Seo
      Subject: RE: Loosing data during shut down

      Ok, it just happened again.
      I was loading my 2 node cluster with 10M items. had 8.45M according to stats, and about 8M disk write queue.
      Shut down 10.2.1.13, it went down immediately, memcached process was not alive.
      Number of items went down to ~4.5M.
      Starting 10.2.1.13, start quite fast, but total items is now 7.3M (lost 1.2M items).
      Attached are the diags, feel free to log in to the node, I am leaving the cluster as is for tonight.

      Btw, the cluster is idle now, http://10.2.1.14:8091 (password is password), but the item persist per second is around 30, which seems very slow for idle cluster. Write queue size is 8M, all of it is on 10.2.1.14.

      Sharon.

      Here are the stats from both nodes as localhost won't work (if this is how diag is collecting it).

      [root@localhost ~]# for i in

      {10.2.1.13,10.2.1.14}

      ; do echo $i; /opt/membase/bin/ep_engine/management/stats $i:11210 all ; done10.2.1.13
      auth_cmds: 2
      auth_errors: 0
      bucket_conns: 4
      bytes_read: 284126626
      bytes_written: 5350495
      cas_badval: 0
      cas_hits: 0
      cas_misses: 0
      cmd_flush: 0
      cmd_get: 0
      cmd_set: 0
      conn_yields: 13727
      connection_structures: 24
      curr_connections: 24
      curr_items: 3082117
      curr_items_tot: 3438428
      daemon_connections: 10
      decr_hits: 0
      decr_misses: 0
      delete_hits: 0
      delete_misses: 0
      ep_bg_fetched: 0
      ep_commit_num: 29
      ep_commit_time: 0
      ep_commit_time_total: 11
      ep_data_age: 2
      ep_data_age_highwat: 5
      ep_db_strategy: multiDB
      ep_dbinit: 69
      ep_dbname: /mnt/opt/membase/1.6.2/data/ns_1/default
      ep_dbshards: 4
      ep_expired: 0
      ep_flush_duration: 1
      ep_flush_duration_highwat: 1
      ep_flush_duration_total: 21
      ep_flush_preempts: 0
      ep_flusher_state: running
      ep_flusher_todo: 1691
      ep_io_num_read: 3106853
      ep_io_num_write: 348973
      ep_io_read_bytes: 2356869802
      ep_io_write_bytes: 264676455
      ep_item_begin_failed: 0
      ep_item_commit_failed: 0
      ep_item_flush_expired: 0
      ep_item_flush_failed: 0
      ep_kv_size: 2900487034
      ep_max_data_size: 7340032000
      ep_max_txn_size: 250000
      ep_mem_high_wat: 5505024000
      ep_mem_low_wat: 4404019200
      ep_min_data_age: 0
      ep_num_eject_failures: 0
      ep_num_expiry_pager_runs: 0
      ep_num_non_resident: 0
      ep_num_not_my_vbuckets: 0
      ep_num_pager_runs: 0
      ep_num_value_ejects: 0
      ep_oom_errors: 0
      ep_overhead: 26023752
      ep_pending_ops: 0
      ep_pending_ops_max: 0
      ep_pending_ops_max_duration: 0
      ep_pending_ops_total: 0
      ep_queue_age_cap: 900
      ep_queue_size: 5892
      ep_storage_age: 1
      ep_storage_age_highwat: 2
      ep_storage_type: featured
      ep_tap_keepalive: 0
      ep_tmp_oom_errors: 0
      ep_too_old: 0
      ep_too_young: 0
      ep_total_cache_size: 2900488201
      ep_total_del_items: 0
      ep_total_enqueued: 356326
      ep_total_new_items: 348745
      ep_total_persisted: 348745
      ep_vbucket_del: 225
      ep_vbucket_del_avg_walltime: 29025804
      ep_vbucket_del_fail: 0
      ep_vbucket_del_max_walltime: 53811481
      ep_vbucket_del_total_walltime: 6530806001
      ep_version: 1.6.2_31_g4fc276c
      ep_warmed_up: 3105829
      ep_warmup: true
      ep_warmup_dups: 0
      ep_warmup_oom: 0
      ep_warmup_thread: complete
      ep_warmup_time: 25246091
      get_hits: 0
      get_misses: 0
      incr_hits: 0
      incr_misses: 0
      libevent: 1.4.13-stable
      limit_maxbytes: 67108864
      mem_used: 2926510786
      pid: 20375
      pointer_size: 64
      rejected_conns: 0
      rusage_system: 25.794078
      rusage_user: 29.512513
      tap_connect_received: 1
      tap_mutation_received: 356337
      tap_mutation_sent: 3082117
      tap_opaque_sent: 1
      threads: 4
      time: 1288939897
      total_connections: 39
      uptime: 670
      version: 1.4.4_305_g1126681
      10.2.1.14
      auth_cmds: 1445
      auth_errors: 0
      bucket_conns: 4
      bytes_read: 58453418460
      bytes_written: 38465409625
      cas_badval: 0
      cas_hits: 0
      cas_misses: 0
      cmd_flush: 0
      cmd_get: 47973755
      cmd_set: 27957775
      conn_yields: 1513895
      connection_structures: 33
      curr_connections: 14
      curr_items: 4225958
      curr_items_tot: 8451922
      daemon_connections: 10
      decr_hits: 0
      decr_misses: 0
      delete_hits: 0
      delete_misses: 0
      ep_bg_fetched: 1517061
      ep_bg_load_avg: 70
      ep_bg_max_load: 88068
      ep_bg_max_wait: 13081324
      ep_bg_min_load: 24
      ep_bg_min_wait: 15
      ep_bg_num_samples: 1517061
      ep_bg_wait_avg: 1364
      ep_commit_num: 411
      ep_commit_time: 9
      ep_commit_time_total: 1926
      ep_data_age: 214
      ep_data_age_highwat: 6821
      ep_db_strategy: multiDB
      ep_dbinit: 1
      ep_dbname: /opt/membase/1.6.2/data/ns_1/default
      ep_dbshards: 4
      ep_expired: 0
      ep_flush_duration: 17
      ep_flush_duration_highwat: 5306
      ep_flush_duration_total: 17414
      ep_flush_preempts: 5821
      ep_flusher_state: running
      ep_flusher_todo: 34165
      ep_io_num_read: 7036420
      ep_io_num_write: 15825616
      ep_io_read_bytes: 5340284472
      ep_io_write_bytes: 12006557302
      ep_item_begin_failed: 0
      ep_item_commit_failed: 0
      ep_item_flush_expired: 0
      ep_item_flush_failed: 0
      ep_kv_size: 7125130246
      ep_max_data_size: 7340032000
      ep_max_txn_size: 250000
      ep_mem_high_wat: 5505024000
      ep_mem_low_wat: 4404019200
      ep_min_data_age: 0
      ep_num_eject_failures: 407646963
      ep_num_expiry_pager_runs: 5
      ep_num_non_resident: 456403
      ep_num_not_my_vbuckets: 1889
      ep_num_pager_runs: 211
      ep_num_value_ejects: 5380201
      ep_oom_errors: 0
      ep_overhead: 286479509
      ep_pending_ops: 0
      ep_pending_ops_max: 1
      ep_pending_ops_max_duration: 538
      ep_pending_ops_total: 2
      ep_queue_age_cap: 900
      ep_queue_size: 8410908
      ep_storage_age: 2154
      ep_storage_age_highwat: 7092
      ep_storage_type: featured
      ep_tap_keepalive: 0
      ep_tmp_oom_errors: 2292
      ep_too_old: 9323623
      ep_too_young: 0
      ep_total_cache_size: 14823359086
      ep_total_del_items: 0
      ep_total_enqueued: 39301350
      ep_total_new_items: 10753641
      ep_total_persisted: 15823557
      ep_vbucket_del: 973
      ep_vbucket_del_avg_walltime: 565799013
      ep_vbucket_del_fail: 0
      ep_vbucket_del_max_walltime: 1620564892
      ep_vbucket_del_total_walltime: 550522440000
      ep_version: 1.6.2_31_g4fc276c
      ep_warmed_up: 0
      ep_warmup: true
      ep_warmup_dups: 0
      ep_warmup_oom: 0
      ep_warmup_thread: complete
      ep_warmup_time: 188
      get_hits: 47973755
      get_misses: 0
      incr_hits: 0
      incr_misses: 0
      libevent: 1.4.13-stable
      limit_maxbytes: 67108864
      mem_used: 7411609755
      pid: 25274
      pointer_size: 64
      rejected_conns: 0
      rusage_system: 3922.763649
      rusage_user: 1913.485106
      tap_connect_received: 708
      tap_mutation_received: 44095079
      tap_mutation_sent: 27013403
      tap_opaque_received: 7
      tap_opaque_sent: 23
      tap_vbucket_set_received: 1364
      tap_vbucket_set_sent: 1364
      threads: 4
      time: 1288939895
      total_connections: 2199
      uptime: 21054
      version: 1.4.4_305_g1126681
      [root@localhost ~]#

      From: Dustin Sallings
      Sent: Thursday, November 04, 2010 11:13 PM
      To: Sharon Barr
      Cc: Chiyoung Seo
      Subject: Re: Loosing data during shut down

      On Nov 4, 2010, at 23:07, Sharon Barr wrote:

      I had a 2 node cluster with a lot of items on the write queue (~4 million), with a painful 30item_persist_per_sec, with bursts of few thousands every now and then.
      I shut down the server, went very fast, and restarted it.
      After restart, half of my items were gone. After clicking failover of that node, my items came back because they were available on the replica.

      I thought that during shut down you are still trying to flush all the items to disk, what is the expected behavior for this scenario (I hope that losing items is not it..)?

      No, that's a daemon task and it's got a special shutdown sequence. It's supposed to complete the whole sequence before shutdown. If it didn't, I'd suspect a crash. I've not seen it not do that in a very long time.

      Many of the unit tests rely on this behavior, for one. I'd certainly want to look into that if it's reproducible (hopefully you can find a crash in browse_logs)


      Dustin Sallings

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            dustin Dustin Sallings (Inactive)
            sharon Sharon Barr (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty