Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-10844

KV+XDCR (TAP) : system test - item count mismatch (memcached connection requests time out)

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Blocker
    • 3.0
    • 3.0
    • couchbase-bucket, XDCR
    • Security Level: Public
    • CentOS, 64 bit 15GB RAM nodes , 4 cores
    • Untriaged
    • No
    • June 30 - July 18

    Description

      Setup
      --------
      8 * 8 clusters with XDCR
      3 buckets - 2 non-sasl buckets and 1 saslbucket
      1. standardbucket (bi-directional - 6GB) <---the bucket under discussion
      2. standardbucket1 (uni-directional 5GB)
      3. saslbucket (no xdcr 1 GB)

      Source cluster - http://172.23.105.44:8091/index.html
      Destination cluster - http://172.23.105.55:8091/index.html

      Scenario
      -------------

      • Did the following:

      1. Load on both clusters till vb_active_resident_items_ratio < 50.
      2. Access phase with 98% gets, 2%sets runs for 3 hours
      3. Rebalance-out 1 node at cluster1 with workload
      4. Rebalance-in 1 node at cluster1 with workload
      5. Failover and remove node at cluster1 with workload
      6. Failover and add-back node at cluster1 with workload
      7. Rebalance-out 1 node at cluster2 with workload
      8. Rebalance-in 1 node at cluster2 with workload
      9. Failover and remove node at cluster2 with workload
      10. Failover and add-back node at cluster2 with workload
      11. Soft restart all nodes in cluster1 one by one with workload
      12. Soft restart all nodes in cluster2 one by one with workload

      After 1.5days since the completion of the test
      number of items in 'standardbucket' in C1 - 67309284
      number of items in 'standardbucket' in C2 - 67513466

      Missing items : 204182 at C1

      No outbound replications on either clusters, XDCR queue size = 0, xdcr percent complete = 100%,many many checkpointing errors are seen on C1 and C2.
      C1 - http://172.23.105.44:8091/index.html#sec=buckets
      C2 - http://172.23.105.54:8091/index.html#sec=buckets

      The items in standardbucket in source and destination clusters do not match but there are no items in the replication queue on both clusters

      Views with revid info (sorry, I ran out of diskspace trying to download and compare 120M items, trying other machines)
      -------------------------------
      http://172.23.105.44:8092/standardbucket/_design/ddoc/_view/_all_doc?descending=false&stale=false&connection_timeout=60000&limit=100000000&skip=0

      http://172.23.105.54:8092/standardbucket/_design/ddco1/_view/54_docs?descending=false&stale=false&connection_timeout=60000&limit=100000000&skip=0

      Checkpointing errors seen in logs
      ---------------------------------------------------
      [xdcr:error,2014-04-10T14:19:43.911,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:218]Checkpointing failed due to remote vbopaque mismatch:

      {mismatch, [<<"264435226106585">>, <<"1397159965">>]}

      [xdcr:error,2014-04-10T14:19:43.912,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep:start_replication:1000]checkpoint commit failure at start of replication for vb 264
      [xdcr:error,2014-04-10T14:19:43.912,ns_1@172.23.105.44:<0.12433.0>:xdc_vbucket_rep:terminate:534]Replication (XMem mode) `455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1` (`standardbucket1/264` -> `http://*****@172.23.105.57:8092/standardbucket1%2f264%3be6756656d287a83af17925c49bd9c6e0`) failed.Please see ns_server debug log for complete state dump
      [xdcr:error,2014-04-10T14:19:47.443,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:220]Checkpointing failed unexpectedly (or could be network problem): {error,500,
      <<"

      {\"error\":\"unexpected_reason\",\"reason\":\"killed\"}

      \n">>}
      [xdcr:error,2014-04-10T14:19:47.443,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep:start_replication:1000]checkpoint commit failure at start of replication for vb 124
      [xdcr:error,2014-04-10T14:19:47.444,ns_1@172.23.105.44:<0.12836.0>:xdc_vbucket_rep:terminate:534]Replication (XMem mode) `455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1` (`standardbucket1/124` -> `http://*****@172.23.105.54:8092/standardbucket1%2f124%3be6756656d287a83af17925c49bd9c6e0`) failed.Please see ns_server debug log for complete state dump
      [xdcr:error,2014-04-10T14:00:00.006,ns_1@172.23.105.44:<0.5889.0>:xdc_vbucket_rep_ckpt:do_checkpoint_old:218]Checkpointing failed due to remote vbopaque mismatch:

      {mismatch, [<<"11294974043546">>, <<"1397159968">>]}

      [xdcr:error,2014-04-10T12:34:55.818,ns_1@172.23.105.44:<0.6033.0>:xdc_vbucket_rep:handle_info:118]Error initializing vb replicator ({init_state,
      {rep,
      <<"455e4a452e237ea5f4d86c543303b49c/standardbucket1/standardbucket1">>,
      <<"standardbucket1">>,
      <<"/remoteClusters/455e4a452e237ea5f4d86c543303b49c/buckets/standardbucket1">>,
      "xmem",
      [

      {optimistic_replication_threshold,256},
      {worker_batch_size,500},
      {failure_restart_interval,30},
      {doc_batch_size_kb,2048},
      {checkpoint_interval,1800},
      {max_concurrent_reps,32},
      {connection_timeout,180},
      {worker_processes,4},
      {http_connections,20},
      {retries_per_request,2},
      {socket_options,
      [{keepalive,true},{nodelay,false}]},
      {pause_requested,false},
      {supervisor_max_r,25},
      {supervisor_max_t,5},
      {trace_dump_invprob,1000}]},
      113,"xmem",<0.6012.0>,<0.6013.0>,
      <0.6009.0>}):{exit,
      {shutdown,
      {gen_server,call,
      ['ns_memcached-standardbucket1',
      {stats, [118,98,117,99,107,101, 116,45,115,101,113,110, 111,32,"113"]},
      180000]}}}

      [xdcr:error,2014-04-10T12:59:12.767,ns_1@172.23.105.54:<0.5861.0>:xdc_vbucket_rep:terminate:507]Shutting xdcr vb replicator ({init_state,
      {rep,
      <<"e4e06be20b146e347092d4fb78ba36f5/standardbucket/standardbucket">>,
      <<"standardbucket">>,
      <<"/remoteClusters/e4e06be20b146e347092d4fb78ba36f5/buckets/standardbucket">>,
      "xmem",
      [{optimistic_replication_threshold,256}

      ,

      {worker_batch_size,500}

      ,

      {failure_restart_interval,30}

      ,

      {doc_batch_size_kb,2048}

      ,

      {checkpoint_interval,1800}

      ,

      {max_concurrent_reps,32}

      ,

      {connection_timeout,180}

      ,

      {worker_processes,4}

      ,

      {http_connections,20}

      ,

      {retries_per_request,2}

      ,
      {socket_options,
      [

      {keepalive,true}

      ,

      {nodelay,false}

      ]},

      {pause_requested,false}

      ,

      {supervisor_max_r,25}

      ,

      {supervisor_max_t,5}

      ,

      {trace_dump_invprob,1000}

      ]},
      24,"xmem",<0.5839.0>,<0.5840.0>,<0.5831.0>}) down without ever successfully initializing: {{{badmatch,
      {error,
      timeout}},
      [

      {mc_client_binary, cmd_vocal_recv, 5}

      ,

      {mc_client_binary, create_bucket, 4}

      ,

      {ns_memcached, ensure_bucket, 2}

      ,

      {ns_memcached, complete_connection_phase, 2}

      ,

      {ns_memcached, handle_cast, 2}

      ,

      {gen_server, 5}

      ,

      {ns_memcached, init, 1}

      ,

      {gen_server, init_it, 6}

      ]},
      {gen_server,
      call,
      ['ns_memcached-standardbucket',

      {stats, [118, 98, 117, 99, 107, 101, 116, 45, 115, 101, 113, 110, 111, 32, "24"]}

      ,
      180000]}}
      A sense of how many "Checkpointing failed unexpectedly" errors are seen on a node :
      [root@172.23.105.44 logs]# grep -c "Checkpointing failed unexpectedly" xdcr_errors.1
      256

      Could be a duplicate of MB-10792. This can be closed if attached logs reveal so, in which case, since there is data loss, this problem should be a blocker.

      Cbcollect info
      --------------------

      Source - https://s3.amazonaws.com/bugdb/jira/MB-10844/source.tar.gz
      Destination - https://s3.amazonaws.com/bugdb/jira/MB-10844/dest.tar.gz

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              apiravi Aruna Piravi (Inactive)
              apiravi Aruna Piravi (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty