Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6919

Possibly incorrect OUTBOUND XDCR OPERATIONS seen on source cluster, very slow replication (very less number of sets) observed with number of items > 200M

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
      None
    • Environment:
      64bit ec2-ubuntu-12.04 LTS
      Build 1844

      Description

      Cluster set up:
      c1 : c2 :: 10 : 10

      sbucket: c1 -> c2
      default: c2 -> c1

      >> Replication set up with continuous front end load
      >> Front end load for default = ~10K ops per sec
      >> Front end load for sbucket = ~4-5K ops per sec
      >> Average replication seen on c1 (for default): ~12-14K ops per sec
      >> Average replication seen on c2 (for sbucket): ~15-18K ops per sec

      At a particular snapshot, on C1:

      {With same amount of load (mixed), on bucket "sbucket"}

      No. of items: 214M
      No. of items in replication queue: 136M (way too high)
      Secs in replicating = 0 (!?)
      Secs in checkpointing = 385 (!?)
      Checkpoints issued = 79 (!?)

      These stats being for a cluster that's been up running with continuous load and replication for ~65hrs.

      Also seen on the destination C2:

      {on bucket "sbucket"}

      Gets per sec: 19.2K
      Sets per sec: 347 (seems very low)

      Also seeing a number of these errors on the XDCR tab on the source:

      2012-10-15 19:17:50 - Error replicating vbucket 397: {http_request_failed, "POST", "http://Administrator:*****@ec2-175-41-177-173.ap-southeast-1.compute.amazonaws.com:8092/sbucket%2f397%3bc8731525718bcbdd0bf0382e420c453f/_revs_diff", {error,

      {error,timeout}}}

      2012-10-15 19:17:50 - Error replicating vbucket 381: {http_request_failed, "POST", "http://Administrator:*****@ec2-175-41-177-173.ap-southeast-1.compute.amazonaws.com:8092/sbucket%2f381%3bc8731525718bcbdd0bf0382e420c453f/_revs_diff", {error,{error,timeout}

      }}

      ....

      Load on sbucket with mcsoda:
      lib/perf_engines/mcsoda.py sbucket@ec2-50-18-140-172.us-west-1.compute.amazonaws.com:11211 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.03 expiration=60 ratio-deletes=0.5 min-value-size=1000 threads=30 max-items=100000000 exit-after-creates=2 prefix=KEY1_ max-creates=100000000

      Load on default with cbworkloadgen:
      /opt/couchbase/bin/tools/cbworkloadgen -n ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:8091 -r .7 -i 400000000 -s 128 -t 30 -p KEY3_

      Killed the front end load on both the buckets right now, replication seen catching up.
      Live clusters:
      c1: http://ec2-50-18-140-172.us-west-1.compute.amazonaws.com:8091/
      c2: http://ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:8091/

      Attached grabbed diags for the orchestrator on c1.

      1. Screen Shot 2012-10-15 at 11.49.00 AM.png
        96 kB
      2. Screen Shot 2012-10-15 at 11.49.29 AM.png
        98 kB
      3. Screen Shot 2012-10-16 at 2.23.33 PM.png
        114 kB
      4. Screen Shot 2012-10-16 at 2.25.04 PM.png
        57 kB
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        junyi Junyi Xie (Inactive) added a comment -

        1. The stats "docs to replicate" is not the # of items accumulated in changes queue. It is the stats reported by CouchDB (109M docs on your screenshot).
        Stat "docs in the queue" is the real number of docs in the queue, which 1.33M. I think I explained this in my earlier email to xdcr-eng.

        Today, to reduce the memory overhead of XDCR, the queue is limited to max 4k items and 400KB per active vbucket. By some simple math, you can compute that
        the number of items in queue per clusters is about 4K *32*10 = 1.3M items, which is consistent with your observation. This is expected behavior.

        In your case, since you have large clusters of 10 nodes, you may try higher number of concurrent replications from 32 to 100 to see any difference.

        2. You have expired items in your workload but with very small expiration time 60 seconds, given the size of your workload, it will mostly like expire before replicating to the remote cluster, that is why you see high XDCR ops but low set/deletes. I strongly suggest you exclude expired items from your test since it just created lots of confusion.

        Without further information, this is pretty much all I can say. Please let me know how to log onto the EC2 node and how to access the logs.

        Show
        junyi Junyi Xie (Inactive) added a comment - 1. The stats "docs to replicate" is not the # of items accumulated in changes queue. It is the stats reported by CouchDB (109M docs on your screenshot). Stat "docs in the queue" is the real number of docs in the queue, which 1.33M. I think I explained this in my earlier email to xdcr-eng. Today, to reduce the memory overhead of XDCR, the queue is limited to max 4k items and 400KB per active vbucket. By some simple math, you can compute that the number of items in queue per clusters is about 4K *32*10 = 1.3M items, which is consistent with your observation. This is expected behavior. In your case, since you have large clusters of 10 nodes, you may try higher number of concurrent replications from 32 to 100 to see any difference. 2. You have expired items in your workload but with very small expiration time 60 seconds, given the size of your workload, it will mostly like expire before replicating to the remote cluster, that is why you see high XDCR ops but low set/deletes. I strongly suggest you exclude expired items from your test since it just created lots of confusion. Without further information, this is pretty much all I can say. Please let me know how to log onto the EC2 node and how to access the logs.
        Hide
        abhinav Abhinav Dangeti added a comment -

        I understand, to log into the ec2 nodes, for the west coast cluster you could use the QAKey as:
        ssh -i QAKey.pem ubuntu@ec2-50-18-140-172.us-west-1.compute.amazonaws.com
        and for the southeast cluster, using SingaporeQAKey.pem as:
        ssh -i SingaporeQAKey.pem ubuntu@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com

        I shall email you both the keys.

        Show
        abhinav Abhinav Dangeti added a comment - I understand, to log into the ec2 nodes, for the west coast cluster you could use the QAKey as: ssh -i QAKey.pem ubuntu@ec2-50-18-140-172.us-west-1.compute.amazonaws.com and for the southeast cluster, using SingaporeQAKey.pem as: ssh -i SingaporeQAKey.pem ubuntu@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com I shall email you both the keys.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        On southeast cluster where there is an ongoing XDCR with high get but low set/delete, I saw a bunch of source lost the conflict resolution, consistent with the stats on UI.

        Please remove the expired items from your workload, otherwise we cannot determine it is a bug or not.

        [xdcr:debug,2012-10-15T22:16:07.067,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24978.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:07.071,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25039.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:07.074,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24910.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:07.078,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24916.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:07.081,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25327.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:07.225,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24889.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:07.230,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25085.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:14.582,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25015.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:22.882,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25084.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        [xdcr:debug,2012-10-15T22:16:29.451,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24913.2227>:capi_replication:get_missing_revs:43]after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.

        Show
        junyi Junyi Xie (Inactive) added a comment - On southeast cluster where there is an ongoing XDCR with high get but low set/delete, I saw a bunch of source lost the conflict resolution, consistent with the stats on UI. Please remove the expired items from your workload, otherwise we cannot determine it is a bug or not. [xdcr:debug,2012-10-15T22:16:07.067,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24978.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:07.071,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25039.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:07.074,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24910.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:07.078,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24916.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:07.081,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25327.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:07.225,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24889.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:07.230,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25085.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:14.582,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25015.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:22.882,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.25084.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500. [xdcr:debug,2012-10-15T22:16:29.451,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:<0.24913.2227>:capi_replication:get_missing_revs:43] after conflict resolution for 500 docs, num of remote winners is 0 and number of local winners is 500.
        Hide
        ketaki Ketaki Gangal added a comment -

        Seeing this w/ bidirectional setup as well:

        Num checkpoints =0
        Num seconds in checkpointing = 0

        Adding screenshot from both the clusters.

        Show
        ketaki Ketaki Gangal added a comment - Seeing this w/ bidirectional setup as well: Num checkpoints =0 Num seconds in checkpointing = 0 Adding screenshot from both the clusters.
        Show
        junyi Junyi Xie (Inactive) added a comment - http://review.couchbase.org/#/c/21613/

          People

          • Assignee:
            junyi Junyi Xie (Inactive)
            Reporter:
            abhinav Abhinav Dangeti
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes