Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6563

Replication ops/sec drops to 0, on stop/start load from source cluster. [Load with only creates looks okay, load with expired items--> replication rate drops very low.]

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Duplicate
    • Affects Version/s: 2.0-beta
    • Fix Version/s: 2.0-beta-2
    • Component/s: XDCR
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Build - 2.0-1696
      vbuckets 1024
      unidirectional replication.
      4G, 4 Core machines.
      No Swap

      Description

      • Setup a 2:3 unidirecitonal replication between 2 clusters.
      • Start mix load on the source cluster.

      On initial start of replication, seeing very good xdc ops/sec and creates/sec ranging from 4k-8k ops/sec.
      Stop load on source.

      Start load on source ( Note this will be treated now as updates)

      • Seeing 8-10k xdc ops/sec on destination cluster, No ( 0-30 items updated) updates/creates being done on destination cluster.

      Can reproduce this with
      nohup lib/perf_engines/mcsoda.py localhost:41208 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.03 expiration=30 ratio-deletes=0.04 min-value-size=2,3 max-items=2000000 exit-after-creates=0 prefix=k_two&
      nohup lib/perf_engines/mcsoda.py localhost:41208 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.03 expiration=30 ratio-deletes=0.04 min-value-size=2,3 max-items=2000000 exit-after-creates=0 prefix=k_one&

      ( With /without updates/deletes also works)

      Note: For about 10-15 minutes, saw no crash reports on the source side.
      Now seeing unable to POST error messages on the source side.

      Source logs show

        • Reason for termination ==
        • {http_request_failed,"POST",
          "http://Administrator:*****@10.3.121.38:8092/default%2f907%3bdc8cfd8bf825ca8adece5b7387af2afd/_bulk_docs",
          {error, {error,timeout}}}

          [error_logger:error,2012-09-07T13:11:25.697,ns_1@10.3.121.32:error_logger:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: xdc_vbucket_rep:init/1
          pid: <0.7311.2>
          registered_name: []
          exception exit: {http_request_failed,"POST",
          "http://Administrator:*****@10.3.121.38:8092/default%2f907%3bdc8cfd8bf825ca8adece5b7387af2afd/_bulk_docs",
          {error,{error,timeout}

          }}
          in function gen_server:terminate/6
          ancestors: [<0.6821.2>,<0.6816.2>,xdc_replication_sup,ns_server_sup,
          ns_server_cluster_sup,<0.60.0>]
          messages: []
          links: [<0.6821.2>]
          dictionary: []
          trap_exit: true
          status: running
          heap_size: 28657
          stack_size: 24
          reductions: 247791
          neighbours:

      Adding atop commands form the source, seeing no major CPU contention,

      Adding screenshot from destination.
      Adding atop commands from source

      live cluster at : 10.3.121.31 - source, 10.3.121.38-destination

      Adding ns_server logs.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        ketaki Ketaki Gangal added a comment -

        Tested with expired items load, seeing very high XDC ops/sec 12-13k and extremely low xdc sets/sec[200-400 ] across the destination cluster.

        • The CPU on the destination is moderate ~ 60-80% across the nodes.
        • Load is stopped at the source.
        • Restarted every node on destination incrementally ( one by one), seeing a spike on sets/sec, and then it settles to 300-400 xdc sets/sec, with consistent 12-13k xdc ops/sec throughout.
        • Restarted nodes on source, seeing no major improvement on xdc sets/sec.
        • Rebooted node on destination, no major improvement on xdc sets/sec.

        If there are 2 gets per every set, then why do we see very high XDC ops/sec , v/s very low XDC sets/sec especially with expired items.

        Workload used
        -nohup lib/perf_engines/mcsoda.py localhost:23205 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.1 expiration=60 ratio-deletes=0.02 min-value-size=3,5 max-items=2000000 exit-after-creates=1 prefix=a_two&
        -nohup lib/perf_engines/mcsoda.py localhost:23204 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=0.8 ratio-sets=0.3 ratio-expirations=0.1 expiration=60 ratio-deletes=0.02 min-value-size=3,5 max-items=2000000 exit-after-creates=0 prefix=a_one&
        -lib/perf_engines/mcsoda.py localhost:23204 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.5 expiration=60 min-value-size=3,5 max-items=2000000 exit-after-creates=1 prefix=a_one&

        Show
        ketaki Ketaki Gangal added a comment - Tested with expired items load, seeing very high XDC ops/sec 12-13k and extremely low xdc sets/sec [200-400 ] across the destination cluster. The CPU on the destination is moderate ~ 60-80% across the nodes. Load is stopped at the source. Restarted every node on destination incrementally ( one by one), seeing a spike on sets/sec, and then it settles to 300-400 xdc sets/sec, with consistent 12-13k xdc ops/sec throughout. Restarted nodes on source, seeing no major improvement on xdc sets/sec. Rebooted node on destination, no major improvement on xdc sets/sec. If there are 2 gets per every set, then why do we see very high XDC ops/sec , v/s very low XDC sets/sec especially with expired items. Workload used -nohup lib/perf_engines/mcsoda.py localhost:23205 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.1 expiration=60 ratio-deletes=0.02 min-value-size=3,5 max-items=2000000 exit-after-creates=1 prefix=a_two& -nohup lib/perf_engines/mcsoda.py localhost:23204 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=0.8 ratio-sets=0.3 ratio-expirations=0.1 expiration=60 ratio-deletes=0.02 min-value-size=3,5 max-items=2000000 exit-after-creates=0 prefix=a_one& -lib/perf_engines/mcsoda.py localhost:23204 vbuckets=1024 doc-gen=0 doc-cache=0 ratio-creates=1 ratio-sets=1 ratio-expirations=0.5 expiration=60 min-value-size=3,5 max-items=2000000 exit-after-creates=1 prefix=a_one&
        Hide
        ketaki Ketaki Gangal added a comment -

        Screenshot of xdc ops from XDC section on destination.

        Show
        ketaki Ketaki Gangal added a comment - Screenshot of xdc ops from XDC section on destination.
        Hide
        ketaki Ketaki Gangal added a comment -

        There are continuous timeouts on the source logs.

        =========================CRASH REPORT=========================
        crasher:
        initial call: xdc_vbucket_rep:init/1
        pid: <0.13992.2>
        registered_name: []
        exception exit: {http_request_failed,"POST",
        "http://Administrator:*****@10.3.121.36:8092/saslbucket%2f454%3b4dba968ce61b9e1f6f39cf7c796bc50b/_revs_diff",
        {error,

        {error,timeout}

        }}
        in function gen_server:terminate/6
        ancestors: [<0.7831.0>,<0.7767.0>,xdc_replication_sup,ns_server_sup,
        ns_server_cluster_sup,<0.60.0>]
        messages: []
        links: [<0.14232.2>,<0.14234.2>,<0.14235.2>,<0.14233.2>,<0.14230.2>,
        <0.14231.2>,<0.7831.0>]
        dictionary: []
        trap_exit: true
        status: running
        heap_size: 75025
        stack_size: 24
        reductions: 280322
        neighbours:

        Show
        ketaki Ketaki Gangal added a comment - There are continuous timeouts on the source logs. =========================CRASH REPORT========================= crasher: initial call: xdc_vbucket_rep:init/1 pid: <0.13992.2> registered_name: [] exception exit: {http_request_failed,"POST", "http://Administrator:*****@10.3.121.36:8092/saslbucket%2f454%3b4dba968ce61b9e1f6f39cf7c796bc50b/_revs_diff", {error, {error,timeout} }} in function gen_server:terminate/6 ancestors: [<0.7831.0>,<0.7767.0>,xdc_replication_sup,ns_server_sup, ns_server_cluster_sup,<0.60.0>] messages: [] links: [<0.14232.2>,<0.14234.2>,<0.14235.2>,<0.14233.2>,<0.14230.2>, <0.14231.2>,<0.7831.0>] dictionary: [] trap_exit: true status: running heap_size: 75025 stack_size: 24 reductions: 280322 neighbours:
        Hide
        ketaki Ketaki Gangal added a comment -

        Screenshot, by the hour.

        Show
        ketaki Ketaki Gangal added a comment - Screenshot, by the hour.
        Hide
        ketaki Ketaki Gangal added a comment -

        Closing this one. More information on http://www.couchbase.com/issues/browse/MB-6662

        Show
        ketaki Ketaki Gangal added a comment - Closing this one. More information on http://www.couchbase.com/issues/browse/MB-6662

          People

          • Assignee:
            ketaki Ketaki Gangal
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes