Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7321

XDCR: constant crashes/time-outs/mb_master restarts during perf. tests on Windows

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0.1
    • Component/s: ns_server, XDCR
    • Security Level: Public
    • Labels:
      None
    • Environment:
      VMs, Windows 64-bit, 24GB, 4 cores
      build 1969

      Description

      2 <-> 2 nodes, 2 buckets per cluster, unidir replication
      4K ops/sec/cluster, 50/50 gets/sets), no views

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        pavelpaulau Pavel Paulau created issue -
        Show
        pavelpaulau Pavel Paulau added a comment - diags: https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.30-1222012-849-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.31-1222012-855-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.32-1222012-852-diag.zip https://s3.amazonaws.com/bugdb/jira/MB-7321/d052ea5a/192.168.162.33-1222012-858-diag.zip
        pavelpaulau Pavel Paulau made changes -
        Field Original Value New Value
        Assignee Pavel Paulau [ pavelpaulau ] Junyi Xie [ junyi ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Pavel,

        to understand the severity of this issue

        does this impact rate of replication ? is the cluster usable ?
        does this problem go away when you reduce the load on the cluster ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Pavel, to understand the severity of this issue does this impact rate of replication ? is the cluster usable ? does this problem go away when you reduce the load on the cluster ?
        Hide
        pavelpaulau Pavel Paulau added a comment -

        It does impact but slightly and for short period of time, for given light workload it results in small queue spikes.

        Symptoms are similar to issues with scheduler threads whereas async threads were enabled in recent builds.

        Show
        pavelpaulau Pavel Paulau added a comment - It does impact but slightly and for short period of time, for given light workload it results in small queue spikes. Symptoms are similar to issues with scheduler threads whereas async threads were enabled in recent builds.
        Hide
        abhinav Abhinav Dangeti added a comment -

        Do you see these timeouts right at the start when you just set up the replication?
        For e.g: there would be xdcr errors if replication is set up immediately after creating a replication reference and you would be seeing "Failures in grabbing vbucket stats" ..

        I tried reproducing your scenario, but in my case i started replication a couple of minutes after i set up the replication reference, and i noticed no crashes or drop in the replication rate at any point until the finish, I had a load with the similar sets and gets ratio as well.

        So a fixed value for this timeout that we need to give the cluster between setting up the replication reference and actually starting the replication, is something that I am not sure of, but if we do give it a couple of minutes I am pretty sure that we shouldn't be seeing any xdcr errors in grabbing vbucket stats.

        Show
        abhinav Abhinav Dangeti added a comment - Do you see these timeouts right at the start when you just set up the replication? For e.g: there would be xdcr errors if replication is set up immediately after creating a replication reference and you would be seeing "Failures in grabbing vbucket stats" .. I tried reproducing your scenario, but in my case i started replication a couple of minutes after i set up the replication reference, and i noticed no crashes or drop in the replication rate at any point until the finish, I had a load with the similar sets and gets ratio as well. So a fixed value for this timeout that we need to give the cluster between setting up the replication reference and actually starting the replication, is something that I am not sure of, but if we do give it a couple of minutes I am pretty sure that we shouldn't be seeing any xdcr errors in grabbing vbucket stats.
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0.1 [ 10399 ]
        Fix Version/s 2.0 [ 10114 ]
        Hide
        pavelpaulau Pavel Paulau added a comment -

        No, it happens after 4-5 hours of test run time.

        Last run was the most troubling so far, 2 nodes were marked as down even after test. There are logs here if you are curious:
        http://qa.hq.northscale.net/job/xperf-win/32/

        Show
        pavelpaulau Pavel Paulau added a comment - No, it happens after 4-5 hours of test run time. Last run was the most troubling so far, 2 nodes were marked as down even after test. There are logs here if you are curious: http://qa.hq.northscale.net/job/xperf-win/32/
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        XDCR saw a lot of timeout from underlying ns_server and ep_engine, and therefore, a lot of vb replicators
        crashed as expected. Such timeout are not expected to see for this scale of test.

        Pavel mentioned this happened for a short period of time, what happened to ns_server or ep_engine during that time making it so slow and even too busy to serve xdcr request?

        I am not sure what I can fix on the side of XDCR. Looks to me ep_engine or ns_server team need to triage the issue.

        [xdcr:error,2012-12-02T8:16:20.578,ns_1@10.2.3.33:<0.15416.0>:xdc_vbucket_rep:terminate:298]Replication `41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1` (`bucket-1/412` -> `http://Administrator:password@10.2.\
        3.31:8092/bucket-1%2f412%3b2b6f9272ff82c9cc0dfcdd22ce77b9d6`) failed: {timeout,{gen_server,call,[ns_config,get]}}

        [xdcr:error,2012-12-02T8:20:01.874,ns_1@10.2.3.33:<0.1527.128>:capi_replication:update_replicated_docs:100][Bucket:"bucket-0", Vb:505]: update 170 docs takes too long to finish!(total time spent: 189 secs, defaul\
        t connection time out: 180 secs)
        [xdcr:error,2012-12-02T8:20:11.406,ns_1@10.2.3.33:<0.1531.128>:capi_replication:update_replicated_docs:100][Bucket:"bucket-0", Vb:432]: update 172 docs takes too long to finish!(total time spent: 199 secs, defaul\
        t connection time out: 180 secs)

        [xdcr:error,2012-12-02T8:33:13.109,ns_1@10.2.3.33:<0.14775.128>:xdc_vbucket_rep:terminate:284]Shutting xdcr vb replicator ({init_state,
        {rep,
        <<"41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1">>,
        <<"bucket-1">>,
        <<"/remoteClusters/41d1101e89da1d590261faef5067a4e8/buckets/bucket-1">>,
        [

        {connection_timeout,180000}

        ,

        {continuous,true}

        ,

        {http_connections,20}

        ,

        {retries,2}

        ,
        {socket_options,
        [

        {keepalive,true}

        ,

        {nodelay,false}

        ]},

        {worker_batch_size,500}

        ,

        {worker_processes,4}

        ]},
        488,<0.15362.0>,<0.15363.0>,<0.15357.0>}) down without ever successfully initializing: {badmatch,
        {error,
        all_nodes_failed,
        <<"Failed to grab remote bucket info from any of known nodes">>}}

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited XDCR saw a lot of timeout from underlying ns_server and ep_engine, and therefore, a lot of vb replicators crashed as expected. Such timeout are not expected to see for this scale of test. Pavel mentioned this happened for a short period of time, what happened to ns_server or ep_engine during that time making it so slow and even too busy to serve xdcr request? I am not sure what I can fix on the side of XDCR. Looks to me ep_engine or ns_server team need to triage the issue. [xdcr:error,2012-12-02T8:16:20.578,ns_1@10.2.3.33:<0.15416.0>:xdc_vbucket_rep:terminate:298] Replication `41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1` (`bucket-1/412` -> ` http://Administrator:password@10.2.\ 3.31:8092/bucket-1%2f412%3b2b6f9272ff82c9cc0dfcdd22ce77b9d6`) failed: {timeout,{gen_server,call, [ns_config,get] }} [xdcr:error,2012-12-02T8:20:01.874,ns_1@10.2.3.33:<0.1527.128>:capi_replication:update_replicated_docs:100] [Bucket:"bucket-0", Vb:505] : update 170 docs takes too long to finish!(total time spent: 189 secs, defaul\ t connection time out: 180 secs) [xdcr:error,2012-12-02T8:20:11.406,ns_1@10.2.3.33:<0.1531.128>:capi_replication:update_replicated_docs:100] [Bucket:"bucket-0", Vb:432] : update 172 docs takes too long to finish!(total time spent: 199 secs, defaul\ t connection time out: 180 secs) [xdcr:error,2012-12-02T8:33:13.109,ns_1@10.2.3.33:<0.14775.128>:xdc_vbucket_rep:terminate:284] Shutting xdcr vb replicator ({init_state, {rep, <<"41d1101e89da1d590261faef5067a4e8/bucket-1/bucket-1">>, <<"bucket-1">>, <<"/remoteClusters/41d1101e89da1d590261faef5067a4e8/buckets/bucket-1">>, [ {connection_timeout,180000} , {continuous,true} , {http_connections,20} , {retries,2} , {socket_options, [ {keepalive,true} , {nodelay,false} ]}, {worker_batch_size,500} , {worker_processes,4} ]}, 488,<0.15362.0>,<0.15363.0>,<0.15357.0>}) down without ever successfully initializing: {badmatch, {error, all_nodes_failed, <<"Failed to grab remote bucket info from any of known nodes">>}}
        junyi Junyi Xie (Inactive) made changes -
        Assignee Junyi Xie [ junyi ] Pavel Paulau [ pavelpaulau ]
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        From the log above, XDCR saw lots of timeout at different stages of replicator, e.g., fetching ns_config parameter during initialization, post data during replication, and even fetch remote vbucket map.

        Show
        junyi Junyi Xie (Inactive) added a comment - From the log above, XDCR saw lots of timeout at different stages of replicator, e.g., fetching ns_config parameter during initialization, post data during replication, and even fetch remote vbucket map.
        pavelpaulau Pavel Paulau made changes -
        Assignee Pavel Paulau [ pavelpaulau ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Please, be more specific about what exactly you want me to help with.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Please, be more specific about what exactly you want me to help with.
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Pavel Paulau [ pavelpaulau ]
        Hide
        pavelpaulau Pavel Paulau added a comment -

        From Junyi:

        "Looks to me ep_engine or ns_server team need to triage the issue."

        Both xdcr and ns_server team gave the runaround. You are last candidate.

        Show
        pavelpaulau Pavel Paulau added a comment - From Junyi: "Looks to me ep_engine or ns_server team need to triage the issue." Both xdcr and ns_server team gave the runaround. You are last candidate.
        pavelpaulau Pavel Paulau made changes -
        Assignee Pavel Paulau [ pavelpaulau ] Chiyoung Seo [ chiyoung ]
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Chiyoung Seo [ chiyoung ] Rahim Yaseen [ yaseen ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        per bug scrub
        please rerun with the latest 2.0.1 build

        Show
        farshid Farshid Ghods (Inactive) added a comment - per bug scrub please rerun with the latest 2.0.1 build
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Rahim Yaseen [ yaseen ] Pavel Paulau [ pavelpaulau ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        per bug scrub

        Ronnie,

        are there results available from xdcr performance testing

        Show
        farshid Farshid Ghods (Inactive) added a comment - per bug scrub Ronnie, are there results available from xdcr performance testing
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Pavel Paulau [ pavelpaulau ] Ronnie Sun [ ronnie ]
        Hide
        ronnie Ronnie Sun (Inactive) added a comment -

        I don't think so. Reassign to pavel.

        Hi Pavel,

        Is there a place we summarize xdcr results?

        Thanks,
        Ronnie

        Show
        ronnie Ronnie Sun (Inactive) added a comment - I don't think so. Reassign to pavel. Hi Pavel, Is there a place we summarize xdcr results? Thanks, Ronnie
        ronnie Ronnie Sun (Inactive) made changes -
        Assignee Ronnie Sun [ ronnie ] Pavel Paulau [ pavelpaulau ]
        Hide
        pavelpaulau Pavel Paulau added a comment -

        Not reproduced in 2.0.1 so far.

        Show
        pavelpaulau Pavel Paulau added a comment - Not reproduced in 2.0.1 so far.
        pavelpaulau Pavel Paulau made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Cannot Reproduce [ 5 ]
        pavelpaulau Pavel Paulau made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            pavelpaulau Pavel Paulau
            Reporter:
            pavelpaulau Pavel Paulau
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes