Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7756

[xdcr only]Node crash on Bidirectional replication on Windows 2.0.1

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: couchbase-bucket, ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      2.0.1-153-rel
      1 bucket, Bidirectional replication between 2 clusters.

      Description

      • Unidirectional replication on 10M items on a 4:4 node windows cluster - Ok for 1+ day.
      • Front end loads are 2k, 5k on cluster1 and cluster2.
      • After some intiial replication, one of the nodes on the cluster2 is down.

      Adding logs from the clusters.

      • This is only xdcr test case.
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        junyi Junyi Xie (Inactive) added a comment -

        From ns_server and ep_engine folks, the node crash in ns_server and ep_engine is somehow "expected".

        I do not know if anything I can do from XDCR perspective.

        Show
        junyi Junyi Xie (Inactive) added a comment - From ns_server and ep_engine folks, the node crash in ns_server and ep_engine is somehow "expected". I do not know if anything I can do from XDCR perspective.
        Hide
        jin Jin Lim (Inactive) added a comment -

        Based on Mike's comments above and Alk's debugging that memcachd complained about enomem for requests being sent:
        [couchdb:error,2013-02-15T0:59:17.950,ns_1@10.3.2.12:<0.5411.9>:couch_log:error:42]Uncaught error in HTTP request: {error,{case_clause,

        {memcached_error,enomem,undefined}

        }}

        we may want to rerun the teest with more ram or have one of our engineers take a look at the live cluster still showing the node crash. QE please advise on this. Thanks.

        Show
        jin Jin Lim (Inactive) added a comment - Based on Mike's comments above and Alk's debugging that memcachd complained about enomem for requests being sent: [couchdb:error,2013-02-15T0:59:17.950,ns_1@10.3.2.12:<0.5411.9>:couch_log:error:42] Uncaught error in HTTP request: {error,{case_clause, {memcached_error,enomem,undefined} }} we may want to rerun the teest with more ram or have one of our engineers take a look at the live cluster still showing the node crash. QE please advise on this. Thanks.
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        CC Siri

        Siri, can you suggest what QE should be looking at while running the test to discriminate between memory pressure and other causes of the timeouts. And it would be even better if we could collect this information in ns_server.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - CC Siri Siri, can you suggest what QE should be looking at while running the test to discriminate between memory pressure and other causes of the timeouts. And it would be even better if we could collect this information in ns_server.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        QE is going to rerun this test on EC2 cluster and see if this issue is reproducible there as well

        Show
        farshid Farshid Ghods (Inactive) added a comment - QE is going to rerun this test on EC2 cluster and see if this issue is reproducible there as well
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        unable to reproduce on ec2

        Show
        farshid Farshid Ghods (Inactive) added a comment - unable to reproduce on ec2

          People

          • Assignee:
            ketaki Ketaki Gangal
            Reporter:
            ketaki Ketaki Gangal
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes