Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7129

Multiple nodes go down with erlang crash (crash.dump available) in a 10:10 XDCR setup and erlang possibly hung in a couple of nodes as well, all in the same cluster

    Details

      Description

      • Front end loads for biXDCR_bucket on C1 and C2 and for uniXDCR_src on C1, and replication going on
      • On C2:
      • 3 nodes down: With erl_crash.dump files generated (will be attached)
      • 2 nodes with erlang possibly hung, and in pend state. (In top, beam.smp keeps appearing and disappearing using up 1.0G of resident memory, but no cores generated, no erl_crash.dump files, memcached seems to be still running)
      • Unable to grab diags off any of these nodes.
      • Result - All items in biXDCR_bucket on C2 lost .
      • Half the items in uniXDCR_dest on C2 lost.

      Noticed a whole bunch of these crash reports on one of the "Pending" nodes on C2:

        • Reason for termination ==
        • {noproc,
          {gen_server,call,
          [remote_clusters_info,
          Unknown macro: {get_remote_bucket, [{hostname, "ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"}, {uuid,<<"0b3a63d5d8805e0c6670c619cc346299">>}, {name,"SANPAULO (C2)"},
          {username,"Administrator"},
          {password,"password"}],
          "biXDCR_bucket",false,30000},
          infinity]}}

          [error_logger:error,2012-11-07T5:57:56.025,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: xdc_vbucket_rep:init/1
          pid: <0.28161.8>
          registered_name: []
          exception exit: {noproc,
          {gen_server,call,
          [remote_clusters_info,
          {get_remote_bucket,
          [{hostname, "ec2-177-71-147-19.sa-east-1.compute.amazonaws.com:8091"},
          {uuid, <<"0b3a63d5d8805e0c6670c619cc346299">>},
          {name,"SANPAULO (C2)"}, {username,"Administrator"}, {password,"password"}], "biXDCR_bucket",false,30000}

          ,
          infinity]}}
          in function gen_server:terminate/6
          ancestors: [<0.3608.5>,<0.3603.5>,xdc_replication_sup,ns_server_sup,
          ns_server_cluster_sup,<0.64.0>]
          messages: []
          links: [<0.3608.5>]
          dictionary: []
          trap_exit: true
          status: running
          heap_size: 514229
          stack_size: 24
          reductions: 35035
          neighbours:

        • Reason for termination ==
        • killed

      [error_logger:error,2012-11-07T5:58:41.704,ns_1@ec2-54-251-5-97.ap-southeast-1.compute.amazonaws.com:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: couch_db:init/1
      pid: <0.19405.4>
      registered_name: []
      exception exit: killed
      in function gen_server:terminate/6
      ancestors: [couch_server,couch_primary_services,couch_server_sup,
      cb_couch_sup,ns_server_cluster_sup,<0.64.0>]
      messages: []
      links: []
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 1597
      stack_size: 24
      reductions: 11968
      neighbours:

      Attached are the grabbed diags from one of the non-down nodes on C2.

      1. ec2-122-248-217-156.ap-southeast-1.compute.amazonaws.com-8091-diag.txt.gz
        4.27 MB
        Abhinav Dangeti
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        In MB-7115 seemingly same issue is causing system to ran out of fds before it could die. Perhaps because there are more fds spent on vbuckets as it's was just 2 nodes apparently

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - In MB-7115 seemingly same issue is causing system to ran out of fds before it could die. Perhaps because there are more fds spent on vbuckets as it's was just 2 nodes apparently
        Hide
        steve Steve Yen added a comment -

        from bug-scrub - this also reproduces on 2:2

        Show
        steve Steve Yen added a comment - from bug-scrub - this also reproduces on 2:2
        Hide
        steve Steve Yen added a comment -

        bug fix of MB-7133 should likely make this very unlikely.

        Show
        steve Steve Yen added a comment - bug fix of MB-7133 should likely make this very unlikely.
        Hide
        steve Steve Yen added a comment -

        per bug-scrub for repro

        Show
        steve Steve Yen added a comment - per bug-scrub for repro
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Talked Abhinav, issue not reproduced in his latest large scale test 10:10 (bidir + unidir XDCR).

        Show
        junyi Junyi Xie (Inactive) added a comment - Talked Abhinav, issue not reproduced in his latest large scale test 10:10 (bidir + unidir XDCR).

          People

          • Assignee:
            abhinav Abhinav Dangeti
            Reporter:
            abhinav Abhinav Dangeti
          • Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes