Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7697

[system test] Severe timeouts on destination cluster, after remote reference is created on it. The destination node goes in pending state, bucket shutdown intermittently.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Incomplete
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0.1, 2.1.0
    • Component/s: None
    • Security Level: Public
    • Labels:
    • Environment:
      centos 5.x 64bit

      Description

      Environment:

      • Source:
        Each node has 4 core CPU, 4GB RAM
        Install couchbase server 2.0.0-1976 on 2 node cluster
        2 nodes cluster using host name (not IP), one node with default data path and other node with custom data path
        Create 2 bucket, one default (2GB) with one replica and one sasl (1.1GB) bucket with 2 replica
      • Destination:
        Each node has 4 core CPU, 4GB RAM
        Install couchbase server 2.0.0-1976 on 2 node cluster
        2 nodes cluster using IP, one node with default data path and other node with custom data path
        Create 2 bucket, one default (2GB) with one replica and one sasl (1.1GB) bucket with 2 replica

      At source cluster, create one doc with 3 views
      Buckets at source and destination are empty.
      At source cluster, create replication from source to destination, seeing buckets at destination were shutdown.

      [user:info,2013-02-06T13:11:25.236,ns_1@10.3.3.9:ns_memcached-sasl<0.2984.1>:ns_memcached:terminate:661]Shutting down bucket "sasl" on 'ns_1@10.3.3.9' for server shutdown
      [ns_server:info,2013-02-06T13:11:25.236,ns_1@10.3.3.9:ns_memcached-sasl<0.2984.1>:ns_memcached:terminate:672]This bucket shutdown is not due to bucket deletion. Doing nothing
      [couchdb:error,2013-02-06T13:11:25.265,ns_1@10.3.3.9:<0.12058.1>:couch_log:error:42]Uncaught error in HTTP request: {exit,
      {shutdown,
      {gen_server,call,
      ['ns_memcached-sasl',

      {stats,<<>>},
      180000]}}}

      Stacktrace: [{gen_server,call,3},
      {ns_memcached,do_call,3},
      {capi_frontend,get_db_info,1},
      {couch_httpd_db,db_req,2},
      {couch_db_frontend,do_db_req,2},
      {couch_httpd,handle_request,6},
      {mochiweb_http,headers,5},
      {proc_lib,init_p_do_apply,3}]
      [couchdb:error,2013-02-06T13:11:25.269,ns_1@10.3.3.9:<0.12050.1>:couch_log:error:42]Uncaught error in HTTP request: {exit,
      {shutdown,
      {gen_server,call,
      ['ns_memcached-sasl',
      {stats,<<>>}

      ,
      180000]}}}

      Link to collect info files at source cluster:
      https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201302/2nodes-200GA-src-xdcr-bucket-shutdown-at-des-20130206-135530.tgz

      Link to collect info files at destination cluster:
      https://s3.amazonaws.com/packages.couchbase/collect_info/2_0_1/201302/2nodes-200GA-des-xdcr-bucket-shutdown-at-des-20130206-135530.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        ketaki Ketaki Gangal added a comment -

        Hi Tony,

        Can you add cpu delay information from the destination cluster.

        -Ketaki

        Show
        ketaki Ketaki Gangal added a comment - Hi Tony, Can you add cpu delay information from the destination cluster. -Ketaki
        Hide
        thuan Thuan Nguyen added a comment -

        measure-sched-delay is installed under /root/measure-sched-delays

        Output is put to text.cpu0 text.cpu1 text.cpu2 text.cpu3

        Show
        thuan Thuan Nguyen added a comment - measure-sched-delay is installed under /root/measure-sched-delays Output is put to text.cpu0 text.cpu1 text.cpu2 text.cpu3
        Hide
        ketaki Ketaki Gangal added a comment -

        Hi Aliaksey,

        The additional cpudelay info on nodes :10.3.3.9, 10.3.3.7
        at root/measure-sched-delays.

        Show
        ketaki Ketaki Gangal added a comment - Hi Aliaksey, The additional cpudelay info on nodes :10.3.3.9, 10.3.3.7 at root/measure-sched-delays.
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        I looked the source log briefly and seems to me the issue is more likely on destination side. Some crash info seen at destination cluster(10.3.3.7)

        at destination

        [error_logger:error,2013-02-06T13:10:54.183,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server remote_clusters_info terminating

          • Last message in was gc
          • When Server state == {state,"/opt/couchbase/var/lib/couchbase/remote_clusters_cache",
            {set,0,16,16,8,80,48,
            {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}

            ,
            {{[],[],[],[],[],[],[],[],[],[],[],[],[],
            [],[],[]}}}}

          • Reason for termination ==
          • {noproc,
            Unknown macro: {gen_server,call, [xdc_rdoc_replication_srv, {foreach_doc,#Fun<xdc_rdoc_replication_srv.2.51124823>}, infinity]}

            }

        [error_logger:error,2013-02-06T13:10:54.185,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: remote_clusters_info:init/1
        pid: <0.3710.0>
        registered_name: remote_clusters_info
        exception exit: {noproc,
        {gen_server,call,
        [xdc_rdoc_replication_srv,

        {foreach_doc, #Fun<xdc_rdoc_replication_srv.2.51124823>}

        ,
        infinity]}}
        in function gen_server:terminate/6
        ancestors: [ns_server_sup,ns_server_cluster_sup,<0.59.0>]
        messages: []
        links: [<0.3678.0>]
        dictionary: []
        trap_exit: false
        status: running
        heap_size: 121393
        stack_size: 24
        reductions: 25380
        neighbours:

        Unable to open local db file
        [ns_server:debug,2013-02-06T13:10:08.309,ns_1@10.3.3.7:couch_stats_reader-sasl<0.21266.0>:couch_stats_reader:vbuckets_aggregation_loop:126]Failed to open vbucket: 187 (

        {not_found,no_db_file}

        ). Ignoring
        [menelaus:warn,2013-02-06T13:10:08.309,ns_1@10.3.3.7:<0.21293.0>:menelaus_web:loop:430]Server error during processing: ["web request failed",

        {path,"/pools/default"}

        ,

        {type,exit}

        ,
        {what,
        {timeout,

        {gen_server,call, [ns_node_disco,nodes_wanted]}}},
        {trace,
        [{gen_server,call,2},
        {ns_orchestrator,needs_rebalance,0},
        {ns_cluster_membership,is_balanced,0},
        {menelaus_web,build_pool_info,4},
        {menelaus_web,handle_pool_info,2},
        {menelaus_web,loop,3},
        {mochiweb_http,headers,5},
        {proc_lib,init_p_do_apply,3}]}]
        [error_logger:error,2013-02-06T13:10:08.313,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: mb_master:init/1
        pid: <0.12030.0>
        registered_name: mb_master
        exception exit: {timeout,{gen_server,call,[ns_node_disco,nodes_wanted]}}
        in function gen_fsm:terminate/7
        ancestors: [ns_server_sup,ns_server_cluster_sup,<0.59.0>]
        messages: [{'$gen_event',
        {heartbeat,
        {{[2,0],release,0},'ns_1@10.3.3.9'},
        candidate,
        [{peers,['ns_1@10.3.3.7','ns_1@10.3.3.9']},
        {versioning,true}]}},
        send_heartbeat,
        {'$gen_event',
        {heartbeat,
        {{[2,0],release,0},'ns_1@10.3.3.9'},
        candidate,
        [{peers,['ns_1@10.3.3.7','ns_1@10.3.3.9']},
        {versioning,true}]}},

        {what,
        {timeout,
        {gen_server,call, [ns_node_disco,nodes_wanted]}

        }},
        {trace,
        [

        {gen_server,call,2}

        ,

        {menelaus_web,build_pool_info,4}

        ,

        {menelaus_web,handle_pool_info,2}

        ,

        {menelaus_web,loop,3}

        ,

        {mochiweb_http,headers,5}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}]
        [ns_server:debug,2013-02-06T13:10:13.357,ns_1@10.3.3.7:ns_config_rep<0.25786.0>:ns_config_rep:init:66]init pulling

        [error_logger:error,2013-02-06T13:10:13.355,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76]** Generic server ns_config_rep terminating

          • Last message in was sync_random
          • When Server state == {state}
          • Reason for termination ==
          • {timeout,
            Unknown macro: {gen_server,call,[ns_node_disco,nodes_actual_proper]}

            }

        [error_logger:error,2013-02-06T13:10:13.428,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]

        Show
        junyi Junyi Xie (Inactive) added a comment - I looked the source log briefly and seems to me the issue is more likely on destination side. Some crash info seen at destination cluster(10.3.3.7) at destination [error_logger:error,2013-02-06T13:10:54.183,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76] ** Generic server remote_clusters_info terminating Last message in was gc When Server state == {state,"/opt/couchbase/var/lib/couchbase/remote_clusters_cache", {set,0,16,16,8,80,48, {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]} , {{[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}}}} Reason for termination == {noproc, Unknown macro: {gen_server,call, [xdc_rdoc_replication_srv, {foreach_doc,#Fun<xdc_rdoc_replication_srv.2.51124823>}, infinity]} } [error_logger:error,2013-02-06T13:10:54.185,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: remote_clusters_info:init/1 pid: <0.3710.0> registered_name: remote_clusters_info exception exit: {noproc, {gen_server,call, [xdc_rdoc_replication_srv, {foreach_doc, #Fun<xdc_rdoc_replication_srv.2.51124823>} , infinity]}} in function gen_server:terminate/6 ancestors: [ns_server_sup,ns_server_cluster_sup,<0.59.0>] messages: [] links: [<0.3678.0>] dictionary: [] trap_exit: false status: running heap_size: 121393 stack_size: 24 reductions: 25380 neighbours: Unable to open local db file [ns_server:debug,2013-02-06T13:10:08.309,ns_1@10.3.3.7:couch_stats_reader-sasl<0.21266.0>:couch_stats_reader:vbuckets_aggregation_loop:126] Failed to open vbucket: 187 ( {not_found,no_db_file} ). Ignoring [menelaus:warn,2013-02-06T13:10:08.309,ns_1@10.3.3.7:<0.21293.0>:menelaus_web:loop:430] Server error during processing: ["web request failed", {path,"/pools/default"} , {type,exit} , {what, {timeout, {gen_server,call, [ns_node_disco,nodes_wanted]}}}, {trace, [{gen_server,call,2}, {ns_orchestrator,needs_rebalance,0}, {ns_cluster_membership,is_balanced,0}, {menelaus_web,build_pool_info,4}, {menelaus_web,handle_pool_info,2}, {menelaus_web,loop,3}, {mochiweb_http,headers,5}, {proc_lib,init_p_do_apply,3}]}] [error_logger:error,2013-02-06T13:10:08.313,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: mb_master:init/1 pid: <0.12030.0> registered_name: mb_master exception exit: {timeout,{gen_server,call, [ns_node_disco,nodes_wanted] }} in function gen_fsm:terminate/7 ancestors: [ns_server_sup,ns_server_cluster_sup,<0.59.0>] messages: [{'$gen_event', {heartbeat, {{ [2,0] ,release,0},'ns_1@10.3.3.9'}, candidate, [{peers,['ns_1@10.3.3.7','ns_1@10.3.3.9']}, {versioning,true}]}}, send_heartbeat, {'$gen_event', {heartbeat, {{ [2,0] ,release,0},'ns_1@10.3.3.9'}, candidate, [{peers,['ns_1@10.3.3.7','ns_1@10.3.3.9']}, {versioning,true}]}}, – {what, {timeout, {gen_server,call, [ns_node_disco,nodes_wanted]} }}, {trace, [ {gen_server,call,2} , {menelaus_web,build_pool_info,4} , {menelaus_web,handle_pool_info,2} , {menelaus_web,loop,3} , {mochiweb_http,headers,5} , {proc_lib,init_p_do_apply,3} ]}] [ns_server:debug,2013-02-06T13:10:13.357,ns_1@10.3.3.7:ns_config_rep<0.25786.0>:ns_config_rep:init:66] init pulling [error_logger:error,2013-02-06T13:10:13.355,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_msg:76] ** Generic server ns_config_rep terminating Last message in was sync_random When Server state == {state} Reason for termination == {timeout, Unknown macro: {gen_server,call,[ns_node_disco,nodes_actual_proper]} } [error_logger:error,2013-02-06T13:10:13.428,ns_1@10.3.3.7:error_logger<0.5.0>:ale_error_logger_handler:log_report:72]
        Hide
        ketaki Ketaki Gangal added a comment -

        Per discussion w/ Aliaksey, Junyi

        These timeouts are 2.0-expected timeouts likely and are known issues , now fixed in 2.0.1

        The aim of this test is to test-offline upgrade. As long as we are able to load, xdcr replicate data, we should be ok to tolerate these timeouts.

        In the event, we are not able to run the setup w/ 2.0, we would need to modify/patch 2.0.0 w/ timeout changes and test it.

        -Ketaki

        Show
        ketaki Ketaki Gangal added a comment - Per discussion w/ Aliaksey, Junyi These timeouts are 2.0-expected timeouts likely and are known issues , now fixed in 2.0.1 The aim of this test is to test-offline upgrade. As long as we are able to load, xdcr replicate data, we should be ok to tolerate these timeouts. In the event, we are not able to run the setup w/ 2.0, we would need to modify/patch 2.0.0 w/ timeout changes and test it. -Ketaki
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Per discussion, another option is to enable asynchronous threads manually. But then you might hit the bug in Erlang VM that we fixed.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Per discussion, another option is to enable asynchronous threads manually. But then you might hit the bug in Erlang VM that we fixed.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        pls verify / close. open a new bug if u hit the bug in Erlang VM.

        Show
        maria Maria McDuff (Inactive) added a comment - pls verify / close. open a new bug if u hit the bug in Erlang VM.

          People

          • Assignee:
            thuan Thuan Nguyen
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes