Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7760

[system test][xdcr+views] rebalance failed due to timeout

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.1.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      centos 5.x 64 bit

      Description

      Environment:

      • Both source and destination cluster are in 2.0.0 GA
      • 2 nodes cluster at source with 2 buckets, one doc and 1 view for each doc
      • 2 nodes cluster at source with 2 buckets, one doc and 1 view for each doc

      Load 200K items to both buckets
      Create xdcr from source to destination cluster

      Do offline upgrade at source and destination cluster from 2.0.0-1976 to 2.0.1-156.

      Add node ubu-2509 with buid 2.0.1-156 to source cluster and rebalance with load running about 1K at both cluster.
      Rebalance failed due to timeout near the end of first bucket rebalanced (sasl bucket).

      [stats:error,2013-02-15T2:48:46.891,ns_1@cen-2501.hq.couchbase.com:<0.26578.6>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@cen-2503.hq.couchbase.com']
      [ns_server:error,2013-02-15T2:48:46.909,ns_1@cen-2501.hq.couchbase.com:<0.26448.6>:ns_single_vbucket_mover:spawn_and_wait:87]Got unexpected exit signal {'EXIT',<0.20537.5>,
      {timeout,
      {gen_server,call,
      [ns_config,

      {update_with_changes, #Fun<ns_config.4.26158082>}

      ]}}}

      [stats:error,2013-02-15T2:48:46.944,ns_1@cen-2501.hq.couchbase.com:<0.26584.6>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@cen-2503.hq.couchbase.com']
      [stats:error,2013-02-15T2:48:46.944,ns_1@cen-2501.hq.couchbase.com:<0.26588.6>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@cen-2503.hq.couchbase.com']
      [error_logger:error,2013-02-15T2:48:46.982,ns_1@cen-2501.hq.couchbase.com:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** gen_event handler

      {ns_pubsub,#Ref<0.0.0.33541>}

      crashed.

        • Was installed in master_activity_events_ingress
        • Last event was: {submit_custom_master_event, #Fun<master_activity_events.1.65826123>}
        • When handler state == {state,#Fun<master_activity_events.2.6034187>,[]}
        • Reason == {timeout,
          Unknown macro: {gen_fsm,sync_send_all_state_event, [mb_master,master_node]}

          }

      [ns_server:debug,2013-02-15T2:48:48.887,ns_1@cen-2501.hq.couchbase.com:<0.12931.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

      {ns_node_disco_events,<0.12894.0>}

      exited with reason {timeout,
      {gen_server,
      call,
      [ns_config,

      {eval, #Fun<ns_bucket.0.52407284>}

      ]}}
      [ns_server:debug,2013-02-15T2:48:48.888,ns_1@cen-2501.hq.couchbase.com:<0.12896.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

      {ns_config_events,<0.12894.0>}

      exited with reason {timeout,
      {gen_server,
      call,
      [ns_config,

      {eval, #Fun<ns_bucket.0.52407284>}

      ]}}
      [ns_server:debug,2013-02-15T2:48:49.628,ns_1@cen-2501.hq.couchbase.com:<0.12934.0>:ns_pubsub:do_subscribe_link:132]Parent process of subscription

      {mc_couch_events,<0.12894.0>}

      exited with reason {timeout,
      {gen_server,
      call,
      [ns_config,

      {eval, #Fun<ns_bucket.0.52407284>}

      ]}}
      [error_logger:error,2013-02-15T2:48:49.630,ns_1@cen-2501.hq.couchbase.com:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_pubsub:do_subscribe_link/4
      pid: <0.12826.0>
      registered_name: []
      exception exit: {handler_crashed,master_activity_events_ingress,
      {'EXIT',
      {timeout,
      {gen_fsm,sync_send_all_state_event,
      [mb_master,master_node]}}}}
      in function ns_pubsub:do_subscribe_link/4
      ancestors: [ns_server_sup,ns_server_cluster_sup,<0.66.0>]
      messages: []
      links: [<0.12781.0>,<0.12825.0>]
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 233
      stack_size: 24
      reductions: 115
      neighbours:

      [ns_server:error,2013-02-15T2:48:49.646,ns_1@cen-2501.hq.couchbase.com:<0.26665.6>:ns_orchestrator:rebalance_progress:176]Couldn't talk to orchestrator: {exit,
      {timeout,
      {gen_fsm,sync_send_event,
      [

      {global,ns_orchestrator}

      ,
      rebalance_progress,2000]}}}
      [ns_server:info,2013-02-15T2:48:50.142,ns_1@cen-2501.hq.couchbase.com:<0.26236.6>:diag_handler:log_all_tap_and_checkpoint_stats:132]end of logging tap & checkpoint stats
      [ns_server:error,2013-02-15T2:48:50.248,ns_1@cen-2501.hq.couchbase.com:<0.13095.0>:ns_memcached:verify_report_long_call:297]call

      {stats,<<>>}

      took too long: 44142120 us
      [ns_server:info,2013-02-15T2:48:50.439,ns_1@cen-2501.hq.couchbase.com:ns_port_memcached<0.12863.0>:ns_port_server:log:171]memcached<0.12863.0>: Fri Feb 15 02:48:49.788622 PST 3: TAP (Producer) eq_tapq:replication_building_95_'ns_1@10.3.3.29' - disconnected, keep alive for 300 seconds
      memcached<0.12863.0>: Fri Feb 15 02:48:49.925897 PST 3: TAP (Producer) eq_tapq:replication_building_95_'ns_1@cen-2503.hq.couchbase.com' - disconnected, keep alive for 300 seconds

      [ns_server:debug,2013-02-15T2:48:50.209,ns_1@cen-2501.hq.couchbase.com:capi_set_view_manager-default<0.26696.6>:capi_set_view_manager:init:218]Usable vbuckets:
      [933,622,311,0,856,545,490,179,779,724,413,102,958,647,336,25,881,570,259,204,
      804,749,438,127,983,672,50,361,906,595,284,229,829,518,463,152,75,697,386,
      1008,931,620,309,254,854,543,488,177,777,722,411,100,956,645,334,23,879,568,
      257,202,802,747,436,125,981,670,48,359,904,593,282,227,827,516,461,150,73,
      695,384,1006,929,618,307,252,852,541,486,175,98,775,720,409,954,643,332,21,
      877,566,511,200,800,745,43

      [error_logger:error,2013-02-15T2:48:50.834,ns_1@cen-2501.hq.couchbase.com:error_logger<0.6.0>:ale_error_logger_handler:log_msg:76]** Generic server <0.12894.0> terminating

        • Last message in was replicate_newnodes_docs
        • When Server state == {state,"default",'capi_ddoc_replication_srv-default',
          ['ns_1@cen-2503.hq.couchbase.com'],
          [{doc,<<"_design/d3">>,
          {1,<<142,152,152,32>>}

          ,
          {[{<<"views">>,
          {[{<<"v1">>,
          {[

          Unknown macro: {<<"map">>, <<"function(doc,meta){\nemit(doc.num,null);\n}">>}

          ]}}]}}]},
          0,false,[]}],
          1024,false,undefined,
          [active,active,active,active,active,active,
          active,active,active,active,active,active,
          active,active,active,active,active,active,
          active,active,active,active,active,active,

      {[],[],[],[],[],[],[],[],[],[],[],[],[],
      [],[],[]}}}}

        • Reason for termination ==
        • {timeout,{gen_server,call,[ns_config, {eval,#Fun<ns_bucket.0.52407284>}]}}

          [ns_server:error,2013-02-15T2:48:51.270,ns_1@cen-2501.hq.couchbase.com:ns_memcached-sasl<0.12947.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 4332674 us
          [ns_server:info,2013-02-15T2:48:51.328,ns_1@cen-2501.hq.couchbase.com:mb_master<0.12822.0>:mb_master:candidate:365]Changing master from 'ns_1@cen-2501.hq.couchbase.com' to 'ns_1@10.3.3.29'
          [ns_server:error,2013-02-15T2:48:51.360,ns_1@cen-2501.hq.couchbase.com:ns_memcached-default<0.12948.0>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long: 2473280 us
          [ns_server:error,2013-02-15T2:48:52.214,ns_1@cen-2501.hq.couchbase.com:<0.13093.0>:ns_memcached:verify_report_long_call:297]call {stats,<<"tapagg _">>} took too long: 561239 us
          [stats:error,2013-02-15T2:48:54.830,ns_1@cen-2501.hq.couchbase.com:<0.26666.6>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.3.3.29']
          [error_logger:error,2013-02-15T2:48:54.781,ns_1@cen-2501.hq.couchbase.com:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
          =========================CRASH REPORT=========================
          crasher:
          initial call: capi_set_view_manager:init/1
          pid: <0.12894.0>
          registered_name: []
          exception exit: {timeout,
          {gen_server,call,
          [ns_config,{eval,#Fun<ns_bucket.0.52407284>}

          ]}}
          in function gen_server:init_it/6
          ancestors: ['single_bucket_sup-default',<0.12879.0>]
          messages: [{#Ref<0.0.173.118623>,
          {ok,[

          {uuid,<<"7bace5ad7988f92d0263e613c872aefd">>}

          ,

          {sasl_password,[]}

          ,

          {num_replicas,1}

          ,

          {replica_index,false}

          ,

          {ram_quota,1572864000}

          ,

          {auth_type,sasl}

          ,

          {autocompaction,false}

          ,

          {flush_enabled,false}

          ,

          {type,membase}

          ,

          {num_vbuckets,1024}

          ,

          {servers,['ns_1@cen-2501.hq.couchbase.com', 'ns_1@cen-2503.hq.couchbase.com']}

          ,
          {map,[['ns_1@cen-2501.hq.couchbase.com',

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        maria Maria McDuff (Inactive) added a comment -

        as of today, QE is still blocked with verifying this because rebalance is hanging, getting stuck - MB-8231.

        Show
        maria Maria McDuff (Inactive) added a comment - as of today, QE is still blocked with verifying this because rebalance is hanging, getting stuck - MB-8231 .
        Hide
        ketaki Ketaki Gangal added a comment -

        Need xdcr to be stable to move to xdcr+views testing, currently no stable builds to move up to xdcr+views testing.

        Show
        ketaki Ketaki Gangal added a comment - Need xdcr to be stable to move to xdcr+views testing, currently no stable builds to move up to xdcr+views testing.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        waiting for 2.0.2 stable build.

        Show
        maria Maria McDuff (Inactive) added a comment - waiting for 2.0.2 stable build.
        Hide
        ketaki Ketaki Gangal added a comment -

        Retesting this w/ build 202-823 here http://10.3.3.60:8091/index.html#sec=buckets ( Waiting for DWQ to be less than 1M before starting up rebalance)

        Cluster Config :

        2 XDCR replications,
        3 Views
        Multiple buckets

        Show
        ketaki Ketaki Gangal added a comment - Retesting this w/ build 202-823 here http://10.3.3.60:8091/index.html#sec=buckets ( Waiting for DWQ to be less than 1M before starting up rebalance) Cluster Config : 2 XDCR replications, 3 Views Multiple buckets
        Hide
        ketaki Ketaki Gangal added a comment -

        Not able to repro this w/ build 823.

        Closing this bug.

        Show
        ketaki Ketaki Gangal added a comment - Not able to repro this w/ build 823. Closing this bug.

          People

          • Assignee:
            ketaki Ketaki Gangal
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes