Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6481

Rebalance failed due to the timeout in bulk_set_vbucket_state operation with continuous view queries

    XMLWordPrintable

Details

    Description

      Install couchbase server 2.0.0-1645 on 11 nodes
      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.26
      10.3.121.24
      10.3.121.25
      10.3.121.23
      Create cluster 10 nodes
      Create 2 buckets, default and saslbucket
      Create 1 docs with 2 views per doc for each bucket
      Load 7+ million items to default bucket and 1+ million items to saslbucket
      Maintain load about 12 K ops on default bucket and 3 K ops on saslbucket
      Continuously querying views on both buckets
      Rebalance in and out node
      Then do swap rebalance by add 3 nodes ( 24, 25 and 26) to cluster and remove 3 nodes (20, 22 and 23) out of cluster
      After 3 minutes rebalance, rebalance failed with error

      2012-08-29 01:03:09.578 ns_orchestrator:4:info:message(ns_1@10.3.121.13) - Starting rebalance, KeepNodes = ['ns_1@10.3.121.13','ns_1@10.3.121.14',
      'ns_1@10.3.121.15','ns_1@10.3.121.16',
      'ns_1@10.3.121.17','ns_1@10.3.121.24',
      'ns_1@10.3.121.25','ns_1@10.3.121.26'], EjectNodes = ['ns_1@10.3.121.20',
      'ns_1@10.3.121.22',
      'ns_1@10.3.121.23']

      2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "saslbucket"
      2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "default"
      2012-08-29 01:03:09.896 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "saslbucket"
      2012-08-29 01:03:09.898 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "default"
      2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "saslbucket"
      2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "default"
      2012-08-29 01:03:13.771 ns_rebalancer:0:info:message(ns_1@10.3.121.13) - Started rebalancing bucket saslbucket
      2012-08-29 01:03:14.794 ns_memcached:1:info:message(ns_1@10.3.121.26) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.26' in 0 seconds.
      2012-08-29 01:03:14.820 ns_memcached:1:info:message(ns_1@10.3.121.24) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.24' in 0 seconds.
      2012-08-29 01:03:14.831 ns_memcached:1:info:message(ns_1@10.3.121.25) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.25' in 0 seconds.
      2012-08-29 01:03:16.674 ns_vbucket_mover:0:info:message(ns_1@10.3.121.13) - Bucket "saslbucket" rebalance appears to be swap rebalance
      2012-08-29 01:06:21.275 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.20',
      {'EXIT',
      {{timeout,
      {gen_server,call,
      [timeout_diag_logger,
      {diag,
      {timeout,

      {gen_server,call,[ns_config,get]}

      }}]}},
      {gen_server,call,
      [

      {'janitor_agent-saslbucket', 'ns_1@10.3.121.20'}

      ,
      {if_rebalance,<0.12846.432>,
      {update_vbucket_state,294,replica,
      undefined,undefined}},
      infinity]}}}}]},
      [

      {janitor_agent,bulk_set_vbucket_state,4}

      ,

      {ns_vbucket_mover, update_replication_post_move,3}

      ,

      {ns_vbucket_mover,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}

      Look at log on node 20 around 2012-08-29 01:06:21 before rebalance failed, see memcached connection closed by force

      [ns_server:info,2012-08-29T1:05:49.620,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.121.22'

      [ns_server:debug,2012-08-29T1:05:50.500,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
      [ns_server:debug,2012-08-29T1:05:50.656,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:113]Got full synchronization request from 'ns_1@10.3.121.13'
      [ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_log:ns_config_log:log_common:111]config change:
      buckets ->
      [{configs,[{"default",
      [

      {map,[]}

      ,

      {fastForwardMap,[]},
      {uuid,<<"cdf4999f6920e2d4c3526e17f36c843f">>},
      {sasl_password,[]},
      {num_replicas,2},
      {replica_index,true},
      {ram_quota,3145728000},
      {auth_type,sasl},
      {autocompaction,false},
      {type,membase},
      {num_vbuckets,1024},
      {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14', 'ns_1@10.3.121.15','ns_1@10.3.121.16', 'ns_1@10.3.121.17','ns_1@10.3.121.20', 'ns_1@10.3.121.22','ns_1@10.3.121.23']}]},
      {"saslbucket",
      [{map,[{202, ['ns_1@10.3.121.15','ns_1@10.3.121.22','ns_1@10.3.121.16'], ['ns_1@10.3.121.15','ns_1@10.3.121.25', 'ns_1@10.3.121.16']}]},
      {fastForwardMap,[]}

      ,

      {uuid,<<"9f1fdb211536936b49c8a392e079ecb0">>}

      ,

      {num_replicas,2}

      ,

      {replica_index,true}

      ,

      {ram_quota,3145728000}

      ,

      {auth_type,sasl}

      ,

      {sasl_password,"password"}

      ,

      {autocompaction,false}

      ,

      {type,membase}

      ,

      {num_vbuckets,1024}

      ,

      {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14', 'ns_1@10.3.121.15','ns_1@10.3.121.16', 'ns_1@10.3.121.17','ns_1@10.3.121.24', 'ns_1@10.3.121.25','ns_1@10.3.121.26', 'ns_1@10.3.121.20','ns_1@10.3.121.22', 'ns_1@10.3.121.23']}

      ]}]}]
      [ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:119]Fully synchronized config in 458 us
      [ns_server:info,2012-08-29T1:05:50.857,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:50.552220 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.17' - disconnected, keep alive for 300 seconds
      memcached<0.506.0>: Wed Aug 29 08:05:50.623620 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.13' - disconnected, keep alive for 300 seconds

      [ns_server:debug,2012-08-29T1:05:50.902,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
      [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:<0.16154.0>:ns_memcached:verify_report_long_call:274]call

      {stats,<<>>}

      took too long: 1276102 us
      [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:'ns_memcached-saslbucket':ns_memcached:handle_info:594]handle_info(ensure_bucket,..) took too long: 809713 us
      [stats:warn,2012-08-29T1:05:51.387,ns_1@10.3.121.20:<0.16160.0>:stats_collector:latest_tick:201]Dropped 1 ticks
      [ns_server:info,2012-08-29T1:05:52.779,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:52.492646 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.24' - Connection is closed by force.

      Link to diags of all nodes
      https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/11nodes-1645-memchaced-close-connection-by-force-20120829.tgz

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            thuan Thuan Nguyen
            thuan Thuan Nguyen
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty