Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6481

Rebalance failed due to the timeout in bulk_set_vbucket_state operation with continuous view queries

    Details

      Description

      Install couchbase server 2.0.0-1645 on 11 nodes
      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.26
      10.3.121.24
      10.3.121.25
      10.3.121.23
      Create cluster 10 nodes
      Create 2 buckets, default and saslbucket
      Create 1 docs with 2 views per doc for each bucket
      Load 7+ million items to default bucket and 1+ million items to saslbucket
      Maintain load about 12 K ops on default bucket and 3 K ops on saslbucket
      Continuously querying views on both buckets
      Rebalance in and out node
      Then do swap rebalance by add 3 nodes ( 24, 25 and 26) to cluster and remove 3 nodes (20, 22 and 23) out of cluster
      After 3 minutes rebalance, rebalance failed with error

      2012-08-29 01:03:09.578 ns_orchestrator:4:info:message(ns_1@10.3.121.13) - Starting rebalance, KeepNodes = ['ns_1@10.3.121.13','ns_1@10.3.121.14',
      'ns_1@10.3.121.15','ns_1@10.3.121.16',
      'ns_1@10.3.121.17','ns_1@10.3.121.24',
      'ns_1@10.3.121.25','ns_1@10.3.121.26'], EjectNodes = ['ns_1@10.3.121.20',
      'ns_1@10.3.121.22',
      'ns_1@10.3.121.23']

      2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "saslbucket"
      2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "default"
      2012-08-29 01:03:09.896 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "saslbucket"
      2012-08-29 01:03:09.898 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "default"
      2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "saslbucket"
      2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "default"
      2012-08-29 01:03:13.771 ns_rebalancer:0:info:message(ns_1@10.3.121.13) - Started rebalancing bucket saslbucket
      2012-08-29 01:03:14.794 ns_memcached:1:info:message(ns_1@10.3.121.26) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.26' in 0 seconds.
      2012-08-29 01:03:14.820 ns_memcached:1:info:message(ns_1@10.3.121.24) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.24' in 0 seconds.
      2012-08-29 01:03:14.831 ns_memcached:1:info:message(ns_1@10.3.121.25) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.25' in 0 seconds.
      2012-08-29 01:03:16.674 ns_vbucket_mover:0:info:message(ns_1@10.3.121.13) - Bucket "saslbucket" rebalance appears to be swap rebalance
      2012-08-29 01:06:21.275 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.20',
      {'EXIT',
      {{timeout,
      {gen_server,call,
      [timeout_diag_logger,
      {diag,
      {timeout,

      {gen_server,call,[ns_config,get]}

      }}]}},
      {gen_server,call,
      [

      {'janitor_agent-saslbucket', 'ns_1@10.3.121.20'}

      ,
      {if_rebalance,<0.12846.432>,
      {update_vbucket_state,294,replica,
      undefined,undefined}},
      infinity]}}}}]},
      [

      {janitor_agent,bulk_set_vbucket_state,4}

      ,

      {ns_vbucket_mover, update_replication_post_move,3}

      ,

      {ns_vbucket_mover,handle_info,2}

      ,

      {gen_server,handle_msg,5}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}

      Look at log on node 20 around 2012-08-29 01:06:21 before rebalance failed, see memcached connection closed by force

      [ns_server:info,2012-08-29T1:05:49.620,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.121.22'

      [ns_server:debug,2012-08-29T1:05:50.500,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
      [ns_server:debug,2012-08-29T1:05:50.656,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:113]Got full synchronization request from 'ns_1@10.3.121.13'
      [ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_log:ns_config_log:log_common:111]config change:
      buckets ->
      [{configs,[{"default",
      [

      {map,[]}

      ,

      {fastForwardMap,[]},
      {uuid,<<"cdf4999f6920e2d4c3526e17f36c843f">>},
      {sasl_password,[]},
      {num_replicas,2},
      {replica_index,true},
      {ram_quota,3145728000},
      {auth_type,sasl},
      {autocompaction,false},
      {type,membase},
      {num_vbuckets,1024},
      {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14', 'ns_1@10.3.121.15','ns_1@10.3.121.16', 'ns_1@10.3.121.17','ns_1@10.3.121.20', 'ns_1@10.3.121.22','ns_1@10.3.121.23']}]},
      {"saslbucket",
      [{map,[{202, ['ns_1@10.3.121.15','ns_1@10.3.121.22','ns_1@10.3.121.16'], ['ns_1@10.3.121.15','ns_1@10.3.121.25', 'ns_1@10.3.121.16']}]},
      {fastForwardMap,[]}

      ,

      {uuid,<<"9f1fdb211536936b49c8a392e079ecb0">>}

      ,

      {num_replicas,2}

      ,

      {replica_index,true}

      ,

      {ram_quota,3145728000}

      ,

      {auth_type,sasl}

      ,

      {sasl_password,"password"}

      ,

      {autocompaction,false}

      ,

      {type,membase}

      ,

      {num_vbuckets,1024}

      ,

      {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14', 'ns_1@10.3.121.15','ns_1@10.3.121.16', 'ns_1@10.3.121.17','ns_1@10.3.121.24', 'ns_1@10.3.121.25','ns_1@10.3.121.26', 'ns_1@10.3.121.20','ns_1@10.3.121.22', 'ns_1@10.3.121.23']}

      ]}]}]
      [ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:119]Fully synchronized config in 458 us
      [ns_server:info,2012-08-29T1:05:50.857,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:50.552220 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.17' - disconnected, keep alive for 300 seconds
      memcached<0.506.0>: Wed Aug 29 08:05:50.623620 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.13' - disconnected, keep alive for 300 seconds

      [ns_server:debug,2012-08-29T1:05:50.902,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
      [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:<0.16154.0>:ns_memcached:verify_report_long_call:274]call

      {stats,<<>>}

      took too long: 1276102 us
      [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:'ns_memcached-saslbucket':ns_memcached:handle_info:594]handle_info(ensure_bucket,..) took too long: 809713 us
      [stats:warn,2012-08-29T1:05:51.387,ns_1@10.3.121.20:<0.16160.0>:stats_collector:latest_tick:201]Dropped 1 ticks
      [ns_server:info,2012-08-29T1:05:52.779,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:52.492646 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.24' - Connection is closed by force.

      Link to diags of all nodes
      https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/11nodes-1645-memchaced-close-connection-by-force-20120829.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        thuan Thuan Nguyen created issue -
        thuan Thuan Nguyen made changes -
        Field Original Value New Value
        Description Install couchbase server 2.0.0-1645 on 11 nodes
        10.3.121.13
        10.3.121.14
        10.3.121.15
        10.3.121.16
        10.3.121.17
        10.3.121.20
        10.3.121.22
        10.3.121.26
        10.3.121.24
        10.3.121.25
        10.3.121.23
        Create cluster 10 nodes
        Create 2 buckets, default and saslbucket
        Create 1 docs with 2 views per doc for each bucket
        Load 7+ million items to default bucket and 1+ million items to saslbucket
        Maintain load about 12 K ops on default bucket and 3 K ops on saslbucket
        Continuously querying views on both buckets
        Rebalance in and out node
        Then do swap rebalance by add 3 nodes ( 24, 25 and 26) to cluster and remove 3 nodes (20, 22 and 23) out of cluster
        After 3 minutes rebalance, rebalance failed with error


        2012-08-29 01:03:09.578 ns_orchestrator:4:info:message(ns_1@10.3.121.13) - Starting rebalance, KeepNodes = ['ns_1@10.3.121.13','ns_1@10.3.121.14',
                                         'ns_1@10.3.121.15','ns_1@10.3.121.16',
                                         'ns_1@10.3.121.17','ns_1@10.3.121.24',
                                         'ns_1@10.3.121.25','ns_1@10.3.121.26'], EjectNodes = ['ns_1@10.3.121.20',
                                                                                               'ns_1@10.3.121.22',
                                                                                               'ns_1@10.3.121.23']

        2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "saslbucket"
        2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "default"
        2012-08-29 01:03:09.896 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "saslbucket"
        2012-08-29 01:03:09.898 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "default"
        2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "saslbucket"
        2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "default"
        2012-08-29 01:03:13.771 ns_rebalancer:0:info:message(ns_1@10.3.121.13) - Started rebalancing bucket saslbucket
        2012-08-29 01:03:14.794 ns_memcached:1:info:message(ns_1@10.3.121.26) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.26' in 0 seconds.
        2012-08-29 01:03:14.820 ns_memcached:1:info:message(ns_1@10.3.121.24) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.24' in 0 seconds.
        2012-08-29 01:03:14.831 ns_memcached:1:info:message(ns_1@10.3.121.25) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.25' in 0 seconds.
        2012-08-29 01:03:16.674 ns_vbucket_mover:0:info:message(ns_1@10.3.121.13) - Bucket "saslbucket" rebalance appears to be swap rebalance
        2012-08-29 01:06:21.275 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
                                       [{'ns_1@10.3.121.20',
                                         {'EXIT',
                                          {{timeout,
                                            {gen_server,call,
                                             [timeout_diag_logger,
                                              {diag,
                                               {timeout,
                                                {gen_server,call,[ns_config,get]}}}]}},
                                           {gen_server,call,
                                            [{'janitor_agent-saslbucket',
                                              'ns_1@10.3.121.20'},
                                             {if_rebalance,<0.12846.432>,
                                              {update_vbucket_state,294,replica,
                                               undefined,undefined}},
                                             infinity]}}}}]},
                                      [{janitor_agent,bulk_set_vbucket_state,4},
                                       {ns_vbucket_mover,
                                        update_replication_post_move,3},
                                       {ns_vbucket_mover,handle_info,2},
                                       {gen_server,handle_msg,5},
                                       {proc_lib,init_p_do_apply,3}]}



        Look at log on node 20 around2012-08-29 01:06:21 before rebalance failed, see memcached connection closed by force

        [ns_server:debug,2012-08-29T1:05:50.902,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
        [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:<0.16154.0>:ns_memcached:verify_report_long_call:274]call {stats,<<>>} took too long: 1276102 us
        [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:'ns_memcached-saslbucket':ns_memcached:handle_info:594]handle_info(ensure_bucket,..) took too long: 809713 us
        [stats:warn,2012-08-29T1:05:51.387,ns_1@10.3.121.20:<0.16160.0>:stats_collector:latest_tick:201]Dropped 1 ticks
        [ns_server:info,2012-08-29T1:05:52.779,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:52.492646 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.24' - Connection is closed by force.

        Link to diags of all nodes
        https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/11nodes-1645-memchaced-close-connection-by-force-20120829.tgz
        Install couchbase server 2.0.0-1645 on 11 nodes
        10.3.121.13
        10.3.121.14
        10.3.121.15
        10.3.121.16
        10.3.121.17
        10.3.121.20
        10.3.121.22
        10.3.121.26
        10.3.121.24
        10.3.121.25
        10.3.121.23
        Create cluster 10 nodes
        Create 2 buckets, default and saslbucket
        Create 1 docs with 2 views per doc for each bucket
        Load 7+ million items to default bucket and 1+ million items to saslbucket
        Maintain load about 12 K ops on default bucket and 3 K ops on saslbucket
        Continuously querying views on both buckets
        Rebalance in and out node
        Then do swap rebalance by add 3 nodes ( 24, 25 and 26) to cluster and remove 3 nodes (20, 22 and 23) out of cluster
        After 3 minutes rebalance, rebalance failed with error


        2012-08-29 01:03:09.578 ns_orchestrator:4:info:message(ns_1@10.3.121.13) - Starting rebalance, KeepNodes = ['ns_1@10.3.121.13','ns_1@10.3.121.14',
                                         'ns_1@10.3.121.15','ns_1@10.3.121.16',
                                         'ns_1@10.3.121.17','ns_1@10.3.121.24',
                                         'ns_1@10.3.121.25','ns_1@10.3.121.26'], EjectNodes = ['ns_1@10.3.121.20',
                                                                                               'ns_1@10.3.121.22',
                                                                                               'ns_1@10.3.121.23']

        2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "saslbucket"
        2012-08-29 01:03:09.893 ns_storage_conf:0:info:message(ns_1@10.3.121.24) - Deleting old data files of bucket "default"
        2012-08-29 01:03:09.896 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "saslbucket"
        2012-08-29 01:03:09.898 ns_storage_conf:0:info:message(ns_1@10.3.121.26) - Deleting old data files of bucket "default"
        2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "saslbucket"
        2012-08-29 01:03:09.942 ns_storage_conf:0:info:message(ns_1@10.3.121.25) - Deleting old data files of bucket "default"
        2012-08-29 01:03:13.771 ns_rebalancer:0:info:message(ns_1@10.3.121.13) - Started rebalancing bucket saslbucket
        2012-08-29 01:03:14.794 ns_memcached:1:info:message(ns_1@10.3.121.26) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.26' in 0 seconds.
        2012-08-29 01:03:14.820 ns_memcached:1:info:message(ns_1@10.3.121.24) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.24' in 0 seconds.
        2012-08-29 01:03:14.831 ns_memcached:1:info:message(ns_1@10.3.121.25) - Bucket "saslbucket" loaded on node 'ns_1@10.3.121.25' in 0 seconds.
        2012-08-29 01:03:16.674 ns_vbucket_mover:0:info:message(ns_1@10.3.121.13) - Bucket "saslbucket" rebalance appears to be swap rebalance
        2012-08-29 01:06:21.275 ns_orchestrator:2:info:message(ns_1@10.3.121.13) - Rebalance exited with reason {{bulk_set_vbucket_state_failed,
                                       [{'ns_1@10.3.121.20',
                                         {'EXIT',
                                          {{timeout,
                                            {gen_server,call,
                                             [timeout_diag_logger,
                                              {diag,
                                               {timeout,
                                                {gen_server,call,[ns_config,get]}}}]}},
                                           {gen_server,call,
                                            [{'janitor_agent-saslbucket',
                                              'ns_1@10.3.121.20'},
                                             {if_rebalance,<0.12846.432>,
                                              {update_vbucket_state,294,replica,
                                               undefined,undefined}},
                                             infinity]}}}}]},
                                      [{janitor_agent,bulk_set_vbucket_state,4},
                                       {ns_vbucket_mover,
                                        update_replication_post_move,3},
                                       {ns_vbucket_mover,handle_info,2},
                                       {gen_server,handle_msg,5},
                                       {proc_lib,init_p_do_apply,3}]}



        Look at log on node 20 around 2012-08-29 01:06:21 before rebalance failed, see memcached connection closed by force

        [ns_server:info,2012-08-29T1:05:49.620,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.121.22'

        [ns_server:debug,2012-08-29T1:05:50.500,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
        [ns_server:debug,2012-08-29T1:05:50.656,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:113]Got full synchronization request from 'ns_1@10.3.121.13'
        [ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_log:ns_config_log:log_common:111]config change:
        buckets ->
        [{configs,[{"default",
                    [{map,[]},
                     {fastForwardMap,[]},
                     {uuid,<<"cdf4999f6920e2d4c3526e17f36c843f">>},
                     {sasl_password,[]},
                     {num_replicas,2},
                     {replica_index,true},
                     {ram_quota,3145728000},
                     {auth_type,sasl},
                     {autocompaction,false},
                     {type,membase},
                     {num_vbuckets,1024},
                     {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14',
                               'ns_1@10.3.121.15','ns_1@10.3.121.16',
                               'ns_1@10.3.121.17','ns_1@10.3.121.20',
                               'ns_1@10.3.121.22','ns_1@10.3.121.23']}]},
                   {"saslbucket",
                    [{map,[{202,
                            ['ns_1@10.3.121.15','ns_1@10.3.121.22','ns_1@10.3.121.16'],
                            ['ns_1@10.3.121.15','ns_1@10.3.121.25',
                             'ns_1@10.3.121.16']}]},
                     {fastForwardMap,[]},
                     {uuid,<<"9f1fdb211536936b49c8a392e079ecb0">>},
                     {num_replicas,2},
                     {replica_index,true},
                     {ram_quota,3145728000},
                     {auth_type,sasl},
                     {sasl_password,"password"},
                     {autocompaction,false},
                     {type,membase},
                     {num_vbuckets,1024},
                     {servers,['ns_1@10.3.121.13','ns_1@10.3.121.14',
                               'ns_1@10.3.121.15','ns_1@10.3.121.16',
                               'ns_1@10.3.121.17','ns_1@10.3.121.24',
                               'ns_1@10.3.121.25','ns_1@10.3.121.26',
                               'ns_1@10.3.121.20','ns_1@10.3.121.22',
                               'ns_1@10.3.121.23']}]}]}]
        [ns_server:debug,2012-08-29T1:05:50.660,ns_1@10.3.121.20:ns_config_rep:ns_config_rep:handle_call:119]Fully synchronized config in 458 us
        [ns_server:info,2012-08-29T1:05:50.857,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:50.552220 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.17' - disconnected, keep alive for 300 seconds
        memcached<0.506.0>: Wed Aug 29 08:05:50.623620 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.13' - disconnected, keep alive for 300 seconds

        [ns_server:debug,2012-08-29T1:05:50.902,ns_1@10.3.121.20:'capi_ddoc_replication_srv-default':cb_generic_replication_srv:handle_info:144]doing replicate_newnodes_docs
        [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:<0.16154.0>:ns_memcached:verify_report_long_call:274]call {stats,<<>>} took too long: 1276102 us
        [ns_server:error,2012-08-29T1:05:51.136,ns_1@10.3.121.20:'ns_memcached-saslbucket':ns_memcached:handle_info:594]handle_info(ensure_bucket,..) took too long: 809713 us
        [stats:warn,2012-08-29T1:05:51.387,ns_1@10.3.121.20:<0.16160.0>:stats_collector:latest_tick:201]Dropped 1 ticks
        [ns_server:info,2012-08-29T1:05:52.779,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:05:52.492646 3: TAP (Producer) eq_tapq:replication_building_272_'ns_1@10.3.121.24' - Connection is closed by force.


        Link to diags of all nodes
        https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/11nodes-1645-memchaced-close-connection-by-force-20120829.tgz
        Hide
        thuan Thuan Nguyen added a comment -

        4 minutes later after this rebalance failed, I did swap rebalance 3 nodes again. Rebalance failed after 13 minutes with error
        Rebalance exited with reason {ns_node_disco_events_exited,
        {'EXIT',<0.26298.432>,normal}}
        ns_orchestrator002 ns_1@10.3.121.13 01:23:41 - Wed Aug 29, 2012

        Got exit from node disco events subscription ns_vbucket_mover000 ns_1@10.3.121.13 01:23:39 - Wed Aug 29, 2012

        I looked at diags on node 13, I saw the same error of this bug, memcached connection is closed by force

          • At node 13:
            [ns_server:info,2012-08-29T1:23:40.678,ns_1@10.3.121.13:<0.20579.433>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.20': [<<"replication_building_548_'ns_1@10.3.121.24'">>,
            <<"replication_building_548_'ns_1@10.3.121.16'">>,
            <<"replication_building_548_'ns_1@10.3.121.17'">>]
            [error_logger:error,2012-08-29T1:23:40.713,ns_1@10.3.121.13:error_logger:ale_error_logger_handler:log_report:72]
            =========================CRASH REPORT=========================
            crasher:
            initial call: ns_single_vbucket_mover:mover/6
            pid: <0.20570.433>
            registered_name: []
            exception exit: {exited,
            {'EXIT',<0.26285.432>,
            {ns_node_disco_events_exited,
            {'EXIT',<0.26298.432>,normal}}}}
            in function ns_single_vbucket_mover:mover_inner_old_style/6
            in call from misc:try_with_maybe_ignorant_after/2
            in call from ns_single_vbucket_mover:mover/6
            ancestors: [<0.26285.432>,<0.26141.432>]
            messages: [{'EXIT',<0.26285.432>,
            {ns_node_disco_events_exited,
            {'EXIT',<0.26298.432>,normal}

            }}]
            links: [<0.20578.433>,<0.26285.432>]
            dictionary: [

            {cleanup_list,[<0.20579.433>]}

            ]
            trap_exit: true
            status: running
            heap_size: 1597
            stack_size: 24
            reductions: 3010
            neighbours:

        [couchdb:info,2012-08-29T1:23:40.760,ns_1@10.3.121.13:<0.29969.390>:couch_log:info:39]10.3.121.14 - - POST /_view_merge/?stale=ok&limit=10 200
        [ns_server:debug,2012-08-29T1:23:40.933,ns_1@10.3.121.13:<0.20580.433>:ns_pubsub:do_subscribe_link:149]Deleting

        {ns_stats_event,<0.15525.374>}

        event handler: ok
        [ns_server:info,2012-08-29T1:23:38.869,ns_1@10.3.121.13:<0.20467.433>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
        [

        {<<"vb_1017:persisted_checkpoint_id">>,<<"65">>}

        ,

          • At node 20:
            [couchdb:info,2012-08-29T1:23:40.724,ns_1@10.3.121.20:<0.13706.405>:couch_log:info:39]Updater reading changes from passive partitions to update replica set view group `_design/d1` from set `default`
            [couchdb:info,2012-08-29T1:23:40.734,ns_1@10.3.121.20:<0.31308.390>:couch_log:info:39]10.3.121.14 - - POST /_view_merge/?stale=ok&limit=10 200
            [ns_server:info,2012-08-29T1:23:40.737,ns_1@10.3.121.20:<0.16149.0>:mc_connection:do_delete_vbucket:123]Deleting vbucket: saslbucket/547
            [couchdb:info,2012-08-29T1:23:40.759,ns_1@10.3.121.20:couch_server:couch_log:info:39]Deleting database saslbucket/547
            [couchdb:info,2012-08-29T1:23:40.808,ns_1@10.3.121.20:couch_server:couch_log:info:39]Deleting file /data/saslbucket/547.couch.313
            [ns_server:info,2012-08-29T1:23:40.883,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169]memcached<0.506.0>: Wed Aug 29 08:23:40.677138 3: TAP (Producer) eq_tapq:replication_building_548_'ns_1@10.3.121.24' - Connection is closed by force.
        Show
        thuan Thuan Nguyen added a comment - 4 minutes later after this rebalance failed, I did swap rebalance 3 nodes again. Rebalance failed after 13 minutes with error Rebalance exited with reason {ns_node_disco_events_exited, {'EXIT',<0.26298.432>,normal}} ns_orchestrator002 ns_1@10.3.121.13 01:23:41 - Wed Aug 29, 2012 Got exit from node disco events subscription ns_vbucket_mover000 ns_1@10.3.121.13 01:23:39 - Wed Aug 29, 2012 I looked at diags on node 13, I saw the same error of this bug, memcached connection is closed by force At node 13: [ns_server:info,2012-08-29T1:23:40.678,ns_1@10.3.121.13:<0.20579.433>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59] Killed the following tap names on 'ns_1@10.3.121.20': [<<"replication_building_548_'ns_1@10.3.121.24'">>, <<"replication_building_548_'ns_1@10.3.121.16'">>, <<"replication_building_548_'ns_1@10.3.121.17'">>] [error_logger:error,2012-08-29T1:23:40.713,ns_1@10.3.121.13:error_logger:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: ns_single_vbucket_mover:mover/6 pid: <0.20570.433> registered_name: [] exception exit: {exited, {'EXIT',<0.26285.432>, {ns_node_disco_events_exited, {'EXIT',<0.26298.432>,normal}}}} in function ns_single_vbucket_mover:mover_inner_old_style/6 in call from misc:try_with_maybe_ignorant_after/2 in call from ns_single_vbucket_mover:mover/6 ancestors: [<0.26285.432>,<0.26141.432>] messages: [{'EXIT',<0.26285.432>, {ns_node_disco_events_exited, {'EXIT',<0.26298.432>,normal} }}] links: [<0.20578.433>,<0.26285.432>] dictionary: [ {cleanup_list,[<0.20579.433>]} ] trap_exit: true status: running heap_size: 1597 stack_size: 24 reductions: 3010 neighbours: [couchdb:info,2012-08-29T1:23:40.760,ns_1@10.3.121.13:<0.29969.390>:couch_log:info:39] 10.3.121.14 - - POST /_view_merge/?stale=ok&limit=10 200 [ns_server:debug,2012-08-29T1:23:40.933,ns_1@10.3.121.13:<0.20580.433>:ns_pubsub:do_subscribe_link:149] Deleting {ns_stats_event,<0.15525.374>} event handler: ok [ns_server:info,2012-08-29T1:23:38.869,ns_1@10.3.121.13:<0.20467.433>:diag_handler:log_all_tap_and_checkpoint_stats:128] checkpoint:saslbucket: [ {<<"vb_1017:persisted_checkpoint_id">>,<<"65">>} , At node 20: [couchdb:info,2012-08-29T1:23:40.724,ns_1@10.3.121.20:<0.13706.405>:couch_log:info:39] Updater reading changes from passive partitions to update replica set view group `_design/d1` from set `default` [couchdb:info,2012-08-29T1:23:40.734,ns_1@10.3.121.20:<0.31308.390>:couch_log:info:39] 10.3.121.14 - - POST /_view_merge/?stale=ok&limit=10 200 [ns_server:info,2012-08-29T1:23:40.737,ns_1@10.3.121.20:<0.16149.0>:mc_connection:do_delete_vbucket:123] Deleting vbucket: saslbucket/547 [couchdb:info,2012-08-29T1:23:40.759,ns_1@10.3.121.20:couch_server:couch_log:info:39] Deleting database saslbucket/547 [couchdb:info,2012-08-29T1:23:40.808,ns_1@10.3.121.20:couch_server:couch_log:info:39] Deleting file /data/saslbucket/547.couch.313 [ns_server:info,2012-08-29T1:23:40.883,ns_1@10.3.121.20:ns_port_memcached:ns_port_server:log:169] memcached<0.506.0>: Wed Aug 29 08:23:40.677138 3: TAP (Producer) eq_tapq:replication_building_548_'ns_1@10.3.121.24' - Connection is closed by force.
        thuan Thuan Nguyen made changes -
        Summary memcached closes the connection on source with error Rebalance exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@10.3.121.20', {'EXIT', {{timeout, {gen_server,call, [timeout_diag_logger, {diag, {timeout, rebalance failed due to memcached connection is closed by force
        chiyoung Chiyoung Seo made changes -
        Sprint Status Current Sprint
        Hide
        karan Karan Kumar (Inactive) added a comment - - edited

        We looked at the memcached logs.
        @Tony is going to attach the memcached logs from node 13.
        memcached log file named memcached-log-1663-20120905.tgz in attached location above.

        On the node 13. Rebalance failure around this time.
        Rebalance exited with reason {{bulk_set_vbucket_state_failed,
        [{'ns_1@10.3.121.16',
        {'EXIT',
        {{normal,
        {gen_server,call,
        [<15201.5655.244>,

        {start_vbucket_filter_change, [171,197,224,225,226,227,228,475,476, 477,478,479,531,532,533,655,656,657]}

        ,
        30000]}},
        {gen_server,call,
        [

        {'janitor_agent-saslbucket', 'ns_1@10.3.121.16'}

        ,
        {if_rebalance,<0.6019.435>,
        {update_vbucket_state,657,replica,
        undefined,'ns_1@10.3.121.23'}},
        infinity]}}}}]},
        [

        {janitor_agent,bulk_set_vbucket_state,4}

        ,

        {ns_vbucket_mover, update_replication_post_move,3}

        ,

        {ns_vbucket_mover,handle_info,2}

        ,

        {gen_server,handle_msg,5}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}

        Node 13 memcached logs:-
        Wed Sep 5 16:30:35.608213 3: 101 Closing connection due to read error: Connection reset by peer

        From memcached.c code:-

        if (res == -1) {
        if (errno == EAGAIN || errno == EWOULDBLOCK)

        { break; }

        settings.extensions.logger->log(EXTENSION_LOG_WARNING, c,
        "%d Closing connection due to read error: %s", c->sfd, strerror(errno));
        return READ_ERROR;
        }

        Show
        karan Karan Kumar (Inactive) added a comment - - edited We looked at the memcached logs. @Tony is going to attach the memcached logs from node 13. memcached log file named memcached-log-1663-20120905.tgz in attached location above. On the node 13. Rebalance failure around this time. Rebalance exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@10.3.121.16', {'EXIT', {{normal, {gen_server,call, [<15201.5655.244>, {start_vbucket_filter_change, [171,197,224,225,226,227,228,475,476, 477,478,479,531,532,533,655,656,657]} , 30000]}}, {gen_server,call, [ {'janitor_agent-saslbucket', 'ns_1@10.3.121.16'} , {if_rebalance,<0.6019.435>, {update_vbucket_state,657,replica, undefined,'ns_1@10.3.121.23'}}, infinity]}}}}]}, [ {janitor_agent,bulk_set_vbucket_state,4} , {ns_vbucket_mover, update_replication_post_move,3} , {ns_vbucket_mover,handle_info,2} , {gen_server,handle_msg,5} , {proc_lib,init_p_do_apply,3} ]} Node 13 memcached logs:- Wed Sep 5 16:30:35.608213 3: 101 Closing connection due to read error: Connection reset by peer From memcached.c code:- if (res == -1) { if (errno == EAGAIN || errno == EWOULDBLOCK) { break; } settings.extensions.logger->log(EXTENSION_LOG_WARNING, c, "%d Closing connection due to read error: %s", c->sfd, strerror(errno)); return READ_ERROR; }
        thuan Thuan Nguyen made changes -
        Attachment memcached-log-1663-20120905.tgz [ 14770 ]
        Hide
        chiyoung Chiyoung Seo added a comment -

        "connection is closed by force" is the normal log from ep-engine whenever ep-engine receives the connection close request from the ns-server after completing the replication building phase. In other words, it's not an error case.

        The rebalance originally failed due to the timeout from 10.3.121.20 in processing

        {update_vbucket_state,294,replica}

        on the sasl bucket. We were continuously querying views on both buckets during rebalance.

        It would be more useful to grab any paging and swap activities during rebalance, but I think we already had the bugs opened for the very similar timeout issues with continuous view updates and queries.

        Show
        chiyoung Chiyoung Seo added a comment - "connection is closed by force" is the normal log from ep-engine whenever ep-engine receives the connection close request from the ns-server after completing the replication building phase. In other words, it's not an error case. The rebalance originally failed due to the timeout from 10.3.121.20 in processing {update_vbucket_state,294,replica} on the sasl bucket. We were continuously querying views on both buckets during rebalance. It would be more useful to grab any paging and swap activities during rebalance, but I think we already had the bugs opened for the very similar timeout issues with continuous view updates and queries.
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Looked at the ns_server code.

        -define(SET_VBUCKET_STATE_TIMEOUT, infinity).

        The vbucket_state_timeout is set to infinity. Not sure, what we can update more to?

        Show
        karan Karan Kumar (Inactive) added a comment - Looked at the ns_server code. -define(SET_VBUCKET_STATE_TIMEOUT, infinity). The vbucket_state_timeout is set to infinity. Not sure, what we can update more to?
        chiyoung Chiyoung Seo made changes -
        Summary rebalance failed due to memcached connection is closed by force Rebalance failed due to the timeout in bulk_set_vbucket_state operation with continuous view queries
        farshid Farshid Ghods (Inactive) made changes -
        Labels 2.0-beta-release-notes
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0 [ 10114 ]
        Fix Version/s 2.0-beta [ 10113 ]
        Hide
        peter peter added a comment -

        Alk, the timeout was set to inifinity and we still received a timeout. Who should next look at this?

        Show
        peter peter added a comment - Alk, the timeout was set to inifinity and we still received a timeout. Who should next look at this?
        peter peter made changes -
        Assignee Chiyoung Seo [ chiyoung ] Aleksey Kondratenko [ alkondratenko ]
        karan Karan Kumar (Inactive) made changes -
        Labels 2.0-beta-release-notes 2.0-beta-release-notes system-test
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Thuan Nguyen [ thuan ]
        Show
        chiyoung Chiyoung Seo added a comment - http://www.couchbase.com/issues/browse/MB-6595
        chiyoung Chiyoung Seo made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        chiyoung Chiyoung Seo made changes -
        Sprint Status Current Sprint
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            thuan Thuan Nguyen
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes