Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6573

[longevity] rebalance failed due to error "Resetting rebalance status since it's not really running" when there are major page faults on some of the nodes in the cluster

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta
    • Fix Version/s: 2.0
    • Component/s: ns_server
    • Security Level: Public
    • Environment:
      centos 6.2 64bit build 2.0.0-1697

      Description

      Cluster information:

      • 11 centos 6.2 64bit server with 4 cores CPU
      • Each server has 10 GB RAM and 150 GB disk.
      • 8 GB RAM for couchbase server at each node (80% total system memmories)
      • Disk format ext3 on both data and root
      • Each server has its own drive, no disk sharing with other server.
      • Load 7 million items to both buckets
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Maintain load about 10 K ops, query view on both doc

      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.24
      10.3.121.25
      10.3.121.23

      Create cluster with 10 nodes
      Do swap rebalance. Add node 26 and remove node 25.
      Before and during rebalance, cluster does not go into swap.

      Rebalance failed with error "Resetting rebalance status since it's not really running"

      Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11ndoes-1697-reb-failed-reset-reb-20120907.tgz

      Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-11nodes-1697-reb-failed-reset-reb-20120907.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        thuan Thuan Nguyen added a comment -

        couchdb:info,2012-09-07T18:46:04.076,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39]Native compactor output: Compacted /data/saslbucket/51.couch.1 -> /data/saslbucket/51.couch.1.compact
        [couchdb:info,2012-09-07T18:46:04.086,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39]Native initial compact succeeded for "saslbucket/51"
        [couchdb:info,2012-09-07T18:46:04.095,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39]CouchDB swapping files /data/saslbucket/51.couch.2 and /data/saslbucket/51.couch.1.compact.
        [couchdb:info,2012-09-07T18:46:04.104,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39]Compaction for db "saslbucket/51" completed.
        [ns_server:info,2012-09-07T18:46:04.105,ns_1@10.3.121.13:<0.13562.5>:compaction_daemon:spawn_vbucket_compactor:644]Compacting <<"saslbucket/52">>
        [couchdb:info,2012-09-07T18:46:04.106,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]Starting compaction for db "saslbucket/52"
        [couchdb:info,2012-09-07T18:46:04.522,ns_1@10.3.121.13:<0.7547.5>:couch_log:info:39]10.3.121.15 - - POST /_view_merge/?stale=ok&limit=10 200
        [ns_server:debug,2012-09-07T18:46:04.668,ns_1@10.3.121.13:<0.13524.5>:ns_pubsub:do_subscribe_link:149]Deleting

        {ns_stats_event,<0.29002.1>} event handler: ok
        [ns_server:debug,2012-09-07T18:46:04.728,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.29002.1>}

        [couchdb:info,2012-09-07T18:46:04.855,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39]10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200
        [couchdb:info,2012-09-07T18:46:04.868,ns_1@10.3.121.13:<0.13271.5>:couch_log:info:39]Updater checkpointing set view `saslbucket` update for main group `_design/d11`
        [ns_server:debug,2012-09-07T18:46:04.864,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:149]Deleting

        {ns_stats_event,<0.29002.1>} event handler: ok
        [ns_server:debug,2012-09-07T18:46:04.904,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:120]Started subscription {ns_stats_event,<0.29002.1>}

        [couchdb:info,2012-09-07T18:46:04.965,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39]10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200

        [user:info,2012-09-07T18:46:05.012,ns_1@10.3.121.13:ns_config:ns_janitor:maybe_stop_rebalance_status:139]Resetting rebalance status since it's not really running

        [ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
        [ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
        [ns_server:debug,2012-09-07T18:46:05.018,ns_1@10.3.121.13:ns_config_rep:ns_config_rep:do_push_keys:317]Replicating some config keys ([rebalance_status,rebalancer_pid]..)
        [ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111]config change:
        rebalancer_pid ->
        undefined
        [ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
        [ns_server:debug,2012-09-07T18:46:05.046,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
        [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111]config change:
        rebalance_status ->

        {none,<<"Rebalance stopped by janitor.">>}

        [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
        [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330]doing replicate_newnodes_docs
        [couchdb:info,2012-09-07T18:46:05.336,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39]10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200
        [stats:error,2012-09-07T18:46:05.424,ns_1@10.3.121.13:<0.5179.2>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.3.121.26']
        [ns_server:debug,2012-09-07T18:46:05.894,ns_1@10.3.121.13:<0.13614.5>:ns_pubsub:do_subscribe_link:120]Started subscription

        {ns_stats_event,<0.10014.5>}

        [ns_server:debug,2012-09-07T18:46:05.939,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:149]Deleting

        {ns_stats_event,<0.29002.1>}

        event handler: ok
        [couchdb:info,2012-09-07T18:46:06.629,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39]Native compactor output: Compacted /data/saslbucket/52.couch.1 -> /data/saslbucket/52.couch.1.compact
        [couchdb:info,2012-09-07T18:46:06.665,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39]Native initial compact succeeded for "saslbucket/52"
        [couchdb:info,2012-09-07T18:46:06.678,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]CouchDB swapping files /data/saslbucket/52.couch.2 and /data/saslbucket/52.couch.1.compact.
        [couchdb:info,2012-09-07T18:46:06.981,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39]Compaction for db "saslbucket/52" completed.
        [couchdb:info,2012-09-07T18:46:07.039,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39]10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200
        [ns_server:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.13641.5>:compaction_daemon:spawn_vbucket_compactor:644]Compacting <<"saslbucket/53">>
        [couchdb:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.8887.0>:couch_log:info:39]Starting compaction for db "saslbucket/53"
        [ns_server:info,2012-09-07T18:46:07.435,ns_1@10.3.121.13:<0.13244.5>:ns_orchestrator:handle_info:258]Skipping janitor in state janitor_running:

        {janitor_state, ["default"], <0.13595.5>}

        [ns_server:warn,2012-09-07T18:46:09.013,ns_1@10.3.121.13:mb_master:mb_master:send_heartbeat_with_peers:473]send heartbeat timed out

        Show
        thuan Thuan Nguyen added a comment - couchdb:info,2012-09-07T18:46:04.076,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39]Native compactor output: Compacted /data/saslbucket/51.couch.1 -> /data/saslbucket/51.couch.1.compact [couchdb:info,2012-09-07T18:46:04.086,ns_1@10.3.121.13:<0.13536.5>:couch_log:info:39] Native initial compact succeeded for "saslbucket/51" [couchdb:info,2012-09-07T18:46:04.095,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39] CouchDB swapping files /data/saslbucket/51.couch.2 and /data/saslbucket/51.couch.1.compact. [couchdb:info,2012-09-07T18:46:04.104,ns_1@10.3.121.13:<0.8901.0>:couch_log:info:39] Compaction for db "saslbucket/51" completed. [ns_server:info,2012-09-07T18:46:04.105,ns_1@10.3.121.13:<0.13562.5>:compaction_daemon:spawn_vbucket_compactor:644] Compacting <<"saslbucket/52">> [couchdb:info,2012-09-07T18:46:04.106,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39] Starting compaction for db "saslbucket/52" [couchdb:info,2012-09-07T18:46:04.522,ns_1@10.3.121.13:<0.7547.5>:couch_log:info:39] 10.3.121.15 - - POST /_view_merge/?stale=ok&limit=10 200 [ns_server:debug,2012-09-07T18:46:04.668,ns_1@10.3.121.13:<0.13524.5>:ns_pubsub:do_subscribe_link:149] Deleting {ns_stats_event,<0.29002.1>} event handler: ok [ns_server:debug,2012-09-07T18:46:04.728,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:120] Started subscription {ns_stats_event,<0.29002.1>} [couchdb:info,2012-09-07T18:46:04.855,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39] 10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200 [couchdb:info,2012-09-07T18:46:04.868,ns_1@10.3.121.13:<0.13271.5>:couch_log:info:39] Updater checkpointing set view `saslbucket` update for main group `_design/d11` [ns_server:debug,2012-09-07T18:46:04.864,ns_1@10.3.121.13:<0.13566.5>:ns_pubsub:do_subscribe_link:149] Deleting {ns_stats_event,<0.29002.1>} event handler: ok [ns_server:debug,2012-09-07T18:46:04.904,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:120] Started subscription {ns_stats_event,<0.29002.1>} [couchdb:info,2012-09-07T18:46:04.965,ns_1@10.3.121.13:<0.5824.5>:couch_log:info:39] 10.3.121.25 - - POST /_view_merge/?stale=ok&limit=10 200 [user:info,2012-09-07T18:46:05.012,ns_1@10.3.121.13:ns_config:ns_janitor:maybe_stop_rebalance_status:139] Resetting rebalance status since it's not really running [ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330] doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.014,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330] doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.018,ns_1@10.3.121.13:ns_config_rep:ns_config_rep:do_push_keys:317] Replicating some config keys ( [rebalance_status,rebalancer_pid] ..) [ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111] config change: rebalancer_pid -> undefined [ns_server:debug,2012-09-07T18:46:05.045,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330] doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.046,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330] doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:ns_config_log:ns_config_log:log_common:111] config change: rebalance_status -> {none,<<"Rebalance stopped by janitor.">>} [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-default':capi_set_view_manager:handle_info:330] doing replicate_newnodes_docs [ns_server:debug,2012-09-07T18:46:05.064,ns_1@10.3.121.13:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:330] doing replicate_newnodes_docs [couchdb:info,2012-09-07T18:46:05.336,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39] 10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200 [stats:error,2012-09-07T18:46:05.424,ns_1@10.3.121.13:<0.5179.2>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@10.3.121.26'] [ns_server:debug,2012-09-07T18:46:05.894,ns_1@10.3.121.13:<0.13614.5>:ns_pubsub:do_subscribe_link:120] Started subscription {ns_stats_event,<0.10014.5>} [ns_server:debug,2012-09-07T18:46:05.939,ns_1@10.3.121.13:<0.13581.5>:ns_pubsub:do_subscribe_link:149] Deleting {ns_stats_event,<0.29002.1>} event handler: ok [couchdb:info,2012-09-07T18:46:06.629,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39] Native compactor output: Compacted /data/saslbucket/52.couch.1 -> /data/saslbucket/52.couch.1.compact [couchdb:info,2012-09-07T18:46:06.665,ns_1@10.3.121.13:<0.13563.5>:couch_log:info:39] Native initial compact succeeded for "saslbucket/52" [couchdb:info,2012-09-07T18:46:06.678,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39] CouchDB swapping files /data/saslbucket/52.couch.2 and /data/saslbucket/52.couch.1.compact. [couchdb:info,2012-09-07T18:46:06.981,ns_1@10.3.121.13:<0.8894.0>:couch_log:info:39] Compaction for db "saslbucket/52" completed. [couchdb:info,2012-09-07T18:46:07.039,ns_1@10.3.121.13:<0.6542.5>:couch_log:info:39] 10.3.121.17 - - POST /_view_merge/?stale=ok&limit=10 200 [ns_server:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.13641.5>:compaction_daemon:spawn_vbucket_compactor:644] Compacting <<"saslbucket/53">> [couchdb:info,2012-09-07T18:46:07.040,ns_1@10.3.121.13:<0.8887.0>:couch_log:info:39] Starting compaction for db "saslbucket/53" [ns_server:info,2012-09-07T18:46:07.435,ns_1@10.3.121.13:<0.13244.5>:ns_orchestrator:handle_info:258] Skipping janitor in state janitor_running: {janitor_state, ["default"], <0.13595.5>} [ns_server:warn,2012-09-07T18:46:09.013,ns_1@10.3.121.13:mb_master:mb_master:send_heartbeat_with_peers:473] send heartbeat timed out
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        if i understand correctly nodes are not in swap but there are still time outs in sending heartbeats

        we have had this issue before and the bug was marked as won't fix due to the node being in swap. if there are doubts that this is happening because of slowness in some internal calls we should raise that timeout by default.

        Show
        farshid Farshid Ghods (Inactive) added a comment - if i understand correctly nodes are not in swap but there are still time outs in sending heartbeats we have had this issue before and the bug was marked as won't fix due to the node being in swap. if there are doubts that this is happening because of slowness in some internal calls we should raise that timeout by default.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        This is very interesting thing. Indeed there's no swap use. That however doesn't mean there are no major page faults.

        My guess is we're either dealing with:

        • cpu contention in erlang (less likely but we have a way to monitor that)
        • major page faults
        • virtual machine host 'robbing' our CPU cycles

        Unfortunately atop data I have only starts Sep 8 while logs end Sep 7 around 18:40.

        Need more data.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - This is very interesting thing. Indeed there's no swap use. That however doesn't mean there are no major page faults. My guess is we're either dealing with: cpu contention in erlang (less likely but we have a way to monitor that) major page faults virtual machine host 'robbing' our CPU cycles Unfortunately atop data I have only starts Sep 8 while logs end Sep 7 around 18:40. Need more data.
        Hide
        karan Karan Kumar (Inactive) added a comment -

        We definitely can get you more data here.

        Can we tell us how to monitor
        cpu contention in erlang?

        Karan

        Show
        karan Karan Kumar (Inactive) added a comment - We definitely can get you more data here. Can we tell us how to monitor cpu contention in erlang? Karan
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Karan,

        please work with engineering and Tony to understand what is exaclty required to investigate these issues.

        Tony has spent days and hours running these tests and still its unknown to us where and why this happens.

        i would like us to narrow this down to specific calls or interactions between components in the system.

        Show
        farshid Farshid Ghods (Inactive) added a comment - Karan, please work with engineering and Tony to understand what is exaclty required to investigate these issues. Tony has spent days and hours running these tests and still its unknown to us where and why this happens. i would like us to narrow this down to specific calls or interactions between components in the system.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Thuan,

        please revise the spec according the recommendation Aliaksey and Karan will pass to you so that we reduce hitting this condition for now.

        Show
        farshid Farshid Ghods (Inactive) added a comment - Thuan, please revise the spec according the recommendation Aliaksey and Karan will pass to you so that we reduce hitting this condition for now.
        Show
        thuan Thuan Nguyen added a comment - Link to diags of all nodes on Sept 8th 2012 https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-rebalance-failed-bulk_set_vbucket_state_failed-20120908.tgz
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        atop data starts at Sep 8 01:31:58

        While last message with memcached slowness indication is at:

        696802:[ns_server:error,2012-09-08T1:24:43.330,ns_1@10.3.121.13:<0.6381.0>:ns_memcached:verify_report_long_call:274]call

        {stats,<<>>}

        took too long: 720860 us

        I'd like you guys to keep atop data for longer periods of time.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - atop data starts at Sep 8 01:31:58 While last message with memcached slowness indication is at: 696802: [ns_server:error,2012-09-08T1:24:43.330,ns_1@10.3.121.13:<0.6381.0>:ns_memcached:verify_report_long_call:274] call {stats,<<>>} took too long: 720860 us I'd like you guys to keep atop data for longer periods of time.
        Hide
        thuan Thuan Nguyen added a comment -

        Here is the link to atop file node 13 (around 1.4 GB in size). https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13
        This atop file also stored at local vm at /tmp directory

        All other atop files of other nodes are in /tmp directory

        All of these atop started on
        drwx------ 2 root root 4.0K Sep 8 01:21 atop.d

        Show
        thuan Thuan Nguyen added a comment - Here is the link to atop file node 13 (around 1.4 GB in size). https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-node13 This atop file also stored at local vm at /tmp directory All other atop files of other nodes are in /tmp directory All of these atop started on drwx------ 2 root root 4.0K Sep 8 01:21 atop.d
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Karan,

        should this be assigned to Aliaksey now ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Karan, should this be assigned to Aliaksey now ?
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Unable to repro on 1781

        Show
        karan Karan Kumar (Inactive) added a comment - Unable to repro on 1781

          People

          • Assignee:
            karan Karan Kumar (Inactive)
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes