Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44682

[Ephemeral] Rebalance with warming up node failed with reason "pre_rebalance_janitor_run_failed"

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-4554

      Scenario:

      • 7 node cluster, Ephemeral bucket with replicas=2
      • Load 25K initial docs into the bucket
      • Restart couchbase-server on one of the node and start rebalance when the node is warming up

      Observation:

      Rebalance fails with reason "pre_rebalance_janitor_run_failed" for node 172.23.123.102

      Test case:

       

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i /tmp/testexec.29663.ini sdk_retries=10,num_items=250000,GROUP=P1;durability,EXCLUDE_GROUP=not_for_ephemeral,durability=MAJORITY,bucket_type=ephemeral,get-cbcollect-info=True,upgrade_version=7.0.0-4554 -t rebalance_new.rebalance_out.RebalanceOutTests.rebalance_out_with_warming_up,value_size=1024,bucket_type=ephemeral,upgrade_version=7.0.0-4554,sdk_retries=10,GROUP=P1;durability,nodes_out=5,EXCLUDE_GROUP=not_for_ephemeral,max_verify=100000,get-cbcollect-info=False,replicas=2,durability=MAJORITY,log_level=error,nodes_init=7,num_items=250000,infra_log_level=critical'
      

      Jenkins log: http://qa.sc.couchbase.com/job/test_suite_executor-TAF/96698/console

       

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          jwalker Jim Walker added a comment -

          Can you update with cbcollects?

          jwalker Jim Walker added a comment - Can you update with cbcollects?
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Cbcollect files are available here: http://cb-logs-qe.s3-website-us-west-2.amazonaws.com/7.0.0-4554/jenkins_logs/test_suite_executor-TAF/96698/archive.zip
          owend Daniel Owen added a comment -

          Raising to critical to avoid being moved to CC.next

          owend Daniel Owen added a comment - Raising to critical to avoid being moved to CC.next
          paolo.cocchi Paolo Cocchi added a comment - - edited

          Orchestrator on node .288 detects ".102 down":

          [user:error,2021-03-01T10:42:37.211-08:00,ns_1@172.23.106.228:<0.4604.0>:ns_orchestrator:log_rebalance_completion:1407]Rebalance exited with reason {pre_rebalance_janitor_run_failed,"
          default",
                                           {error,unsafe_nodes,['ns_1@172.23.123.102']}}.
          Rebalance Operation Id = 8a3dc656fb73df01a78d3f9678c0c550
          ..
          [ns_server:debug,2021-03-01T10:42:37.891-08:00,ns_1@172.23.106.228:<0.4606.0>:auto_failover:log_down_nodes_reason:361]Node 'ns_1@172.23.123.102' is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. "
          

          On node .102:

          2021-03-01T10:40:26.396127-08:00 INFO 47: Create bucket [default]
          ..
          2021-03-01T10:42:08.708872-08:00 INFO Bucket default deleted
          ..
          2021-03-01T10:42:08.793465-08:00 INFO ---------- Closing logfile
          2021-03-01T10:42:36.247414-08:00 INFO ---------- Opening logfile: /opt/couchbase/var/lib/couchbase/logs/memcached.log.000000.txt
          2021-03-01T10:42:36.249334-08:00 INFO Couchbase version 7.0.0-4554 starting.
          ..
          2021-03-01T10:42:37.102231-08:00 INFO 47: Create bucket [default]
          ..
          2021-03-01T10:42:37.109882-08:00 INFO 47: Bucket [default] created successfully
          ..
          2021-03-01T10:42:37.601677-08:00 INFO 56: HELO [regular] [ {"ip":"127.0.0.1","port":36468} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ]
          2021-03-01T10:42:41.527482-08:00 INFO 55: [ {"ip":"127.0.0.1","port":40296} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: 137
          2021-03-01T10:42:41.527827-08:00 INFO Pushing new cluster config for bucket:[default] revision:[137]
          2021-03-01T10:42:41.608683-08:00 INFO (default) VBucket: created vb:1023 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:132703071848157 topology:null
          2021-03-01T10:42:41.608724-08:00 INFO (default) VBucket::setState: transitioning vb:1023 with high seqno:0 from:replica to:replica
          ..
          <long sequence of VB create>
          ..
          2021-03-01T10:42:41.709590-08:00 INFO 57: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.152->ns_1@172.23.123.102:default - (vb:18) Attempting to add stream: opaque_:1, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:25965989700716, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"}
          ..
          <long sequence of AddStream>
          ..
          2021-03-01T10:42:42.267257-08:00 INFO (default) EventuallyPersistentEngine::enableTraffic() result true
          

          So everything seems to proceed fine from the memcached PoV on node .102 after restart.

          paolo.cocchi Paolo Cocchi added a comment - - edited Orchestrator on node .288 detects ".102 down": [user:error,2021-03-01T10:42:37.211-08:00,ns_1@172.23.106.228:<0.4604.0>:ns_orchestrator:log_rebalance_completion:1407]Rebalance exited with reason {pre_rebalance_janitor_run_failed," default", {error,unsafe_nodes,['ns_1@172.23.123.102']}}. Rebalance Operation Id = 8a3dc656fb73df01a78d3f9678c0c550 .. [ns_server:debug,2021-03-01T10:42:37.891-08:00,ns_1@172.23.106.228:<0.4606.0>:auto_failover:log_down_nodes_reason:361]Node 'ns_1@172.23.123.102' is considered down. Reason:"The data service did not respond for the duration of the auto-failover threshold. Either none of the buckets have warmed up or there is an issue with the data service. " On node .102: 2021-03-01T10:40:26.396127-08:00 INFO 47: Create bucket [default] .. 2021-03-01T10:42:08.708872-08:00 INFO Bucket default deleted .. 2021-03-01T10:42:08.793465-08:00 INFO ---------- Closing logfile 2021-03-01T10:42:36.247414-08:00 INFO ---------- Opening logfile: /opt/couchbase/var/lib/couchbase/logs/memcached.log.000000.txt 2021-03-01T10:42:36.249334-08:00 INFO Couchbase version 7.0.0-4554 starting. .. 2021-03-01T10:42:37.102231-08:00 INFO 47: Create bucket [default] .. 2021-03-01T10:42:37.109882-08:00 INFO 47: Bucket [default] created successfully .. 2021-03-01T10:42:37.601677-08:00 INFO 56: HELO [regular] [ {"ip":"127.0.0.1","port":36468} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ] 2021-03-01T10:42:41.527482-08:00 INFO 55: [ {"ip":"127.0.0.1","port":40296} - {"ip":"127.0.0.1","port":11209} (<ud>@ns_server</ud>) ] Updated cluster configuration for bucket [default]. New revision: 137 2021-03-01T10:42:41.527827-08:00 INFO Pushing new cluster config for bucket:[default] revision:[137] 2021-03-01T10:42:41.608683-08:00 INFO (default) VBucket: created vb:1023 with state:replica initialState:dead lastSeqno:0 persistedRange:{0,0} max_cas:0 uuid:132703071848157 topology:null 2021-03-01T10:42:41.608724-08:00 INFO (default) VBucket::setState: transitioning vb:1023 with high seqno:0 from:replica to:replica .. <long sequence of VB create> .. 2021-03-01T10:42:41.709590-08:00 INFO 57: (default) DCP (Consumer) eq_dcpq:replication:ns_1@172.23.105.152->ns_1@172.23.123.102:default - (vb:18) Attempting to add stream: opaque_:1, start_seqno_:0, end_seqno_:18446744073709551615, vb_uuid:25965989700716, snap_start_seqno_:0, snap_end_seqno_:0, last_seqno:0, stream_req_value:{"uid":"0"} .. <long sequence of AddStream> .. 2021-03-01T10:42:42.267257-08:00 INFO (default) EventuallyPersistentEngine::enableTraffic() result true So everything seems to proceed fine from the memcached PoV on node .102 after restart.
          paolo.cocchi Paolo Cocchi added a comment -

          Hi Ashwin Govindarajulu, we should get ns_server's eye on this one.

          As per previous comment, KV logs don't show anything problematic and also ns_server seems to proceed successfully with bucket re-creation (after memcached restart).

          I'm not sure how exactly ns_server determines "The data service did not respond for the duration of the auto-failover threshold". What kind of connection to memcached does it use? What kind of request/response does it use?

          Just before the "pre_rebalance_janitor_run_failed" I see this in ns_server.debug.log:

          [ns_server:debug,2021-03-01T10:42:37.191-08:00,ns_1@172.23.106.228:<0.18609.0>:ns_janitor:maybe_config_sync:244]Found states mismatch in bucket "default":
          [{18,
            ['ns_1@172.23.105.152','ns_1@172.23.106.228','ns_1@172.23.123.102'],
            [{'ns_1@172.23.105.152',active,[]},{'ns_1@172.23.106.228',replica,[]}]},
           {19,
            ['ns_1@172.23.105.152','ns_1@172.23.106.228','ns_1@172.23.123.102'],
            [{'ns_1@172.23.105.152',active,[]},{'ns_1@172.23.106.228',replica,[]}]},
           {20,
          ..
          

          Could that relate to the failure?

          Could it be possible that auto-failover (with its state checks) is just clashing with a manual/clean shutdown?

          paolo.cocchi Paolo Cocchi added a comment - Hi Ashwin Govindarajulu , we should get ns_server's eye on this one. As per previous comment, KV logs don't show anything problematic and also ns_server seems to proceed successfully with bucket re-creation (after memcached restart). I'm not sure how exactly ns_server determines "The data service did not respond for the duration of the auto-failover threshold". What kind of connection to memcached does it use? What kind of request/response does it use? Just before the "pre_rebalance_janitor_run_failed" I see this in ns_server.debug.log: [ns_server:debug,2021-03-01T10:42:37.191-08:00,ns_1@172.23.106.228:<0.18609.0>:ns_janitor:maybe_config_sync:244]Found states mismatch in bucket "default": [{18, ['ns_1@172.23.105.152','ns_1@172.23.106.228','ns_1@172.23.123.102'], [{'ns_1@172.23.105.152',active,[]},{'ns_1@172.23.106.228',replica,[]}]}, {19, ['ns_1@172.23.105.152','ns_1@172.23.106.228','ns_1@172.23.123.102'], [{'ns_1@172.23.105.152',active,[]},{'ns_1@172.23.106.228',replica,[]}]}, {20, .. Could that relate to the failure? Could it be possible that auto-failover (with its state checks) is just clashing with a manual/clean shutdown?
          dfinlay Dave Finlay added a comment - - edited

          I think this is completely normal and correct behavior: https://docs.couchbase.com/server/current/cli/cbcli/couchbase-cli-setting-autoreprovision.html.

          "Auto-reprovisioning" can happen for ephemeral buckets when memcached restarts - and it happened here. If memcached restarts it will lose all data for active vbuckets in the ephemeral bucket on that node that restarted. If the janitor onlines the bucket on that node, the replicas will connect to it and immediately drop all their data.

          So what happens is that the janitor checks to see if the memcached vbuckets are in the "missing" state and if they are and the bucket is of type ephemeral, ns_server won't online vbuckets on the node which restarted but will rather activate replica vbuckets elsewhere in the cluster. A rebalance is required to get things back to normal.

          In this case, .102 goes down:

          2021-03-01T10:42:08.868-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.105.152) - Node 'ns_1@172.23.105.152' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason,
                                                                                               connection_closed}]
          2021-03-01T10:42:08.869-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.98.215) - Node 'ns_1@172.23.98.215' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason,
                                                                                              connection_closed}]
          2021-03-01T10:42:08.869-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.107.52) - Node 'ns_1@172.23.107.52' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason,
                                                                                              connection_closed}]
          ...
          

          The rebalance starts:

          2021-03-01T10:42:30.361-08:00, ns_orchestrator:0:info:message(ns_1@172.23.106.228) - Starting rebalance, KeepNodes = ['ns_1@172.23.106.228','ns_1@172.23.123.102'], EjectNodes = ['ns_1@172.23.123.100',
                                                                                                       'ns_1@172.23.123.101',
                                                                                                       'ns_1@172.23.107.52',
                                                                                                       'ns_1@172.23.105.152',
                                                                                                       'ns_1@172.23.98.215'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 8a3dc656fb73df01a78d3f9678c0c550
          

          .102 comes back up:

          2021-03-01T10:42:35.883-08:00, ns_node_disco:4:info:node up(ns_1@172.23.105.152) - Node 'ns_1@172.23.105.152' saw that node 'ns_1@172.23.123.102' came up. Tags: []
          2021-03-01T10:42:35.886-08:00, ns_node_disco:4:info:node up(ns_1@172.23.123.102) - Node 'ns_1@172.23.123.102' saw that node 'ns_1@172.23.105.152' came up. Tags: []
          2021-03-01T10:42:35.900-08:00, ns_node_disco:4:info:node up(ns_1@172.23.106.228) - Node 'ns_1@172.23.106.228' saw that node 'ns_1@172.23.123.102' came up. Tags: []
          2021-03-01T10:42:35.906-08:00, ns_node_disco:4:info:node up(ns_1@172.23.123.102) - Node 'ns_1@172.23.123.102' saw that node 'ns_1@172.23.106.228' came up. Tags: []
          ...
          

          Pre-rebalance ns_server checks to see if we'd lose data in ephemeral buckets due to onlining a node that had been restarted - and in this case data would be lost. So the rebalance stops, correctly:

          2021-03-01T10:42:37.211-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.106.228) - Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default",
                                           {error,unsafe_nodes,['ns_1@172.23.123.102']}}.
          

          Subsequently the janitor runs and auto-reprovisions the vbuckets in the ephemeral bucket on .102:

          2021-03-01T10:42:41.487-08:00, auto_reprovision:0:info:message(ns_1@172.23.106.228) - Bucket "default" has been reprovisioned on following nodes: ['ns_1@172.23.123.102']. Nodes on which the data service restarted: ['ns_1@172.23.123.102'].
          2021-03-01T10:42:41.521-08:00, auto_reprovision:0:info:message(ns_1@172.23.106.228) - auto-reprovision is disabled as maximum number of nodes (1) that can be auto-reprovisioned has been reached.
          

          I think this should be resolved as: feature is working correctly.

          dfinlay Dave Finlay added a comment - - edited I think this is completely normal and correct behavior: https://docs.couchbase.com/server/current/cli/cbcli/couchbase-cli-setting-autoreprovision.html . "Auto-reprovisioning" can happen for ephemeral buckets when memcached restarts - and it happened here. If memcached restarts it will lose all data for active vbuckets in the ephemeral bucket on that node that restarted. If the janitor onlines the bucket on that node, the replicas will connect to it and immediately drop all their data. So what happens is that the janitor checks to see if the memcached vbuckets are in the "missing" state and if they are and the bucket is of type ephemeral, ns_server won't online vbuckets on the node which restarted but will rather activate replica vbuckets elsewhere in the cluster. A rebalance is required to get things back to normal. In this case, .102 goes down: 2021-03-01T10:42:08.868-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.105.152) - Node 'ns_1@172.23.105.152' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason, connection_closed}] 2021-03-01T10:42:08.869-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.98.215) - Node 'ns_1@172.23.98.215' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason, connection_closed}] 2021-03-01T10:42:08.869-08:00, ns_node_disco:5:warning:node down(ns_1@172.23.107.52) - Node 'ns_1@172.23.107.52' saw that node 'ns_1@172.23.123.102' went down. Details: [{nodedown_reason, connection_closed}] ... The rebalance starts: 2021-03-01T10:42:30.361-08:00, ns_orchestrator:0:info:message(ns_1@172.23.106.228) - Starting rebalance, KeepNodes = ['ns_1@172.23.106.228','ns_1@172.23.123.102'], EjectNodes = ['ns_1@172.23.123.100', 'ns_1@172.23.123.101', 'ns_1@172.23.107.52', 'ns_1@172.23.105.152', 'ns_1@172.23.98.215'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 8a3dc656fb73df01a78d3f9678c0c550 .102 comes back up: 2021-03-01T10:42:35.883-08:00, ns_node_disco:4:info:node up(ns_1@172.23.105.152) - Node 'ns_1@172.23.105.152' saw that node 'ns_1@172.23.123.102' came up. Tags: [] 2021-03-01T10:42:35.886-08:00, ns_node_disco:4:info:node up(ns_1@172.23.123.102) - Node 'ns_1@172.23.123.102' saw that node 'ns_1@172.23.105.152' came up. Tags: [] 2021-03-01T10:42:35.900-08:00, ns_node_disco:4:info:node up(ns_1@172.23.106.228) - Node 'ns_1@172.23.106.228' saw that node 'ns_1@172.23.123.102' came up. Tags: [] 2021-03-01T10:42:35.906-08:00, ns_node_disco:4:info:node up(ns_1@172.23.123.102) - Node 'ns_1@172.23.123.102' saw that node 'ns_1@172.23.106.228' came up. Tags: [] ... Pre-rebalance ns_server checks to see if we'd lose data in ephemeral buckets due to onlining a node that had been restarted - and in this case data would be lost. So the rebalance stops, correctly: 2021-03-01T10:42:37.211-08:00, ns_orchestrator:0:critical:message(ns_1@172.23.106.228) - Rebalance exited with reason {pre_rebalance_janitor_run_failed,"default", {error,unsafe_nodes,['ns_1@172.23.123.102']}}. Subsequently the janitor runs and auto-reprovisions the vbuckets in the ephemeral bucket on .102: 2021-03-01T10:42:41.487-08:00, auto_reprovision:0:info:message(ns_1@172.23.106.228) - Bucket "default" has been reprovisioned on following nodes: ['ns_1@172.23.123.102']. Nodes on which the data service restarted: ['ns_1@172.23.123.102']. 2021-03-01T10:42:41.521-08:00, auto_reprovision:0:info:message(ns_1@172.23.106.228) - auto-reprovision is disabled as maximum number of nodes (1) that can be auto-reprovisioned has been reached. I think this should be resolved as: feature is working correctly.

          Mass closing all non-fixed bugs

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Mass closing all non-fixed bugs

          People

            dfinlay Dave Finlay
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty