Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7490

impossible to rebalance cluster when one node was failover before offline upgrade 2.0.0->2.0.1 cluster(cluster is broken, pun on the UI)

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major
    • Resolution: Won't Fix
    • Affects Version/s: 2.0, 2.0.1
    • Fix Version/s: None
    • Component/s: installer, ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      test: newupgradetests.MultiNodesUpgradeTests.offline_cluster_upgrade,initial_version=2.0.0-1978-rel,nodes_init=2,during-ops=failover,upgrade_version=2.0.1-112-rel,initial_vbuckets=64

      steps:
      1. 2.0.0 release cluster with 2 nodes 10.3.121.112 & 10.3.121.113(2.0.0-1978-rel)
      2. failover 10.3.121.112
      3. stop 2 nodes and upgrade them on 2.0.1-112

      result:
      for some reason the node 10.3.121.113 does not start
      panels Active Servers and Pending Rebalance reversed or show nonsense( see screenshots)

      I tried to play with failover, add back, rebalance, start manually 10.3.121.113, etc. but it did not help

      server's logs, tests logs and some screenshots are attached

      1. 10.3.121.112-8091-diag.txt.gz
        2.72 MB
        Andrei Baranouski
      2. 10.3.121.113-8091-diag.txt.gz
        254 kB
        Andrei Baranouski
      3. test_logs.txt
        69 kB
        Andrei Baranouski
      1. add_back1.png
        51 kB
      2. add_back2.png
        52 kB
      3. failo0ver_13.png
        58 kB
      4. restart_13_man.png
        65 kB
      5. Screenshot from 2013-01-04 13-21-20.png
        242 kB
      6. Screenshot from 2013-01-04 13-28-54.png
        52 kB
      7. Screenshot from 2013-01-04 13-31-45.png
        81 kB
      8. Screenshot from 2013-01-04 13-39-34.png
        44 kB
      9. step4.png
        163 kB
      10. step5.png
        102 kB
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Something prevented .113 from starting up:

        [error_logger:error,2013-01-04T2:09:45.459,nonode@nohost:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: couch_server:init/1
        pid: <0.217.0>
        registered_name: []
        exception exit: {undef,[

        {file2,ensure_dir,["/tmp/.delete/foo"]}

        ,

        {couch_file,init_delete_dir,1}

        ,

        {couch_server,init,1}

        ,

        {gen_server,init_it,6}

        ,

        {proc_lib,init_p_do_apply,3}

        ]}
        in function gen_server:init_it/6
        ancestors: [couch_primary_services,couch_server_sup,cb_couch_sup,
        ns_server_cluster_sup,<0.59.0>]
        messages: []
        links: [<0.212.0>]
        dictionary: []
        trap_exit: false
        status: running
        heap_size: 377
        stack_size: 24
        reductions: 186
        neighbours:

        Looks like some missing file because that function actually exists and didn't change between 2.0.0 and 2.0.1

        I'd need another reproduction with collect info to see what's going on. I.e. collect_info will give me list of files.

        Second problem is that somehow UI allowed you to failover .113 even though it was last remaining active node in cluster. I'll try to rerproduce and will file separate bug.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Something prevented .113 from starting up: [error_logger:error,2013-01-04T2:09:45.459,nonode@nohost:error_logger<0.6.0>:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: couch_server:init/1 pid: <0.217.0> registered_name: [] exception exit: {undef,[ {file2,ensure_dir,["/tmp/.delete/foo"]} , {couch_file,init_delete_dir,1} , {couch_server,init,1} , {gen_server,init_it,6} , {proc_lib,init_p_do_apply,3} ]} in function gen_server:init_it/6 ancestors: [couch_primary_services,couch_server_sup,cb_couch_sup, ns_server_cluster_sup,<0.59.0>] messages: [] links: [<0.212.0>] dictionary: [] trap_exit: false status: running heap_size: 377 stack_size: 24 reductions: 186 neighbours: Looks like some missing file because that function actually exists and didn't change between 2.0.0 and 2.0.1 I'd need another reproduction with collect info to see what's going on. I.e. collect_info will give me list of files. Second problem is that somehow UI allowed you to failover .113 even though it was last remaining active node in cluster. I'll try to rerproduce and will file separate bug.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        See above

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - See above
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Indeed there's issue with incorrectly allowing failover in that case. Filed: MB-7493

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Indeed there's issue with incorrectly allowing failover in that case. Filed: MB-7493
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        can't reproduce it now

        Show
        andreibaranouski Andrei Baranouski added a comment - can't reproduce it now
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        steps:
        1.cluster with 2 nodes: 10.3.121.112 and 10.3.121.114
        2. failover node 10.3.121.114
        3. stop 2 nodes
        4. start only node 10.3.121.114( step4.png)
        5. on UI console of 10.3.121.114 add its back ( step5.png)

        new screenshosts and collect_info are attached

        Show
        andreibaranouski Andrei Baranouski added a comment - steps: 1.cluster with 2 nodes: 10.3.121.112 and 10.3.121.114 2. failover node 10.3.121.114 3. stop 2 nodes 4. start only node 10.3.121.114( step4.png) 5. on UI console of 10.3.121.114 add its back ( step5.png) new screenshosts and collect_info are attached
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Not sure exactly what you expect. Rebalancing requires both nodes to be up.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Not sure exactly what you expect. Rebalancing requires both nodes to be up.
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        Alk, after such steps I want to return back node that was failover in step#2.
        I know that it should be cleaned but if node 10.3.121.112 disappeared without a trace, we can not revive the second one and the only solution is to reinstall it

        Show
        andreibaranouski Andrei Baranouski added a comment - Alk, after such steps I want to return back node that was failover in step#2. I know that it should be cleaned but if node 10.3.121.112 disappeared without a trace, we can not revive the second one and the only solution is to reinstall it
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        need confirmation that we will not handle it

        Show
        andreibaranouski Andrei Baranouski added a comment - need confirmation that we will not handle it
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We will not handle it.

        If some node was reinstalled it's state is lost.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We will not handle it. If some node was reinstalled it's state is lost.

          People

          • Assignee:
            alkondratenko Aleksey Kondratenko (Inactive)
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes