Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-5258

Rebalance failure when Online upgrading from 180 to 181 (Rebalance exited with reason {wait_for_memcached_failed,"bucket-1",)

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Cannot Reproduce
    • Affects Version/s: 1.8.1-release-candidate
    • Fix Version/s: 1.8.1
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Ubuntu 64 bit

      Description

      Test case that failed
      upgradetests.MultipleNodeUpgradeTests.test_upgrade,initial_version=1.8.0r-55,create_buckets=True,insert_data=True,start_upgraded_first=True,load_ratio=10,online_upgrade=True

      [ns_server:info] [2012-05-07 22:01:20] [ns_1@10.3.121.92:<0.18886.1>:diag_handler:log_all_tap_and_checkpoint_stats:123] logging tap & checkpoint stats
      [ns_server:debug] [2012-05-07 22:01:20] [ns_1@10.3.121.92:ns_config_log:ns_config_log:log_common:111] config change:
      counters ->
      [

      {rebalance_fail,1}

      ,

      {rebalance_start,5}

      ,

      {rebalance_success,4}

      ]
      [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.30188.0> on node 'ns_1@10.3.121.93' with exit value: {undef,[

      {diag_handler,log_all_tap_and_checkpoint_stats,[]}]}


      [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.29943.0> on node 'ns_1@10.3.121.95' with exit value: {undef,[{diag_handler,log_all_tap_and_checkpoint_stats,[]}

      ]}

      [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.29932.0> on node 'ns_1@10.3.121.98' with exit value: {undef,[

      {diag_handler,log_all_tap_and_checkpoint_stats,[]}]}


      [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.30010.0> on node 'ns_1@10.3.121.97' with exit value: {undef,[{diag_handler,log_all_tap_and_checkpoint_stats,[]}

      ]}

      [ns_server:debug] [2012-05-07 22:01:20] [ns_1@10.3.121.92:ns_config_events:ns_node_disco_conf_events:handle_event:56] ns_node_disco_conf_events config all
      [ns_server:debug] [2012-05-07 22:01:20] [ns_1@10.3.121.92:ns_config_rep:ns_config_rep:handle_info:201] Pushing config
      [ns_server:debug] [2012-05-07 22:01:20] [ns_1@10.3.121.92:ns_config_rep:ns_config_rep:handle_info:203] Pushing config done
      [ns_server:debug] [2012-05-07 22:01:20] [ns_1@10.3.121.92:ns_config_log:ns_config_log:log_common:111] config change:
      rebalancer_pid ->
      undefined
      [ns_server:debug] [2012-05-07 22:01:20] [ns_1@10.3.121.92:ns_config_log:ns_config_log:log_common:111] config change:
      rebalance_status ->

      {none,<<"Rebalance failed. See logs for detailed reason. You can try rebalance again.">>}

      [ns_server:debug] [2012-05-07 22:01:20] [ns_1@10.3.121.92:ns_config_events:ns_node_disco_conf_events:handle_event:56] ns_node_disco_conf_events config all

        Activity

        Hide
        karan Karan Kumar (Inactive) added a comment -

        There are other very suspicous errors as well

        Log .94
        [error_logger:error] [2012-05-07 22:01:35] [ns_1@10.3.121.94:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.995.0> on node 'ns_1@10.3.121.94' with exit value: badmatch,{error,shutdown,[

        {single_bucket_sup,'-start_link/1-fun-0-',2}

        ]}

        Log .92
        [

        {rebalance_fail,1}

        ,

        {rebalance_start,5}

        ,

        {rebalance_success,4}

        ]
        [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.30188.0> on node 'ns_1@10.3.121.93' with exit value: {undef,[

        {diag_handler,log_all_tap_and_checkpoint_stats,[]}]}

        [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.29943.0> on node 'ns_1@10.3.121.95' with exit value: {undef,[{diag_handler,log_all_tap_and_checkpoint_stats,[]}

        ]}

        [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.29932.0> on node 'ns_1@10.3.121.98' with exit value: {undef,[

        {diag_handler,log_all_tap_and_checkpoint_stats,[]}]}

        [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.30010.0> on node 'ns_1@10.3.121.97' with exit value: {undef,[{diag_handler,log_all_tap_and_checkpoint_stats,[]}

        ]}

        Show
        karan Karan Kumar (Inactive) added a comment - There are other very suspicous errors as well Log .94 [error_logger:error] [2012-05-07 22:01:35] [ns_1@10.3.121.94:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.995.0> on node 'ns_1@10.3.121.94' with exit value: badmatch,{error,shutdown ,[ {single_bucket_sup,'-start_link/1-fun-0-',2} ]} Log .92 [ {rebalance_fail,1} , {rebalance_start,5} , {rebalance_success,4} ] [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.30188.0> on node 'ns_1@10.3.121.93' with exit value: {undef,[ {diag_handler,log_all_tap_and_checkpoint_stats,[]}]} [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.29943.0> on node 'ns_1@10.3.121.95' with exit value: {undef,[{diag_handler,log_all_tap_and_checkpoint_stats,[]} ]} [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.29932.0> on node 'ns_1@10.3.121.98' with exit value: {undef,[ {diag_handler,log_all_tap_and_checkpoint_stats,[]}]} [error_logger:error] [2012-05-07 22:01:20] [ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76] Error in process <0.30010.0> on node 'ns_1@10.3.121.97' with exit value: {undef,[{diag_handler,log_all_tap_and_checkpoint_stats,[]} ]}
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Those failures are harmless. Older nodes do not have new stats logging calls. We don't care.

        Node .94 have problems starting memcached:

        [user:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1069.0>:ns_port_server:handle_info:102] Port server memcached on node 'ns_1@10.3.121.94' exited with status 71. Restarting. Messages: failed to listen on TCP port 11211: Success
        [ns_server:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1069.0>:ns_port_server:log:161] memcached<0.1069.0>: failed to listen on TCP port 11211: Success

        [ns_server:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:ns_log:ns_log:handle_cast:115] suppressing duplicate log ns_port_server:0("Port server memcached on node 'ns_1@10.3.121.94' exited with status 71. Restarting. Messages: failed to listen on TCP port 11211: Success") because it's been seen 2 times in the past 10.045148 secs (last seen 5.024509 secs ago
        [ns_server:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1068.0>:supervisor_cushion:handle_info:61] Cushion managed supervisor for memcached failed:

        {abnormal,71}

        [user:warn] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1068.0>:supervisor_cushion:die_slowly:81] Service memcached exited on node 'ns_1@10.3.121.94' in 0.02s

        Apparently memcached (or maybe moxi) is already running. I'll check if it's memcached configuration problem.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Those failures are harmless. Older nodes do not have new stats logging calls. We don't care. Node .94 have problems starting memcached: [user:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1069.0>:ns_port_server:handle_info:102] Port server memcached on node 'ns_1@10.3.121.94' exited with status 71. Restarting. Messages: failed to listen on TCP port 11211: Success [ns_server:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1069.0>:ns_port_server:log:161] memcached<0.1069.0>: failed to listen on TCP port 11211: Success [ns_server:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:ns_log:ns_log:handle_cast:115] suppressing duplicate log ns_port_server:0("Port server memcached on node 'ns_1@10.3.121.94' exited with status 71. Restarting. Messages: failed to listen on TCP port 11211: Success") because it's been seen 2 times in the past 10.045148 secs (last seen 5.024509 secs ago [ns_server:info] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1068.0>:supervisor_cushion:handle_info:61] Cushion managed supervisor for memcached failed: {abnormal,71} [user:warn] [2012-05-07 22:01:40] [ns_1@10.3.121.94:<0.1068.0>:supervisor_cushion:die_slowly:81] Service memcached exited on node 'ns_1@10.3.121.94' in 0.02s Apparently memcached (or maybe moxi) is already running. I'll check if it's memcached configuration problem.
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Apparently memcached is trying to listen on 11211. While it's clearly asked for something else:

        [error_logger:info] [2012-05-07 21:59:59] [ns_1@10.3.121.94:error_logger:ale_error_logger_handler:log_report:72]
        =========================PROGRESS REPORT=========================
        supervisor:

        {local,ns_port_sup}

        started: [

        {pid,<0.284.0>}

        ,
        {name,
        {memcached,"/opt/couchbase/bin/memcached",
        ["-X",
        "/opt/couchbase/lib/memcached/stdin_term_handler.so",
        "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-E",
        "/opt/couchbase/lib/memcached/bucket_engine.so",
        "-B","binary","-r","-c","10000","-e",
        "admin=_admin;default_bucket_name=default;auto_create=false",
        []],
        [{env,
        [

        {"EVENT_NOSELECT","1"},
        {"MEMCACHED_TOP_KEYS","100"},
        {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
        {"ISASL_DB_CHECK_TIME","1"}]},
        use_stdio,stderr_to_stdout,exit_status,
        port_server_send_eol,stream]}},
        {mfargs,
        {supervisor_cushion,start_link,
        [memcached,5000,ns_port_server,start_link,
        [memcached,"/opt/couchbase/bin/memcached",
        ["-X",
        "/opt/couchbase/lib/memcached/stdin_term_handler.so",
        "-l","0.0.0.0:11210,0.0.0.0:11209:1000",
        "-E",
        "/opt/couchbase/lib/memcached/bucket_engine.so",
        "-B","binary","-r","-c","10000","-e",
        "admin=_admin;default_bucket_name=default;auto_create=false",
        []],
        [{env,
        [{"EVENT_NOSELECT","1"}

        ,

        {"MEMCACHED_TOP_KEYS","100"}

        ,

        {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"}

        ,

        {"ISASL_DB_CHECK_TIME","1"}

        ]},
        use_stdio,stderr_to_stdout,exit_status,
        port_server_send_eol,stream]]]}},

        {restart_type,permanent}

        ,

        {shutdown,10000}

        ,

        {child_type,worker}

        ]

        is it possible that memcached binary is somehow old (i.e. 1.8.0 one)

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Apparently memcached is trying to listen on 11211. While it's clearly asked for something else: [error_logger:info] [2012-05-07 21:59:59] [ns_1@10.3.121.94:error_logger:ale_error_logger_handler:log_report:72] =========================PROGRESS REPORT========================= supervisor: {local,ns_port_sup} started: [ {pid,<0.284.0>} , {name, {memcached,"/opt/couchbase/bin/memcached", ["-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so", "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-E", "/opt/couchbase/lib/memcached/bucket_engine.so", "-B","binary","-r","-c","10000","-e", "admin=_admin;default_bucket_name=default;auto_create=false", []], [{env, [ {"EVENT_NOSELECT","1"}, {"MEMCACHED_TOP_KEYS","100"}, {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"}, {"ISASL_DB_CHECK_TIME","1"}]}, use_stdio,stderr_to_stdout,exit_status, port_server_send_eol,stream]}}, {mfargs, {supervisor_cushion,start_link, [memcached,5000,ns_port_server,start_link, [memcached,"/opt/couchbase/bin/memcached", ["-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so", "-l","0.0.0.0:11210,0.0.0.0:11209:1000", "-E", "/opt/couchbase/lib/memcached/bucket_engine.so", "-B","binary","-r","-c","10000","-e", "admin=_admin;default_bucket_name=default;auto_create=false", []], [{env, [{"EVENT_NOSELECT","1"} , {"MEMCACHED_TOP_KEYS","100"} , {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"} , {"ISASL_DB_CHECK_TIME","1"} ]}, use_stdio,stderr_to_stdout,exit_status, port_server_send_eol,stream]]]}}, {restart_type,permanent} , {shutdown,10000} , {child_type,worker} ] is it possible that memcached binary is somehow old (i.e. 1.8.0 one)
        Hide
        karan Karan Kumar (Inactive) added a comment -

        Just to be doubly sure.. this is alright.. I am running the test again.

        When online upgrading.. you would be in a state where some nodes are on latest 181 and some are on 180

        Show
        karan Karan Kumar (Inactive) added a comment - Just to be doubly sure.. this is alright.. I am running the test again. When online upgrading.. you would be in a state where some nodes are on latest 181 and some are on 180
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Double check you don't have memcached process from 1.8.0 node

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Double check you don't have memcached process from 1.8.0 node
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        [ns_server:info] [2012-06-12 1:36:45] [ns_1@10.1.3.116:ns_port_memcached:ns_port_server:log:166] memcached<0.295.0>: failed to listen on TCP port 11210: Success

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - [ns_server:info] [2012-06-12 1:36:45] [ns_1@10.1.3.116:ns_port_memcached:ns_port_server:log:166] memcached<0.295.0>: failed to listen on TCP port 11210: Success

          People

          • Assignee:
            karan Karan Kumar (Inactive)
            Reporter:
            karan Karan Kumar (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes