Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6493

Rebalance exits with reason "timeout, ns_memcached-bucket','ns_1@10.1.3.238'}, {get_vbucket,". Successive attempts to rebalance also fail.

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-beta
    • Fix Version/s: 2.0-beta
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      vBuckets: 1024
      Env: Centos 64bit

      Description

      • Create a 2:2 cluster.
      • Setup unidirectional replication of 2M items
      • Rebalance out one of the destination nodes

      Example scenario:
      Source: 10.1.3.236, 10.3.2.54
      Destination: 10.1.3.238, 10.3.2.55
      Rebalance out 10.3.2.55

      Towards the end of rebalancing on the destination, rebalance fails with the following reason:
      Rebalance exited with reason {timeout,
      {gen_server,call,
      [

      {'ns_memcached-bucket','ns_1@10.1.3.238'}

      ,

      {get_vbucket,749}

      ,
      60000]}}

      1. 10.1.3.236-8091-diag.txt.gz
        1.62 MB
        Abhinav Dangeti
      2. 10.1.3.238-8091-diag.txt.gz
        5.45 MB
        Abhinav Dangeti
      3. 10.3.2.54-8091-diag.txt.gz
        10.30 MB
        Abhinav Dangeti
      4. 10.3.2.55-8091-diag.txt.gz
        5.45 MB
        Abhinav Dangeti
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        abhinav Abhinav Dangeti created issue -
        Hide
        abhinav Abhinav Dangeti added a comment -

        Also seen in bug MB-6385.

        Show
        abhinav Abhinav Dangeti added a comment - Also seen in bug MB-6385 .
        junyi Junyi Xie (Inactive) made changes -
        Field Original Value New Value
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        junyi Junyi Xie (Inactive) made changes -
        Resolution Fixed [ 1 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        junyi Junyi Xie (Inactive) made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        junyi Junyi Xie (Inactive) made changes -
        Resolution Duplicate [ 3 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        junyi Junyi Xie (Inactive) made changes -
        Comment [ Same issue as MB-6468
        ]
        farshid Farshid Ghods (Inactive) made changes -
        Fix Version/s 2.0-beta [ 10113 ]
        Affects Version/s 2.0-beta [ 10113 ]
        Affects Version/s 2.0 [ 10114 ]
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Looks like a bunch of ns_server crash in 10.1.3.238-8091-diag.txt.

        ns_server:debug,2012-08-30T14:17:41.431,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:init:43]starting ns_port_server with delay of 5000
        [error_logger:error,2012-08-30T14:17:41.431,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
        =========================SUPERVISOR REPORT=========================
        Supervisor:

        {local,ns_port_sup}
        Context: child_terminated
        Reason: {abnormal,71}
        Offender: [{pid,<0.7380.3>},
        {name,
        {memcached,"/opt/couchbase/bin/memcached",
        ["-X",
        "/opt/couchbase/lib/memcached/stdin_term_handler.so",
        "-X",
        "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
        "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
        "11210","-E",
        "/opt/couchbase/lib/memcached/bucket_engine.so",
        "-B","binary","-r","-c","10000","-e",
        "admin=_admin;default_bucket_name=default;auto_create=false",
        []],
        [{env,
        [{"EVENT_NOSELECT","1"},
        {"MEMCACHED_TOP_KEYS","100"},
        {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
        {"ISASL_DB_CHECK_TIME","1"}]},
        use_stdio,stderr_to_stdout,exit_status,
        port_server_send_eol,stream]}},
        {mfargs,
        {erlang,apply,
        [#Fun<ns_port_sup.3.119727222>,
        [memcached,"/opt/couchbase/bin/memcached",
        ["-X",
        "/opt/couchbase/lib/memcached/stdin_term_handler.so",
        "-X",
        "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
        "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
        "11210","-E",
        "/opt/couchbase/lib/memcached/bucket_engine.so",
        "-B","binary","-r","-c","10000","-e",
        "admin=_admin;default_bucket_name=default;auto_create=false",
        []],
        [{env,
        [{"EVENT_NOSELECT","1"},
        {"MEMCACHED_TOP_KEYS","100"},
        {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
        {"ISASL_DB_CHECK_TIME","1"}]},
        use_stdio,stderr_to_stdout,exit_status,
        port_server_send_eol,stream]]]}},
        {restart_type,permanent},
        {shutdown,86400000},
        {child_type,worker}]


        [error_logger:info,2012-08-30T14:17:41.433,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
        =========================PROGRESS REPORT=========================
        supervisor: {local,ns_port_sup}

        started: [

        {pid,<0.7445.3>}

        ,
        {name,
        {memcached,"/opt/couchbase/bin/memcached",
        ["-X",
        "/opt/couchbase/lib/memcached/stdin_term_handler.so",
        "-X",
        "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
        "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p",
        "11210","-E",
        "/opt/couchbase/lib/memcached/bucket_engine.so",
        "-B","binary","-r","-c","10000","-e",
        "admin=_admin;default_bucket_name=default;auto_create=false",
        []],
        [{env,
        [

        {"EVENT_NOSELECT","1"},
        {"MEMCACHED_TOP_KEYS","100"},
        {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"},
        {"ISASL_DB_CHECK_TIME","1"}]},
        use_stdio,stderr_to_stdout,exit_status,
        port_server_send_eol,stream]}},
        {mfargs,
        {erlang,apply,
        [#Fun<ns_port_sup.3.119727222>,
        [memcached,"/opt/couchbase/bin/memcached",
        ["-X",
        "/opt/couchbase/lib/memcached/stdin_term_handler.so",
        "-X",
        "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log",
        "-l","0.0.0.0:11210,0.0.0.0:11209:1000",
        "-p","11210","-E",
        "/opt/couchbase/lib/memcached/bucket_engine.so",
        "-B","binary","-r","-c","10000","-e",
        "admin=_admin;default_bucket_name=default;auto_create=false",
        []],
        [{env,
        [{"EVENT_NOSELECT","1"}

        ,

        {"MEMCACHED_TOP_KEYS","100"}

        ,

        {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"}

        ,

        {"ISASL_DB_CHECK_TIME","1"}

        ]},
        use_stdio,stderr_to_stdout,exit_status,
        port_server_send_eol,stream]]]}},

        {restart_type,permanent},
        {shutdown,86400000},
        {child_type,worker}]

        [user:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:handle_info:107]Port server memcached on node 'ns_1@10.1.3.238' exited with status 71. Restarting. Messages: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use
        [ns_server:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:log:169]memcached<0.7446.3>: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use

        [ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:handle_info:62]Cushion managed supervisor for memcached failed: {abnormal,71}
        [user:warn,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:die_slowly:81]Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s

        [error_logger:error,2012-08-30T14:17:41.519,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_msg:76]** Generic server <0.7446.3> terminating
        ** Last message in was {#Port<0.127099>,{exit_status,71}}
        ** When Server state == {state,#Port<0.127099>,memcached,
        {["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use", empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty], [empty]},
        {ok,{1346361461712443,#Ref<0.0.35.116743>}},
        ["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use"],
        0,true}
        ** Reason for termination ==
        ** {abnormal,71}

        [ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:ns_log:ns_log:handle_cast:125]suppressing duplicate log supervisor_cushion:1("Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s\n") because it's been seen 1 times in the past 36.083474 secs (last seen 36.083474 secs ago
        [error_logger:error,2012-08-30T14:17:41.521,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
        =========================CRASH REPORT=========================
        crasher:
        initial call: ns_port_server:init/1
        pid: <0.7446.3>
        registered_name: ns_port_memcached
        exception exit: {abnormal,71}
        in function gen_server:terminate/6
        ancestors: [<0.7445.3>,ns_port_sup,ns_server_sup,ns_server_cluster_sup,
        <0.60.0>]
        messages: [{'EXIT',#Port<0.127099>,normal}]
        links: [<0.7445.3>]
        dictionary: []
        trap_exit: true
        status: running
        heap_size: 1597
        stack_size: 24
        reductions: 2399
        neighbours:

        [stats:error,2012-08-30T14:17:41.626,ns_1@10.1.3.238:<0.5564.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
        [stats:error,2012-08-30T14:17:41.628,ns_1@10.1.3.238:<0.5533.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
        [stats:error,2012-08-30T14:17:41.630,ns_1@10.1.3.238:<0.5557.3>:stats_reader:log_bad_responses:191]Some nodes didn't respond: ['ns_1@10.1.3.238']
        [error_logger:error,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
        =========================SUPERVISOR REPORT=========================
        Supervisor: {local,menelaus_sup}
        Context: child_terminated
        Reason: {noproc,
        {gen_server,call,
        [{'stats_reader-bucket','ns_1@10.1.3.238'},
        {latest,minute,1}]}}
        Offender: [{pid,<0.7432.3>},
        {name,menelaus_web_alerts_srv},
        {mfargs,{menelaus_web_alerts_srv,start_link,[]}},
        {restart_type,permanent}

        ,

        {shutdown,5000},
        {child_type,worker}]


        [error_logger:info,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72]
        =========================PROGRESS REPORT=========================
        supervisor: {local,menelaus_sup}
        started: [{pid,<0.7458.3>},
        {name,menelaus_web_alerts_srv},
        {mfargs,{menelaus_web_alerts_srv,start_link,[]}},
        {restart_type,permanent},
        {shutdown,5000}

        ,

        {child_type,worker}

        ]

        [ns_server:debug,2012-08-30T14:17:43.182,ns_1@10.1.3.238:ns_heart:ns_heart:current_status:155]Ignoring failure to get stats for bucket: "bucket":
        {'EXIT',{noproc,{gen_server,call,
        [

        {'stats_reader-bucket','ns_1@10.1.3.238'}

        ,

        {latest,minute}

        ]}}}

        Show
        junyi Junyi Xie (Inactive) added a comment - Looks like a bunch of ns_server crash in 10.1.3.238-8091-diag.txt. ns_server:debug,2012-08-30T14:17:41.431,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:init:43]starting ns_port_server with delay of 5000 [error_logger:error,2012-08-30T14:17:41.431,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72] =========================SUPERVISOR REPORT========================= Supervisor: {local,ns_port_sup} Context: child_terminated Reason: {abnormal,71} Offender: [{pid,<0.7380.3>}, {name, {memcached,"/opt/couchbase/bin/memcached", ["-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so", "-X", "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log", "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p", "11210","-E", "/opt/couchbase/lib/memcached/bucket_engine.so", "-B","binary","-r","-c","10000","-e", "admin=_admin;default_bucket_name=default;auto_create=false", []], [{env, [{"EVENT_NOSELECT","1"}, {"MEMCACHED_TOP_KEYS","100"}, {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"}, {"ISASL_DB_CHECK_TIME","1"}]}, use_stdio,stderr_to_stdout,exit_status, port_server_send_eol,stream]}}, {mfargs, {erlang,apply, [#Fun<ns_port_sup.3.119727222>, [memcached,"/opt/couchbase/bin/memcached", ["-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so", "-X", "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log", "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p", "11210","-E", "/opt/couchbase/lib/memcached/bucket_engine.so", "-B","binary","-r","-c","10000","-e", "admin=_admin;default_bucket_name=default;auto_create=false", []], [{env, [{"EVENT_NOSELECT","1"}, {"MEMCACHED_TOP_KEYS","100"}, {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"}, {"ISASL_DB_CHECK_TIME","1"}]}, use_stdio,stderr_to_stdout,exit_status, port_server_send_eol,stream]]]}}, {restart_type,permanent}, {shutdown,86400000}, {child_type,worker}] [error_logger:info,2012-08-30T14:17:41.433,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72] =========================PROGRESS REPORT========================= supervisor: {local,ns_port_sup} started: [ {pid,<0.7445.3>} , {name, {memcached,"/opt/couchbase/bin/memcached", ["-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so", "-X", "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log", "-l","0.0.0.0:11210,0.0.0.0:11209:1000","-p", "11210","-E", "/opt/couchbase/lib/memcached/bucket_engine.so", "-B","binary","-r","-c","10000","-e", "admin=_admin;default_bucket_name=default;auto_create=false", []], [{env, [ {"EVENT_NOSELECT","1"}, {"MEMCACHED_TOP_KEYS","100"}, {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"}, {"ISASL_DB_CHECK_TIME","1"}]}, use_stdio,stderr_to_stdout,exit_status, port_server_send_eol,stream]}}, {mfargs, {erlang,apply, [#Fun<ns_port_sup.3.119727222>, [memcached,"/opt/couchbase/bin/memcached", ["-X", "/opt/couchbase/lib/memcached/stdin_term_handler.so", "-X", "/opt/couchbase/lib/memcached/file_logger.so,cyclesize=104857600;sleeptime=19;filename=/opt/couchbase/var/lib/couchbase/logs/memcached.log", "-l","0.0.0.0:11210,0.0.0.0:11209:1000", "-p","11210","-E", "/opt/couchbase/lib/memcached/bucket_engine.so", "-B","binary","-r","-c","10000","-e", "admin=_admin;default_bucket_name=default;auto_create=false", []], [{env, [{"EVENT_NOSELECT","1"} , {"MEMCACHED_TOP_KEYS","100"} , {"ISASL_PWFILE", "/opt/couchbase/var/lib/couchbase/data/isasl.pw"} , {"ISASL_DB_CHECK_TIME","1"} ]}, use_stdio,stderr_to_stdout,exit_status, port_server_send_eol,stream]]]}}, {restart_type,permanent}, {shutdown,86400000}, {child_type,worker}] [user:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:handle_info:107] Port server memcached on node 'ns_1@10.1.3.238' exited with status 71. Restarting. Messages: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use [ns_server:info,2012-08-30T14:17:41.518,ns_1@10.1.3.238:ns_port_memcached:ns_port_server:log:169] memcached<0.7446.3>: Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use [ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:handle_info:62] Cushion managed supervisor for memcached failed: {abnormal,71} [user:warn,2012-08-30T14:17:41.519,ns_1@10.1.3.238:<0.7445.3>:supervisor_cushion:die_slowly:81] Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s [error_logger:error,2012-08-30T14:17:41.519,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_msg:76] ** Generic server <0.7446.3> terminating ** Last message in was {#Port<0.127099>,{exit_status,71}} ** When Server state == {state,#Port<0.127099>,memcached, {["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use", empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty, empty,empty,empty,empty,empty,empty,empty], [empty]}, {ok,{1346361461712443,#Ref<0.0.35.116743>}}, ["Thu Aug 30 21:17:41.512134 3: failed to listen on TCP port 11210: Address already in use"] , 0,true} ** Reason for termination == ** {abnormal,71} [ns_server:info,2012-08-30T14:17:41.519,ns_1@10.1.3.238:ns_log:ns_log:handle_cast:125] suppressing duplicate log supervisor_cushion:1("Service memcached exited on node 'ns_1@10.1.3.238' in 0.09s\n") because it's been seen 1 times in the past 36.083474 secs (last seen 36.083474 secs ago [error_logger:error,2012-08-30T14:17:41.521,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72] =========================CRASH REPORT========================= crasher: initial call: ns_port_server:init/1 pid: <0.7446.3> registered_name: ns_port_memcached exception exit: {abnormal,71} in function gen_server:terminate/6 ancestors: [<0.7445.3>,ns_port_sup,ns_server_sup,ns_server_cluster_sup, <0.60.0>] messages: [{'EXIT',#Port<0.127099>,normal}] links: [<0.7445.3>] dictionary: [] trap_exit: true status: running heap_size: 1597 stack_size: 24 reductions: 2399 neighbours: [stats:error,2012-08-30T14:17:41.626,ns_1@10.1.3.238:<0.5564.3>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@10.1.3.238'] [stats:error,2012-08-30T14:17:41.628,ns_1@10.1.3.238:<0.5533.3>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@10.1.3.238'] [stats:error,2012-08-30T14:17:41.630,ns_1@10.1.3.238:<0.5557.3>:stats_reader:log_bad_responses:191] Some nodes didn't respond: ['ns_1@10.1.3.238'] [error_logger:error,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72] =========================SUPERVISOR REPORT========================= Supervisor: {local,menelaus_sup} Context: child_terminated Reason: {noproc, {gen_server,call, [{'stats_reader-bucket','ns_1@10.1.3.238'}, {latest,minute,1}]}} Offender: [{pid,<0.7432.3>}, {name,menelaus_web_alerts_srv}, {mfargs,{menelaus_web_alerts_srv,start_link,[]}}, {restart_type,permanent} , {shutdown,5000}, {child_type,worker}] [error_logger:info,2012-08-30T14:17:42.564,ns_1@10.1.3.238:error_logger:ale_error_logger_handler:log_report:72] =========================PROGRESS REPORT========================= supervisor: {local,menelaus_sup} started: [{pid,<0.7458.3>}, {name,menelaus_web_alerts_srv}, {mfargs,{menelaus_web_alerts_srv,start_link,[]}}, {restart_type,permanent}, {shutdown,5000} , {child_type,worker} ] [ns_server:debug,2012-08-30T14:17:43.182,ns_1@10.1.3.238:ns_heart:ns_heart:current_status:155] Ignoring failure to get stats for bucket: "bucket": {'EXIT',{noproc,{gen_server,call, [ {'stats_reader-bucket','ns_1@10.1.3.238'} , {latest,minute} ]}}}
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Abhinav, please assign to ns_server team.

        Show
        junyi Junyi Xie (Inactive) added a comment - Abhinav, please assign to ns_server team.
        junyi Junyi Xie (Inactive) made changes -
        Assignee Junyi Xie [ junyi ] Abhinav Dangeti [ abhinav ]
        abhinav Abhinav Dangeti made changes -
        Comment [ Both ns_server and memcached seem to have crashed, aliaksey can you please take a look at this? ]
        abhinav Abhinav Dangeti made changes -
        Assignee Abhinav Dangeti [ abhinav ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        abhinav Abhinav Dangeti added a comment -

        Both ns_server and memcached seem to have crashsed, aleksey can you please take a look at this?

        Show
        abhinav Abhinav Dangeti added a comment - Both ns_server and memcached seem to have crashsed, aleksey can you please take a look at this?
        Hide
        ketaki Ketaki Gangal added a comment -

        Repeated rebalance on this cluster also fails with other errors. Abhinav can you post other rebalance failures as well?

        Show
        ketaki Ketaki Gangal added a comment - Repeated rebalance on this cluster also fails with other errors. Abhinav can you post other rebalance failures as well?
        ketaki Ketaki Gangal made changes -
        Summary Rebalance fails when rebalancing out a destination node with ongoing unidirectional replication Rebalance exits with reason "timeout, ns_memcached-bucket','ns_1@10.1.3.238'}, {get_vbucket,". Successive attempts to rebalance also fail.
        Affects Version/s 2.0-beta [ 10113 ]
        Priority Major [ 3 ] Blocker [ 1 ]
        Component/s ns_server [ 10019 ]
        Component/s cross-datacenter-replication [ 10136 ]
        Hide
        abhinav Abhinav Dangeti added a comment -

        Also seen, once rebalance failed if we try to rebalance again,
        rebalance "is failed with reason not_all_nodes_are_ready_yet".

        Show
        abhinav Abhinav Dangeti added a comment - Also seen, once rebalance failed if we try to rebalance again, rebalance "is failed with reason not_all_nodes_are_ready_yet".
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Ketaki Gangal [ ketaki ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        From logs I'm seeing there's some swap usage.

        That's causing long memcached request execution times. And we're doing quite a lot of memcached activity that understandably times out in such condition.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - From logs I'm seeing there's some swap usage. That's causing long memcached request execution times. And we're doing quite a lot of memcached activity that understandably times out in such condition.
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Ketaki Gangal [ ketaki ] Junyi Xie [ junyi ]
        Hide
        junyi Junyi Xie (Inactive) added a comment - - edited

        Hi Peter,

        I am not sure what I can fix. By Alk, it is "understandably times out in such condition". I am not sure why rebalance should fail even memcached takes long execution time when using swap.

        But it sounds like a generic rebalancing problem, that rebalancing may fail if there is swap usage.

        Assume it is expected that rebalance fails in this case, I am not sure what XDCR can do from source node. Please advise.

        Thanks,

        Junyi

        Show
        junyi Junyi Xie (Inactive) added a comment - - edited Hi Peter, I am not sure what I can fix. By Alk, it is "understandably times out in such condition". I am not sure why rebalance should fail even memcached takes long execution time when using swap. But it sounds like a generic rebalancing problem, that rebalancing may fail if there is swap usage. Assume it is expected that rebalance fails in this case, I am not sure what XDCR can do from source node. Please advise. Thanks, Junyi
        junyi Junyi Xie (Inactive) made changes -
        Assignee Junyi Xie [ junyi ] Peter Wansch [ peter ]
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Also sounds it is similar to MB-6284, both are cased by swap usage and caused time out at ns_server and eventually made rebalance failed. Mark it duplicate

        Show
        junyi Junyi Xie (Inactive) added a comment - Also sounds it is similar to MB-6284 , both are cased by swap usage and caused time out at ns_server and eventually made rebalance failed. Mark it duplicate
        junyi Junyi Xie (Inactive) made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Duplicate [ 3 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        i see ns-server tried to restart memcached , then i see memcached crashed and then i also see the bug is also being closed as timeout issue.

        I am going to assign this bug back to QE to dig deeper.

        Show
        farshid Farshid Ghods (Inactive) added a comment - i see ns-server tried to restart memcached , then i see memcached crashed and then i also see the bug is also being closed as timeout issue. I am going to assign this bug back to QE to dig deeper.
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Peter Wansch [ peter ] Ketaki Gangal [ ketaki ]
        farshid Farshid Ghods (Inactive) made changes -
        Resolution Duplicate [ 3 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        also i find it very hard to believe that when memcached mem_used is reported as only 400 MB and the box has 2 GB ram

        [stats:debug,2012-08-30T13:39:15.175,ns_1@10.1.3.238:<0.1206.0>:stats_collector:maybe_log_stats:126](at 1970,1,16},{5,59,19) Stats for bucket "bucket":
        curr_items 287939
        curr_items_tot 635438
        mem_used 429145136

        [couchdb:error,2012-08-30T14:16:53.148,ns_1@10.1.3.238:<0.3227.3>:couch_log:error:42]Uncaught error in HTTP request: {exit,
        {timeout,
        {gen_server,call,
        ['ns_memcached-bucket',

        {stats,<<>>}

        ,
        60000]}}}

        As alk mentioned there are some swap usage

        730582:

        {swap_used,629829632}]},
        743695: {swap_used,629829632}

        ]},
        743872:

        {swap_used,629829632}

        ]},

        but again there are as you see 500 MB free memory available and we dont have indexing going on here so not sure why we are using swap

        1549173:

        {free_memory,5756739584}

        ,
        1550434:

        {free_memory,5756342272}

        ,
        1550518:

        {free_memory,5757095936},
        1550601: {free_memory,579661824},
        1550682: {free_memory,5757095936}

        ,
        1550765:

        {free_memory,579661824}

        ,
        1551091:

        {free_memory,5143658496}

        ,
        1761876:

        {free_memory,321335296}

        ,
        1761946:

        {free_memory,5732896768}

        ,

        yes this machine has only 2 GB ram but memcached is only using 400 MB and we have 500 MB free space.

        math does not add up for me and we can not close this bug until QE digs deepers into logs and find out more about what went wrong

        box also has 4 cpus

        processor : 3
        vendor_id : GenuineIntel
        cpu family : 6
        model : 44
        model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz
        stepping : 2
        cpu MHz : 2660.000
        cache size : 12288 KB
        fpu : yes
        fpu_exception : yes
        cpuid level : 11
        wp : yes
        flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc ida nonstop_tsc arat pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm
        bogomips : 5320.00
        clflush size : 64
        cache_alignment : 64
        address sizes : 40 bits physical, 48 bits virtual
        power management: [8]

        Show
        farshid Farshid Ghods (Inactive) added a comment - also i find it very hard to believe that when memcached mem_used is reported as only 400 MB and the box has 2 GB ram [stats:debug,2012-08-30T13:39:15.175,ns_1@10.1.3.238:<0.1206.0>:stats_collector:maybe_log_stats:126] (at 1970,1,16},{5,59,19 ) Stats for bucket "bucket": curr_items 287939 curr_items_tot 635438 mem_used 429145136 [couchdb:error,2012-08-30T14:16:53.148,ns_1@10.1.3.238:<0.3227.3>:couch_log:error:42] Uncaught error in HTTP request: {exit, {timeout, {gen_server,call, ['ns_memcached-bucket', {stats,<<>>} , 60000]}}} As alk mentioned there are some swap usage 730582: {swap_used,629829632}]}, 743695: {swap_used,629829632} ]}, 743872: {swap_used,629829632} ]}, but again there are as you see 500 MB free memory available and we dont have indexing going on here so not sure why we are using swap 1549173: {free_memory,5756739584} , 1550434: {free_memory,5756342272} , 1550518: {free_memory,5757095936}, 1550601: {free_memory,579661824}, 1550682: {free_memory,5757095936} , 1550765: {free_memory,579661824} , 1551091: {free_memory,5143658496} , 1761876: {free_memory,321335296} , 1761946: {free_memory,5732896768} , yes this machine has only 2 GB ram but memcached is only using 400 MB and we have 500 MB free space. math does not add up for me and we can not close this bug until QE digs deepers into logs and find out more about what went wrong box also has 4 cpus processor : 3 vendor_id : GenuineIntel cpu family : 6 model : 44 model name : Intel(R) Xeon(R) CPU X5650 @ 2.67GHz stepping : 2 cpu MHz : 2660.000 cache size : 12288 KB fpu : yes fpu_exception : yes cpuid level : 11 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss syscall nx rdtscp lm constant_tsc ida nonstop_tsc arat pni ssse3 cx16 sse4_1 sse4_2 popcnt lahf_lm bogomips : 5320.00 clflush size : 64 cache_alignment : 64 address sizes : 40 bits physical, 48 bits virtual power management: [8]
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Junyi/Alk,

        feel free to declare this as won't fix but please look at my latest comment regarding the environment and stats gathered here.

        Show
        farshid Farshid Ghods (Inactive) added a comment - Junyi/Alk, feel free to declare this as won't fix but please look at my latest comment regarding the environment and stats gathered here.
        farshid Farshid Ghods (Inactive) made changes -
        Assignee Ketaki Gangal [ ketaki ] Aleksey Kondratenko [ alkondratenko ]
        Hide
        junyi Junyi Xie (Inactive) added a comment -

        Based on Fashid's comment.

        Show
        junyi Junyi Xie (Inactive) added a comment - Based on Fashid's comment.
        junyi Junyi Xie (Inactive) made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Won't Fix [ 2 ]
        Hide
        peter peter added a comment -

        Damien is changing the XDCR replication manager initialization code which may address this problem.

        Show
        peter peter added a comment - Damien is changing the XDCR replication manager initialization code which may address this problem.
        peter peter made changes -
        Resolution Won't Fix [ 2 ]
        Status Resolved [ 5 ] Reopened [ 4 ]
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Apparently Damien is working on this. Thus getting this off my list

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Apparently Damien is working on this. Thus getting this off my list
        alkondratenko Aleksey Kondratenko (Inactive) made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Damien Katz [ damien ]
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ns-server-2-0 #461 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/461/)
        MB-6493: Throttle xdc_vbucket_rep initialization (Revision ff26c4fda40322ba2c0eb5d039edf5fedbc282f5)
        MB-6493: Add supervisor2 and use for restart of xdc_vbucker_rep (Revision 3f5ee80e654f7b83621e25b4f0f5c52c3a84b6b3)
        MB-6493: Fix vb map cache regression when xdcr init fails (Revision 807184854ac3c67740830dc83b5b17170282fe60)

        Result = SUCCESS
        pwansch :
        Files :

        • src/xdc_replication.erl
        • src/xdc_vbucket_rep.erl

        pwansch :
        Files :

        • src/xdc_replication.erl
        • src/xdc_vbucket_rep.erl
        • src/ns_config_default.erl
        • src/supervisor2.erl
        • include/xdc_replicator.hrl
        • src/xdc_vbucket_rep_sup.erl

        pwansch :
        Files :

        • src/xdc_vbucket_rep.erl
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ns-server-2-0 #461 (See http://qa.hq.northscale.net/job/github-ns-server-2-0/461/ ) MB-6493 : Throttle xdc_vbucket_rep initialization (Revision ff26c4fda40322ba2c0eb5d039edf5fedbc282f5) MB-6493 : Add supervisor2 and use for restart of xdc_vbucker_rep (Revision 3f5ee80e654f7b83621e25b4f0f5c52c3a84b6b3) MB-6493 : Fix vb map cache regression when xdcr init fails (Revision 807184854ac3c67740830dc83b5b17170282fe60) Result = SUCCESS pwansch : Files : src/xdc_replication.erl src/xdc_vbucket_rep.erl pwansch : Files : src/xdc_replication.erl src/xdc_vbucket_rep.erl src/ns_config_default.erl src/supervisor2.erl include/xdc_replicator.hrl src/xdc_vbucket_rep_sup.erl pwansch : Files : src/xdc_vbucket_rep.erl
        Hide
        damien damien added a comment -

        Abhinav, can you verify if this is still a problem with my latest changes?

        Show
        damien damien added a comment - Abhinav, can you verify if this is still a problem with my latest changes?
        damien damien made changes -
        Assignee Damien Katz [ damien ] Abhinav Dangeti [ abhinav ]
        Hide
        abhinav Abhinav Dangeti added a comment -

        Reproduced the scenario on build 1695: Problem doesn't persist.

        Show
        abhinav Abhinav Dangeti added a comment - Reproduced the scenario on build 1695: Problem doesn't persist.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Ketaki,

        are you still running more tests to verify the fix ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Ketaki, are you still running more tests to verify the fix ?
        Hide
        ketaki Ketaki Gangal added a comment -

        2.0-1700. Rebalance is working as expected, seeing no timeout errors.

        Show
        ketaki Ketaki Gangal added a comment - 2.0-1700. Rebalance is working as expected, seeing no timeout errors.
        ketaki Ketaki Gangal made changes -
        Status Reopened [ 4 ] Resolved [ 5 ]
        Resolution Fixed [ 1 ]
        farshid Farshid Ghods (Inactive) made changes -
        Status Resolved [ 5 ] Closed [ 6 ]
        farshid Farshid Ghods (Inactive) made changes -
        Resolution Fixed [ 1 ]
        Status Closed [ 6 ] Reopened [ 4 ]
        Sprint Status Current Sprint
        farshid Farshid Ghods (Inactive) made changes -
        Status Reopened [ 4 ] Closed [ 6 ]
        Resolution Fixed [ 1 ]

          People

          • Assignee:
            abhinav Abhinav Dangeti
            Reporter:
            abhinav Abhinav Dangeti
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes