Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7572

test_add_back_failed_node: Rebalance exited with reason {badmatch, [{<0.25911.52>, {{badmatch,[{<20234.16808.17>,killed}]}, [{misc,sync_shutdown_many_i_am_trapping_exits,1

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0.1
    • Fix Version/s: 2.0.1
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None

      Description

      build 139
      http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/170/consoleFull
      ./testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=100 -t swaprebalance.SwapRebalanceFailedTests.test_add_back_failed_node,replica=2,num-buckets=3,num-swap=1,keys-count=1000000,swap-orchestrator=False

      steps:
      [2013-01-21 08:18:48,775] - [swaprebalance:472] INFO - failover node ns_1@10.3.121.94 and rebalance afterwards
      [2013-01-21 08:19:19,849] - [rest_client:767] INFO - fail_over successful
      [2013-01-21 08:19:20,071] - [rest_client:795] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.121.94&user=Administrator&knownNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.98%2Cns_1%4010.3.121.96%2Cns_1%4010.3.121.93%2Cns_1%4010.3.121.97%2Cns_1%4010.3.121.95
      [2013-01-21 08:19:20,161] - [rest_client:799] INFO - rebalance operation started
      [2013-01-21 08:19:20,337] - [rest_client:894] INFO - rebalance percentage : 0 %
      ...
      [2013-01-21 09:11:14,394] - [rest_client:894] INFO - rebalance percentage : 68.451909784 %
      [2013-01-21 09:11:16,519] - [rest_client:879] ERROR -

      {u'status': u'none', u'errorMessage': u'Rebalance failed. See logs for detailed reason. You can try rebalance again.'}

      - rebalance failed

      2013-01-21 09:38:12.596 ns_vbucket_mover:0:critical:message(ns_1@10.3.121.93) - <0.25863.52> exited with {badmatch,
      [{<0.25911.52>,
      {{badmatch,[

      {<20234.16808.17>,killed}]},
      [{misc, sync_shutdown_many_i_am_trapping_exits,1},
      {misc,try_with_maybe_ignorant_after,2},
      {gen_server,terminate,6},
      {proc_lib,init_p_do_apply,3}]}}]}
      2013-01-21 09:38:12.774 ns_orchestrator:2:info:message(ns_1@10.3.121.93) - Rebalance exited with reason {badmatch,
      [{<0.25911.52>,
      {{badmatch,[{<20234.16808.17>,killed}

      ]},
      [

      {misc, sync_shutdown_many_i_am_trapping_exits,1}

      ,

      {misc,try_with_maybe_ignorant_after,2}

      ,

      {gen_server,terminate,6}

      ,

      {proc_lib,init_p_do_apply,3}

      ]}}]}

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        andreibaranouski Andrei Baranouski created issue -
        Show
        andreibaranouski Andrei Baranouski added a comment - https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.92-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.93-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.94-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.95-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.96-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.97-8091-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/10.3.121.98-8091-diag.txt.gz
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        andrei,

        does this exact same test pass against 2.0 ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - andrei, does this exact same test pass against 2.0 ?
        farshid Farshid Ghods (Inactive) made changes -
        Field Original Value New Value
        Fix Version/s 2.0.1 [ 10399 ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        yes, sure

        Show
        andreibaranouski Andrei Baranouski added a comment - yes, sure
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Assignee Aleksey Kondratenko [ alkondratenko ] Aliaksey Artamonau [ aliaksey artamonau ]
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Log file was rotated long past the rebalance failure. Please reproduce and upload diags grabbed shortly after the failure.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Log file was rotated long past the rebalance failure. Please reproduce and upload diags grabbed shortly after the failure.
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Assignee Aliaksey Artamonau [ aliaksey artamonau ] Andrei Baranouski [ andreibaranouski ]
        Show
        andreibaranouski Andrei Baranouski added a comment - sorry, Aliaksey following logs should be ok https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121.92-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-93-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-94-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-95-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-96-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-97-diag.txt.gz https://s3.amazonaws.com/bugdb/jira/MB-7572/b17838ff-4d37-4768-b8fa-cd6a0bcfa8fe-10.3.121-98-diag.txt.gz
        andreibaranouski Andrei Baranouski made changes -
        Assignee Andrei Baranouski [ andreibaranouski ] Aliaksey Artamonau [ aliaksey artamonau ]
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        So here what happened.

        Node .93 builds a replica for vbucket 635 on node .95:

        [ns_server:debug,2013-01-21T9:37:00.815,ns_1@10.3.121.93:<0.25911.52>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 635 into 'ns_1@10.3.121.95' is <20234.16808.17>

        Relevant messages from the .95:

        [ns_server:info,2013-01-21T9:37:00.723,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:509]Setting

        {"10.3.121.95",11209} vbucket 635 to state replica
        [ns_server:info,2013-01-21T9:37:00.815,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171]memcached<0.15884.7>: Mon Jan 21 09:37:00.613175 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.93 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0

        [ns_server:debug,2013-01-21T9:37:00.869,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:kill_tapname:1005]killing tap named: replication_building_635_'ns_1@10.3.121.95'
        [rebalance:info,2013-01-21T9:37:00.876,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:568]Starting tap stream:
        [{vbuckets,[635]},
        {checkpoints,[{635,0}]},
        {name,<<"replication_building_635_'ns_1@10.3.121.95'">>},
        {takeover,false}]
        {{"10.3.121.96",11209},
        {"10.3.121.95",11209}

        ,
        [

        {vbuckets,[635]}

        ,

        {takeover,false}

        ,

        {suffix,"building_635_'ns_1@10.3.121.95'"}

        ,

        {username,"bucket-0"}

        ,

        {password,"password"}

        ]}
        [rebalance:debug,2013-01-21T9:37:00.887,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:605]upstream_sender pid: <0.16809.17>
        [rebalance:debug,2013-01-21T9:37:00.888,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:handle_call:336]Suspended had_backfill waiter
        {had_backfill,undefined,undefined,
        [

        {<18136.25927.52>,#Ref<18136.0.225.131688>}]}
        [rebalance:info,2013-01-21T9:37:00.889,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:932]Initial stream for vbucket 635
        [rebalance:debug,2013-01-21T9:37:00.890,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_downstream:896]Replied had_backfill: true to [{<18136.25927.52>,#Ref<18136.0.225.131688>}

        ]
        [ns_server:info,2013-01-21T9:37:00.925,ns_1@10.3.121.95:<0.16810.17>:ebucketmigrator_srv:init:509]Setting

        {"10.3.121.95",11209}

        vbucket 811 to state replica
        [ns_server:info,2013-01-21T9:37:01.094,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171]memcached<0.15884.7>: Mon Jan 21 09:37:00.890055 PST 3: TAP (Consumer) eq_tapq:anon_3574 - Reset vbucket 635 was completed succecssfully.

        Then we see a TAP_OPAQUE_CLOSE_BACKFILL:

        [ns_server:debug,2013-01-21T9:37:11.320,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:946]seen backfill-close message

        Then we attempt to shutdown the ebucketmigrator. As part of this, to confirm all the messages that we've sent to downstream we send an opaque message with (last seen sequence number + 1) and wait until downstream replies. We set timeout to 30 seconds for the entire sequence of actions.

        [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:confirm_sent_messages:727]Going to wait for reception of opaque message ack
        [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:722]Sending opaque message to confirm downstream reception
        [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:724]Sent fine

        And exactly 30 seconds later we see that the ebucketmigrator got killed since timeout expired:

        [ns_server:error,2013-01-21T9:37:52.271,ns_1@10.3.121.93:<0.25911.52>:misc:sync_shutdown_many_i_am_trapping_exits:1416]Shutdown of the following failed: [

        {<20234.16808.17>,killed}

        ]

        So the problem seems to be that memcached on .95 didn't manage to reply to us in 30 seconds.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - So here what happened. Node .93 builds a replica for vbucket 635 on node .95: [ns_server:debug,2013-01-21T9:37:00.815,ns_1@10.3.121.93:<0.25911.52>:ns_replicas_builder_utils:spawn_replica_builder:86] Replica building ebucketmigrator for vbucket 635 into 'ns_1@10.3.121.95' is <20234.16808.17> Relevant messages from the .95: [ns_server:info,2013-01-21T9:37:00.723,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:509] Setting {"10.3.121.95",11209} vbucket 635 to state replica [ns_server:info,2013-01-21T9:37:00.815,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171] memcached<0.15884.7>: Mon Jan 21 09:37:00.613175 PST 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.93 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0 [ns_server:debug,2013-01-21T9:37:00.869,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:kill_tapname:1005] killing tap named: replication_building_635_'ns_1@10.3.121.95' [rebalance:info,2013-01-21T9:37:00.876,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:568] Starting tap stream: [{vbuckets,[635]}, {checkpoints, [{635,0}] }, {name,<<"replication_building_635_'ns_1@10.3.121.95'">>}, {takeover,false}] {{"10.3.121.96",11209}, {"10.3.121.95",11209} , [ {vbuckets,[635]} , {takeover,false} , {suffix,"building_635_'ns_1@10.3.121.95'"} , {username,"bucket-0"} , {password,"password"} ]} [rebalance:debug,2013-01-21T9:37:00.887,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:init:605] upstream_sender pid: <0.16809.17> [rebalance:debug,2013-01-21T9:37:00.888,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:handle_call:336] Suspended had_backfill waiter {had_backfill,undefined,undefined, [ {<18136.25927.52>,#Ref<18136.0.225.131688>}]} [rebalance:info,2013-01-21T9:37:00.889,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:932] Initial stream for vbucket 635 [rebalance:debug,2013-01-21T9:37:00.890,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_downstream:896] Replied had_backfill: true to [{<18136.25927.52>,#Ref<18136.0.225.131688>} ] [ns_server:info,2013-01-21T9:37:00.925,ns_1@10.3.121.95:<0.16810.17>:ebucketmigrator_srv:init:509] Setting {"10.3.121.95",11209} vbucket 811 to state replica [ns_server:info,2013-01-21T9:37:01.094,ns_1@10.3.121.95:ns_port_memcached<0.15884.7>:ns_port_server:log:171] memcached<0.15884.7>: Mon Jan 21 09:37:00.890055 PST 3: TAP (Consumer) eq_tapq:anon_3574 - Reset vbucket 635 was completed succecssfully. Then we see a TAP_OPAQUE_CLOSE_BACKFILL: [ns_server:debug,2013-01-21T9:37:11.320,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:process_upstream:946] seen backfill-close message Then we attempt to shutdown the ebucketmigrator. As part of this, to confirm all the messages that we've sent to downstream we send an opaque message with (last seen sequence number + 1) and wait until downstream replies. We set timeout to 30 seconds for the entire sequence of actions. [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.16808.17>:ebucketmigrator_srv:confirm_sent_messages:727] Going to wait for reception of opaque message ack [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:722] Sending opaque message to confirm downstream reception [ns_server:debug,2013-01-21T9:37:22.270,ns_1@10.3.121.95:<0.17215.17>:ebucketmigrator_srv:confirm_sent_messages:724] Sent fine And exactly 30 seconds later we see that the ebucketmigrator got killed since timeout expired: [ns_server:error,2013-01-21T9:37:52.271,ns_1@10.3.121.93:<0.25911.52>:misc:sync_shutdown_many_i_am_trapping_exits:1416] Shutdown of the following failed: [ {<20234.16808.17>,killed} ] So the problem seems to be that memcached on .95 didn't manage to reply to us in 30 seconds.
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Please see my comment above Chiyoung.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Please see my comment above Chiyoung.
        Aliaksey Artamonau Aliaksey Artamonau made changes -
        Assignee Aliaksey Artamonau [ aliaksey artamonau ] Chiyoung Seo [ chiyoung ]
        mikew Mike Wiederhold made changes -
        Component/s couchbase-bucket [ 10173 ]
        Hide
        dipti Dipti Borkar added a comment -

        per bug-scrub: is this a regression?

        Show
        dipti Dipti Borkar added a comment - per bug-scrub: is this a regression?
        Hide
        chiyoung Chiyoung Seo added a comment -

        memcached log was truncated unfortunately. Can you upload the separated memcached log file from the node .95?

        Show
        chiyoung Chiyoung Seo added a comment - memcached log was truncated unfortunately. Can you upload the separated memcached log file from the node .95?
        chiyoung Chiyoung Seo made changes -
        Assignee Chiyoung Seo [ chiyoung ] Andrei Baranouski [ andreibaranouski ]
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        don't have logs anymore. Cluster has been cleaned

        Show
        andreibaranouski Andrei Baranouski added a comment - don't have logs anymore. Cluster has been cleaned
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Andrei,

        did test pass in the next iteration ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Andrei, did test pass in the next iteration ?
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        this test passed against 144

        Show
        andreibaranouski Andrei Baranouski added a comment - this test passed against 144
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        please reopen if test fails again

        Show
        farshid Farshid Ghods (Inactive) added a comment - please reopen if test fails again
        farshid Farshid Ghods (Inactive) made changes -
        Status Open [ 1 ] Resolved [ 5 ]
        Resolution Cannot Reproduce [ 5 ]
        mikew Mike Wiederhold made changes -
        Sprint Status Current Sprint
        andreibaranouski Andrei Baranouski made changes -
        Status Resolved [ 5 ] Closed [ 6 ]

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes