Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30074

[System Test] KV node still in pending state after memcached crashes

    XMLWordPrintable

    Details

      Description

      Build : 5.5.0-2884
      Test : GSI component test : -test tests/2i/test_idx_rebalance_replica_vulcan_kv_opt.yml -scope tests/2i/scope_idx_rebalance_replica_vulcan.yml
      Scale : 3
      Iteration : 3rd iteration (~20 hrs of run)
      In the 3rd iteration, the step in the test to rebalance out a KV node fails when memcached crashes on the master node. The cluster becomes pretty unusable after this. The subsequent rebalance has got stuck (see MB-30073), the buckets are in warmup state forever

      The following error is shown on the diag log:

      Service 'memcached' exited with status 137. Restarting. Messages:
      2018-06-11T18:33:16.430335Z WARNING 106: Slow operation. {"cid":"172.23.106.161:40160/81905100","duration":"4193 ms","trace":"request=10723899535396330:4193337","command":"GET","peer":"172.23.106.161:40160"}
      2018-06-11T18:33:16.433047Z WARNING 110: Slow operation. {"cid":"172.23.106.161:40210/31c25100","duration":"4205 ms","trace":"request=10723899526306866:4205138","command":"GET","peer":"172.23.106.161:40210"}
      2018-06-11T18:33:16.433697Z WARNING (other-3) Slow runtime for 'Running a flusher loop: shard 2' on thread writer_worker_0: 4198 ms
      2018-06-11T18:33:16.521086Z WARNING (other-1) Slow runtime for 'Checkpoint Remover on vb 213' on thread nonIO_worker_1: 20 ms
      2018-06-11T18:33:16.581773Z WARNING (other-2) Slow runtime for 'Checkpoint Remover on vb 498' on thread nonIO_worker_0: 33 ms
      2018-06-11T18:33:17.253240Z WARNING (other-2) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 346 ms
      2018-06-11T18:33:18.229913Z WARNING (default) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 623 ms
      2018-06-11T18:33:19.469135Z WARNING (other-1) Slow runtime for 'Checkpoint Remover on vb 339' on thread nonIO_worker_0: 20 ms
      

      Following is seen in the debug log:

      [error_logger:error,2018-06-11T18:33:24.935-07:00,ns_1@172.23.104.16:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
      =========================CRASH REPORT=========================
        crasher:
          initial call: erlang:apply/2
          pid: <0.12839.0>
          registered_name: []
          exception error: no match of right hand side value {error,closed}
            in function  mc_client_binary:stats_recv/4 (src/mc_client_binary.erl, line 164)
            in call from mc_client_binary:stats/4 (src/mc_client_binary.erl, line 406)
            in call from ns_memcached:do_handle_call/3 (src/ns_memcached.erl, line 460)
            in call from ns_memcached:worker_loop/3 (src/ns_memcached.erl, line 228)
          ancestors: ['ns_memcached-default',<0.12824.0>,
                        'single_bucket_kv_sup-default',ns_bucket_sup,
                        ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
                        <0.170.0>,ns_server_cluster_sup,<0.89.0>]
          messages: []
          links: [<0.12825.0>,#Port<0.9577>]
          dictionary: [{last_call,verify_warmup},{sockname,{{127,0,0,1},51713}}]
          trap_exit: false
          status: running
          heap_size: 6772
          stack_size: 27
          reductions: 241838516
        neighbours:
      
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          ajit.yagaty Ajit Yagaty added a comment -

          The final rebalance, which didn't finish, seems to have started at 2018-06-11T15:45:37.395:

          [user:info,2018-06-11T15:45:37.395-07:00,ns_1@172.23.104.16:<0.9134.0>:ns_orchestrator:idle:663]Starting rebalance, KeepNodes = ['ns_1@172.23.104.16','ns_1@172.23.104.18',
                                           'ns_1@172.23.104.19','ns_1@172.23.104.21',
                                           'ns_1@172.23.104.23','ns_1@172.23.104.25',
                                           'ns_1@172.23.104.93','ns_1@172.23.104.94',
                                           'ns_1@172.23.96.96'], EjectNodes = ['ns_1@172.23.104.17'], Failed over and being ejected nodes = []; no delta recovery nodes

          The rebalance operation seems to have gone past the bucket rebalance and moved well into indexer service rebalance:

          [ns_server:debug,2018-06-11T16:38:30.852-07:00,ns_1@172.23.104.16:service_rebalancer-index<0.29211.175>:service_agent:wait_for_agents:73]Waiting for the service agents for service index to come up on nodes:
['ns_1@172.23.104.23','ns_1@172.23.104.25','ns_1@172.23.104.93',
 'ns_1@172.23.104.94','ns_1@172.23.96.96']
          [rebalance:info,2018-06-11T16:38:30.874-07:00,ns_1@172.23.104.16:service_rebalancer-index-worker<0.29420.175>:service_rebalancer:rebalance:110]Rebalancing service index with id <<"592ef9c2f2d7c0b9f566356c4dc0386e">>.
KeepNodes: ['ns_1@172.23.104.23','ns_1@172.23.104.25','ns_1@172.23.104.93',
            'ns_1@172.23.104.94','ns_1@172.23.96.96']
          [ns_server:debug,2018-06-11T16:38:31.382-07:00,ns_1@172.23.104.16:service_rebalancer-index-worker<0.29420.175>:service_rebalancer:rebalance:115]Got node infos:
[{'ns_1@172.23.104.25',[{node_id,<<"2acd86ceedcc9b4aeaed4dd0d7765e68">>},
                        {priority,3},
                        {opaque,null}]},
 {'ns_1@172.23.96.96',[{node_id,<<"3f2a71ee76448d15e6e7c622a25fabca">>},
                       {priority,3},
                       {opaque,null}]},
 {'ns_1@172.23.104.23',[{node_id,<<"6821f3a765d69b20a69e0b116f3c9def">>},
                        {priority,3},
                        {opaque,null}]},
 {'ns_1@172.23.104.93',[{node_id,<<"dcff1d721e143e8db37989faec6ed768">>},
                        {priority,3},
                        {opaque,null}]},
 {'ns_1@172.23.104.94',[{node_id,<<"7beee02120f69f9d4ef766f31eda04f0">>},
                        {priority,3},
                        {opaque,null}]}]
          

          Node 172.23.104.25 has been chosen as the leader to carry out the indexer rebalance:


          [ns_server:debug,2018-06-11T16:38:33.135-07:00,ns_1@172.23.104.16:service_rebalancer-index-worker<0.29420.175>:service_rebalancer:rebalance:124]Using node 'ns_1@172.23.104.25' as a leader

          On node 172.23.104.25, the ns_server.json_rpc.log indicates that the StartTopologyChange has been issued but it seems like the rebalance never completes:

          [json_rpc:debug,2018-06-11T16:38:33.139-07:00,ns_1@172.23.104.25:json_rpc_connection-index-service_api<0.2525.34>:json_rpc_connection:handle_call:158]sending jsonrpc call:{[{jsonrpc,<<"2.0">>},
                                 {id,7695},
                                 {method,<<"ServiceAPI.StartTopologyChange">>},
                                 {params,
                                  [{[{id,<<"592ef9c2f2d7c0b9f566356c4dc0386e">>},
                                     {currentTopologyRev,null},
                                     {type,<<"topology-change-rebalance">>},
                                     {keepNodes,
                                      [{[{nodeInfo,
                                          {[{nodeId,
                                             <<"6821f3a765d69b20a69e0b116f3c9def">>},
                                            {priority,3},
                                            {opaque,null}]}},
                                         {recoveryType,<<"recovery-full">>}]},
                                       {[{nodeInfo,
                                          {[{nodeId,
                                             <<"2acd86ceedcc9b4aeaed4dd0d7765e68">>},
                                            {priority,3},
                                            {opaque,null}]}},
                                         {recoveryType,<<"recovery-full">>}]},
                                       {[{nodeInfo,
                                          {[{nodeId,
                                             <<"dcff1d721e143e8db37989faec6ed768">>},
                                            {priority,3},
                                            {opaque,null}]}},
                                         {recoveryType,<<"recovery-full">>}]},
                                       {[{nodeInfo,
                                          {[{nodeId,
                                             <<"7beee02120f69f9d4ef766f31eda04f0">>},
                                            {priority,3},
                                            {opaque,null}]}},
                                         {recoveryType,<<"recovery-full">>}]},
                                       {[{nodeInfo,
                                          {[{nodeId,
                                             <<"3f2a71ee76448d15e6e7c622a25fabca">>},
                                            {priority,3},
                                            {opaque,null}]}},
                                         {recoveryType,<<"recovery-full">>}]}]},
                                     {ejectNodes,[]}]}]}]}

          It can be seen that even after several hours the progress made is very small:

          [json_rpc:debug,2018-06-12T00:06:00.837-07:00,ns_1@172.23.104.25:json_rpc_connection-index-service_api<0.2525.34>:json_rpc_connection:handle_info:94]got response: [{<<"id">>,19138},
                         {<<"result">>,
                          {[{<<"rev">>,<<"AAAAAAAAIjg=">>},
                            {<<"tasks">>,
                             [{[{<<"rev">>,<<"AAAAAAAAAAA=">>},
                                {<<"id">>,
                                 <<"prepare/592ef9c2f2d7c0b9f566356c4dc0386e">>},
                                {<<"type">>,<<"task-prepared">>},
                                {<<"status">>,<<"task-running">>},
                                {<<"isCancelable">>,true},
                                {<<"progress">>,0},
                                {<<"extra">>,
                                 {[{<<"rebalanceId">>,
                                    <<"592ef9c2f2d7c0b9f566356c4dc0386e">>}]}}]},
                              {[{<<"rev">>,<<"AAAAAAAAFlc=">>},
                                {<<"id">>,
                                 <<"rebalance/592ef9c2f2d7c0b9f566356c4dc0386e">>},
                                {<<"type">>,<<"task-rebalance">>},
                                {<<"status">>,<<"task-running">>},
                                {<<"isCancelable">>,true},
                                {<<"progress">>,0.9579263819875723},
                                {<<"extra">>,
                                 {[{<<"rebalanceId">>,
                                    <<"592ef9c2f2d7c0b9f566356c4dc0386e">>}]}}]}]}]}},
                         {<<"error">>,null}]

          As far as "why the buckets didn’t come back online after the memcached crash" is concerned, it's because the traffic was not enabled on them. The janitor process is responsible for enabling the traffic and janitor doesn’t run while rebalancing is on-going and here the rebalance operation is stuck in indexer service rebalance.

          On the side, in the memory utilization graph that Jim has attached, the bump in projector’s memory consumption correlates with the start of indexer service rebalance.

          Assigning it to the indexer folks to determine the cause of the hang.

           

          Show
          ajit.yagaty Ajit Yagaty added a comment - The final rebalance, which didn't finish, seems to have started at 2018-06-11T15:45:37.395: [user:info, 2018 - 06 -11T15: 45 : 37.395 - 07 : 00 ,ns_1 @172 .23. 104.16 :< 0.9134 . 0 >:ns_orchestrator:idle: 663 ]Starting rebalance, KeepNodes = [ 'ns_1@172.23.104.16' , 'ns_1@172.23.104.18' ,                                  'ns_1@172.23.104.19' , 'ns_1@172.23.104.21' ,                                  'ns_1@172.23.104.23' , 'ns_1@172.23.104.25' ,                                  'ns_1@172.23.104.93' , 'ns_1@172.23.104.94' ,                                  'ns_1@172.23.96.96' ], EjectNodes = [ 'ns_1@172.23.104.17' ], Failed over and being ejected nodes = []; no delta recovery nodes The rebalance operation seems to have gone past the bucket rebalance and moved well into indexer service rebalance: [ns_server:debug, 2018 - 06 -11T16: 38 : 30.852 - 07 : 00 ,ns_1 @172 .23. 104.16 :service_rebalancer-index< 0.29211 . 175 >:service_agent:wait_for_agents: 73 ]Waiting for the service agents for service index to come up on nodes:
[ 'ns_1@172.23.104.23' , 'ns_1@172.23.104.25' , 'ns_1@172.23.104.93' ,
  'ns_1@172.23.104.94' , 'ns_1@172.23.96.96' ] [rebalance:info, 2018 - 06 -11T16: 38 : 30.874 - 07 : 00 ,ns_1 @172 .23. 104.16 :service_rebalancer-index-worker< 0.29420 . 175 >:service_rebalancer:rebalance: 110 ]Rebalancing service index with id << "592ef9c2f2d7c0b9f566356c4dc0386e" >>.
KeepNodes: [ 'ns_1@172.23.104.23' , 'ns_1@172.23.104.25' , 'ns_1@172.23.104.93' ,
            'ns_1@172.23.104.94' , 'ns_1@172.23.96.96' ] [ns_server:debug, 2018 - 06 -11T16: 38 : 31.382 - 07 : 00 ,ns_1 @172 .23. 104.16 :service_rebalancer-index-worker< 0.29420 . 175 >:service_rebalancer:rebalance: 115 ]Got node infos:
[{ 'ns_1@172.23.104.25' ,[{node_id,<< "2acd86ceedcc9b4aeaed4dd0d7765e68" >>},
                        {priority, 3 },
                        {opaque, null }]},
 { 'ns_1@172.23.96.96' ,[{node_id,<< "3f2a71ee76448d15e6e7c622a25fabca" >>},
                       {priority, 3 },
                       {opaque, null }]},
 { 'ns_1@172.23.104.23' ,[{node_id,<< "6821f3a765d69b20a69e0b116f3c9def" >>},
                        {priority, 3 },
                        {opaque, null }]},
 { 'ns_1@172.23.104.93' ,[{node_id,<< "dcff1d721e143e8db37989faec6ed768" >>},
                        {priority, 3 },
                        {opaque, null }]},
 { 'ns_1@172.23.104.94' ,[{node_id,<< "7beee02120f69f9d4ef766f31eda04f0" >>},
                        {priority, 3 },
                        {opaque, null }]}] Node 172.23.104.25 has been chosen as the leader to carry out the indexer rebalance:
 [ns_server:debug, 2018 - 06 -11T16: 38 : 33.135 - 07 : 00 ,ns_1 @172 .23. 104.16 :service_rebalancer-index-worker< 0.29420 . 175 >:service_rebalancer:rebalance: 124 ]Using node 'ns_1@172.23.104.25' as a leader On node 172.23.104.25, the ns_server.json_rpc.log indicates that the StartTopologyChange has been issued but it seems like the rebalance never completes: [json_rpc:debug, 2018 - 06 -11T16: 38 : 33.139 - 07 : 00 ,ns_1 @172 .23. 104.25 :json_rpc_connection-index-service_api< 0.2525 . 34 >:json_rpc_connection:handle_call: 158 ]sending jsonrpc call:{[{jsonrpc,<< "2.0" >>},                        {id, 7695 },                        {method,<< "ServiceAPI.StartTopologyChange" >>},                        {params,                         [{[{id,<< "592ef9c2f2d7c0b9f566356c4dc0386e" >>},                            {currentTopologyRev, null },                            {type,<< "topology-change-rebalance" >>},                            {keepNodes,                             [{[{nodeInfo,                                 {[{nodeId,                                    << "6821f3a765d69b20a69e0b116f3c9def" >>},                                   {priority, 3 },                                   {opaque, null }]}},                                {recoveryType,<< "recovery-full" >>}]},                              {[{nodeInfo,                                 {[{nodeId,                                    << "2acd86ceedcc9b4aeaed4dd0d7765e68" >>},                                   {priority, 3 },                                   {opaque, null }]}},                                {recoveryType,<< "recovery-full" >>}]},                              {[{nodeInfo,                                 {[{nodeId,                                    << "dcff1d721e143e8db37989faec6ed768" >>},                                   {priority, 3 },                                   {opaque, null }]}},                                {recoveryType,<< "recovery-full" >>}]},                              {[{nodeInfo,                                 {[{nodeId,                                    << "7beee02120f69f9d4ef766f31eda04f0" >>},                                   {priority, 3 },                                   {opaque, null }]}},                                {recoveryType,<< "recovery-full" >>}]},                              {[{nodeInfo,                                 {[{nodeId,                                    << "3f2a71ee76448d15e6e7c622a25fabca" >>},                                   {priority, 3 },                                   {opaque, null }]}},                                {recoveryType,<< "recovery-full" >>}]}]},                            {ejectNodes,[]}]}]}]} It can be seen that even after several hours the progress made is very small: [json_rpc:debug, 2018 - 06 -12T00: 06 : 00.837 - 07 : 00 ,ns_1 @172 .23. 104.25 :json_rpc_connection-index-service_api< 0.2525 . 34 >:json_rpc_connection:handle_info: 94 ]got response: [{<< "id" >>, 19138 },                {<< "result" >>,                 {[{<< "rev" >>,<< "AAAAAAAAIjg=" >>},                   {<< "tasks" >>,                    [{[{<< "rev" >>,<< "AAAAAAAAAAA=" >>},                       {<< "id" >>,                        << "prepare/592ef9c2f2d7c0b9f566356c4dc0386e" >>},                       {<< "type" >>,<< "task-prepared" >>},                       {<< "status" >>,<< "task-running" >>},                       {<< "isCancelable" >>, true },                       {<< "progress" >>, 0 },                       {<< "extra" >>,                        {[{<< "rebalanceId" >>,                           << "592ef9c2f2d7c0b9f566356c4dc0386e" >>}]}}]},                     {[{<< "rev" >>,<< "AAAAAAAAFlc=" >>},                       {<< "id" >>,                        << "rebalance/592ef9c2f2d7c0b9f566356c4dc0386e" >>},                       {<< "type" >>,<< "task-rebalance" >>},                       {<< "status" >>,<< "task-running" >>},                       {<< "isCancelable" >>, true },                       {<< "progress" >>, 0.9579263819875723 },                       {<< "extra" >>,                        {[{<< "rebalanceId" >>,                           << "592ef9c2f2d7c0b9f566356c4dc0386e" >>}]}}]}]}]}},                {<< "error" >>, null }] As far as "why the buckets didn’t come back online after the memcached crash" is concerned, it's because the traffic was not enabled on them. The janitor process is responsible for enabling the traffic and janitor doesn’t run while rebalancing is on-going and here the rebalance operation is stuck in indexer service rebalance. On the side, in the memory utilization graph that Jim has attached, the bump in projector’s memory consumption correlates with the start of indexer service rebalance. Assigning it to the indexer folks to determine the cause of the hang.  
          Hide
          deepkaran.salooja Deepkaran Salooja added a comment -

          As far as "why the buckets didn’t come back online after the memcached crash" is concerned, it's because the traffic was not enabled on them. The janitor process is responsible for enabling the traffic and janitor doesn’t run while rebalancing is on-going and here the rebalance operation is stuck in indexer service rebalance.

          This doesn't seem like the right thing to do. Index rebalance can legitimately take hours to complete. If memcached goes down during this duration, it shouldn't wait for index rebalance to finish before taking traffic.

          Index rebalance is waiting as the index build doesn't finish. As one of the nodes went into pending state, the projector from that node is returning error and indexer is waiting for that to resolve. Once the nodes goes back to normal state, this will resolve.

          2018-06-12T00:08:28.651-07:00 [Error] KVSender::restartVbuckets MAINT_STREAM default Error in fetching cluster info Node is not a member of bucket: 172.23.104.17:8091
          2018-06-12T00:08:28.657-07:00 [Error] Timekeeper::sendRestartMsg Error Response from KV Node is not a member of bucket: 172.23.104.17:8091 For Request
          

          Show
          deepkaran.salooja Deepkaran Salooja added a comment - As far as "why the buckets didn’t come back online after the memcached crash" is concerned, it's because the traffic was not enabled on them. The janitor process is responsible for enabling the traffic and janitor doesn’t run while rebalancing is on-going and here the rebalance operation is stuck in indexer service rebalance. This doesn't seem like the right thing to do. Index rebalance can legitimately take hours to complete. If memcached goes down during this duration, it shouldn't wait for index rebalance to finish before taking traffic. Index rebalance is waiting as the index build doesn't finish. As one of the nodes went into pending state, the projector from that node is returning error and indexer is waiting for that to resolve. Once the nodes goes back to normal state, this will resolve. 2018-06-12T00:08:28.651-07:00 [Error] KVSender::restartVbuckets MAINT_STREAM default Error in fetching cluster info Node is not a member of bucket: 172.23.104.17:8091 2018-06-12T00:08:28.657-07:00 [Error] Timekeeper::sendRestartMsg Error Response from KV Node is not a member of bucket: 172.23.104.17:8091 For Request
          Hide
          ajit.yagaty Ajit Yagaty added a comment -

          Spoke to Deep about this. As I understand it, the indexer rebalance may take longer to indicate that it’s complete when there are in-flight queries. If the memcached process has restarted during that time the availability of data service will be impacted as we don’t bring the buckets online. This will be even more pronounced if there are more services to be rebalanced after current one is done, as the janitor will not be run until all services are rebalanced.

          Currently, janitor cleanup and rebalance operations are mutually exclusive. The janitor cleans up both buckets and services. It seems like it would be beneficial to run the bucket cleanup after all the buckets have been rebalanced and the service cleanup after all the services have been rebalanced. This would improve the availability in cases like this. But this is a very involved change in ns_server. Since this is not a regression and this behavior also exists in Spock, I am moving this over to mad-hatter.

          Dave Finlay - could you please let me know if this is fine by you?

          Show
          ajit.yagaty Ajit Yagaty added a comment - Spoke to Deep about this. As I understand it, the indexer rebalance may take longer to indicate that it’s complete when there are in-flight queries. If the memcached process has restarted during that time the availability of data service will be impacted as we don’t bring the buckets online. This will be even more pronounced if there are more services to be rebalanced after current one is done, as the janitor will not be run until all services are rebalanced. Currently, janitor cleanup and rebalance operations are mutually exclusive. The janitor cleans up both buckets and services. It seems like it would be beneficial to run the bucket cleanup after all the buckets have been rebalanced and the service cleanup after all the services have been rebalanced. This would improve the availability in cases like this. But this is a very involved change in ns_server. Since this is not a regression and this behavior also exists in Spock, I am moving this over to mad-hatter. Dave Finlay - could you please let me know if this is fine by you?
          Hide
          dfinlay Dave Finlay added a comment -

          I think so - especially with so little time left in Vulcan and the fact that this behavior is unchanged from 5.0 / 5.1.

          Poonam Dhavale, Abhijeeth Nuthan: I imagine that once we have the fix for MB-24242 this situation would in general be handled (if autofailover is enabled, etc).

          Show
          dfinlay Dave Finlay added a comment - I think so - especially with so little time left in Vulcan and the fact that this behavior is unchanged from 5.0 / 5.1. Poonam Dhavale , Abhijeeth Nuthan : I imagine that once we have the fix for MB-24242 this situation would in general be handled (if autofailover is enabled, etc).
          Hide
          Abhijeeth.Nuthan Abhijeeth Nuthan added a comment -

          Dave Finlay  : Provided auto-failover is enabled and all conditions are met, MB-24242 should by design interrupt rebalance post KV rebalance and auto-failover the nodes with not ready buckets. Thereby, ensuring availability. 

          Show
          Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Dave Finlay   : Provided auto-failover is enabled and all conditions are met, MB-24242 should by design interrupt rebalance post KV rebalance and auto-failover the nodes with not ready buckets. Thereby, ensuring availability. 

            People

            • Assignee:
              ajit.yagaty Ajit Yagaty
              Reporter:
              mihir.kamdar Mihir Kamdar
            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.