Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30074

[System Test] KV node still in pending state after memcached crashes

    XMLWordPrintable

    Details

      Description

      Build : 5.5.0-2884
      Test : GSI component test : -test tests/2i/test_idx_rebalance_replica_vulcan_kv_opt.yml -scope tests/2i/scope_idx_rebalance_replica_vulcan.yml
      Scale : 3
      Iteration : 3rd iteration (~20 hrs of run)
      In the 3rd iteration, the step in the test to rebalance out a KV node fails when memcached crashes on the master node. The cluster becomes pretty unusable after this. The subsequent rebalance has got stuck (see MB-30073), the buckets are in warmup state forever

      The following error is shown on the diag log:

      Service 'memcached' exited with status 137. Restarting. Messages:
      2018-06-11T18:33:16.430335Z WARNING 106: Slow operation. {"cid":"172.23.106.161:40160/81905100","duration":"4193 ms","trace":"request=10723899535396330:4193337","command":"GET","peer":"172.23.106.161:40160"}
      2018-06-11T18:33:16.433047Z WARNING 110: Slow operation. {"cid":"172.23.106.161:40210/31c25100","duration":"4205 ms","trace":"request=10723899526306866:4205138","command":"GET","peer":"172.23.106.161:40210"}
      2018-06-11T18:33:16.433697Z WARNING (other-3) Slow runtime for 'Running a flusher loop: shard 2' on thread writer_worker_0: 4198 ms
      2018-06-11T18:33:16.521086Z WARNING (other-1) Slow runtime for 'Checkpoint Remover on vb 213' on thread nonIO_worker_1: 20 ms
      2018-06-11T18:33:16.581773Z WARNING (other-2) Slow runtime for 'Checkpoint Remover on vb 498' on thread nonIO_worker_0: 33 ms
      2018-06-11T18:33:17.253240Z WARNING (other-2) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 346 ms
      2018-06-11T18:33:18.229913Z WARNING (default) Slow runtime for 'Backfilling items for a DCP Connection' on thread auxIO_worker_0: 623 ms
      2018-06-11T18:33:19.469135Z WARNING (other-1) Slow runtime for 'Checkpoint Remover on vb 339' on thread nonIO_worker_0: 20 ms
      

      Following is seen in the debug log:

      [error_logger:error,2018-06-11T18:33:24.935-07:00,ns_1@172.23.104.16:error_logger<0.6.0>:ale_error_logger_handler:do_log:203]
      =========================CRASH REPORT=========================
        crasher:
          initial call: erlang:apply/2
          pid: <0.12839.0>
          registered_name: []
          exception error: no match of right hand side value {error,closed}
            in function  mc_client_binary:stats_recv/4 (src/mc_client_binary.erl, line 164)
            in call from mc_client_binary:stats/4 (src/mc_client_binary.erl, line 406)
            in call from ns_memcached:do_handle_call/3 (src/ns_memcached.erl, line 460)
            in call from ns_memcached:worker_loop/3 (src/ns_memcached.erl, line 228)
          ancestors: ['ns_memcached-default',<0.12824.0>,
                        'single_bucket_kv_sup-default',ns_bucket_sup,
                        ns_bucket_worker_sup,ns_server_sup,ns_server_nodes_sup,
                        <0.170.0>,ns_server_cluster_sup,<0.89.0>]
          messages: []
          links: [<0.12825.0>,#Port<0.9577>]
          dictionary: [{last_call,verify_warmup},{sockname,{{127,0,0,1},51713}}]
          trap_exit: false
          status: running
          heap_size: 6772
          stack_size: 27
          reductions: 241838516
        neighbours:
      
      

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment -

          Spoke to Deep about this. As I understand it, the indexer rebalance may take longer to indicate that it’s complete when there are in-flight queries. If the memcached process has restarted during that time the availability of data service will be impacted as we don’t bring the buckets online. This will be even more pronounced if there are more services to be rebalanced after current one is done, as the janitor will not be run until all services are rebalanced.

          Currently, janitor cleanup and rebalance operations are mutually exclusive. The janitor cleans up both buckets and services. It seems like it would be beneficial to run the bucket cleanup after all the buckets have been rebalanced and the service cleanup after all the services have been rebalanced. This would improve the availability in cases like this. But this is a very involved change in ns_server. Since this is not a regression and this behavior also exists in Spock, I am moving this over to mad-hatter.

          Dave Finlay - could you please let me know if this is fine by you?

          Show
          ajit.yagaty Ajit Yagaty [X] (Inactive) added a comment - Spoke to Deep about this. As I understand it, the indexer rebalance may take longer to indicate that it’s complete when there are in-flight queries. If the memcached process has restarted during that time the availability of data service will be impacted as we don’t bring the buckets online. This will be even more pronounced if there are more services to be rebalanced after current one is done, as the janitor will not be run until all services are rebalanced. Currently, janitor cleanup and rebalance operations are mutually exclusive. The janitor cleans up both buckets and services. It seems like it would be beneficial to run the bucket cleanup after all the buckets have been rebalanced and the service cleanup after all the services have been rebalanced. This would improve the availability in cases like this. But this is a very involved change in ns_server. Since this is not a regression and this behavior also exists in Spock, I am moving this over to mad-hatter. Dave Finlay - could you please let me know if this is fine by you?
          Hide
          dfinlay Dave Finlay added a comment -

          I think so - especially with so little time left in Vulcan and the fact that this behavior is unchanged from 5.0 / 5.1.

          Poonam Dhavale, Abhijeeth Nuthan: I imagine that once we have the fix for MB-24242 this situation would in general be handled (if autofailover is enabled, etc).

          Show
          dfinlay Dave Finlay added a comment - I think so - especially with so little time left in Vulcan and the fact that this behavior is unchanged from 5.0 / 5.1. Poonam Dhavale , Abhijeeth Nuthan : I imagine that once we have the fix for MB-24242 this situation would in general be handled (if autofailover is enabled, etc).
          Hide
          Abhijeeth.Nuthan Abhijeeth Nuthan added a comment -

          Dave Finlay  : Provided auto-failover is enabled and all conditions are met, MB-24242 should by design interrupt rebalance post KV rebalance and auto-failover the nodes with not ready buckets. Thereby, ensuring availability. 

          Show
          Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - Dave Finlay   : Provided auto-failover is enabled and all conditions are met, MB-24242 should by design interrupt rebalance post KV rebalance and auto-failover the nodes with not ready buckets. Thereby, ensuring availability. 
          Hide
          dfinlay Dave Finlay added a comment -

          Resolving as with the resolution of MB-24242, rebalance should be interrupted by memcached failure.

          Mihir Kamdar: it would be good to create some tests that do the following to verify that this is indeed resolved:

          1. run rebalance and have it be the case that the indexing phase of rebalance takes some time (e.g. add a new indexer node such that the movement of indexing partitions onto the new node takes some time)
          2. restart memcached during indexing rebalance

          Can we do this?

          (Separately, it should really be the case that if indexing is trying to open connections against memcached and it can't it should give up after a number of tries and fail its rebalance.)

          CC: Deepkaran Salooja

          Show
          dfinlay Dave Finlay added a comment - Resolving as with the resolution of MB-24242 , rebalance should be interrupted by memcached failure. Mihir Kamdar : it would be good to create some tests that do the following to verify that this is indeed resolved: run rebalance and have it be the case that the indexing phase of rebalance takes some time (e.g. add a new indexer node such that the movement of indexing partitions onto the new node takes some time) restart memcached during indexing rebalance Can we do this? (Separately, it should really be the case that if indexing is trying to open connections against memcached and it can't it should give up after a number of tries and fail its rebalance.) CC: Deepkaran Salooja
          Hide
          deepkaran.salooja Deepkaran Salooja added a comment -

          Filed MB-34796 for indexer improvement.

          Show
          deepkaran.salooja Deepkaran Salooja added a comment - Filed MB-34796 for indexer improvement.

            People

            Assignee:
            ajit.yagaty Ajit Yagaty [X] (Inactive)
            Reporter:
            mihir.kamdar Mihir Kamdar
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty