Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7272

memcached/ep-engine crashes in flusher or other paths when it receives a shutdown message from ns-server

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 1.8.1, 2.0
    • Fix Version/s: 2.0
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None

      Description

      this case could occur in many places

      1- when node is waming up and ns-server sneds a shutdown command to delete the bucket during warmup
      2- when node is warming up ( a failed over node ) ns-server sends a shutdown command to delete the bucket
      3- when a node which was rebalanced out but for some reason memcached is still doing sth , ns-server sends a shut down command

      scenario #2 is very very common and in large environments where warm up takes 8 hours or so user will keep retrying the rebalance button and it wont succeed unless user manually kills the memcached process manually by running kill command.

      in general ep-engine needs to abort instead of crashing
      on the other hand during normal shutdown , when ns-server sends a command to ep-engine to shut down . ep-engine should wait until all items are flushed and then shutdown.

      seems like we need to differentiate a command that says shut down gracefully or shut down with force.

      some of the bugs :

      http://www.couchbase.com/issues/browse/MB-7110
      http://www.couchbase.com/issues/browse/MB-7263

        Activity

        Hide
        jin Jin Lim (Inactive) added a comment -

        The toy build for a fix candidate has been uploaded for testing. QE and the development team will be verifying the fix for next few days. Thanks!

        http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-couchstore-x86_64_2.0.0-11302012A-toy.rpm

        Show
        jin Jin Lim (Inactive) added a comment - The toy build for a fix candidate has been uploaded for testing. QE and the development team will be verifying the fix for next few days. Thanks! http://builds.hq.northscale.net/latestbuilds/couchbase-server-community_toy-couchstore-x86_64_2.0.0-11302012A-toy.rpm
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        tried to test toy build for cases in MB-7110 [system test] rebalance failed due to "Failed to wait deletion of some buckets on some nodes"
        with steps:
        1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated
        10.3.121.112, 10.3.121.113, 10.3.121.114, 10.3.121.115
        2. load ~1.6M items in each bucket
        3. add node in cluster 10.3.121.116

        result: received exactly the same erros as in the
        MB-7263 Service memcached constantly exited on dest master node after certain steps in XDCR + rebalance scenarious: Port server memcached on node 'ns_1@10.3.121.63' exited with status 71. failed to listen on TCP port 11210: Address already in use

        Port server memcached on node 'ns_1@10.3.121.112' exited with status 71. Restarting. Messages: Mon Dec 3 03:11:51.120720 PST 3: failed to listen on TCP port 11210: Address already in use

        leave the cluster alive for investigation

        Show
        andreibaranouski Andrei Baranouski added a comment - tried to test toy build for cases in MB-7110 [system test] rebalance failed due to "Failed to wait deletion of some buckets on some nodes" with steps: 1. cluster 4 nodes, 1 default and 1 sasl bucket with 1500MB of RAM allocated 10.3.121.112, 10.3.121.113, 10.3.121.114, 10.3.121.115 2. load ~1.6M items in each bucket 3. add node in cluster 10.3.121.116 result: received exactly the same erros as in the MB-7263 Service memcached constantly exited on dest master node after certain steps in XDCR + rebalance scenarious: Port server memcached on node 'ns_1@10.3.121.63' exited with status 71. failed to listen on TCP port 11210: Address already in use Port server memcached on node 'ns_1@10.3.121.112' exited with status 71. Restarting. Messages: Mon Dec 3 03:11:51.120720 PST 3: failed to listen on TCP port 11210: Address already in use leave the cluster alive for investigation
        Hide
        jin Jin Lim (Inactive) added a comment - - edited

        Thanks Andrei. Please leave the cluster while the development team is investigating the issue.

        In the mean time please note that:

        1) this bug is to track ep-engine crash when it receives the shutdown (delete) while warminging up. The toy build must have addressed the issue and your last test didn't see the crash from ep-engine threads.
        2) as you stated, the latest error (OSERR = 71, port already being in use) you encountered sounds much like the original issue of MB-7263. Which I will continue to investigate from this point on.

        Thanks,
        Jin

        Show
        jin Jin Lim (Inactive) added a comment - - edited Thanks Andrei. Please leave the cluster while the development team is investigating the issue. In the mean time please note that: 1) this bug is to track ep-engine crash when it receives the shutdown (delete) while warminging up. The toy build must have addressed the issue and your last test didn't see the crash from ep-engine threads. 2) as you stated, the latest error (OSERR = 71, port already being in use) you encountered sounds much like the original issue of MB-7263 . Which I will continue to investigate from this point on. Thanks, Jin
        Hide
        steve Steve Yen added a comment -

        from bug-scrub mtg,

        looks like there's fix from Jin and from ns-server team (the infinity fix), and they both need to go in.

        Show
        steve Steve Yen added a comment - from bug-scrub mtg, looks like there's fix from Jin and from ns-server team (the infinity fix), and they both need to go in.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        build 1974 has this fix

        Show
        farshid Farshid Ghods (Inactive) added a comment - build 1974 has this fix
        Hide
        kzeller kzeller added a comment -

        Added to RN as:

        During Couchbase Server warmup or rebalance, if you delete a data bucket,
        it will cause the node to crash.

        Show
        kzeller kzeller added a comment - Added to RN as: During Couchbase Server warmup or rebalance, if you delete a data bucket, it will cause the node to crash.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #461 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/461/)
        MB-7272 stop warmup task immediately if shutdown is being requested (Revision 6b89027ba3b2b461d978d593b14918040c819e2c)

        Result = SUCCESS
        Jin :
        Files :

        • src/warmup.cc
        • src/warmup.hh
        • src/ep.cc
        • src/ep.hh
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #461 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/461/ ) MB-7272 stop warmup task immediately if shutdown is being requested (Revision 6b89027ba3b2b461d978d593b14918040c819e2c) Result = SUCCESS Jin : Files : src/warmup.cc src/warmup.hh src/ep.cc src/ep.hh

          People

          • Assignee:
            jin Jin Lim (Inactive)
            Reporter:
            farshid Farshid Ghods (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes