Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-4805

ep-engine says warmup is done too early

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0-developer-preview-4
    • Fix Version/s: 2.0-developer-preview-4
    • Component/s: couchbase-bucket
    • Security Level: Public
    • Labels:
      None

      Description

      • Load a bunch of stuff to single node cluster
      • kill memcached
      • observe how ns_server will quite quickly mark this node as green and double check with stats that warmup is done
      • but try to see if loaded data is there and observe that it's not. Also it can be seen that item count is growing

      Note this is really bad because janitor will activate vbuckets that are being loaded and start replication from it. Thats very scary. It's quite likely path that caused data loss I've observed in 2.0.

      # Subject Project Status CR V
      For Gerrit Dashboard: &For+MB-4805=message:MB-4805

        Activity

        Hide
        trond Trond Norbye added a comment -

        ep-engine reports that warmup is complete (to allow external traffic) when all metadata for items are loaded into memory. At this time it is safe to run all operations (get / set / add / remove etc), because we know if the data is there or not. The "item count" represented by curr_items is incremented by the second phase when we're actually loading the body for the data (we can always discuss if that's the wrong thing to do or if it should be reported as we load the meta-data).

        There is an ongoing task to rewrite the warmup phase by using a smarter way of determine the actual items we want to have loaded.

        Do you have a concrete bug I may try to find here? or is this just a suspicion that this may lead to a data loss?

        Show
        trond Trond Norbye added a comment - ep-engine reports that warmup is complete (to allow external traffic) when all metadata for items are loaded into memory. At this time it is safe to run all operations (get / set / add / remove etc), because we know if the data is there or not. The "item count" represented by curr_items is incremented by the second phase when we're actually loading the body for the data (we can always discuss if that's the wrong thing to do or if it should be reported as we load the meta-data). There is an ongoing task to rewrite the warmup phase by using a smarter way of determine the actual items we want to have loaded. Do you have a concrete bug I may try to find here? or is this just a suspicion that this may lead to a data loss?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        the observed behavior is that items that should be there are not there according to GET results. So there is some obvious bug.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - the observed behavior is that items that should be there are not there according to GET results. So there is some obvious bug.
        Hide
        trond Trond Norbye added a comment -

        So you have called a get request and gotten a NOT_FOUND response (and not a tmpfail/busy) back for an item that you are 100% was there (and persisted before you killed the system)?

        Show
        trond Trond Norbye added a comment - So you have called a get request and gotten a NOT_FOUND response (and not a tmpfail/busy) back for an item that you are 100% was there (and persisted before you killed the system)?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Hm. I don't know if it was tmpfail or not_found. Will double check. Item was definitely persisted.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Hm. I don't know if it was tmpfail or not_found. Will double check. Item was definitely persisted.
        Hide
        trond Trond Norbye added a comment -

        I'm going to write a test case to see if I can reproduce it

        Show
        trond Trond Norbye added a comment - I'm going to write a test case to see if I can reproduce it
        Show
        trond Trond Norbye added a comment - http://review.couchbase.org/#change,13331
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #196 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/196/)
        MB-4805 Fix items lost during startup (Revision 3dc3329a3a97c71a592a45b983c00e80520c3309)

        Result = SUCCESS
        Trond Norbye :
        Files :

        • ep.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #196 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/196/ ) MB-4805 Fix items lost during startup (Revision 3dc3329a3a97c71a592a45b983c00e80520c3309) Result = SUCCESS Trond Norbye : Files : ep.cc
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #201 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/201/)
        Backport: MB-4805 Fix items lost during startup (Revision 81f4e978a023b8b1c3c0e14d6e100db8679ebfe2)

        Result = SUCCESS
        Trond Norbye :
        Files :

        • ep.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #201 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/201/ ) Backport: MB-4805 Fix items lost during startup (Revision 81f4e978a023b8b1c3c0e14d6e100db8679ebfe2) Result = SUCCESS Trond Norbye : Files : ep.cc

          People

          • Assignee:
            trond Trond Norbye
            Reporter:
            alkondratenko Aleksey Kondratenko (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes