Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7160

if flush times out final stage (janitor creating vbuckets back) it returns success causing clients to see TMPFAIL after flush succeeds (WAS: there are reports that even after invoking FLUSH nodes return TMPFAIL...)

    Details

    • Type: Bug
    • Status: Reopened
    • Priority: Major
    • Resolution: Unresolved
    • Affects Version/s: 2.0-beta-2, 2.0.1, 2.1.0
    • Fix Version/s: bug-backlog
    • Component/s: ns_server
    • Security Level: Public
    • Triage:
      Triaged
    • Is this a Regression?:
      Yes

      Description

      "reported by SDK team"
      After a flush (through REST) for a time we still get tmpfail returned by server nodes. This is not expected, and would be kind of annoying from an application and/or cause problems with automated tests.

      update:

      I've updated subject. Indeed this is possible. But IMHO given that clients should always be prepared to handle TMPFAIL out of everything I've lowered down to minor.

      further update:

      I disagree on the "always be able to handle TMPFAIL", especially in the case of running tests at the SDK side. At the moment, we ask our users to handle tmpfail directly. That's intentional and seems to make sense as a pressure relief valve for steady state, but between these flushes and especially from unit tests, it'd be best if the cluster could just block either the operation request or the flush response until complete.

      Raising this to major owing to end-user reports of trouble here.

      1. test.rb
        0.3 kB
        Sergey Avseyev
      2. test.txt
        0.4 kB
        Sergey Avseyev

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          wayne Wayne Siu added a comment -

          Triaged in Maintenance Meeting on Dec 16, 2014.
          Cihan,
          Please review and see if we should keep this ticket. It's been over an year without any activities.

          Show
          wayne Wayne Siu added a comment - Triaged in Maintenance Meeting on Dec 16, 2014. Cihan, Please review and see if we should keep this ticket. It's been over an year without any activities.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Matt, let's be specific here. This ticket is only about returning 204 if flush takes longer than 30 seconds.

          There's another ticket on my team to investigate reported slowness for cases where we know it's not fsyncs. Once we investigate we might or might not be able to improve it.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Matt, let's be specific here. This ticket is only about returning 204 if flush takes longer than 30 seconds. There's another ticket on my team to investigate reported slowness for cases where we know it's not fsyncs. Once we investigate we might or might not be able to improve it.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Don't get "or" above. This is not fixed yet. This is fixable relatively easily (assuming 204 is ok). But I won't be crying if it's not for 3.0.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Don't get "or" above. This is not fixed yet. This is fixable relatively easily (assuming 204 is ok). But I won't be crying if it's not for 3.0.
          Hide
          ingenthr Matt Ingenthron added a comment -

          Given that this was deferred, we should at least release note it.

          Show
          ingenthr Matt Ingenthron added a comment - Given that this was deferred, we should at least release note it.
          Hide
          ingenthr Matt Ingenthron added a comment -

          This may or may not be directly related to the specific issue reported here. It seems to be a contributing factor anyway.

          Trond carried out a series of experiments recently recently to test whether or not the time problems with flush time were owing to the underlying MB-6232 as previously suspected. MB-6232 says fsyncs and serially doing a lot of work with vbuckets cause lots of the creation/deletion slowness we see.

          To isolate this and find a solution for a particular deployment, a number of tests were run. In the last experiment, two tests were run:
          1) Reduce the number of vbuckets to 8, but otherwise use a real fileystem, disks, etc. (~2sec)
          2) Use ramfs (9.5sec)

          1) Reduce the number of vbuckets:

          I uninstalled 2.0.1-build 170 from my box and removed /opt/couchbase before reinstalling it and stopping it immediately. Then I edited /opt/couchbase/bin/couchbase-server and added (right under the license header):

          COUCHBASE_NUM_VBUCKETS=8
          export COUCHBASE_NUM_VBUCKETS

          This sets the server to be using only 8 vbuckets instead of the default number of 1024. This number HAS to be a "power-of-two", and should not be less than the number of nodes you have in your cluster (then you'll have nodes without any vbuckets).

          Running the same program as I used earlier (who did a flush of an empty bucket) now use ~2 secs on a filesystem that is mounted with "barrier=1" and less than a second (0.7) on a filesystem that is mounted with "barrier=0"

          2) Using ramfs

          I created the ramfs by running:

          mount -t ramfs -o size=100m ramfs /tmp/couchbase_data
          chown couchbase:couchbase /tmp/couchbase_data

          Then installed couchbase and had it use /tmp/couchbase_data for storage.

          Running flush takes ~9.5 sec for an empty bucket. Reducing the number of vbuckets to 8 like described in 1 used 0.36 s

          — — —

          In an earlier set of experiments, there was more measurement of where time was going. I've edited this slightly from Trond:

          Test rig: (Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GH with 4GB memory).
          The system is using ext4 as a filesystem, 2GB of memory to Couchbase (and installed the two sample databases during installation).

          Test was run with the following PHP script which was intended to simulate a series of flushes as unit tests are run:

          <?php
          $counter = 1;
          $errors = 0;

          $cb = new Couchbase("localhost", "Administrator", "password");

          while ($counter < 11) {
          try

          { echo "\rFlushing: " . $counter . " Errors: " . $errors; $cb->flush(); }

          catch (CouchbaseException $e)

          { $errors++; }

          $counter++;
          }
          ?>

          When I initially ran the script it reported:
          trond@desktop:~/compile/sdk/php$ time php -c php.ini example/buckets.php
          Flushing: 10 Errors: 8
          real 5m36.613s
          user 0m0.012s
          sys 0m0.020s

          Please note that we can't really "trust" the numbers above when we had an error, because I don't check for the reason why it failed (the next one could for instance fail due to the fact that we've already got a flush running etc).

          When I remounted the filesystem with barrier=0 and I got:
          trond@deskttime op:~/compile/sdk/php$ php -c php.ini example/buckets.php
          Flushing: 10 Errors: 0
          real 3m4.609s
          user 0m0.028s
          sys 0m0.008s

          At least all of the flush calls succeed, but look at the time.. It takes more than 3 minutes to run 10 flush commands, so for a user running 500 unit tests each separated by a flush, just running the flush for their tests would result in a roughly 2 1/2 hours waiting for the flush commands in their 500 tests cycles.. (then depending on how much data they add it may take longer...)

          What I do find interesting is when I run top while I'm doing this I see beam.smp (erlang vm) using from 100-170% CPU (aka almost two cores), whereas memcached is relatively idle. From what I can see it looks to me that flushing a vbucket is setting all of the vbuckets to "dead" before activating them again.

          I know we don't have stats for everything we do, but from running the "timings" cbstat command we see (removing the printout of the distribution there):

          set_vb_cmd (2048 total)
          Avg : ( 27us)
          del_vb_cmd (2048 total)
          Avg : ( 44us)
          disk_vbstate_snapshot (1622 total)
          Avg : ( 6ms)
          disk_vb_del (1024 total)
          Avg : ( 3ms)

          If I'm adding all of these up (assuming that they're all done in sequence) we end up at roughly 13 secs, but the command "time" report of the php process doing a single flush reports:
          real 0m23.613s
          user 0m0.016s
          sys 0m0.012s

          So there is 10 secs not accounted for, but this seems to vary.. (starting php and doing a single set use 0m0.033s so it's not the php runtime overhead)..

          I'm not entirely sure what the "disk_vbstate_snapshot" is used for but it accounts for roughly 9 of the 13 secs)

          Wen I disabled that function in ep-engine I got the example program above running those 10 flushes:

          Flushing: 10 Errors: 0
          real 0m22.411s
          user 0m0.020s
          sys 0m0.012s

          Interestingly enough eam.smp is now down at ~100% CPU... For fun I tried to run the script with 500 flushes (to see if it changed over time), and we're down to less than 18 minutes flushing (which isn't that bad )

          I know we added that call to disk_vbstate_snapshot for some reason, so just removing isn't an alternative. I am however pretty sure we don't need to snapshot them 1600+ times while we're running a flush (I would have guessed that two would do the trick.. one when they are all disabled, one when they are all back up again). It is ns-server and not ep-engine who needs to ensure that the clustermap is consistent after a potential crash during the process, and to be honest ns_server could just persist a flag saying that it is doing a flush before it starts, then nuke the flag when it receives the response that all of the vbuckets are dead if it want to protect itself from coming back up in a "hosed" configuration if a crash occurs during the flush.

          Personally I think we should change our code to let ns_server send a SINGLE message to ep_engine listing ALL of the vbuckets it want to shut down, and it should get a single return message when that is in place (and we could then have a SINGLE snapshot vbucket when we've disabled all of them "atomically") then a SINGLE enable message (returning in a single snapshot_vbucket_state).

          Right now we're also adding FLUSH markers to our tap connections. Given that ns_server is the organizer of the entire process it could just shutdown the TAP connections before running the flush command, and then restart the replication chains when its done. (the database should be empty anyway, and we wouldn't have to add special flush logic to the tap streams).

          It would be interesting to know if all of this message passing and constant vbucket state change is part of its heavy CPU usage during the flush.

          Show
          ingenthr Matt Ingenthron added a comment - This may or may not be directly related to the specific issue reported here. It seems to be a contributing factor anyway. Trond carried out a series of experiments recently recently to test whether or not the time problems with flush time were owing to the underlying MB-6232 as previously suspected. MB-6232 says fsyncs and serially doing a lot of work with vbuckets cause lots of the creation/deletion slowness we see. To isolate this and find a solution for a particular deployment, a number of tests were run. In the last experiment, two tests were run: 1) Reduce the number of vbuckets to 8, but otherwise use a real fileystem, disks, etc. (~2sec) 2) Use ramfs (9.5sec) 1) Reduce the number of vbuckets: I uninstalled 2.0.1-build 170 from my box and removed /opt/couchbase before reinstalling it and stopping it immediately. Then I edited /opt/couchbase/bin/couchbase-server and added (right under the license header): COUCHBASE_NUM_VBUCKETS=8 export COUCHBASE_NUM_VBUCKETS This sets the server to be using only 8 vbuckets instead of the default number of 1024. This number HAS to be a "power-of-two", and should not be less than the number of nodes you have in your cluster (then you'll have nodes without any vbuckets). Running the same program as I used earlier (who did a flush of an empty bucket) now use ~2 secs on a filesystem that is mounted with "barrier=1" and less than a second (0.7) on a filesystem that is mounted with "barrier=0" 2) Using ramfs I created the ramfs by running: mount -t ramfs -o size=100m ramfs /tmp/couchbase_data chown couchbase:couchbase /tmp/couchbase_data Then installed couchbase and had it use /tmp/couchbase_data for storage. Running flush takes ~9.5 sec for an empty bucket. Reducing the number of vbuckets to 8 like described in 1 used 0.36 s — — — In an earlier set of experiments, there was more measurement of where time was going. I've edited this slightly from Trond: Test rig: (Intel(R) Core(TM)2 Quad CPU Q6600 @ 2.40GH with 4GB memory). The system is using ext4 as a filesystem, 2GB of memory to Couchbase (and installed the two sample databases during installation). Test was run with the following PHP script which was intended to simulate a series of flushes as unit tests are run: <?php $counter = 1; $errors = 0; $cb = new Couchbase("localhost", "Administrator", "password"); while ($counter < 11) { try { echo "\rFlushing: " . $counter . " Errors: " . $errors; $cb->flush(); } catch (CouchbaseException $e) { $errors++; } $counter++; } ?> When I initially ran the script it reported: trond@desktop:~/compile/sdk/php$ time php -c php.ini example/buckets.php Flushing: 10 Errors: 8 real 5m36.613s user 0m0.012s sys 0m0.020s Please note that we can't really "trust" the numbers above when we had an error, because I don't check for the reason why it failed (the next one could for instance fail due to the fact that we've already got a flush running etc). When I remounted the filesystem with barrier=0 and I got: trond@deskttime op:~/compile/sdk/php$ php -c php.ini example/buckets.php Flushing: 10 Errors: 0 real 3m4.609s user 0m0.028s sys 0m0.008s At least all of the flush calls succeed, but look at the time.. It takes more than 3 minutes to run 10 flush commands, so for a user running 500 unit tests each separated by a flush, just running the flush for their tests would result in a roughly 2 1/2 hours waiting for the flush commands in their 500 tests cycles.. (then depending on how much data they add it may take longer...) What I do find interesting is when I run top while I'm doing this I see beam.smp (erlang vm) using from 100-170% CPU (aka almost two cores), whereas memcached is relatively idle. From what I can see it looks to me that flushing a vbucket is setting all of the vbuckets to "dead" before activating them again. I know we don't have stats for everything we do, but from running the "timings" cbstat command we see (removing the printout of the distribution there): set_vb_cmd (2048 total) Avg : ( 27us) del_vb_cmd (2048 total) Avg : ( 44us) disk_vbstate_snapshot (1622 total) Avg : ( 6ms) disk_vb_del (1024 total) Avg : ( 3ms) If I'm adding all of these up (assuming that they're all done in sequence) we end up at roughly 13 secs, but the command "time" report of the php process doing a single flush reports: real 0m23.613s user 0m0.016s sys 0m0.012s So there is 10 secs not accounted for, but this seems to vary.. (starting php and doing a single set use 0m0.033s so it's not the php runtime overhead).. I'm not entirely sure what the "disk_vbstate_snapshot" is used for but it accounts for roughly 9 of the 13 secs) Wen I disabled that function in ep-engine I got the example program above running those 10 flushes: Flushing: 10 Errors: 0 real 0m22.411s user 0m0.020s sys 0m0.012s Interestingly enough eam.smp is now down at ~100% CPU... For fun I tried to run the script with 500 flushes (to see if it changed over time), and we're down to less than 18 minutes flushing (which isn't that bad ) I know we added that call to disk_vbstate_snapshot for some reason, so just removing isn't an alternative. I am however pretty sure we don't need to snapshot them 1600+ times while we're running a flush (I would have guessed that two would do the trick.. one when they are all disabled, one when they are all back up again). It is ns-server and not ep-engine who needs to ensure that the clustermap is consistent after a potential crash during the process, and to be honest ns_server could just persist a flag saying that it is doing a flush before it starts, then nuke the flag when it receives the response that all of the vbuckets are dead if it want to protect itself from coming back up in a "hosed" configuration if a crash occurs during the flush. Personally I think we should change our code to let ns_server send a SINGLE message to ep_engine listing ALL of the vbuckets it want to shut down, and it should get a single return message when that is in place (and we could then have a SINGLE snapshot vbucket when we've disabled all of them "atomically") then a SINGLE enable message (returning in a single snapshot_vbucket_state). Right now we're also adding FLUSH markers to our tap connections. Given that ns_server is the organizer of the entire process it could just shutdown the TAP connections before running the flush command, and then restart the replication chains when its done. (the database should be empty anyway, and we wouldn't have to add special flush logic to the tap streams). It would be interesting to know if all of this message passing and constant vbucket state change is part of its heavy CPU usage during the flush.

            People

            • Assignee:
              cihan Cihan Biyikoglu (Inactive)
              Reporter:
              farshid Farshid Ghods (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

              • Created:
                Updated:

                Gerrit Reviews

                There are no open Gerrit changes