Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6232

ep-engine needs 1.5 minutes to create 1k vbuckets. Seems too slow (faster with barrier=0)

    Details

    • Triage:
      Triaged
    • Is this a Regression?:
      Yes

      Description

      SUBJ.

      Just try single node dev cluster and try to load beers sample (it'll fail for perhaps different reason).

      You can observe in logs:

      We're doing set_vbucket_state requests from janitor (ns_memcached messages) and this is the first messages from mccouch about just created vbucket (capi_set_view_manager messages and mc_connection messages). Note: I've added extra log messages at the beginning of sync_notify and end to see if it's couchdb's or ns_servers' fault. From timestamps it's clear that it's not.

      [ns_server:debug,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:106]sending out sync_notify
      [ns_server:info,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 740 state to active
      [views:debug,2012-08-15T9:47:40.309,n_0@127.0.0.1:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:418]Got set_vbucket event for beer-sample/1023. Updated state: active
      [ns_server:info,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 739 state to active
      [ns_server:debug,2012-08-15T9:47:40.309,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:113]done
      [ns_server:info,2012-08-15T9:47:40.310,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 738 state to active
      [ns_server:info,2012-08-15T9:47:40.310,n_0@127.0.0.1:<0.870.0>:ns_memcached:do_handle_call:485]Changed vbucket 737 state to active
      

      And last message is at:

      [ns_server:debug,2012-08-15T9:49:14.410,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:106]sending out sync_notify
      [views:debug,2012-08-15T9:49:14.410,n_0@127.0.0.1:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:418]Got set_vbucket event for beer-sample/1. Updated state: active
      [ns_server:debug,2012-08-15T9:49:14.410,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:113]done
      [ns_server:debug,2012-08-15T9:49:14.501,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:106]sending out sync_notify
      [views:debug,2012-08-15T9:49:14.502,n_0@127.0.0.1:mc_couch_events:capi_set_view_manager:handle_mc_couch_event:418]Got set_vbucket event for beer-sample/0. Updated state: active
      [ns_server:debug,2012-08-15T9:49:14.502,n_0@127.0.0.1:<0.868.0>:mc_connection:do_notify_vbucket_update:113]done
      

      I've just retried with remounting with barrier=0 and it was blazing fast. What this mount option changes is short fsync are nearly instant because they merely send data to disk's buffer without waiting until data actually hits platter.

      So that's evidence of some excessive use of fsyncs somewhere during vbucket db files creation and that's clearly on ep-engine/couchstore side.

      Set this to blocker as it'll affect people trying to create buckets on real hardware, hard disks and any modern linux distro (where barriers are on by default).

      1. debug.1.gz
        269 kB
        Tommie McAfee

        Issue Links

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          ingenthr Matt Ingenthron added a comment -

          Note that on MB-6305 this caused quite a bit of trouble. Currently, one cannot reliably go through a cycle of delete/create bucket in any reliable fashion using public interfaces. Sometime between build 1495 and build 2.0.0c 709, the delete/recreate cycle now takes multiple minutes.

          In automated testing some SDKs used to use flush_all(). At the recommendation of engineering, we moved to delete/create. Now using delete/create is intractable. This has broken our testing again, unfortunately.

          The larger concern is that our users will probably see this pretty quickly too. What we were doing here isn't all that unusual.

          Show
          ingenthr Matt Ingenthron added a comment - Note that on MB-6305 this caused quite a bit of trouble. Currently, one cannot reliably go through a cycle of delete/create bucket in any reliable fashion using public interfaces. Sometime between build 1495 and build 2.0.0c 709, the delete/recreate cycle now takes multiple minutes. In automated testing some SDKs used to use flush_all(). At the recommendation of engineering, we moved to delete/create. Now using delete/create is intractable. This has broken our testing again, unfortunately. The larger concern is that our users will probably see this pretty quickly too. What we were doing here isn't all that unusual.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Matt, I think you can try workaround of:

          mount / -o remount,nobarrier

          And if it's not helping in your case then there's something else that makes it slower then needed.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Matt, I think you can try workaround of: mount / -o remount,nobarrier And if it's not helping in your case then there's something else that makes it slower then needed.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Matt, I think you can try workaround of:

          mount / -o remount,nobarrier

          And if it's not helping in your case then there's something else that makes it slower then needed.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Matt, I think you can try workaround of: mount / -o remount,nobarrier And if it's not helping in your case then there's something else that makes it slower then needed.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          BTW I recall now having similar problem pre DP4. I've fixed it by creating all vbuckets in parallel. But that was ofcourse when mccouch was at it's full power.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - BTW I recall now having similar problem pre DP4. I've fixed it by creating all vbuckets in parallel. But that was ofcourse when mccouch was at it's full power.
          Hide
          ingenthr Matt Ingenthron added a comment -

          @alk, I presume that should still be in here, so the issue is further down it would seem.

          thanks for the additional info

          Show
          ingenthr Matt Ingenthron added a comment - @alk, I presume that should still be in here, so the issue is further down it would seem. thanks for the additional info
          Hide
          tommie Tommie McAfee added a comment -

          Just tried to reproduce this with 2.0 build 1689 and loading bear samples times out:

          [ns_server:error,2012-09-06T10:49:42.436,ns_1@127.0.0.1:<0.752.0>:menelaus_web:handle_post_sample_buckets:506]Loading sample buckets timed out

          Same error is shown in the UI. If I click to retry it says data is already loaded, however if I inspect the db there are no documents in the bucket. Not sure if this is similar issue...attaching diag

          Show
          tommie Tommie McAfee added a comment - Just tried to reproduce this with 2.0 build 1689 and loading bear samples times out: [ns_server:error,2012-09-06T10:49:42.436,ns_1@127.0.0.1:<0.752.0>:menelaus_web:handle_post_sample_buckets:506] Loading sample buckets timed out Same error is shown in the UI. If I click to retry it says data is already loaded, however if I inspect the db there are no documents in the bucket. Not sure if this is similar issue...attaching diag
          Show
          farshid Farshid Ghods (Inactive) added a comment - more info here : http://www.couchbase.com/issues/browse/MB-6305
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Matt, when I referenced mccouch that was old mccouch that was our persistence path. This is not the case anymore. So same issue is still hitting us. We're seemingly doing (potentially few) fsyncs per vbucket and we're doing all that sequentially.

          Tommie, docloader thing is most likely distinct issue. I think we already have bug for that.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Matt, when I referenced mccouch that was old mccouch that was our persistence path. This is not the case anymore. So same issue is still hitting us. We're seemingly doing (potentially few) fsyncs per vbucket and we're doing all that sequentially. Tommie, docloader thing is most likely distinct issue. I think we already have bug for that.
          Hide
          mikew Mike Wiederhold added a comment -

          Alk,

          I investigated this issue further and found that the latency to send a vbucket state transition to mccouch was 100ms. Interestingly to do a state transition to dead the latency was an order of magnitude quicker so I think there might be something going on on the mccouch side here.

          Show
          mikew Mike Wiederhold added a comment - Alk, I investigated this issue further and found that the latency to send a vbucket state transition to mccouch was 100ms. Interestingly to do a state transition to dead the latency was an order of magnitude quicker so I think there might be something going on on the mccouch side here.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Mike, by looking at timestamps in logs I'm pretty sure there are no delays on mccouch side.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Mike, by looking at timestamps in logs I'm pretty sure there are no delays on mccouch side.
          Hide
          peter peter added a comment -

          Mike, can you continue driving to analyze this?

          Show
          peter peter added a comment - Mike, can you continue driving to analyze this?
          Hide
          mikew Mike Wiederhold added a comment -

          The fix for this bug will require changes on both the ep-engine and mccouch side and requires modifying how we store vbucket state data and how we notify mccouch about vbucket state changes. The changes will not be small and will be risky for the 2.0 release. I recommend this change be move to 2.0.1.

          Show
          mikew Mike Wiederhold added a comment - The fix for this bug will require changes on both the ep-engine and mccouch side and requires modifying how we store vbucket state data and how we notify mccouch about vbucket state changes. The changes will not be small and will be risky for the 2.0 release. I recommend this change be move to 2.0.1.
          Hide
          chiyoung Chiyoung Seo added a comment -

          For more details. we currently store a vbucket state file in its vbucket database as a local doc, which means that it will require one fsync for every state change. I was trying to replace those individual local docs with a global vbucket state doc and store it in the "master" meta database, but realized that it requires changes in interactions between ep-engine and erlang, and also affects our 2.0 backup and restore tools at this time.

          This is the design defect and limitation and risky changes for 2.0 at this time.

          Show
          chiyoung Chiyoung Seo added a comment - For more details. we currently store a vbucket state file in its vbucket database as a local doc, which means that it will require one fsync for every state change. I was trying to replace those individual local docs with a global vbucket state doc and store it in the "master" meta database, but realized that it requires changes in interactions between ep-engine and erlang, and also affects our 2.0 backup and restore tools at this time. This is the design defect and limitation and risky changes for 2.0 at this time.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Can we consider not using fsync for initial vbucket creation ?

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Can we consider not using fsync for initial vbucket creation ?
          Hide
          mikew Mike Wiederhold added a comment -

          Also see MB-6234 which is a duplicate of this issue.

          Show
          mikew Mike Wiederhold added a comment - Also see MB-6234 which is a duplicate of this issue.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          MB-6234 is not necessarily a dup. This bug is about metadata ops.

          Particularly, why ep-engine is even needs tons of fsync to create brand new vbuckets ?

          Why ep-engine cannot just drop those ops on the floor when during that 1.5 minute there's flush ?

          I think at least those actions are possible and perhaps more and sdk folks will finally have a working flush.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - MB-6234 is not necessarily a dup. This bug is about metadata ops. Particularly, why ep-engine is even needs tons of fsync to create brand new vbuckets ? Why ep-engine cannot just drop those ops on the floor when during that 1.5 minute there's flush ? I think at least those actions are possible and perhaps more and sdk folks will finally have a working flush.
          Hide
          dipti Dipti Borkar added a comment -

          We need to better understand this issue in ep_engine and couchstore and come up with options to fix it in the 2.0.2 / 2.1 timeframe

          Show
          dipti Dipti Borkar added a comment - We need to better understand this issue in ep_engine and couchstore and come up with options to fix it in the 2.0.2 / 2.1 timeframe
          Hide
          ingenthr Matt Ingenthron added a comment -

          Yay!

          Show
          ingenthr Matt Ingenthron added a comment - Yay!
          Hide
          mikew Mike Wiederhold added a comment -

          There are some other issues still in 2.0.2 that will likely help improve this issue, but I am moving this task out to 2.1

          Show
          mikew Mike Wiederhold added a comment - There are some other issues still in 2.0.2 that will likely help improve this issue, but I am moving this task out to 2.1
          Hide
          maria Maria McDuff (Inactive) added a comment -

          Created CBQE for QE to add in functional automation suite: http://www.couchbase.com/issues/browse/CBQE-1192

          Show
          maria Maria McDuff (Inactive) added a comment - Created CBQE for QE to add in functional automation suite: http://www.couchbase.com/issues/browse/CBQE-1192
          Hide
          ingenthr Matt Ingenthron added a comment -

          I removed the note in the summary about the barrier/disk sync because we have some evidence that in some environments, even ramdisk does not make the problem go away.

          Show
          ingenthr Matt Ingenthron added a comment - I removed the note in the summary about the barrier/disk sync because we have some evidence that in some environments, even ramdisk does not make the problem go away.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Matt, unfortunately you're not quite correct. 1.5 minutes for bucket creation is only possible on spinning disk and barriers on. Customer SM issues are order of magnitude smaller.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Matt, unfortunately you're not quite correct. 1.5 minutes for bucket creation is only possible on spinning disk and barriers on. Customer SM issues are order of magnitude smaller.
          Hide
          ingenthr Matt Ingenthron added a comment -

          That's a fair criticism, but it still wasn't a correct statement by saying it "gets fast". Well, it may be if we are talking strictly about ep-engine. I've changed it back.

          Show
          ingenthr Matt Ingenthron added a comment - That's a fair criticism, but it still wasn't a correct statement by saying it "gets fast". Well, it may be if we are talking strictly about ep-engine. I've changed it back.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          "fast" is always not specific enough. For my human feelings fews seconds to create 1k vbuckets without barriers is fast. For automated tests that do flush (and that requires more than just creating 1k vbuckets) it's not fast.

          IMHO a bit too many of our bugs including admittedly this are not specific enough. Most importantly a bunch of tickets like "XXX is too slow".

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - "fast" is always not specific enough. For my human feelings fews seconds to create 1k vbuckets without barriers is fast. For automated tests that do flush (and that requires more than just creating 1k vbuckets) it's not fast. IMHO a bit too many of our bugs including admittedly this are not specific enough. Most importantly a bunch of tickets like "XXX is too slow".
          Hide
          ingenthr Matt Ingenthron added a comment -

          Much agreed. The request from users here would be sub-second, mostly due to the flush-testing use case. I can't speak for the practicality of this if 1024 individual IOs are needed and we have a disk that does only 120 IOPS, but that's what the user's desire is. I'm certain it can be refactored to not being 1024 synchronous IOs.

          Show
          ingenthr Matt Ingenthron added a comment - Much agreed. The request from users here would be sub-second, mostly due to the flush-testing use case. I can't speak for the practicality of this if 1024 individual IOs are needed and we have a disk that does only 120 IOPS, but that's what the user's desire is. I'm certain it can be refactored to not being 1024 synchronous IOs.
          Hide
          chiyoung Chiyoung Seo added a comment -

          Points: 3

          Show
          chiyoung Chiyoung Seo added a comment - Points: 3
          Hide
          chiyoung Chiyoung Seo added a comment -

          The ep-engine team will discuss the possible approaches with the other teams to resolve this issue.

          Show
          chiyoung Chiyoung Seo added a comment - The ep-engine team will discuss the possible approaches with the other teams to resolve this issue.
          Hide
          chiyoung Chiyoung Seo added a comment -

          Let's discuss this issue offline to see how we can tackle it.

          Show
          chiyoung Chiyoung Seo added a comment - Let's discuss this issue offline to see how we can tackle it.
          Hide
          chiyoung Chiyoung Seo added a comment -

          The engine team discussed this issue, but as we mentioned earlier in this ticket, maintaining a vbucket_state local doc per vbucket database file is one of the major reasons for this bucket creation slowness with the barrier on.

          We plan to consolidate all the vbucket state local docs into a single local doc and manage it in "master.couch" database file. Then, it won't require 1024 sync IOs. However, this fix is not only in the engine side, but backup / restore and compactor. Note that the DB compactor doesn't compact "master.couch" file.

          We plan to address it in the next major or minor release.

          Show
          chiyoung Chiyoung Seo added a comment - The engine team discussed this issue, but as we mentioned earlier in this ticket, maintaining a vbucket_state local doc per vbucket database file is one of the major reasons for this bucket creation slowness with the barrier on. We plan to consolidate all the vbucket state local docs into a single local doc and manage it in "master.couch" database file. Then, it won't require 1024 sync IOs. However, this fix is not only in the engine side, but backup / restore and compactor. Note that the DB compactor doesn't compact "master.couch" file. We plan to address it in the next major or minor release.
          Hide
          FilipeManana Filipe Manana (Inactive) added a comment -

          I would suggest a dedicated database file for keeping those states. Using master database will require some coordination between ep-engine and ns_server/couchdb (to prevent concurrent btree updates), leading to all sorts of issues and complexity. Shouldn't add any significant overhead this way.

          Show
          FilipeManana Filipe Manana (Inactive) added a comment - I would suggest a dedicated database file for keeping those states. Using master database will require some coordination between ep-engine and ns_server/couchdb (to prevent concurrent btree updates), leading to all sorts of issues and complexity. Shouldn't add any significant overhead this way.
          Hide
          sundar Sundar Sridharan added a comment -

          Just a quick note/update that we don't seem to be seeing this issue anymore with current master branch. Bucket creation on single node does not take more than a minute.
          I think Chiyoung and Filipe's suggestion of a separate file for storing the vbucket snapshots would be very useful.
          Reducing the priority from Blocker to Minor.
          thanks

          Show
          sundar Sundar Sridharan added a comment - Just a quick note/update that we don't seem to be seeing this issue anymore with current master branch. Bucket creation on single node does not take more than a minute. I think Chiyoung and Filipe's suggestion of a separate file for storing the vbucket snapshots would be very useful. Reducing the priority from Blocker to Minor. thanks
          Hide
          dipti Dipti Borkar added a comment -

          Would be good for someone else to validate this.

          Show
          dipti Dipti Borkar added a comment - Would be good for someone else to validate this.
          Hide
          ingenthr Matt Ingenthron added a comment -

          Sundar: why reduce the priority? The priority should remain the same until it's fixed.

          Show
          ingenthr Matt Ingenthron added a comment - Sundar: why reduce the priority? The priority should remain the same until it's fixed.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          I willing to bet few hundreds that it's not fixed. That on durable spinning disk setup it still needs at least tens of seconds to create a bucket.

          Whether that's a blocker or minor or whatever it's hard to say. I.e. given that UPR will de-couple views and XDCR from disk persistence so folks mainly affected by this slowness today (i.e. running tests involving lots of bucket creations) will not hit this as hard as before.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - I willing to bet few hundreds that it's not fixed. That on durable spinning disk setup it still needs at least tens of seconds to create a bucket. Whether that's a blocker or minor or whatever it's hard to say. I.e. given that UPR will de-couple views and XDCR from disk persistence so folks mainly affected by this slowness today (i.e. running tests involving lots of bucket creations) will not hit this as hard as before.
          Hide
          ingenthr Matt Ingenthron added a comment -

          Whether rubles or dollars, I'm not taking that bet.

          Regarding the UPR/XDCR decoupling, if that's the resolution that delivers the user needs, that's fine. I don't expect it would fix things, so the issues where this came up in the first place would still be affected.

          Show
          ingenthr Matt Ingenthron added a comment - Whether rubles or dollars, I'm not taking that bet. Regarding the UPR/XDCR decoupling, if that's the resolution that delivers the user needs, that's fine. I don't expect it would fix things, so the issues where this came up in the first place would still be affected.
          Hide
          sundar Sundar Sridharan added a comment -

          I just created a bucket on a single node couchbase server without any replicas on 10.5.2.36 - bucket became ready in 8 seconds.
          Which seems to be a spindle disk..
          cat /sys/block/sda/queue/rotational
          1
          I repeated the test on 10.5.2.34 and then again on 10.5.2.33 machines and the bucket creation time were all well under 10 seconds.
          Please let me know if you still see this issue with the machine name and login credentials and I'll take it from there.
          thanks

          Show
          sundar Sundar Sridharan added a comment - I just created a bucket on a single node couchbase server without any replicas on 10.5.2.36 - bucket became ready in 8 seconds. Which seems to be a spindle disk.. cat /sys/block/sda/queue/rotational 1 I repeated the test on 10.5.2.34 and then again on 10.5.2.33 machines and the bucket creation time were all well under 10 seconds. Please let me know if you still see this issue with the machine name and login credentials and I'll take it from there. thanks
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Just tried on my box. Took about 50 seconds with barriers enabled.

          In order to see that slowness you need a box where fsync actually waits for data to hit the disk. Our VMs are not "honest" on fsync. And we also have few boxes with battery backed cache where fsync legitimately can be quick.

          Get yourself a Linux or Windows laptop with HDD and out of the box you'll observe the slowness.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Just tried on my box. Took about 50 seconds with barriers enabled. In order to see that slowness you need a box where fsync actually waits for data to hit the disk. Our VMs are not "honest" on fsync. And we also have few boxes with battery backed cache where fsync legitimately can be quick. Get yourself a Linux or Windows laptop with HDD and out of the box you'll observe the slowness.
          Hide
          ingenthr Matt Ingenthron added a comment -

          You can figure out if something is buffering for you pretty easily. A real disk would do somewhere from 80-150 IOPS. WIth dd and the oflags argument, you can ask it to sync between each write. If it's in that range, then it's proper spinning rust. If it's not, something is buffering for you.

          For example:
          [ingenthr@centosa ~]$ time dd if=/dev/zero of=testfile bs=4096 count=1200 oflag=dsync
          1200+0 records in
          1200+0 records out
          4915200 bytes (4.9 MB) copied, 11.8134 seconds, 416 kB/s

          real 0m11.833s
          user 0m0.004s
          sys 0m0.057s
          [ingenthr@centosa ~]$ time dd if=/dev/zero of=testfile bs=4096 count=1200
          1200+0 records in
          1200+0 records out
          4915200 bytes (4.9 MB) copied, 0.01124 seconds, 437 MB/s

          real 0m0.014s
          user 0m0.000s
          sys 0m0.013s

          Note the first one did about 100 IOPS with dsync on, and the latter one did about 100x better, simulating the buffering you might get from a VM hypervisor that lies, etc.

          (this is a CentOS instance on KVM with two 5900 RPM "green" disks under it, on ZFS on SmartOS)

          Show
          ingenthr Matt Ingenthron added a comment - You can figure out if something is buffering for you pretty easily. A real disk would do somewhere from 80-150 IOPS. WIth dd and the oflags argument, you can ask it to sync between each write. If it's in that range, then it's proper spinning rust. If it's not, something is buffering for you. For example: [ingenthr@centosa ~] $ time dd if=/dev/zero of=testfile bs=4096 count=1200 oflag=dsync 1200+0 records in 1200+0 records out 4915200 bytes (4.9 MB) copied, 11.8134 seconds, 416 kB/s real 0m11.833s user 0m0.004s sys 0m0.057s [ingenthr@centosa ~] $ time dd if=/dev/zero of=testfile bs=4096 count=1200 1200+0 records in 1200+0 records out 4915200 bytes (4.9 MB) copied, 0.01124 seconds, 437 MB/s real 0m0.014s user 0m0.000s sys 0m0.013s Note the first one did about 100 IOPS with dsync on, and the latter one did about 100x better, simulating the buffering you might get from a VM hypervisor that lies, etc. (this is a CentOS instance on KVM with two 5900 RPM "green" disks under it, on ZFS on SmartOS)
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Saw this marked as regression. Which is probably mistake. This was always the case since switch to couch*

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Saw this marked as regression. Which is probably mistake. This was always the case since switch to couch*
          Hide
          pavelpaulau Pavel Paulau added a comment -

          I guess every old ticket is marked as "Regression" when you update it.
          Somebody added this new field to MB project in JIRA recently... and it's probably not configured properly.

          Show
          pavelpaulau Pavel Paulau added a comment - I guess every old ticket is marked as "Regression" when you update it. Somebody added this new field to MB project in JIRA recently... and it's probably not configured properly.
          Hide
          bterkuile Benjamin ter Kuile added a comment -

          Leaving out a flush command in a test suit changes the time from 2minutes and 43 seconds to just 9 seconds. Flushing is important for isolated unit tests.

          Show
          bterkuile Benjamin ter Kuile added a comment - Leaving out a flush command in a test suit changes the time from 2minutes and 43 seconds to just 9 seconds. Flushing is important for isolated unit tests.
          Hide
          sundar Sundar Sridharan added a comment -

          As per Pavel's performance testings this issue is only seen on spindle disks without any RAID setup. This setup is for single developer use only and production environments typically have some form of RAID or SSD setups where we do not see this issue. Hence after consultation with PM the decision is to not support the single HDD case since optimizing for that case can lower performance of the more typical RAID/SSD case.
          thanks

          Show
          sundar Sundar Sridharan added a comment - As per Pavel's performance testings this issue is only seen on spindle disks without any RAID setup. This setup is for single developer use only and production environments typically have some form of RAID or SSD setups where we do not see this issue. Hence after consultation with PM the decision is to not support the single HDD case since optimizing for that case can lower performance of the more typical RAID/SSD case. thanks
          Hide
          ingenthr Matt Ingenthron added a comment -

          Sundar: Could you re-assign this to the PM who made this decision? I'd like to discuss it. This is a significant issue for developer experience. It's caused countless support requests and confusion. It affects multi-spindle and SSD deployments as well, so that's a false comparison.

          Show
          ingenthr Matt Ingenthron added a comment - Sundar: Could you re-assign this to the PM who made this decision? I'd like to discuss it. This is a significant issue for developer experience. It's caused countless support requests and confusion. It affects multi-spindle and SSD deployments as well, so that's a false comparison.
          Hide
          chiyoung Chiyoung Seo added a comment -

          There is mis-communication among the teams. We identified the root cause and still need to fix this issue, but not for 3.0 as it requires changing a way of storing vbucket_state docs and creating individual vbucket database files during the bucket instantiation.

          Show
          chiyoung Chiyoung Seo added a comment - There is mis-communication among the teams. We identified the root cause and still need to fix this issue, but not for 3.0 as it requires changing a way of storing vbucket_state docs and creating individual vbucket database files during the bucket instantiation.
          Hide
          ingenthr Matt Ingenthron added a comment -

          Chiyoung: who is the appropriate owner for this?

          Show
          ingenthr Matt Ingenthron added a comment - Chiyoung: who is the appropriate owner for this?
          Hide
          chiyoung Chiyoung Seo added a comment -

          Sundar still owns this ticket. I already communicated with him regarding how we can resolve this issue.

          Show
          chiyoung Chiyoung Seo added a comment - Sundar still owns this ticket. I already communicated with him regarding how we can resolve this issue.
          Hide
          wayne Wayne Siu added a comment -

          Triaged in Maintenance Meeting on Dec 16, 2014.
          Updating the affected and fixed versions.

          Show
          wayne Wayne Siu added a comment - Triaged in Maintenance Meeting on Dec 16, 2014. Updating the affected and fixed versions.
          Hide
          chiyoung Chiyoung Seo added a comment -

          I don't think this can be addressed by a sherlock release. When any long-lasting issue needs to be moved to the next major release, please communicate and discuss it with the corresponding component engineers.

          Show
          chiyoung Chiyoung Seo added a comment - I don't think this can be addressed by a sherlock release. When any long-lasting issue needs to be moved to the next major release, please communicate and discuss it with the corresponding component engineers.
          Hide
          drigby Dave Rigby added a comment -

          Sundar Sridharan Could you add a comment describing what the planned solution to this issue was? You can then assign the MB to me - thanks.

          Show
          drigby Dave Rigby added a comment - Sundar Sridharan Could you add a comment describing what the planned solution to this issue was? You can then assign the MB to me - thanks.
          Hide
          sundar Sundar Sridharan added a comment -

          This issue I believe was seen when a large number of vbucket files were getting instantiated during bucket creation. Each vbucket file instantiation required several fsync calls to 1. set vbucket state 2. set local couchstore doc to persist the doc count. The repeated fsync issued ended up serializing all the calls and on some slow spinning disks, this ended up taking longer than 1.5 minutes. As noted earlier on the ticket, it was hardly seen on SSD or disks with RAID 10.
          The reason it was assigned to me was that after forestdb integration the idea was to reduce the number of vbucket files that exist in the system by grouping several vbuckets into single shard files so that this creation stall can be avoided.

          Show
          sundar Sundar Sridharan added a comment - This issue I believe was seen when a large number of vbucket files were getting instantiated during bucket creation. Each vbucket file instantiation required several fsync calls to 1. set vbucket state 2. set local couchstore doc to persist the doc count. The repeated fsync issued ended up serializing all the calls and on some slow spinning disks, this ended up taking longer than 1.5 minutes. As noted earlier on the ticket, it was hardly seen on SSD or disks with RAID 10. The reason it was assigned to me was that after forestdb integration the idea was to reduce the number of vbucket files that exist in the system by grouping several vbuckets into single shard files so that this creation stall can be avoided.
          Hide
          drigby Dave Rigby added a comment -

          Sundar Sridharan Perfect - thanks for the background info.

          Show
          drigby Dave Rigby added a comment - Sundar Sridharan Perfect - thanks for the background info.
          Hide
          ingenthr Matt Ingenthron added a comment -

          +1 on coalescing to less files. The mapping of vbuckets to files has always felt a bit wrong to me as you lose a lot of the benefits of AOFs when you have lots of AOFs.

          Regarding the fsync() however, I would still argue that there may be another fix. The use case this was raised for, flush_all from REST, or the other case, a new bucket both have a situation where the bucket isn't writeable at all until completion. If the logic were refactored from:
          block frontend traffic
          for each vbucket to create

          { create vbucket fsync(vbucket's fd) }

          unblock frontend traffic

          To:
          block frontend traffic
          for each vbucket to create

          { create vbucket }

          for each vbucket created

          { fsync(vbucket's fd) }

          unblock frontend traffic

          … you'd be doing what's intended with fsync(). You'll be scheduling lots of IO and then waiting for completion on a bunch of things that are already complete rather than schedule/wait/schedule/wait/schedule/wait, etc. It's not like there's any data protection in the interim anyway.

          I know that this is easy to say and probably harder to do, but it'd be a much more complete fix to the issue.

          While here, I should also mention briefly that if you're using Mac OS X, they chose to make fsync(2) a bit less guaranteed than on other unixes. See the man page for details but any changes here should be verified/tested on other platforms and/or you should use the right additional defines in Mac OS to get real behavior. The man page and the internet have more detail.

          Show
          ingenthr Matt Ingenthron added a comment - +1 on coalescing to less files. The mapping of vbuckets to files has always felt a bit wrong to me as you lose a lot of the benefits of AOFs when you have lots of AOFs. Regarding the fsync() however, I would still argue that there may be another fix. The use case this was raised for, flush_all from REST, or the other case, a new bucket both have a situation where the bucket isn't writeable at all until completion. If the logic were refactored from: block frontend traffic for each vbucket to create { create vbucket fsync(vbucket's fd) } unblock frontend traffic To: block frontend traffic for each vbucket to create { create vbucket } for each vbucket created { fsync(vbucket's fd) } unblock frontend traffic … you'd be doing what's intended with fsync(). You'll be scheduling lots of IO and then waiting for completion on a bunch of things that are already complete rather than schedule/wait/schedule/wait/schedule/wait, etc. It's not like there's any data protection in the interim anyway. I know that this is easy to say and probably harder to do, but it'd be a much more complete fix to the issue. While here, I should also mention briefly that if you're using Mac OS X, they chose to make fsync(2) a bit less guaranteed than on other unixes. See the man page for details but any changes here should be verified/tested on other platforms and/or you should use the right additional defines in Mac OS to get real behavior. The man page and the internet have more detail.

            People

            • Assignee:
              drigby Dave Rigby
              Reporter:
              alkondratenko Aleksey Kondratenko (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              13 Start watching this issue

              Dates

              • Created:
                Updated:

                Gerrit Reviews

                There are no open Gerrit changes