Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-41331

fsync while vbucket file grows during normal persistence called too often

    XMLWordPrintable

Details

    • Improvement
    • Status: Open
    • Major
    • Resolution: Unresolved
    • 5.0.1
    • feature-backlog
    • storage-engine
    • 1

    Description

      Currently fsync (among other places) is called here:

      [root@cust1cntsdfdb1 ~]# grep -3 sync /tmp/paf-1.txt
      Thread 25 (Thread 0x2b34d2fd8940 (LWP 30693)):
      #0 0x000000301bece4d7 in fdatasync () from /lib64/libc.so.6
      #1 0x00002b34bb63f79f in couch_sync () from /opt/couchbase/lib/libcouchstore.so.1
      #2 0x00002b34bb3dce9f in cfs_sync () from /opt/couchbase/lib/memcached/ep.so
      #3 0x00002b34bb639b1f in couchstore_commit () from /opt/couchbase/lib/libcouchstore.so.1
      #4 0x00002b34bb3d3e35 in CouchKVStore::saveDocs(unsigned short, unsigned long, _doc**, _docinfo**, int) () from /opt/couchbase/lib/memcached/ep.so #5 0x00002b34bb3d45ab in CouchKVStore::commit2couchstore() () from /opt/couchbase/lib/memcached/ep.so
      [root@cust1cntsdfdb1 ~]#
      

      This causes way too many fsync calls per second on our system, the system can not handle that.

      Please consider making this fsync called not with every saveData, but configurable

      We absolutely need some way to reduce number of fsyncs on our load or we can't use Couchbase out of the box..

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Friends, thank you very much for your attention and input!

          Srinath Duvuru Dave Rigby

          clients are just using Couchbase at runtime, not loading manually any data. Normal updates, so it is not my code that controls how many docs go to saveDocs, it is controlled by Couchbase.

          Is there some way we can tune this point?

          Somehow make Couchbase write more docs in one saveDocs() call, thus reducing the number of fsync()s?

           

           

          Regretfully, storage is not battery-backed up, so power loss causes big data loss.

          Our system is under high load of updates and out of the box couchbase+os+filesystem can not cope with those updates.

          We need some basic control of speed versus reliability.

          In MySQL they have two configuration variables: max time and max bytes, after that time or those pending bytes they do fsync().

          While we were using MySQL (<7 years ago) that worked well. "Not oftener than 1s and 1M" was enough.

           

          Currently, we have no control here. Or have we?

          paf Alexander Petrossian (PAF) added a comment - Friends, thank you very much for your attention and input! Srinath Duvuru Dave Rigby clients are just using Couchbase at runtime, not loading manually any data. Normal updates, so it is not my code that controls how many docs go to saveDocs, it is controlled by Couchbase. Is there some way we can tune this point? Somehow make Couchbase write more docs in one saveDocs() call, thus reducing the number of fsync()s?     Regretfully, storage is not battery-backed up, so power loss causes big data loss. Our system is under high load of updates and out of the box couchbase+os+filesystem can not cope with those updates. We need some basic control of speed versus reliability. In MySQL they have two configuration variables: max time and max bytes, after that time or those pending bytes they do fsync(). While we were using MySQL (<7 years ago) that worked well. "Not oftener than 1s and 1M" was enough.   Currently, we have no control here. Or have we?
          drigby Dave Rigby added a comment -

          Is there some way we can tune this point?
          Somehow make Couchbase write more docs in one saveDocs() call, thus reducing the number of fsync()s?

          Regretfully, storage is not battery-backed up, so power loss causes big data loss.
          Our system is under high load of updates and out of the box couchbase+os+filesystem can not cope with those updates.
          We need some basic control of speed versus reliability.

          From 6.5.0 upwards you can specify at runtime the number of background Writer Threads. With fewer threads performing writes, you'll implicitly end up with larger commit batches (more mutations included in a single commit with the associated fsync). That would result in longer per-item persistence times (given you would be writing fewer, larger batches), but would amortise the fsync cost out over more updates.

          Having said that, I'm not sure I understand why "couchbase+os+filesystem can not cope with those updates" - given we use sync IO, if the disk/OS is slow to fsync, then that will simply cause that Writer thread to be blocked waiting for the write()s / fsync()s to complete, hence more mutations will accumulate on the front-end, hence next commit batch you'll have more mutations waiting to be written and hence a larger batch (and fewer fsyncs per mutation). I suspect that even if you reduced (or experimentally removed) the fsyncs, the disk subsystem still has the same number of bytes to write to disk (given couchstore is append-only we never reuse any blocks, so fewer fsyncs doen't materially reduce the amount of data which needs to be written to disk).

          drigby Dave Rigby added a comment - Is there some way we can tune this point? Somehow make Couchbase write more docs in one saveDocs() call, thus reducing the number of fsync()s? Regretfully, storage is not battery-backed up, so power loss causes big data loss. Our system is under high load of updates and out of the box couchbase+os+filesystem can not cope with those updates. We need some basic control of speed versus reliability. From 6.5.0 upwards you can specify at runtime the number of background Writer Threads. With fewer threads performing writes, you'll implicitly end up with larger commit batches (more mutations included in a single commit with the associated fsync). That would result in longer per-item persistence times (given you would be writing fewer, larger batches), but would amortise the fsync cost out over more updates. Having said that, I'm not sure I understand why "couchbase+os+filesystem can not cope with those updates" - given we use sync IO, if the disk/OS is slow to fsync, then that will simply cause that Writer thread to be blocked waiting for the write()s / fsync()s to complete, hence more mutations will accumulate on the front-end, hence next commit batch you'll have more mutations waiting to be written and hence a larger batch (and fewer fsyncs per mutation). I suspect that even if you reduced (or experimentally removed) the fsyncs, the disk subsystem still has the same number of bytes to write to disk (given couchstore is append-only we never reuse any blocks, so fewer fsyncs doen't materially reduce the amount of data which needs to be written to disk).

          Dave Rigby David, I appreciate your help and attempt to understand. Not the first year now

          You probably refer to this mech:

          curl -X POST -d hostname=<host>:<port>
            -d num_reader_threads=<int>
            -d num_writer_threads=<int>
            -d password=<password>
            -u <administrator>:<password>
            http://<host>:<port>/pools/default/settings/memcached/global 

          That one we've years ago figured out how to tune via the ever-so-important diag interface:

          curl -u "$USER:$PASS" -XPOST -d "ns_bucket:update_bucket_props("$bucket", [
          {extra_config_string, \"max_num_writers=1;max_num_nonio=1\":}
          ])." http://localhost:8091/diag/eval ;done 

          Later on we found & used offical approach (but still way before 6.5):

          cbepctl $HOST:11210 -b $BUCKET -p $PASS set max_num_writers 1
          

          We thought that less writers would cause some magic queue optimizing mechanisms (mentioned in docs very slightly, can't find reference now) to come to work:

          • key1=value1 [pending write]
          • key1=value2 [while value1 is still pending write] – theory: key1=value1 is kicked out from the disk-write queue.

          PAF-1: Is there such magic in code (or is it just in our imagination / wishful thinking)?


          The history of our project shows that the reduction of number of writers did help for a year or so.
          But after that (and ever-growing load) problems returned = disk subsystems started to fail writing things in time and disk queues grew to unreasonable numbers.

          And that was when we've started hacking around fsync() frequency reduction.
          And it did help.
          We're not sure how exactly , we don't know details of fsync implementation.

          But one can guess it does something OTHER than just writes raw payload data to disk.
          Some tables to be updated, some other things done. We don't know.

          But from our practical experience we know that reducing fsyncs did help in our project.
          So in this issue I'm asking to consider to make that fsync configurable... like MySQL team did or some such.
          PAF-2: What do you think?

          paf Alexander Petrossian (PAF) added a comment - - edited Dave Rigby  David, I appreciate your help and attempt to understand. Not the first year now You probably refer to this mech : curl -X POST -d hostname=<host>:<port> -d num_reader_threads=< int > -d num_writer_threads=< int > -d password=<password> -u <administrator>:<password> http: //<host>:<port>/pools/default/settings/memcached/global That one we've years ago figured out how to tune via the ever-so-important diag interface: curl -u "$USER:$PASS" -XPOST -d "ns_bucket:update_bucket_props(" $bucket", [ {extra_config_string, \"max_num_writers= 1 ;max_num_nonio= 1 \":} ])." http: //localhost:8091/diag/eval ;done Later on we found & used offical approach (but still way before 6.5): cbepctl $HOST:11210 -b $BUCKET -p $PASS set max_num_writers 1 We thought that less writers would cause some magic queue optimizing mechanisms (mentioned in docs very slightly, can't find reference now) to come to work: key1=value1 [pending write] key1=value2 [while value1 is still pending write] – theory: key1=value1 is kicked out from the disk-write queue. PAF-1: Is there such magic in code (or is it just in our imagination / wishful thinking)? The history of our project shows that the reduction of number of writers did help for a year or so. But after that (and ever-growing load) problems returned = disk subsystems started to fail writing things in time and disk queues grew to unreasonable numbers. And that was when we've started hacking around fsync() frequency reduction. And it did help . We're not sure how exactly , we don't know details of fsync implementation. But one can guess it does something OTHER than just writes raw payload data to disk. Some tables to be updated, some other things done. We don't know. But from our practical experience we know that reducing fsyncs did help in our project. So in this issue I'm asking to consider to make that fsync configurable... like MySQL team did or some such. PAF-2: What do you think?
          drigby Dave Rigby added a comment -

          We thought that less writers would cause some magic queue optimizing mechanisms (mentioned in docs very slightly, can't find reference now) to come to work:
          key1=value1 [pending write]
          key1=value2 [while value1 is still pending write] – theory: key1=value1 is kicked out from the disk-write queue.

          Yes, there is de-duplication of mutations to the same key, assuming the second write occurs before the first has been flushed. Certainly if you are re-writing the same key repeatedly in a small time window then slowing down the flusher (i.e. with fewer Writer threads) can result in fewer bytes being written to disk.

          In terms of making fsync() configurable, that's not something I think we should consider - it could result in loss of data if we claim to have "committed" something before it's actually persisted to media. Correctness > Performance.

          drigby Dave Rigby added a comment - We thought that less writers would cause some magic queue optimizing mechanisms (mentioned in docs very slightly, can't find reference now) to come to work: key1=value1 [pending write] key1=value2 [while value1 is still pending write] – theory: key1=value1 is kicked out from the disk-write queue. Yes, there is de-duplication of mutations to the same key, assuming the second write occurs before the first has been flushed. Certainly if you are re-writing the same key repeatedly in a small time window then slowing down the flusher (i.e. with fewer Writer threads) can result in fewer bytes being written to disk. In terms of making fsync() configurable, that's not something I think we should consider - it could result in loss of data if we claim to have "committed" something before it's actually persisted to media. Correctness > Performance.

          Dave Rigby

          David, thank you very much for confirming the "city legend"

          We've pondered on Correctness. I feel I understand some of the background now: it probably has to do with slaves knowing checkpoints, and it would not be good for a slave to have a checkpoint that masters disk hasn't?

          PAF:3: maybe it is worth utilizing de-duplication to the limit and slow down the flusher even further, by making a configurable delay between write attempts?
          To write not when a writer thread is ready, but every:

          • N seconds;
          • N items;
          • N bytes.

          This way we're not jeopardizing checkpoints (hurray for Correctness), reducing the number of B+tree creation(save CPU), bytes to be written to disk(save disk throughput) + fsync at the end (hurray for Performance; hurray for not-expensive hardware; few tears for data safety and bargains with the consciousness).

          At the known configurable allowable risk of losing data.

          paf Alexander Petrossian (PAF) added a comment - Dave Rigby David, thank you very much for confirming the "city legend" We've pondered on Correctness. I feel I understand some of the background now: it probably has to do with slaves knowing checkpoints, and it would not be good for a slave to have a checkpoint that masters disk hasn't? PAF:3: maybe it is worth utilizing de-duplication to the limit and slow down the flusher even further, by making a configurable delay between write attempts? To write not when a writer thread is ready, but every: N seconds; N items; N bytes. This way we're not  jeopardizing checkpoints (hurray for Correctness) , reducing the number of B+tree creation(save CPU), bytes to be written to disk(save disk throughput) + fsync at the end (hurray for Performance; hurray for not-expensive hardware; few tears for data safety and bargains with the consciousness) . At the known configurable allowable risk of losing data.

          People

            srinath.duvuru Srinath Duvuru
            paf Alexander Petrossian (PAF)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty