Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • 6.5.0
    • 6.5.0
    • couchbase-bucket
    • None
    • KV-Engine Mad-Hatter GA

    Description

      platform.so has an implementation of the 64bit byteswap functions ntohll() and htonll(), for platforms which don't have that symbol natively.

      Profiling of ep-engine Writer threads highlighed that a large amount of time (~5%) was being spent in platform's ntohll() / htonll() functions. This was surprising, as:

      1. I had assumed that modern Linux (CentOS 7) provided the 64bit byteswap functions, and
      1. Even if the OS doesn't have those functions, I assumed our implementation shouldn't be that slow.

      (For context the top 10 functions in the profile are below, ntohll is the 3rd hottest):

          
              Overhead  Command      Shared Object            Symbol
                 4.82%  mc:writer_2  libsnappy.so.1.2.0       [.] snappy::internal::CompressFragment
                 4.33%  mc:writer_2  [kernel.kallsyms]        [k] _raw_spin_lock_irq
                 4.30%  mc:writer_2  libplatform_so.so.0.1.0  [.] ntohll
                 2.82%  mc:writer_2  libc-2.17.so             [.] __memcpy_ssse3
                 2.49%  mc:writer_2  libsnappy.so.1.2.0       [.] snappy::RawUncompress
                 2.36%  mc:writer_2  [kernel.kallsyms]        [k] _raw_spin_lock
                 1.99%  mc:writer_2  libjemalloc.so.2         [.] je_malloc_usable_size
                 1.74%  mc:writer_2  [kernel.kallsyms]        [k] __radix_tree_lookup
                 1.43%  mc:writer_2  libjemalloc.so.2         [.] je_malloc
      

      Both my assumptions are actually incorrect:

      • CentOS 7 (and other recent distros including Ubuntu 18.04) don't have ntohll / htonll symbols. They do have functionally equivilent function htobe64() since glibc 2.9 (2008), but that's a different, Linux-specific symbol.
      • Our implemenation is slow - it's doing old-style manual byteswap, which is 10x slower on mancouch:

          
              Run on (24 X 2400 MHz CPU s)
              2019-11-06 12:20:30
              -----------------------------------------------------
              Benchmark              Time           CPU Iterations
              -----------------------------------------------------
              Swap64                57 ns         57 ns   12216131
              BuiltinSwap64          5 ns          5 ns  141279127
      

      Given we already have an optimized byteswap implementation available from Folly, use that instead. Also inline the functions to reduce the call overhead.

      Attachments

        For Gerrit Dashboard: MB-36776
        # Subject Branch Project Status CR V

        Activity

          People

            drigby Dave Rigby (Inactive)
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty