Details

    • Technical task
    • Resolution: Fixed
    • Critical
    • 6.5.0
    • 6.5.0
    • couchbase-bucket
    • None
    • KV-Engine Mad-Hatter GA

    Description

      platform.so has an implementation of the 64bit byteswap functions ntohll() and htonll(), for platforms which don't have that symbol natively.

      Profiling of ep-engine Writer threads highlighed that a large amount of time (~5%) was being spent in platform's ntohll() / htonll() functions. This was surprising, as:

      1. I had assumed that modern Linux (CentOS 7) provided the 64bit byteswap functions, and
      1. Even if the OS doesn't have those functions, I assumed our implementation shouldn't be that slow.

      (For context the top 10 functions in the profile are below, ntohll is the 3rd hottest):

          
              Overhead  Command      Shared Object            Symbol
                 4.82%  mc:writer_2  libsnappy.so.1.2.0       [.] snappy::internal::CompressFragment
                 4.33%  mc:writer_2  [kernel.kallsyms]        [k] _raw_spin_lock_irq
                 4.30%  mc:writer_2  libplatform_so.so.0.1.0  [.] ntohll
                 2.82%  mc:writer_2  libc-2.17.so             [.] __memcpy_ssse3
                 2.49%  mc:writer_2  libsnappy.so.1.2.0       [.] snappy::RawUncompress
                 2.36%  mc:writer_2  [kernel.kallsyms]        [k] _raw_spin_lock
                 1.99%  mc:writer_2  libjemalloc.so.2         [.] je_malloc_usable_size
                 1.74%  mc:writer_2  [kernel.kallsyms]        [k] __radix_tree_lookup
                 1.43%  mc:writer_2  libjemalloc.so.2         [.] je_malloc
      

      Both my assumptions are actually incorrect:

      • CentOS 7 (and other recent distros including Ubuntu 18.04) don't have ntohll / htonll symbols. They do have functionally equivilent function htobe64() since glibc 2.9 (2008), but that's a different, Linux-specific symbol.
      • Our implemenation is slow - it's doing old-style manual byteswap, which is 10x slower on mancouch:

          
              Run on (24 X 2400 MHz CPU s)
              2019-11-06 12:20:30
              -----------------------------------------------------
              Benchmark              Time           CPU Iterations
              -----------------------------------------------------
              Swap64                57 ns         57 ns   12216131
              BuiltinSwap64          5 ns          5 ns  141279127
      

      Given we already have an optimized byteswap implementation available from Folly, use that instead. Also inline the functions to reduce the call overhead.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            drigby Dave Rigby (Inactive)
            drigby Dave Rigby (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty