Details

    • Type: Bug
    • Status: Resolved
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0.1
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      Supposedly any

      Description

      Even in 1.8.1 we're using some 3-5 percent of CPU with single bucket when idle. For stats gathering, heartbeats etc.

      We've spotted that current 2.0 is eating 15 or more.

      Early evidence suggests that it's related to lots of erlang scheduler threads we're using now.

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        I've performed a small testing on my machine to measure the impact of different number of schedulers. I measured cpu usage averaged over two minutes when the cluster was completely idle. The cluster had 4 nodes run by cluster_run, one bucket with 250k of binary items generated by memcachetest and simple a view counting number of items. My machine has 8 logical processors (4 cores + hyperthreading). Here are the results:

        128 schedulers:

        CPU CMD
        42% beam.smp
        32% beam.smp
        32% beam.smp
        31% beam.smp

        64 schedulers:

        CPU CMD
        23% beam.smp
        18% beam.smp
        18% beam.smp
        18% beam.smp

        32 schedulers:

        CPU CMD
        17% beam.smp
        13% beam.smp
        13% beam.smp
        13% beam.smp

        16 schedulers

        CPU CMD
        11% beam.smp
        9% beam.smp
        9% beam.smp
        8% beam.smp

        8 schedulers:

        CPU CMD
        10% beam.smp
        6% beam.smp
        6% beam.smp
        6% beam.smp

        4 schedulers:

        CPU CMD
        8% beam.smp
        6% beam.smp
        5% beam.smp
        5% beam.smp

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - I've performed a small testing on my machine to measure the impact of different number of schedulers. I measured cpu usage averaged over two minutes when the cluster was completely idle. The cluster had 4 nodes run by cluster_run, one bucket with 250k of binary items generated by memcachetest and simple a view counting number of items. My machine has 8 logical processors (4 cores + hyperthreading). Here are the results: 128 schedulers: CPU CMD 42% beam.smp 32% beam.smp 32% beam.smp 31% beam.smp 64 schedulers: CPU CMD 23% beam.smp 18% beam.smp 18% beam.smp 18% beam.smp 32 schedulers: CPU CMD 17% beam.smp 13% beam.smp 13% beam.smp 13% beam.smp 16 schedulers CPU CMD 11% beam.smp 9% beam.smp 9% beam.smp 8% beam.smp 8 schedulers: CPU CMD 10% beam.smp 6% beam.smp 6% beam.smp 6% beam.smp 4 schedulers: CPU CMD 8% beam.smp 6% beam.smp 5% beam.smp 5% beam.smp
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        These were results with erlang r15b01. Here are the results for the same setup on r14b04:

        128 schedulers

        CPU CMD
        18% beam.smp
        16% beam.smp
        16% beam.smp
        14% beam.smp

        64 schedulers

        CPU CMD
        11% beam.smp
        10% beam.smp
        10% beam.smp
        10% beam.smp

        32 schedulers

        CPU CMD
        9% beam.smp
        8% beam.smp
        8% beam.smp
        8% beam.smp

        16 schedulers

        CPU CMD
        6% beam.smp
        6% beam.smp
        5% beam.smp
        5% beam.smp

        8 schedulers

        CPU CMD
        5% beam.smp
        4% beam.smp
        4% beam.smp
        4% beam.smp

        4 schedulers

        CPU CMD
        5% beam.smp
        5% beam.smp
        4% beam.smp
        4% beam.smp

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - These were results with erlang r15b01. Here are the results for the same setup on r14b04: 128 schedulers CPU CMD 18% beam.smp 16% beam.smp 16% beam.smp 14% beam.smp 64 schedulers CPU CMD 11% beam.smp 10% beam.smp 10% beam.smp 10% beam.smp 32 schedulers CPU CMD 9% beam.smp 8% beam.smp 8% beam.smp 8% beam.smp 16 schedulers CPU CMD 6% beam.smp 6% beam.smp 5% beam.smp 5% beam.smp 8 schedulers CPU CMD 5% beam.smp 4% beam.smp 4% beam.smp 4% beam.smp 4 schedulers CPU CMD 5% beam.smp 5% beam.smp 4% beam.smp 4% beam.smp
        Hide
        steve Steve Yen added a comment -

        moved to 2.0.1 per bug scrub mtg

        Show
        steve Steve Yen added a comment - moved to 2.0.1 per bug scrub mtg
        Hide
        steve Steve Yen added a comment -

        Adding Alk's informative email on various options for 2.0.0...

        ---------------

        Folks, we have observed that idle CPU consumption of our current product is 15-20 percent of quite fast CPU. And we've found that 128 erlang scheduler threads is causing it.

        I'm also aware that perf folks found up to 50% performance drop when we enabled 128 erlang schedulers.

        Looks like we should consider finding either alternative number or some alternative setup.

        I think main difficulty lies in somewhat long cycle of verifying if some particular setting works. And given that I cannot easily propose anything.

        Our options are:

        • delay any reaction to post-2.0
        • decrease to 64 or 32 scheduler threads for more acceptable level of overhead
        • try single scheduler queue option of Erlang VM with 12-16 scheduler threads (more on that below)
        • find a way to make async threads work. I.e. by kidnapping erlang VM developers and threatening/torturing them
        • do erlang VM splitting allowing more latency insensitive couchdb-side of our project to run with async IO threads off. Yes it's non-trivial work and it's late.
        • something else that's smarter and perhaps crazier

        Apparently any option requires extensive testing.

        Regarding option three.

        I've found that normally erlang has runqueue per scheduler thread. That's supposedly more scalable setup. I.e. given that by default each scheduler thread runs on it's own dedicated CPU, it means CPUs don't bother touching shared data structure.

        But R14 still has a way to request single runqueue. I have no idea if that works. Particularly for some reason R15 does not have this option anymore with the following commit:

        commit 8781932b3b8769b6f208ac7c00471122ec7dd055
        Author: Rickard Green <rickard@erlang.org>
        Date: Fri Nov 18 15:19:46 2011 +0100

        Remove common run-queue in SMP case

        The common run-queue implementation is removed since it is unused,
        untested, undocumented, unsupported, and only complicates the code.

        A spinlock used by the run-queue management sometimes got heavily
        contended. This code has now been rewritten, and the spinlock
        has been removed.

        But if it works, then it would solve potential delays caused by schedulers doing blocking io.

        I.e. assume we have a bunch of runnable processes. In default mode they will be assigned to some scheduler threads. Potentially many per single scheduler. We've seen that when some scheduler is blocked in IO (which happens when async IO is disabled), it's runqueue is not served by any other scheduler. Which causes some processes to be starved and delayed. There is most likely some sort of work stealing between scheduler thread's runqueues, but apparently it's not working in this particular use-case. I.e. it can be seen that inherently unfair runnable process queuing is causing that. So in order to prevent this from happening we decided to be very generous on setting erlang scheduler threads count. We did that assuming there's little overhead, which is clearly not true.

        If we have single shared runqueue, then non-IO processes could be starved only if all scheduler threads are busy doing IO. It's inherently more fair and thus allows for much lower scheduler threads count setting

        Show
        steve Steve Yen added a comment - Adding Alk's informative email on various options for 2.0.0... --------------- Folks, we have observed that idle CPU consumption of our current product is 15-20 percent of quite fast CPU. And we've found that 128 erlang scheduler threads is causing it. I'm also aware that perf folks found up to 50% performance drop when we enabled 128 erlang schedulers. Looks like we should consider finding either alternative number or some alternative setup. I think main difficulty lies in somewhat long cycle of verifying if some particular setting works. And given that I cannot easily propose anything. Our options are: delay any reaction to post-2.0 decrease to 64 or 32 scheduler threads for more acceptable level of overhead try single scheduler queue option of Erlang VM with 12-16 scheduler threads (more on that below) find a way to make async threads work. I.e. by kidnapping erlang VM developers and threatening/torturing them do erlang VM splitting allowing more latency insensitive couchdb-side of our project to run with async IO threads off. Yes it's non-trivial work and it's late. something else that's smarter and perhaps crazier Apparently any option requires extensive testing. Regarding option three. I've found that normally erlang has runqueue per scheduler thread. That's supposedly more scalable setup. I.e. given that by default each scheduler thread runs on it's own dedicated CPU, it means CPUs don't bother touching shared data structure. But R14 still has a way to request single runqueue. I have no idea if that works. Particularly for some reason R15 does not have this option anymore with the following commit: commit 8781932b3b8769b6f208ac7c00471122ec7dd055 Author: Rickard Green <rickard@erlang.org> Date: Fri Nov 18 15:19:46 2011 +0100 Remove common run-queue in SMP case The common run-queue implementation is removed since it is unused, untested, undocumented, unsupported, and only complicates the code. A spinlock used by the run-queue management sometimes got heavily contended. This code has now been rewritten, and the spinlock has been removed. But if it works, then it would solve potential delays caused by schedulers doing blocking io. I.e. assume we have a bunch of runnable processes. In default mode they will be assigned to some scheduler threads. Potentially many per single scheduler. We've seen that when some scheduler is blocked in IO (which happens when async IO is disabled), it's runqueue is not served by any other scheduler. Which causes some processes to be starved and delayed. There is most likely some sort of work stealing between scheduler thread's runqueues, but apparently it's not working in this particular use-case. I.e. it can be seen that inherently unfair runnable process queuing is causing that. So in order to prevent this from happening we decided to be very generous on setting erlang scheduler threads count. We did that assuming there's little overhead, which is clearly not true. If we have single shared runqueue, then non-IO processes could be starved only if all scheduler threads are busy doing IO. It's inherently more fair and thus allows for much lower scheduler threads count setting
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        By lowering number of scheduler threads we're now reasonable

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - By lowering number of scheduler threads we're now reasonable
        Hide
        ingenthr Matt Ingenthron added a comment -

        What build do we expect to see this in?

        Show
        ingenthr Matt Ingenthron added a comment - What build do we expect to see this in?
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        build 1965 has 16 schedulers

        Show
        farshid Farshid Ghods (Inactive) added a comment - build 1965 has 16 schedulers

          People

          • Assignee:
            Aliaksey Artamonau Aliaksey Artamonau
            Reporter:
            alkondratenko Aleksey Kondratenko (Inactive)
          • Votes:
            2 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes