Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7199

Couchbase server can't handle hundreds of view queries with unlimited number of results at the same time

    Details

    • Triage:
      Untriaged

      Description

      Cluster: 6 nodes
      10.6.2.37
      10.6.2.38
      10.6.2.39
      10.6.2.40
      10.6.2.42
      10.6.2.43

      Build # 2.0.0-1952 with 16 erlang schedulers
      each nodes with 390GB SSD drive, 32GB RAM

      2 buckets created sasl and default. Start loading items with 8K creates per sec to each bucket. Then insert a ddoc with 2 views to each bucket. Then have 4 clients do query for the view with 120 reads per sec.

      I don't put any limit on query results and those queries are generated without waiting for previous ones finish showing the results:
      capiUrl = "http://%s:%s/couchBase/" % (cfg.COUCHBASE_IP, cfg.COUCHBASE_PORT)
      url = capiUrl + '%s/design/%s/%s/%s' % (bucket,
      design_doc_name, type_,
      view_name)
      headers =

      {'Content-Type': 'application/json', 'Authorization': 'Basic %s' % authorization, 'Accept': '*/*'}

      req = urllib2.Request(url, headers = headers)

      Then the UI becomes unresponsive.
      Pay attention to the following stats:

      1st is the erlang scheduler on one of the nodes during query happens:

      (ns_1@10.6.2.37)5> F = fun (R) -> io:format("~p ~p~n", [latency:ts(now()), erlang:statistics(run_queues)]), timer:sleep(100), R(R) end.
      #Fun<erl_eval.6.80247286>
      1353032384137

      {11,104,2,0,8,11,0,0,0,0,0,0,0,0,0,0}

      1353032384293

      {4,65,103,7,2,20,0,0,0,0,0,0,0,0,0,0}

      1353032384425

      {3,7,4,25,21,3,0,0,0,0,0,0,0,0,0,0}

      1353032384553

      {23,17,50,6,6,0,0,0,0,0,0,0,0,0,0,0}

      1353032384672

      {16,28,92,15,65,42,0,0,0,0,0,0,0,0,0,0}

      1353032384795

      {6,4,47,15,1,0,0,0,0,0,0,0,0,0,0,0}

      1353032384919

      {1,11,86,59,56,55,0,0,0,0,0,0,0,0,0,0}

      1353032385081

      {54,49,30,44,33,11,0,0,0,0,0,0,0,0,0,0}

      1353032385221

      {15,47,10,45,9,31,0,0,0,0,0,0,0,0,0,0}

      1353032385355

      {46,2,72,89,28,4,0,0,0,0,0,0,0,0,0,0}

      1353032385468

      {11,1,8,26,0,2,0,0,0,0,0,0,0,0,0,0}

      1353032385610

      {7,23,7,14,20,13,0,0,0,0,0,0,0,0,0,0}

      1353032385765

      {7,85,11,16,0,12,0,0,0,0,0,0,0,0,0,0}

      1353032385905

      {9,29,28,2,3,26,0,0,0,0,0,0,0,0,0,0}

      1353032386068

      {48,112,142,31,12,25,0,0,0,0,0,0,0,0,0,0}

      1353032386222

      {11,40,28,36,5,9,0,0,0,0,0,0,0,0,0,0}

      1353032386356

      {64,53,4,5,7,34,0,0,0,0,0,0,0,0,0,0}

      1353032386560

      {0,2,45,2,0,89,0,0,0,0,0,0,0,0,0,0}

      1353032386700

      {50,18,83,4,0,35,0,0,0,0,0,0,0,0,0,0}

      1353032386837

      {0,18,3,2,17,4,0,0,0,0,0,0,0,0,0,0}

      1353032386984

      {2,10,11,6,0,4,0,0,0,0,0,0,0,0,0,0}

      1353032387105

      {1,5,12,2,0,64,0,0,0,0,0,0,0,0,0,0}

      1353032387231

      {22,67,58,5,19,7,0,0,0,0,0,0,0,0,0,0}

      1353032387337

      {17,1,38,33,7,1,0,0,0,0,0,0,0,0,0,0}

      1353032387469

      {5,5,48,27,2,18,0,0,0,0,0,0,0,0,0,0}

      1353032387598

      {2,50,47,88,41,8,0,0,0,0,0,0,0,0,0,0}

      1353032387746

      {2,55,16,35,1,12,0,0,0,0,0,0,0,0,0,0}

      1353032387897

      {3,29,98,0,5,19,0,0,0,0,0,0,0,0,0,0}

      1353032388021

      {29,50,147,0,5,3,0,0,0,0,0,0,0,0,0,0}

      1353032388146

      {15,3,30,3,46,2,0,0,0,0,0,0,0,0,0,0}

      1353032388277

      {53,8,50,1,10,14,0,0,0,0,0,0,0,0,0,0}

      1353032388402

      {2,19,45,0,6,2,0,0,0,0,0,0,0,0,0,0}

      1353032388594

      {17,123,2,0,29,4,0,0,0,0,0,0,0,0,0,0}

      1353032388734

      {35,92,0,3,40,70,0,0,0,0,0,0,0,0,0,0}

      1353032388873

      {2,10,22,5,18,17,0,0,0,0,0,0,0,0,0,0}

      1353032389008

      {112,84,15,0,1,0,0,0,0,0,0,0,0,0,0,0}

      1353032389133

      {102,57,0,25,3,23,0,0,0,0,0,0,0,0,0,0}

      1353032389257

      {44,55,28,5,36,49,0,0,0,0,0,0,0,0,0,0}

      1353032389379

      {4,40,3,48,2,48,0,0,0,0,0,0,0,0,0,0}

      1353032389549

      {24,161,24,38,16,21,0,0,0,0,0,0,0,0,0,0}

      1353032389686

      {54,25,12,23,7,98,0,0,0,0,0,0,0,0,0,0}

      1353032389804

      {79,33,20,2,3,46,0,0,0,0,0,0,0,0,0,0}

      1353032389950

      {90,0,25,13,45,56,0,0,0,0,0,0,0,0,0,0}

      1353032390101

      {59,10,17,1,37,54,0,0,0,0,0,0,0,0,0,0}

      2nd is the top stats about beam.smp:
      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      676 couchbas 20 0 26.0g 24g 5128 S 663.9 77.7 369:01.85 beam.smp

      24G memory usage. And the CPU% is always above 350%

      1. queries2.png
        75 kB
        Tommie McAfee
      2. logs.tgz
        3.23 MB
        FilipeManana
      3. erl_crash.dump.tgz
        157 kB
        FilipeManana
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        FilipeManana Filipe Manana (Inactive) added a comment -

        Agreed, the main cause are the ns_config call timeouts in the ns_server view query HTTP handler.
        With so many timeouts, the logger processes, and mb_master, are the processes with the biggest
        message queue lenghts:

        =proc:<0.167.0>
        State: Scheduled
        Name: 'sink-stderr'
        Spawned as: proc_lib:init_p/5
        Spawned by: <0.33.0>
        Started: Thu Nov 29 17:07:15 2012
        Message queue length: 3431

        =proc:<0.78.0>
        State: Scheduled
        Name: 'sink-disk_debug'
        Spawned as: proc_lib:init_p/5
        Spawned by: <0.33.0>
        Started: Thu Nov 29 17:07:15 2012
        Message queue length: 405

        =proc:<0.66.0>
        State: Waiting
        Name: 'sink-disk_error'
        Spawned as: proc_lib:init_p/5
        Spawned by: <0.33.0>
        Started: Thu Nov 29 17:07:15 2012
        Message queue length: 186

        =proc:<0.787.0>
        State: Waiting
        Name: mb_master
        Spawned as: proc_lib:init_p/5
        Spawned by: <0.371.0>
        Started: Thu Nov 29 17:07:31 2012
        Message queue length: 149

        =proc:<0.627.0>
        State: Waiting
        Spawned as: proc_lib:init_p/5
        Spawned by: <0.580.0>
        Started: Thu Nov 29 17:07:30 2012
        Message queue length: 84

        Show
        FilipeManana Filipe Manana (Inactive) added a comment - Agreed, the main cause are the ns_config call timeouts in the ns_server view query HTTP handler. With so many timeouts, the logger processes, and mb_master, are the processes with the biggest message queue lenghts: =proc:<0.167.0> State: Scheduled Name: 'sink-stderr' Spawned as: proc_lib:init_p/5 Spawned by: <0.33.0> Started: Thu Nov 29 17:07:15 2012 Message queue length: 3431 =proc:<0.78.0> State: Scheduled Name: 'sink-disk_debug' Spawned as: proc_lib:init_p/5 Spawned by: <0.33.0> Started: Thu Nov 29 17:07:15 2012 Message queue length: 405 =proc:<0.66.0> State: Waiting Name: 'sink-disk_error' Spawned as: proc_lib:init_p/5 Spawned by: <0.33.0> Started: Thu Nov 29 17:07:15 2012 Message queue length: 186 =proc:<0.787.0> State: Waiting Name: mb_master Spawned as: proc_lib:init_p/5 Spawned by: <0.371.0> Started: Thu Nov 29 17:07:31 2012 Message queue length: 149 =proc:<0.627.0> State: Waiting Spawned as: proc_lib:init_p/5 Spawned by: <0.580.0> Started: Thu Nov 29 17:07:30 2012 Message queue length: 84
        Hide
        kzeller kzeller added a comment -

        Added to RN as: Be aware that if attempt hundreds of simultaneous queries with an unlimited
        number of results, Couchbase Server may fail. For instanace 10 million
        results queried simultaneously will cause the server to fail. Instead you should
        specify a reasonable limit of results when you query, otherwise the
        server will stall and crash due to excessive memory usage.

        Show
        kzeller kzeller added a comment - Added to RN as: Be aware that if attempt hundreds of simultaneous queries with an unlimited number of results, Couchbase Server may fail. For instanace 10 million results queried simultaneously will cause the server to fail. Instead you should specify a reasonable limit of results when you query, otherwise the server will stall and crash due to excessive memory usage.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        moving to 2.1

        Show
        maria Maria McDuff (Inactive) added a comment - moving to 2.1
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        Not sure what to do about this one. But I've created MB-8501 to do something in ns_server so that it defends itself from this condition and other similar conditions

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - Not sure what to do about this one. But I've created MB-8501 to do something in ns_server so that it defends itself from this condition and other similar conditions
        Hide
        vmx Volker Mische added a comment -

        I close this one as "Fixed" this is a very old bug (pre 2.0 GA). Now that we have performance testing in place, we'll discover those kind of bugs in case it occurs again.

        Show
        vmx Volker Mische added a comment - I close this one as "Fixed" this is a very old bug (pre 2.0 GA). Now that we have performance testing in place, we'll discover those kind of bugs in case it occurs again.

          People

          • Assignee:
            siri Sriram Melkote
            Reporter:
            Chisheng Chisheng Hong (Inactive)
          • Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes