Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7199

Couchbase server can't handle hundreds of view queries with unlimited number of results at the same time

    XMLWordPrintable

    Details

    • Triage:
      Untriaged

      Description

      Cluster: 6 nodes
      10.6.2.37
      10.6.2.38
      10.6.2.39
      10.6.2.40
      10.6.2.42
      10.6.2.43

      Build # 2.0.0-1952 with 16 erlang schedulers
      each nodes with 390GB SSD drive, 32GB RAM

      2 buckets created sasl and default. Start loading items with 8K creates per sec to each bucket. Then insert a ddoc with 2 views to each bucket. Then have 4 clients do query for the view with 120 reads per sec.

      I don't put any limit on query results and those queries are generated without waiting for previous ones finish showing the results:
      capiUrl = "http://%s:%s/couchBase/" % (cfg.COUCHBASE_IP, cfg.COUCHBASE_PORT)
      url = capiUrl + '%s/design/%s/%s/%s' % (bucket,
      design_doc_name, type_,
      view_name)
      headers =

      {'Content-Type': 'application/json', 'Authorization': 'Basic %s' % authorization, 'Accept': '*/*'}

      req = urllib2.Request(url, headers = headers)

      Then the UI becomes unresponsive.
      Pay attention to the following stats:

      1st is the erlang scheduler on one of the nodes during query happens:

      (ns_1@10.6.2.37)5> F = fun (R) -> io:format("~p ~p~n", [latency:ts(now()), erlang:statistics(run_queues)]), timer:sleep(100), R(R) end.
      #Fun<erl_eval.6.80247286>
      1353032384137

      {11,104,2,0,8,11,0,0,0,0,0,0,0,0,0,0}

      1353032384293

      {4,65,103,7,2,20,0,0,0,0,0,0,0,0,0,0}

      1353032384425

      {3,7,4,25,21,3,0,0,0,0,0,0,0,0,0,0}

      1353032384553

      {23,17,50,6,6,0,0,0,0,0,0,0,0,0,0,0}

      1353032384672

      {16,28,92,15,65,42,0,0,0,0,0,0,0,0,0,0}

      1353032384795

      {6,4,47,15,1,0,0,0,0,0,0,0,0,0,0,0}

      1353032384919

      {1,11,86,59,56,55,0,0,0,0,0,0,0,0,0,0}

      1353032385081

      {54,49,30,44,33,11,0,0,0,0,0,0,0,0,0,0}

      1353032385221

      {15,47,10,45,9,31,0,0,0,0,0,0,0,0,0,0}

      1353032385355

      {46,2,72,89,28,4,0,0,0,0,0,0,0,0,0,0}

      1353032385468

      {11,1,8,26,0,2,0,0,0,0,0,0,0,0,0,0}

      1353032385610

      {7,23,7,14,20,13,0,0,0,0,0,0,0,0,0,0}

      1353032385765

      {7,85,11,16,0,12,0,0,0,0,0,0,0,0,0,0}

      1353032385905

      {9,29,28,2,3,26,0,0,0,0,0,0,0,0,0,0}

      1353032386068

      {48,112,142,31,12,25,0,0,0,0,0,0,0,0,0,0}

      1353032386222

      {11,40,28,36,5,9,0,0,0,0,0,0,0,0,0,0}

      1353032386356

      {64,53,4,5,7,34,0,0,0,0,0,0,0,0,0,0}

      1353032386560

      {0,2,45,2,0,89,0,0,0,0,0,0,0,0,0,0}

      1353032386700

      {50,18,83,4,0,35,0,0,0,0,0,0,0,0,0,0}

      1353032386837

      {0,18,3,2,17,4,0,0,0,0,0,0,0,0,0,0}

      1353032386984

      {2,10,11,6,0,4,0,0,0,0,0,0,0,0,0,0}

      1353032387105

      {1,5,12,2,0,64,0,0,0,0,0,0,0,0,0,0}

      1353032387231

      {22,67,58,5,19,7,0,0,0,0,0,0,0,0,0,0}

      1353032387337

      {17,1,38,33,7,1,0,0,0,0,0,0,0,0,0,0}

      1353032387469

      {5,5,48,27,2,18,0,0,0,0,0,0,0,0,0,0}

      1353032387598

      {2,50,47,88,41,8,0,0,0,0,0,0,0,0,0,0}

      1353032387746

      {2,55,16,35,1,12,0,0,0,0,0,0,0,0,0,0}

      1353032387897

      {3,29,98,0,5,19,0,0,0,0,0,0,0,0,0,0}

      1353032388021

      {29,50,147,0,5,3,0,0,0,0,0,0,0,0,0,0}

      1353032388146

      {15,3,30,3,46,2,0,0,0,0,0,0,0,0,0,0}

      1353032388277

      {53,8,50,1,10,14,0,0,0,0,0,0,0,0,0,0}

      1353032388402

      {2,19,45,0,6,2,0,0,0,0,0,0,0,0,0,0}

      1353032388594

      {17,123,2,0,29,4,0,0,0,0,0,0,0,0,0,0}

      1353032388734

      {35,92,0,3,40,70,0,0,0,0,0,0,0,0,0,0}

      1353032388873

      {2,10,22,5,18,17,0,0,0,0,0,0,0,0,0,0}

      1353032389008

      {112,84,15,0,1,0,0,0,0,0,0,0,0,0,0,0}

      1353032389133

      {102,57,0,25,3,23,0,0,0,0,0,0,0,0,0,0}

      1353032389257

      {44,55,28,5,36,49,0,0,0,0,0,0,0,0,0,0}

      1353032389379

      {4,40,3,48,2,48,0,0,0,0,0,0,0,0,0,0}

      1353032389549

      {24,161,24,38,16,21,0,0,0,0,0,0,0,0,0,0}

      1353032389686

      {54,25,12,23,7,98,0,0,0,0,0,0,0,0,0,0}

      1353032389804

      {79,33,20,2,3,46,0,0,0,0,0,0,0,0,0,0}

      1353032389950

      {90,0,25,13,45,56,0,0,0,0,0,0,0,0,0,0}

      1353032390101

      {59,10,17,1,37,54,0,0,0,0,0,0,0,0,0,0}

      2nd is the top stats about beam.smp:
      PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
      676 couchbas 20 0 26.0g 24g 5128 S 663.9 77.7 369:01.85 beam.smp

      24G memory usage. And the CPU% is always above 350%

        Attachments

        1. erl_crash.dump.tgz
          157 kB
        2. logs.tgz
          3.23 MB
        3. queries2.png
          75 kB
        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          FilipeManana Filipe Manana (Inactive) added a comment -

          Agreed, the main cause are the ns_config call timeouts in the ns_server view query HTTP handler.
          With so many timeouts, the logger processes, and mb_master, are the processes with the biggest
          message queue lenghts:

          =proc:<0.167.0>
          State: Scheduled
          Name: 'sink-stderr'
          Spawned as: proc_lib:init_p/5
          Spawned by: <0.33.0>
          Started: Thu Nov 29 17:07:15 2012
          Message queue length: 3431

          =proc:<0.78.0>
          State: Scheduled
          Name: 'sink-disk_debug'
          Spawned as: proc_lib:init_p/5
          Spawned by: <0.33.0>
          Started: Thu Nov 29 17:07:15 2012
          Message queue length: 405

          =proc:<0.66.0>
          State: Waiting
          Name: 'sink-disk_error'
          Spawned as: proc_lib:init_p/5
          Spawned by: <0.33.0>
          Started: Thu Nov 29 17:07:15 2012
          Message queue length: 186

          =proc:<0.787.0>
          State: Waiting
          Name: mb_master
          Spawned as: proc_lib:init_p/5
          Spawned by: <0.371.0>
          Started: Thu Nov 29 17:07:31 2012
          Message queue length: 149

          =proc:<0.627.0>
          State: Waiting
          Spawned as: proc_lib:init_p/5
          Spawned by: <0.580.0>
          Started: Thu Nov 29 17:07:30 2012
          Message queue length: 84

          Show
          FilipeManana Filipe Manana (Inactive) added a comment - Agreed, the main cause are the ns_config call timeouts in the ns_server view query HTTP handler. With so many timeouts, the logger processes, and mb_master, are the processes with the biggest message queue lenghts: =proc:<0.167.0> State: Scheduled Name: 'sink-stderr' Spawned as: proc_lib:init_p/5 Spawned by: <0.33.0> Started: Thu Nov 29 17:07:15 2012 Message queue length: 3431 =proc:<0.78.0> State: Scheduled Name: 'sink-disk_debug' Spawned as: proc_lib:init_p/5 Spawned by: <0.33.0> Started: Thu Nov 29 17:07:15 2012 Message queue length: 405 =proc:<0.66.0> State: Waiting Name: 'sink-disk_error' Spawned as: proc_lib:init_p/5 Spawned by: <0.33.0> Started: Thu Nov 29 17:07:15 2012 Message queue length: 186 =proc:<0.787.0> State: Waiting Name: mb_master Spawned as: proc_lib:init_p/5 Spawned by: <0.371.0> Started: Thu Nov 29 17:07:31 2012 Message queue length: 149 =proc:<0.627.0> State: Waiting Spawned as: proc_lib:init_p/5 Spawned by: <0.580.0> Started: Thu Nov 29 17:07:30 2012 Message queue length: 84
          Hide
          kzeller kzeller added a comment -

          Added to RN as: Be aware that if attempt hundreds of simultaneous queries with an unlimited
          number of results, Couchbase Server may fail. For instanace 10 million
          results queried simultaneously will cause the server to fail. Instead you should
          specify a reasonable limit of results when you query, otherwise the
          server will stall and crash due to excessive memory usage.

          Show
          kzeller kzeller added a comment - Added to RN as: Be aware that if attempt hundreds of simultaneous queries with an unlimited number of results, Couchbase Server may fail. For instanace 10 million results queried simultaneously will cause the server to fail. Instead you should specify a reasonable limit of results when you query, otherwise the server will stall and crash due to excessive memory usage.
          Hide
          maria Maria McDuff (Inactive) added a comment -

          moving to 2.1

          Show
          maria Maria McDuff (Inactive) added a comment - moving to 2.1
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Not sure what to do about this one. But I've created MB-8501 to do something in ns_server so that it defends itself from this condition and other similar conditions

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Not sure what to do about this one. But I've created MB-8501 to do something in ns_server so that it defends itself from this condition and other similar conditions
          Hide
          vmx Volker Mische added a comment -

          I close this one as "Fixed" this is a very old bug (pre 2.0 GA). Now that we have performance testing in place, we'll discover those kind of bugs in case it occurs again.

          Show
          vmx Volker Mische added a comment - I close this one as "Fixed" this is a very old bug (pre 2.0 GA). Now that we have performance testing in place, we'll discover those kind of bugs in case it occurs again.

            People

            • Assignee:
              siri Sriram Melkote
              Reporter:
              Chisheng Chisheng Hong (Inactive)
            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty

                  Error rendering 'com.pagerduty.jira-server-plugin:PagerDuty'. Please contact your Jira administrators.