Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8199

[Doc'd] many concurrent view requests cause excessive resource consumption and even crash

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0.1, 2.1.0
    • Fix Version/s: 2.2.0
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
    • Environment:
      4-core CPU, 16GB RAM, Linux
    • Operating System:
      Centos 64-bit

      Description

      In response to many view requests against the scatter/gather view merger, a node can allocate so many resources that it will fail to recover.

      In one case, this did cause many timeouts in the log leading to max_restart_intensity:
      [error_logger:error,2013-04-25T15:23:26.047,ns_1@10.128.16.171:error_logger<0.6.0>:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor:

      {local,ns_node_disco_sup}


      Context: shutdown
      Reason: reached_max_restart_intensity
      Offender: [

      {pid,<0.17237.774>}

      ,

      {name,ns_config_rep}

      ,
      {mfargs,{ns_config_rep,start_link,[]}},

      {restart_type,permanent}

      ,

      {shutdown,1000}

      ,

      {child_type,worker}

      ]

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        ingenthr Matt Ingenthron added a comment -

        Note, I put this on 2.0.2 since I know it shouldn't be 2.1 and there does not appear to be a 2.0.3. I feared it would be lost if it didn't have a fixfor version. Please move as appropriate.

        Show
        ingenthr Matt Ingenthron added a comment - Note, I put this on 2.0.2 since I know it shouldn't be 2.1 and there does not appear to be a 2.0.3. I feared it would be lost if it didn't have a fixfor version. Please move as appropriate.
        Hide
        maria Maria McDuff (Inactive) added a comment -

        per bug scrub, alk - can you chk if aleksey a. can take a look at this?

        Show
        maria Maria McDuff (Inactive) added a comment - per bug scrub, alk - can you chk if aleksey a. can take a look at this?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        We know this problem so I don't believe we should look again.

        Fixing it for 2.0.2 feels a bit late but possible if really needed

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - We know this problem so I don't believe we should look again. Fixing it for 2.0.2 feels a bit late but possible if really needed
        Hide
        dipti Dipti Borkar added a comment -

        When you say, "we know this problem" can you elaborate on it a bit more? With more customers using views, they are likely to hit this as well.
        Can you help us understand the scenario a bit more? When this problem can happen? What is the probability of hitting this?

        Show
        dipti Dipti Borkar added a comment - When you say, "we know this problem" can you elaborate on it a bit more? With more customers using views, they are likely to hit this as well. Can you help us understand the scenario a bit more? When this problem can happen? What is the probability of hitting this?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        If you send too many view requests to any node it'll swamp it and kill. I recall seeing that during pre-2.0 testing and there must be MB- somewhere.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - If you send too many view requests to any node it'll swamp it and kill. I recall seeing that during pre-2.0 testing and there must be MB- somewhere.
        Hide
        maria Maria McDuff (Inactive) added a comment - - edited

        per bug triage, upgrading to blocker.
        the fix is to throttle the requests and not to crash/terminate.
        it's fine to be slow but not crash.
        alk k to take a look for 2.0.2

        Show
        maria Maria McDuff (Inactive) added a comment - - edited per bug triage, upgrading to blocker. the fix is to throttle the requests and not to crash/terminate. it's fine to be slow but not crash. alk k to take a look for 2.0.2
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        We merged a simple request that can be configured via internal settings: http://review.couchbase.org/26334.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - We merged a simple request that can be configured via internal settings: http://review.couchbase.org/26334 .
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        It should also be noted that given we don't have experience how well this approach works in production we decided to have "unlimited" as default limits.

        We can try playing with that stuff in-house plus get some experience with customers after 2.0.2 is out and then we'll have enough data to enable it by default and set right limits.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - It should also be noted that given we don't have experience how well this approach works in production we decided to have "unlimited" as default limits. We can try playing with that stuff in-house plus get some experience with customers after 2.0.2 is out and then we'll have enough data to enable it by default and set right limits.
        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - CHANGES text is here: http://review.couchbase.org/#/c/26361/2/CHANGES,unified
        Hide
        ingenthr Matt Ingenthron added a comment -

        Alk: we should request QE to develop a test for this. See it cause the problem in 2.0.1 and see it not cause the problem in 2.0.2, right? Assigning it to Maria for that purpose, then it should be closed perhaps when verified? Not sure what QE's process is here now.

        Show
        ingenthr Matt Ingenthron added a comment - Alk: we should request QE to develop a test for this. See it cause the problem in 2.0.1 and see it not cause the problem in 2.0.2, right? Assigning it to Maria for that purpose, then it should be closed perhaps when verified? Not sure what QE's process is here now.
        Hide
        ingenthr Matt Ingenthron added a comment -

        Maria: Can you work with the team on the appropriate way to test that this is fixed and won't cause other problems?

        Show
        ingenthr Matt Ingenthron added a comment - Maria: Can you work with the team on the appropriate way to test that this is fixed and won't cause other problems?
        Hide
        maria Maria McDuff (Inactive) added a comment -

        Abhinav,

        pls verify by:
        -instrumenting a test that sends many view requests. do manual first then automate (if you already have a test that does similar test scenario such as this, just tweak that and use it here for this verification testing).
        -verifying no crashes happen. if you observe, slowness, note it here. slowness is ok.
        -noting alk k's "unlimited" dflt limit set. verify all his changes on review link.
        -using stable build of 2.0.2 which should be built tonight or tomorrow.
        thanks.

        Show
        maria Maria McDuff (Inactive) added a comment - Abhinav, pls verify by: -instrumenting a test that sends many view requests. do manual first then automate (if you already have a test that does similar test scenario such as this, just tweak that and use it here for this verification testing). -verifying no crashes happen. if you observe, slowness, note it here. slowness is ok. -noting alk k's "unlimited" dflt limit set. verify all his changes on review link. -using stable build of 2.0.2 which should be built tonight or tomorrow. thanks.
        Hide
        dipti Dipti Borkar added a comment -

        We also need to document this.

        270
        271 +* (MB-8199) REST and CAPI request throttler implemented.
        272 +
        273 + It's behavior is controlled by three parameters which can be set via
        274 + /internalSettings REST endpoint:
        275 +
        276 + - restRequestLimit
        277 +
        278 + Maximum number of simultaneous connections each node should
        279 + accept on REST port. Diagnostics related endpoints and
        280 + /internalSettings are not counted.
        281 +
        282 + - capiRequestLimit
        283 +
        284 + Maximum number of simultaneous connections each node should
        285 + accept on CAPI port. It should be noted that it includes XDCR
        286 + connections.
        287 +
        288 + - dropRequestMemoryThresholdMiB
        289 +
        290 + The amount of memory used by Erlang VM that should not be
        291 + exceeded. If it's exceeded the server will start dropping
        292 + incoming connections.
        293 +
        294 + When the server decides to reject incoming connection because some
        295 + limit was exceeded, it does so by responding with status code of 503
        296 + and Retry-After header set appropriately (more or less). On REST
        297 + port textual description of why request was rejected returned in a
        298 + body. On CAPI port in CouchDB tradition a JSON object is returned
        299 + with "error" and "reason" fields.
        300 +
        301 + By default all the thresholds are set to be unlimited.

        Show
        dipti Dipti Borkar added a comment - We also need to document this. 270 271 +* ( MB-8199 ) REST and CAPI request throttler implemented. 272 + 273 + It's behavior is controlled by three parameters which can be set via 274 + /internalSettings REST endpoint: 275 + 276 + - restRequestLimit 277 + 278 + Maximum number of simultaneous connections each node should 279 + accept on REST port. Diagnostics related endpoints and 280 + /internalSettings are not counted. 281 + 282 + - capiRequestLimit 283 + 284 + Maximum number of simultaneous connections each node should 285 + accept on CAPI port. It should be noted that it includes XDCR 286 + connections. 287 + 288 + - dropRequestMemoryThresholdMiB 289 + 290 + The amount of memory used by Erlang VM that should not be 291 + exceeded. If it's exceeded the server will start dropping 292 + incoming connections. 293 + 294 + When the server decides to reject incoming connection because some 295 + limit was exceeded, it does so by responding with status code of 503 296 + and Retry-After header set appropriately (more or less). On REST 297 + port textual description of why request was rejected returned in a 298 + body. On CAPI port in CouchDB tradition a JSON object is returned 299 + with "error" and "reason" fields. 300 + 301 + By default all the thresholds are set to be unlimited.
        Show
        kzeller kzeller added a comment - https://github.com/couchbase/ns_server/blob/master/CHANGES#L273
        Hide
        perry Perry Krug added a comment -

        Has QE verified that this does in fact solve the problem?

        Show
        perry Perry Krug added a comment - Has QE verified that this does in fact solve the problem?
        Hide
        perry Perry Krug added a comment - - edited

        Karen, just one thing:
        -[FIXED] The release notes link on page 352 points to "Adjusting Rebalance during Compaction” but should be "8.8.1. Limiting Simultaneous Node Requests" right?

        Show
        perry Perry Krug added a comment - - edited Karen, just one thing: - [FIXED] The release notes link on page 352 points to "Adjusting Rebalance during Compaction” but should be "8.8.1. Limiting Simultaneous Node Requests" right?
        Hide
        alkondratenko Aleksey Kondratenko (Inactive) added a comment -

        renamed to ticket's subject to more accurately reflect it's nature. I.e. this is not strictly speaking a leak.

        Show
        alkondratenko Aleksey Kondratenko (Inactive) added a comment - renamed to ticket's subject to more accurately reflect it's nature. I.e. this is not strictly speaking a leak.
        Hide
        kzeller kzeller added a comment - - edited

        Fixed link: In the past too many simultaneous views requests could overwhelm a node.
        You can now limit the number of simultaneous requests a node can receive. For
        more information, see REST-API, see <xref linkend="couchbase-restapi-request-limits" />.
        DOC'D FOR 2.1

        removing labeling until relevant for 2.2

        Show
        kzeller kzeller added a comment - - edited Fixed link: In the past too many simultaneous views requests could overwhelm a node. You can now limit the number of simultaneous requests a node can receive. For more information, see REST-API, see <xref linkend="couchbase-restapi-request-limits" />. DOC'D FOR 2.1 removing labeling until relevant for 2.2
        Hide
        chiyoung Chiyoung Seo added a comment -

        Karen,

        Please close it if it is already resolved.

        Show
        chiyoung Chiyoung Seo added a comment - Karen, Please close it if it is already resolved.

          People

          • Assignee:
            kzeller kzeller
            Reporter:
            ingenthr Matt Ingenthron
          • Votes:
            0 Vote for this issue
            Watchers:
            9 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes