Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40518

[BP MB-40498] - Eventing is not retrying bucket ops failures like ETMPFAIL that can be retried

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      I am seeing failures with many Eventing workers and 25M+ docs

      I create an Eventing function "test_update_2" (attached) with an alias of "bdp_vardata" to a bucket called "crondata" (Memory Quota 7.9GB) and have 64 workers with the following source code:

      function OnUpdate(doc, meta) {
       var maxattempt = 2;
       for (var tries=1; tries<=maxattempt; tries++) {
         try {
           var doc = bdp_vardata[meta.id];
           doc.random = Math.random();
           bdp_vardata[meta.id] = doc;
           break;
         } catch (e) {
           if (tries === maxattempt) 
             log("attempt "+ tries + " error occured during deletion :: ", 
                 e, " for id ", meta.id); 
         }
       }
      }

      The source bucket and the bucket that is updated is "crondata" in addition there is a 100MB Eventing meta data bucket "metadata"

       

      I load 25,528,448 document into crondata with a KEYs like todelete01::100006 and data like

      {
       "type": "vbs_seed",
       "id": 100006,
      }

      once the eventing function runs all documents in bucket "crondata" will be enriched with a new field called "random"

      {
       "type": "vbs_seed",
       "id": 100006,
       "random": 0.22187920300189878
      }

      The single node server

      When I run Eventing on my 12 core 2.1Ghz 64 MB Xeon

      uname -a
      Linux couch01 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

      /opt/couchbase/bin/couchbase-server -v
      Couchbase Server 6.6.0-7883 (EE)

      Configured with Eventing 256 RAM, Data 7900 MB RAM no other services

      The Issue 

      The system will process about 7.6 Million doc (mutations) and then I will get LCB_ETMPFAIL errors.

      2020-07-15T18:52:46.795-07:00 [INFO] "attempt 2 error occured during deletion :: " {"message":{"code":392,"desc":"Temporary failure received from server. Try again later","name":"LCB_ETMPFAIL"},"stack":"Error\n at OnUpdate (test_update_2.js:10:35)"} " for id " "todelete22::63364"

      A work around

      In the UI simply pause then resume every 7 million rows.  This proves that Eventing can process the data with 64 workers but something odd is happening where we don't honor some sort of resource constraint.

      I believe I also have no issues if I set the workers down form sixty-four (64) to just three (3) workers

      I have prepared a video showing exactly how it fails hopefully the video and the uploaded Eventing function will help track down the root cause.

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              vikas.chaudhary Vikas Chaudhary
              jeelan.poola Jeelan Poola
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty