Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40518

[BP MB-40498] - Eventing is not retrying bucket ops failures like ETMPFAIL that can be retried

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      I am seeing failures with many Eventing workers and 25M+ docs

      I create an Eventing function "test_update_2" (attached) with an alias of "bdp_vardata" to a bucket called "crondata" (Memory Quota 7.9GB) and have 64 workers with the following source code:

      function OnUpdate(doc, meta) {
       var maxattempt = 2;
       for (var tries=1; tries<=maxattempt; tries++) {
         try {
           var doc = bdp_vardata[meta.id];
           doc.random = Math.random();
           bdp_vardata[meta.id] = doc;
           break;
         } catch (e) {
           if (tries === maxattempt) 
             log("attempt "+ tries + " error occured during deletion :: ", 
                 e, " for id ", meta.id); 
         }
       }
      }

      The source bucket and the bucket that is updated is "crondata" in addition there is a 100MB Eventing meta data bucket "metadata"

       

      I load 25,528,448 document into crondata with a KEYs like todelete01::100006 and data like

      {
       "type": "vbs_seed",
       "id": 100006,
      }

      once the eventing function runs all documents in bucket "crondata" will be enriched with a new field called "random"

      {
       "type": "vbs_seed",
       "id": 100006,
       "random": 0.22187920300189878
      }

      The single node server

      When I run Eventing on my 12 core 2.1Ghz 64 MB Xeon

      uname -a
      Linux couch01 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

      /opt/couchbase/bin/couchbase-server -v
      Couchbase Server 6.6.0-7883 (EE)

      Configured with Eventing 256 RAM, Data 7900 MB RAM no other services

      The Issue 

      The system will process about 7.6 Million doc (mutations) and then I will get LCB_ETMPFAIL errors.

      2020-07-15T18:52:46.795-07:00 [INFO] "attempt 2 error occured during deletion :: " {"message":{"code":392,"desc":"Temporary failure received from server. Try again later","name":"LCB_ETMPFAIL"},"stack":"Error\n at OnUpdate (test_update_2.js:10:35)"} " for id " "todelete22::63364"

      A work around

      In the UI simply pause then resume every 7 million rows.  This proves that Eventing can process the data with 64 workers but something odd is happening where we don't honor some sort of resource constraint.

      I believe I also have no issues if I set the workers down form sixty-four (64) to just three (3) workers

      I have prepared a video showing exactly how it fails hopefully the video and the uploaded Eventing function will help track down the root cause.

       

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-40518
          # Subject Branch Project Status CR V

          Activity

            People

              vikas.chaudhary Vikas Chaudhary
              jeelan.poola Jeelan Poola
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty