Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40518

[BP MB-40498] - Eventing is not retrying bucket ops failures like ETMPFAIL that can be retried

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical
    • Resolution: Fixed
    • Affects Version/s: 6.6.0
    • Fix Version/s: 6.6.1
    • Component/s: eventing
    • Triage:
      Untriaged
    • Story Points:
      1
    • Is this a Regression?:
      Unknown

      Description

      I am seeing failures with many Eventing workers and 25M+ docs

      I create an Eventing function "test_update_2" (attached) with an alias of "bdp_vardata" to a bucket called "crondata" (Memory Quota 7.9GB) and have 64 workers with the following source code:

      function OnUpdate(doc, meta) {
       var maxattempt = 2;
       for (var tries=1; tries<=maxattempt; tries++) {
         try {
           var doc = bdp_vardata[meta.id];
           doc.random = Math.random();
           bdp_vardata[meta.id] = doc;
           break;
         } catch (e) {
           if (tries === maxattempt) 
             log("attempt "+ tries + " error occured during deletion :: ", 
                 e, " for id ", meta.id); 
         }
       }
      }

      The source bucket and the bucket that is updated is "crondata" in addition there is a 100MB Eventing meta data bucket "metadata"

       

      I load 25,528,448 document into crondata with a KEYs like todelete01::100006 and data like

      {
       "type": "vbs_seed",
       "id": 100006,
      }

      once the eventing function runs all documents in bucket "crondata" will be enriched with a new field called "random"

      {
       "type": "vbs_seed",
       "id": 100006,
       "random": 0.22187920300189878
      }

      The single node server

      When I run Eventing on my 12 core 2.1Ghz 64 MB Xeon

      uname -a
      Linux couch01 4.19.0-6-amd64 #1 SMP Debian 4.19.67-2+deb10u2 (2019-11-11) x86_64 GNU/Linux

      /opt/couchbase/bin/couchbase-server -v
      Couchbase Server 6.6.0-7883 (EE)

      Configured with Eventing 256 RAM, Data 7900 MB RAM no other services

      The Issue 

      The system will process about 7.6 Million doc (mutations) and then I will get LCB_ETMPFAIL errors.

      2020-07-15T18:52:46.795-07:00 [INFO] "attempt 2 error occured during deletion :: " {"message":{"code":392,"desc":"Temporary failure received from server. Try again later","name":"LCB_ETMPFAIL"},"stack":"Error\n at OnUpdate (test_update_2.js:10:35)"} " for id " "todelete22::63364"

      A work around

      In the UI simply pause then resume every 7 million rows.  This proves that Eventing can process the data with 64 workers but something odd is happening where we don't honor some sort of resource constraint.

      I believe I also have no issues if I set the workers down form sixty-four (64) to just three (3) workers

      I have prepared a video showing exactly how it fails hopefully the video and the uploaded Eventing function will help track down the root cause.

       

        Attachments

          Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

            Activity

            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.0-7911 contains eventing commit a5f85af with commit message:
            MB-40518 : check retriability of LCB errors properly

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.0-7911 contains eventing commit a5f85af with commit message: MB-40518 : check retriability of LCB errors properly
            Hide
            build-team Couchbase Build Team added a comment -

            Build couchbase-server-6.6.1-9001 contains eventing commit a5f85af with commit message:
            MB-40518 : check retriability of LCB errors properly

            Show
            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.1-9001 contains eventing commit a5f85af with commit message: MB-40518 : check retriability of LCB errors properly
            Hide
            vikas.chaudhary Vikas Chaudhary added a comment -

            Verified on 6.6.1-1901

            Seeing retry message as

            root@node3-cb602-ubuntu16:/opt/couchbase/var/lib/couchbase/logs# tailf eventing.log | grep retry_count
            2020-08-17T05:43:21.043-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 74
            2020-08-17T05:43:21.149-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 75
            2020-08-17T05:43:21.153-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 63
            2020-08-17T05:43:39.061-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 77
            2020-08-17T05:43:39.078-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 72
            2020-08-17T05:43:39.175-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 77
            2020-08-17T05:43:57.159-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 75
            2020-08-17T05:43:57.234-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 75
            2020-08-17T05:43:57.255-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 77
            2020-08-17T05:44:16.038-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 57
            2020-08-17T05:44:16.100-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 58
            2020-08-17T05:44:16.129-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 48 

            Show
            vikas.chaudhary Vikas Chaudhary added a comment - Verified on 6.6.1-1901 Seeing retry message as root@node3-cb602-ubuntu16:/opt/couchbase/var/lib/couchbase/logs# tailf eventing.log | grep retry_count 2020-08-17T05:43:21.043-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 74 2020-08-17T05:43:21.149-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 75 2020-08-17T05:43:21.153-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 63 2020-08-17T05:43:39.061-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 77 2020-08-17T05:43:39.078-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 72 2020-08-17T05:43:39.175-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 77 2020-08-17T05:43:57.159-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 75 2020-08-17T05:43:57.234-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 75 2020-08-17T05:43:57.255-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 77 2020-08-17T05:44:16.038-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_2:/tmp/127.0.0.1:8091_2_3842793501.sock:31067] RetryLcbCommand retry_count: 57 2020-08-17T05:44:16.100-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_1:/tmp/127.0.0.1:8091_1_3842793501.sock:31075] RetryLcbCommand retry_count: 58 2020-08-17T05:44:16.129-07:00 [Info] eventing-consumer [worker_Function_505542201_test_vb_shuffle_during_failover_add_back_0:/tmp/127.0.0.1:8091_0_3842793501.sock:31059] RetryLcbCommand retry_count: 48

              People

              Assignee:
              vikas.chaudhary Vikas Chaudhary
              Reporter:
              jeelan.poola Jeelan Poola
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved:

                  Gerrit Reviews

                  There are no open Gerrit changes

                    PagerDuty