Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51563

Rebalance stuck for more than ~20 minutes with eventing service reporting: "Failed to start dcp feed".

    XMLWordPrintable

Details

    • Bug
    • Resolution: Duplicate
    • Major
    • None
    • 6.6.5
    • eventing
    • Couchbase Server Version 6.6.5
    • Untriaged
    • 1
    • Unknown

    Description

      The following scenario is executed in Operator setup on Azure kubernetes cluster. Although, the issue is intermittently seen on EKS cluster as well. So, more likely it is cloud agnostic.

      Coming to the sequence of events:

      • Create a 1 node CB cluster running Server v6.6.5 (GA build) with kv,index,query,eventing services.
      • Create 3 buckets on the cluster: source,destination,metadata. Each bucket is allocated 100MB with Replicas=1 and high i/o priority.
      • Deploy the sample eventing function with buckets mentioned above taking similar role as their names.  Eventing function is deployed using POST request, specifically using https://pkg.go.dev/net/http#NewRequest with request type as POST, hostURL as
        `"http://test-couchbase-j97b4.test-hn7fr.svc:8096/api/v1/functions/test"` and body defined as strings.NewReader(eventingFunc), eventingFunc: 

        eventingJSONFunc := `[{` +        `"appname": "` + eventingFuncName + `",` +        `"id": 0,` +        `"depcfg":{"buckets":[{"alias":"dst_bucket","bucket_name":"` + dstBucketName + `"}],"metadata_bucket":"` + metaBucketName + `","source_bucket":"` + srcBucketName + `"},` +        `"version":"", "handleruuid":0,` +        `"settings": {"dcp_stream_boundary":"everything","deadline_timeout":62,"deployment_status":true,"description":"","execution_timeout":60,"log_level":"INFO","processing_status":true,"user_prefix":"eventing","worker_count":3},` +        `"using_doc_timer": false,` +        `"appcode": "` + jsFunc + `"` +        `}]` 

        `function OnUpdate(doc, meta) {\n    var doc_id = meta.id;\n    dst_bucket[doc_id] = \"test value\";\n}\nfunction OnDelete(meta) {\n  delete dst_bucket[meta.id];\n}` 

      • Start loading source bucket using inbuilt pillowfight binary of CB v6.6.5 with max items as 100k and batch size of 1 continuously.  
      • While data load is in parallel, start scaling operation on the cluster:
        • First scale the cluster to 2 nodes with same set of services, wait for Rebalance to complete. Rebalance completes successfully.
        • Then Scale the cluster to 3 nodes with same set of services, wait for Rebalance to complete. Rebalance completes successfully.
        • Finally, Scale down the cluster to 2 nodes, wait for Rebalance to complete. Rebalance is stuck for ~600 seconds at which point it fails. Another attempt at Rebalance is made but that is stuck for more than ~10 minutes at which point the test exists due to 20 minute timeout period.

      Since, Rebalance is stuck with 3 nodes in the Cluster, all 3 cbopinfo attached(cbopinfo-test-hn7fr*). I've also attached operator logs if required: [\{_}cbopinfo-20220323T151915+0000.tar.gz]{_}

      However, for the sake of transparency, we do not see Rebalance stuck in this scenario everytime we run our test. I'd say it's about 60-70% of the time we observe this in our regression suite and the scenario passes in another couple of retries if run in isolation. We do not see this level of instability with 6.6.4 or 7.0.3 where the fail % in the first try is about 5-10%.

      Looking at logs from a rudimentary perspective, it seems like we are hitting a condition where vbuckets are not being shuffled by worker threads of the eventing nodes successfully during rebalance specifically when there are 3 eventing nodes in the cluster in v6.6.5.

      Probable reason:  

      • Started to look from ns_server.info.
      • First Rebalance starts, test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc added to the cluster. 

         [user:info,2022-03-23T14:49:56.024Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',                                  'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 09ae8b51e22e6a2dea5f4fc1c1bfc6d5 [rebalance 

      • Rebalance completes. 

        [user:info,2022-03-23T14:52:50.972Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:log_rebalance_completion:1460]Rebalance completed successfully.
        Rebalance Operation Id = 09ae8b51e22e6a2dea5f4fc1c1bfc6d5 

      • Second Rebalance starts, test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc added to the cluster.  

        [user:info,2022-03-23T14:53:48.016Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',
                                         'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc',
                                         'ns_1@test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2abced7839bb114f021496087bc56c60
         

      • Rebalance completes. 

        [user:info,2022-03-23T14:58:53.962Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:log_rebalance_completion:1460]Rebalance completed successfully.
        Rebalance Operation Id = 2abced7839bb114f021496087bc56c60 

      • Third Rebalance starts, test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc  being removed from the cluster.  

        [user:info,2022-03-23T14:59:43.386Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',
                                         'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = ['ns_1@test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 7a8bbb90eeba632e2760f61ddafc14e9
         

      • Rebalance exited due to no progress. 

        [user:error,2022-03-23T15:13:37.573Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:log_rebalance_completion:1460]Rebalance exited with reason {service_rebalance_failed,eventing,
                                      {worker_died,
                                       {'EXIT',<0.7168.15>,
                                        {rebalance_failed,
                                         {service_error,
                                          <<"eventing rebalance hasn't made progress for past 600 secs">>}}}}}.
        Rebalance Operation Id = 7a8bbb90eeba632e2760f61ddafc14e9 

      • Another attempt to remove -0002 node after ~14 minutes. 

        [user:info,2022-03-23T15:13:57.471Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',
                                         'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = ['ns_1@test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = a0f31259a260b87cdd182e5257fee4e0 

      • Unfortunately,we do not see any relevant info about the last rebalance in the ns_server.info logs. Same applies for ns_server.error logs which just spews out a bunch of econn errors.
      • Additionally, the last lines of memcached logs are filled with "DCP name limit is 200 chars" warnings although none of the node names exceed 200 chars. 

        2022-03-23T15:20:53.682572+00:00 WARNING 4: Invalid format specified for "DCP_OPEN" - Status: "Invalid arguments" - Closing connection. Packet:[{"bodylen":216,"cas":0,"datatype":"raw","extlen":8,"keylen":208,"magic":"ClientRequest","opaque":16822206,"opcode":"DCP_OPEN","vbucket":0}] Reason:"Dcp name limit is 200 characters" 

        so it may seem like we are providing erroneous DCP client name but in that case why would the test pass in remaining 30-40% of regression runs. 

      • The babysitter logs also has the same message in their logs: 

        [ns_server:info,2022-03-23T15:21:44.338Z,babysitter_of_ns_1@cb.local:<0.169.0>:ns_port_server:log:224]memcached<0.169.0>: 2022-03-23T15:21:44.137461+00:00 WARNING 81: Invalid format specified for "DCP_OPEN" - Status: "Invalid arguments" - Closing connection. Packet:[{"bodylen":216,"cas":0,"datatype":"raw","extlen":8,"keylen":208,"magic":"ClientRequest","opaque":16822206,"opcode":"DCP_OPEN","vbucket":0}] Reason:"Dcp name limit is 200 characters"  

      • Finally coming to eventing logs, we see a lot of repeated err message: dcp.invalidBucket 

         2022-03-23T15:22:02.745+00:00 [Info] Consumer::vbTakeoverCallback [worker_test_1:/tmp/127.0.0.1:8091_1_4294559010.sock:547] vb: 230 vbTakeover request, msg: vbucket is owned by another worker on same node
        2022-03-23T15:22:02.756+00:00 [Info] Connecting to KV on non-TLS port. 
        2022-03-23T15:22:02.761+00:00 [Info] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] ##abcd feed started ...
        2022-03-23T15:22:02.761+00:00 [Info] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] EOF received
        2022-03-23T15:22:02.761+00:00 [Error] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] ##abcd doDcpOpen response status EINVAL
        2022-03-23T15:22:02.761+00:00 [Info] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] ##abcd ... stopped
        2022-03-23T15:22:02.761+00:00 [Error] DCP[{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}] ##abcd Bucket::StartDcpFeedOver : error dcp.invalidFeed in connectToNodes
        2022-03-23T15:22:02.761+00:00 [Error] Consumer::populateDcpFeedVbEntriesCallback [worker_test_2:/tmp/127.0.0.1:8091_2_4294559010.sock:568] Failed to start dcp feed, err: dcp.invalidBucket

      • But this seems like a logging response to DCP client rather than the actual cause itself.
      • Another log message which seemed interesting: 

        2022-03-23T15:22:02.290+00:00 [Info] ServiceMgr::getRebalanceProgress Function: test rebalance progress from node with rest port: 8091 progress: &{114 284 398 341 <nil>} err: <nil> 

      Hence, I would like eventing-devs to take a look as well. Please let me know if you need anything from me.

      Also, did anything change specifically in v6.6.5 when compared to v6.6.4 or v7.0.3?

       

       

      P.S.: If you are interested, we see similar Rebalance stuck situation when we try to scale down a 3 node cluster by removing 1 node at a time which gets replaced by the like node in Operator and Rebalance kicks in. Again, same situation as the scenario described above: passes if retried for 2-3 times in isolation. (Different Operator run, cluster logs: cbopinfo-test-skvwt*)

       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              prateek.kumar Prateek Kumar (Inactive)
              prateek.kumar Prateek Kumar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty