Loading...

Details

Type: Bug
Resolution: Duplicate
Priority: Major
Fix Version/s: None
Affects Version/s: 6.6.5
Component/s: eventing
Labels:
- operator
Environment:
Couchbase Server Version 6.6.5

Triage:
Untriaged
Story Points:
1
Is this a Regression?:
Unknown

Description

The following scenario is executed in Operator setup on Azure kubernetes cluster. Although, the issue is intermittently seen on EKS cluster as well. So, more likely it is cloud agnostic.

Coming to the sequence of events:

Create a 1 node CB cluster running Server v6.6.5 (GA build) with kv,index,query,eventing services.
Create 3 buckets on the cluster: source,destination,metadata. Each bucket is allocated 100MB with Replicas=1 and high i/o priority.

Deploy the sample eventing function with buckets mentioned above taking similar role as their names. Eventing function is deployed using POST request, specifically using https://pkg.go.dev/net/http#NewRequest with request type as POST, hostURL as
`"http://test-couchbase-j97b4.test-hn7fr.svc:8096/api/v1/functions/test"` and body defined as strings.NewReader(eventingFunc), eventingFunc:

eventingJSONFunc := `[{` +        `"appname": "` + eventingFuncName + `",` +        `"id": 0,` +        `"depcfg":{"buckets":[{"alias":"dst_bucket","bucket_name":"` + dstBucketName + `"}],"metadata_bucket":"` + metaBucketName + `","source_bucket":"` + srcBucketName + `"},` +        `"version":"", "handleruuid":0,` +        `"settings": {"dcp_stream_boundary":"everything","deadline_timeout":62,"deployment_status":true,"description":"","execution_timeout":60,"log_level":"INFO","processing_status":true,"user_prefix":"eventing","worker_count":3},` +        `"using_doc_timer": false,` +        `"appcode": "` + jsFunc + `"` +        `}]`

`function OnUpdate(doc, meta) {\n    var doc_id = meta.id;\n    dst_bucket[doc_id] = \"test value\";\n}\nfunction OnDelete(meta) {\n  delete dst_bucket[meta.id];\n}`

Start loading source bucket using inbuilt pillowfight binary of CB v6.6.5 with max items as 100k and batch size of 1 continuously.
While data load is in parallel, start scaling operation on the cluster:
- First scale the cluster to 2 nodes with same set of services, wait for Rebalance to complete. Rebalance completes successfully.
- Then Scale the cluster to 3 nodes with same set of services, wait for Rebalance to complete. Rebalance completes successfully.
- Finally, Scale down the cluster to 2 nodes, wait for Rebalance to complete. Rebalance is stuck for ~600 seconds at which point it fails. Another attempt at Rebalance is made but that is stuck for more than ~10 minutes at which point the test exists due to 20 minute timeout period.

Since, Rebalance is stuck with 3 nodes in the Cluster, all 3 cbopinfo attached(cbopinfo-test-hn7fr*). I've also attached operator logs if required: [\{_}cbopinfo-20220323T151915+0000.tar.gz]{_}

However, for the sake of transparency, we do not see Rebalance stuck in this scenario everytime we run our test. I'd say it's about 60-70% of the time we observe this in our regression suite and the scenario passes in another couple of retries if run in isolation. We do not see this level of instability with 6.6.4 or 7.0.3 where the fail % in the first try is about 5-10%.

Looking at logs from a rudimentary perspective, it seems like we are hitting a condition where vbuckets are not being shuffled by worker threads of the eventing nodes successfully during rebalance specifically when there are 3 eventing nodes in the cluster in v6.6.5.

Probable reason:

Started to look from ns_server.info.

First Rebalance starts, test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc added to the cluster.

 [user:info,2022-03-23T14:49:56.024Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',                                  'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 09ae8b51e22e6a2dea5f4fc1c1bfc6d5 [rebalance

Rebalance completes.

[user:info,2022-03-23T14:52:50.972Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:log_rebalance_completion:1460]Rebalance completed successfully.

Rebalance Operation Id = 09ae8b51e22e6a2dea5f4fc1c1bfc6d5

Second Rebalance starts, test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc added to the cluster.

[user:info,2022-03-23T14:53:48.016Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',

                                 'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc',

                                 'ns_1@test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 2abced7839bb114f021496087bc56c60

Rebalance completes.

[user:info,2022-03-23T14:58:53.962Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:log_rebalance_completion:1460]Rebalance completed successfully.

Rebalance Operation Id = 2abced7839bb114f021496087bc56c60

Third Rebalance starts, test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc being removed from the cluster.

[user:info,2022-03-23T14:59:43.386Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',

                                 'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = ['ns_1@test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = 7a8bbb90eeba632e2760f61ddafc14e9

Rebalance exited due to no progress.

[user:error,2022-03-23T15:13:37.573Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:log_rebalance_completion:1460]Rebalance exited with reason {service_rebalance_failed,eventing,

                              {worker_died,

                               {'EXIT',<0.7168.15>,

                                {rebalance_failed,

                                 {service_error,

                                  <<"eventing rebalance hasn't made progress for past 600 secs">>}}}}}.

Rebalance Operation Id = 7a8bbb90eeba632e2760f61ddafc14e9

Another attempt to remove -0002 node after ~14 minutes.

[user:info,2022-03-23T15:13:57.471Z,ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:<0.847.0>:ns_orchestrator:idle:804]Starting rebalance, KeepNodes = ['ns_1@test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc',

                                 'ns_1@test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc'], EjectNodes = ['ns_1@test-couchbase-j97b4-0002.test-couchbase-j97b4.test-hn7fr.svc'], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = a0f31259a260b87cdd182e5257fee4e0

Unfortunately,we do not see any relevant info about the last rebalance in the ns_server.info logs. Same applies for ns_server.error logs which just spews out a bunch of econn errors.

Additionally, the last lines of memcached logs are filled with "DCP name limit is 200 chars" warnings although none of the node names exceed 200 chars.

2022-03-23T15:20:53.682572+00:00 WARNING 4: Invalid format specified for "DCP_OPEN" - Status: "Invalid arguments" - Closing connection. Packet:[{"bodylen":216,"cas":0,"datatype":"raw","extlen":8,"keylen":208,"magic":"ClientRequest","opaque":16822206,"opcode":"DCP_OPEN","vbucket":0}] Reason:"Dcp name limit is 200 characters"

so it may seem like we are providing erroneous DCP client name but in that case why would the test pass in remaining 30-40% of regression runs.

The babysitter logs also has the same message in their logs:

[ns_server:info,2022-03-23T15:21:44.338Z,babysitter_of_ns_1@cb.local:<0.169.0>:ns_port_server:log:224]memcached<0.169.0>: 2022-03-23T15:21:44.137461+00:00 WARNING 81: Invalid format specified for "DCP_OPEN" - Status: "Invalid arguments" - Closing connection. Packet:[{"bodylen":216,"cas":0,"datatype":"raw","extlen":8,"keylen":208,"magic":"ClientRequest","opaque":16822206,"opcode":"DCP_OPEN","vbucket":0}] Reason:"Dcp name limit is 200 characters"

Finally coming to eventing logs, we see a lot of repeated err message: dcp.invalidBucket

 2022-03-23T15:22:02.745+00:00 [Info] Consumer::vbTakeoverCallback [worker_test_1:/tmp/127.0.0.1:8091_1_4294559010.sock:547] vb: 230 vbTakeover request, msg: vbucket is owned by another worker on same node

2022-03-23T15:22:02.756+00:00 [Info] Connecting to KV on non-TLS port.

2022-03-23T15:22:02.761+00:00 [Info] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] ##abcd feed started ...

2022-03-23T15:22:02.761+00:00 [Info] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] EOF received

2022-03-23T15:22:02.761+00:00 [Error] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] ##abcd doDcpOpen response status EINVAL

2022-03-23T15:22:02.761+00:00 [Info] DCPT[eventing:tbahHeMq-5140:{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}/0] ##abcd ... stopped

2022-03-23T15:22:02.761+00:00 [Error] DCP[{eventing:tbahHeMq-5139:test-couchbase-j97b4-0000.test-couchbase-j97b4.test-hn7fr.svc:8096_test-couchbase-j97b4-0001.test-couchbase-j97b4.test-hn7fr.svc:11210_worker_test_2_GetSeqNos}] ##abcd Bucket::StartDcpFeedOver : error dcp.invalidFeed in connectToNodes

2022-03-23T15:22:02.761+00:00 [Error] Consumer::populateDcpFeedVbEntriesCallback [worker_test_2:/tmp/127.0.0.1:8091_2_4294559010.sock:568] Failed to start dcp feed, err: dcp.invalidBucket

But this seems like a logging response to DCP client rather than the actual cause itself.

Another log message which seemed interesting:

2022-03-23T15:22:02.290+00:00 [Info] ServiceMgr::getRebalanceProgress Function: test rebalance progress from node with rest port: 8091 progress: &{114 284 398 341 <nil>} err: <nil>

Hence, I would like eventing-devs to take a look as well. Please let me know if you need anything from me.

Also, did anything change specifically in v6.6.5 when compared to v6.6.4 or v7.0.3?

P.S.: If you are interested, we see similar Rebalance stuck situation when we try to scale down a 3 node cluster by removing 1 node at a time which gets replaced by the like node in Operator and Rebalance kicks in. Again, same situation as the scenario described above: passes if retried for 2-3 times in isolation. (Different Operator run, cluster logs: cbopinfo-test-skvwt*)

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

cbinfo-test-hn7fr-test-couchbase-j97b4-0000-20220323T151915+0000.zip
39.77 MB
23/Mar/22 10:01 AM
cbinfo-test-hn7fr-test-couchbase-j97b4-0001-20220323T151915+0000.zip
19.77 MB
23/Mar/22 10:01 AM
cbinfo-test-hn7fr-test-couchbase-j97b4-0002-20220323T151915+0000.zip
11.83 MB
23/Mar/22 10:01 AM
cbinfo-test-skvwt-test-couchbase-9z77x-0003-20220323T170801+0000.zip
24.25 MB
23/Mar/22 10:22 AM
cbinfo-test-skvwt-test-couchbase-9z77x-0004-20220323T170801+0000.zip
16.33 MB
23/Mar/22 10:22 AM
cbinfo-test-skvwt-test-couchbase-9z77x-0005-20220323T170801+0000.zip
14.37 MB
23/Mar/22 10:22 AM
cbopinfo-20220323T151915+0000.tar.gz
237 kB
23/Mar/22 10:01 AM

Issue Links

is duplicated by

MB-51435 [BP 6.6.6 MB-42968] - Eventing Enabled Cluster Fails to Recover - long DCP connection names

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Rebalance stuck for more than ~20 minutes with eventing service reporting: "Failed to start dcp feed".

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty