Details
-
Bug
-
Status: Closed
-
Critical
-
Resolution: Fixed
-
Cheshire-Cat
-
Enterprise Edition 7.0.0 build 5095
-
Untriaged
-
Centos 64-bit
-
HideSnapshot: https://supportal.couchbase.com/snapshot/ff8cc1ad0a5b1fdc2c01fdd52cec4165::0
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.105.175.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.233.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.236.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.238.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.250.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.251.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.43.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.44.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.45.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.58.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.121.74.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.121.78.zipShowSnapshot: https://supportal.couchbase.com/snapshot/ff8cc1ad0a5b1fdc2c01fdd52cec4165::0 https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.105.175.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.233.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.236.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.238.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.250.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.106.251.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.43.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.44.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.45.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.107.58.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.121.74.zip https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-04T181638-ns_1%40172.23.121.78.zip -
1
-
Unknown
Description
Build: 7.0.0-5095
Scenario:
- Cluster with index,eventing,fts,backup services actively running
- 3 FTS indexes created (2 with indexPartition=1 and 1 index with indexPartition=6)
- Rebalance in extra FTS node into the cluster along with few other nodes with other services
Observation:
Fts rebalance failed with reason fts worker died,
Rebalance exited with reason {service_rebalance_failed,fts,
|
{worker_died,
|
{'EXIT',<0.12316.37>,
|
{rebalance_failed,inactivity_timeout}}}}.
|
Rebalance Operation Id = b5ce1e972169e0377441008a937cb498
|
..
|
..
|
Starting rebalance, KeepNodes = ['ns_1@172.23.105.175','ns_1@172.23.106.233',
|
'ns_1@172.23.106.236','ns_1@172.23.106.238',
|
'ns_1@172.23.106.250','ns_1@172.23.106.251',
|
'ns_1@172.23.107.43','ns_1@172.23.107.44',
|
'ns_1@172.23.107.45','ns_1@172.23.107.58',
|
'ns_1@172.23.121.74','ns_1@172.23.121.78'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = b5ce1e972169e0377441008a937cb49
|
Snapshot: https://supportal.couchbase.com/snapshot/ff8cc1ad0a5b1fdc2c01fdd52cec4165::0
Attachments
- cluster_overview.png
- 481 kB
- fts_indexes.png
- 464 kB
Issue Links
Activity
Looks like per the screenshot 172.23.106.251 and 172.23.107.58 are the nodes hosting FTS in this cluster.
It seems like the rebalancer is stuck at getting a stats sample for one of the pindexes `hotels_b48a9b6f1e7e824d_54820232` for about 10minutes ..
2021-05-04T11:15:25.323-07:00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe |
2021-05-04T11:15:26.321-07:00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe |
2021-05-04T11:15:26.323-07:00 [INFO] rebalance: waitAssignPIndexDone, awaiting a stats sample grab for pindex hotels_b48a9b6f1e7e824d_54820232, formerPrimaryNode f2a3da64a4d6793b6ca03605e6f601fe |
This ticket looks quite similar to MB-44725 (which has been marked as "cannot reproduce").
The last commit for MB-44725 was to increase logging in the area ..
commit 82a3dd83e5182a6dc682c17ad661342c68b779d5
|
Author: Sreekanth Sivasankaran <sreekanth.sivasankaran@couchbase.com> |
Date: Wed Apr 7 13:37:03 2021 +0530 |
|
MB-44725 - FTS rebalance_failed,inactivity_timeout |
|
Adding more logs and err handling in the
|
pindex restart paths.
|
|
Change-Id: Iaff4419685623cd258ccca71d973b92263bce8d7
|
Reviewed-on: http://review.couchbase.org/c/cbgt/+/150572 |
Well-Formed: Build Bot <build@couchbase.com> |
Reviewed-by: Abhinav Dangeti <abhinav@couchbase.com> |
Tested-by: Sreekanth Sivasankaran <sreekanth.sivasankaran@couchbase.com> |
The log messages now do confirm that rollback is indeed restarting the pindexes - so that's good.
2021-05-04T11:05:13.444-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_f4e0a48a |
2021-05-04T11:05:13.448-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 |
2021-05-04T11:05:13.448-07:00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584 |
2021-05-04T11:05:13.483-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:13.483-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
2021-05-04T11:05:13.486-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_18572d87 |
2021-05-04T11:05:17.421-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
2021-05-04T11:05:17.476-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:17.853-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
2021-05-04T11:05:18.209-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:18.250-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
2021-05-04T11:05:18.721-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:18.863-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
2021-05-04T11:05:19.110-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:19.246-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
2021-05-04T11:05:19.367-07:00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584 |
2021-05-04T11:05:19.370-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_18572d87 |
2021-05-04T11:05:19.413-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 |
2021-05-04T11:05:19.414-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 |
2021-05-04T11:05:19.610-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:20.650-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_f4e0a48a |
2021-05-04T11:05:20.680-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
2021-05-04T11:05:21.201-07:00 [INFO] pindex: restartPIndex starts for pindex: flights_24af63d4d95b77ef_4c1c5584 |
2021-05-04T11:05:21.753-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:22.277-07:00 [INFO] pindex: restartPIndex starts for pindex: routes_46d6c63966e9ce0b_4c1c5584 |
As for the rebalancer that is stuck from between 2021-05-04T11:05:25.400-07:00 to 2021-05-04T11:15:26.323-07:00 in obtaining the stats sample for pindex `hotels_b48a9b6f1e7e824d_54820232`, Sreekanth Sivasankaran would you take a look into the logs from node 172.23.106.251?
Upon taking a closer look into the erring partition hotels_b48a9b6f1e7e824d_54820232 at timestamp 2021-05-04T11:05:20,
the partition got rollbacked and closed. All subsequent rollback requests were failing with "bleve: BleveDest already closed" error as seen in the logs.
2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 442, seqno requested: 583 |
2021-05-04T11:05:20.679-07:00 [INFO] bleve: dest rollback, partition: 442, key: "<ud></ud>", seq: 0, err: <nil> |
2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232` |
2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232` |
2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232` |
2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve: batchWorker stopped for `hotels_b48a9b6f1e7e824d_54820232` |
|
2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 233, seqno requested: 597 |
2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 443, seqno requested: 658 |
2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 230, seqno requested: 607 |
|
2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve_scorch_rollback: path: /opt/couchbase/var/lib/couchbase/data/@fts/hotels_b48a9b6f1e7e824d_54820232.pindex, totSnapshotExamined: 0 |
2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 232, seqno requested: 571 |
2021-05-04T11:05:20.679-07:00 [INFO] pindex_bleve_rollback: path: /opt/couchbase/var/lib/couchbase/data/@fts/hotels_b48a9b6f1e7e824d_54820232.pindex, wasClosed: true, wasPartial: false, err: <nil> |
|
2021-05-04T11:05:20.679-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 225, seqno requested: 535 |
2021-05-04T11:05:20.680-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Received rollback, for vb: 229, seqno requested: 256 |
2021-05-04T11:05:20.680-07:00 [INFO] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_54820232] Received rollback, for vb: 441, seqno requested: 317 |
2021-05-04T11:05:20.680-07:00 [WARN] feed_dcp_gocbcore: [hotels_b48a9b6f1e7e824d_f4e0a48a] Rollback to seqno: 0, vbuuid: 0 for vb: 233, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1524 |
And it subsequently tries to restart the partition.
2021-05-04T11:05:20.680-07:00 [INFO] pindex: restartPIndex starts for pindex: hotels_b48a9b6f1e7e824d_54820232 |
But the restart request for this partition never makes into the janitor work loop as that was busy processing other ongoing requests. This is confirmed from the logs too.
The last successful restart attempt reached the janitor was at timestamp:
2021-05-04T11:05:20.633-07:00 [INFO] janitor: awakes, op: janitor_close_pindex, msg: api-ClosePIndex:hotels_b48a9b6f1e7e824d_6ddbfb54 |
2021-05-04T11:05:20.678-07:00 [INFO] janitor: awakes, op: kick, msg: restart-pindex |
2021-05-04T11:05:21.500-07:00 [INFO] janitor: awakes, op: kick, msg: restart-pindex |
So, the bleve partition is closed, but the pindex is still registered with the manager due to the unsuccessful restart attempt.
And the situation continues until the rebalance which starts at
2021-05-04T11:05:25.284-07:00 [INFO] ctl/manager: StartTopologyChange, started |
This causes the rebalance stats monitoring endpoint to fail later and the eventual rebalance failures too.
Checking further ..
These sort of issues ought to self-heal once the janitor advances with its task list. But the janitor work loop is deadlocked here as we can't see any further kick wake up messages in the logs.
The last janitor kick processed was at the timestamp.
2021-05-04T11:05:21.500-07:00 [INFO] janitor: awakes, op: kick, msg: restart-pindex |
2021-05-04T11:05:21.583-07:00 [INFO] janitor: pindexes to remove: 0 |
2021-05-04T11:05:21.583-07:00 [INFO] janitor: pindexes to add: 0 |
2021-05-04T11:05:21.583-07:00 [INFO] janitor: pindexes to restart: 0 |
2021-05-04T11:05:21.583-07:00 [INFO] janitor: feeds to remove: 0 |
2021-05-04T11:05:21.583-07:00 [INFO] janitor: feeds to add: 1 |
2021-05-04T11:05:21.583-07:00 [INFO] flights_24af63d4d95b77ef_4c1c5584 |
A deadlocked janitor was blocking all further progress here.
My analysis here is that the recent deadlock introduced with the fix here: MB-45989 is causing this.
Let me merge the latest fix and get it retested before more tickets get piled up.
Thanks Sreekanth..
Just an update, able to hit the same issue while rebalancing out the FTS node
+----------------+-------------+-----------------------+---------------+--------------+
|
| Nodes | Services | Version | CPU | Status |
|
+----------------+-------------+-----------------------+---------------+--------------+
|
| 172.23.107.58 | eventing | 7.0.0-5095-enterprise | 68.0274805126 | Cluster node |
|
| 172.23.107.44 | fts | 7.0.0-5095-enterprise | 2.35664776307 | Cluster node |
|
| 172.23.105.175 | kv | 7.0.0-5095-enterprise | 83.7956961369 | Cluster node |
|
| 172.23.107.45 | fts | 7.0.0-5095-enterprise | 2.54600453743 | --- OUT ---> |
|
| 172.23.107.47 | kv | 7.0.0-5095-enterprise | 84.332764729 | Cluster node |
|
| 172.23.107.84 | eventing | 7.0.0-5095-enterprise | 59.6821976838 | Cluster node |
|
| 172.23.121.78 | cbas | 7.0.0-5095-enterprise | 35.1969676459 | Cluster node |
|
| 172.23.106.250 | index, n1ql | 7.0.0-5095-enterprise | 18.1203515889 | Cluster node |
|
| 172.23.107.88 | index, n1ql | 7.0.0-5095-enterprise | 3.36781029263 | Cluster node |
|
| 172.23.107.43 | eventing | 7.0.0-5095-enterprise | 63.9989329065 | Cluster node |
|
| 172.23.106.236 | kv | 7.0.0-5095-enterprise | 86.4183158712 | Cluster node |
|
| 172.23.107.85 | index, n1ql | 7.0.0-5095-enterprise | 26.646389844 | Cluster node |
|
| 172.23.106.251 | backup | 7.0.0-5095-enterprise | 2.22697533971 | Cluster node |
|
| 172.23.106.233 | kv | 7.0.0-5095-enterprise | 87.605560608 | Cluster node |
|
| 172.23.106.238 | index, n1ql | 7.0.0-5095-enterprise | 3.05016385178 | Cluster node |
|
| 172.23.107.54 | kv | 7.0.0-5095-enterprise | 85.4177545692 | Cluster node |
|
| 172.23.107.78 | cbas | 7.0.0-5095-enterprise | 33.6209220814 | Cluster node |
|
| 172.23.121.74 | cbas | 7.0.0-5095-enterprise | 38.1596571122 | Cluster node |
|
| 172.23.107.107 | fts | 7.0.0-5095-enterprise | 2.31650509883 | Cluster node |
|
| 172.23.107.91 | kv | 7.0.0-5095-enterprise | 73.7971478075 | Cluster node |
|
+----------------+-------------+-----------------------+---------------+--------------+
|
Snapshot: http://supportal.couchbase.com/snapshot/73a3f4bbca65dfdfd545c1c60852951b::0
cbcollect logs:
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.105.175.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.233.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.236.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.238.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.250.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.106.251.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.107.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.43.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.44.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.45.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.47.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.54.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.58.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.78.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.84.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.85.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.88.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.107.91.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.121.74.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_failed/collectinfo-2021-05-05T073833-ns_1%40172.23.121.78.zip
Builds 7.0.0-5103+ have the fix for MB-45989 ..
Build couchbase-server-7.0.0-5103 contains cbgt commit a8cb699 with commit message: |
MB-45989: Kicking the janitor from within the JanitorLoop is prohibited |
Still seeing the issue on 7.0.0-5103 during rebalance-out of FTS node (172.23.107.91)
+----------------+-------------+-----------------------+---------------+--------------+
|
| Nodes | Services | Version | CPU | Status |
|
+----------------+-------------+-----------------------+---------------+--------------+
|
| 172.23.107.58 | eventing | 7.0.0-5103-enterprise | 66.2251655629 | Cluster node |
|
| 172.23.107.44 | eventing | 7.0.0-5103-enterprise | 67.2144182348 | Cluster node |
|
| 172.23.105.175 | kv | 7.0.0-5103-enterprise | 85.5087627518 | Cluster node |
|
| 172.23.107.45 | index, n1ql | 7.0.0-5103-enterprise | 2.42401406682 | Cluster node |
|
| 172.23.107.47 | index, n1ql | 7.0.0-5103-enterprise | 23.6120312707 | Cluster node |
|
| 172.23.107.84 | kv | 7.0.0-5103-enterprise | 83.7305699482 | Cluster node |
|
| 172.23.121.78 | index, n1ql | 7.0.0-5103-enterprise | 20.5553413035 | Cluster node |
|
| 172.23.106.250 | cbas | 7.0.0-5103-enterprise | 33.8083075682 | Cluster node |
|
| 172.23.107.88 | kv | 7.0.0-5103-enterprise | 82.7855008445 | Cluster node |
|
| 172.23.107.43 | cbas | 7.0.0-5103-enterprise | 33.1234088168 | Cluster node |
|
| 172.23.106.236 | kv | 7.0.0-5103-enterprise | 86.6442345024 | Cluster node |
|
| 172.23.107.85 | fts | 7.0.0-5103-enterprise | 2.82627810577 | Cluster node |
|
| 172.23.106.251 | backup | 7.0.0-5103-enterprise | 2.52006018054 | Cluster node |
|
| 172.23.106.233 | kv | 7.0.0-5103-enterprise | 85.7810673938 | Cluster node |
|
| 172.23.106.238 | cbas | 7.0.0-5103-enterprise | 40.3115348462 | Cluster node |
|
| 172.23.107.54 | index, n1ql | 7.0.0-5103-enterprise | 2.32499685811 | Cluster node |
|
| 172.23.107.78 | kv | 7.0.0-5103-enterprise | 85.9544749095 | Cluster node |
|
| 172.23.121.74 | eventing | 7.0.0-5103-enterprise | 70.3625576796 | Cluster node |
|
| 172.23.107.107 | fts | 7.0.0-5103-enterprise | 3.03916581892 | Cluster node |
|
| 172.23.107.91 | fts | 7.0.0-5103-enterprise | 16.6752114842 | --- OUT ---> |
|
+----------------+-------------+-----------------------+---------------+--------------+
|
Rebalance exited with reason {service_rebalance_failed,fts,
|
{worker_died, {'EXIT',<0.25658.83>, {rebalance_failed,inactivity_timeout}}}}.
|
Rebalance Operation Id = 36814e67c68dbb5851868791a2c40c4f
|
Snapshot: http://supportal.couchbase.com/snapshot/b8190367e0e24aa99eff3e294d23a2be::0
Cbcollect logs:
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.233.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.236.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.238.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.250.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.106.251.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.107.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.43.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.44.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.45.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.47.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.54.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.58.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.78.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.84.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.85.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.88.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.107.91.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.121.74.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.121.78.zip
https://cb-engineering.s3.amazonaws.com/fts_rebalance_stuck_5103/collectinfo-2021-05-05T140951-ns_1%40172.23.105.175.zip
Abhinav Dangeti, the notify manager is making the feed establishments fall into a recursive infinite loop here. We need to revisit the manager kick fix, (reverting would also be an option else it may trigger a lot of other issues)
FTS nodes - 85, 91,107 and the error is seen on 91. (its the same rollback followed by feed erring loop)
2021-05-05T05:54:06.285-07:00 [WARN] feed_dcp_gocbcore: [a1_users_377bff4da3dcab77_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 1023, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533 |
2021-05-05T05:54:06.286-07:00 [INFO] feed_dcp_gocbcore: releaseAgent, ref count decremented for DCPagent (key: travel-sample:969c10b537adcc0a20800cf3d619a7c0, agent: 0xc008b7f500, ref count: 3, number of agents for key: 1) |
2021-05-05T05:54:06.286-07:00 [INFO] feed_dcp_gocbcore: Close, name: a4_users_562107192a6809c6_4c1c5584, notify manager |
2021-05-05T05:54:06.289-07:00 [INFO] feed_dcp_gocbcore: releaseAgent, ref count decremented for DCPagent (key: travel-sample:969c10b537adcc0a20800cf3d619a7c0, agent: 0xc008b7f500, ref count: 2, number of agents for key: 1) |
2021-05-05T05:54:06.289-07:00 [INFO] feed_dcp_gocbcore: Close, name: a1_users_377bff4da3dcab77_4c1c5584, notify manager |
2021-05-05T05:54:06.320-07:00 [INFO] janitor: pindexes to remove: 0 |
2021-05-05T05:54:06.320-07:00 [INFO] janitor: pindexes to add: 0 |
2021-05-05T05:54:06.320-07:00 [INFO] janitor: pindexes to restart: 0 |
2021-05-05T05:54:06.320-07:00 [INFO] janitor: feeds to remove: 0 |
2021-05-05T05:54:06.320-07:00 [INFO] janitor: feeds to add: 2 |
2021-05-05T05:54:06.320-07:00 [INFO] a1_users_377bff4da3dcab77_4c1c5584 |
2021-05-05T05:54:06.320-07:00 [INFO] a4_users_562107192a6809c6_4c1c5584 |
2021-05-05T05:54:06.321-07:00 [WARN] janitor: JanitorOnce, err: janitor: JanitorOnce errors: 2, []string{"#0: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a1_users, err: bleve: BleveDest already closed", "#1: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a4_users, err: bleve: BleveDest already closed"} -- cbgt.(*Manager).JanitorLoop() at manager_janitor.go:97 |
2021-05-05T05:54:06.321-07:00 [INFO] janitor: awakes, op: kick, msg: gocbcore-feed |
2021-05-05T05:54:06.332-07:00 [INFO] pindex: restartPIndex starts for pindex: a2_users_4389e9ba8dc93303_4c1c5584 |
2021-05-05T05:54:06.332-07:00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 1023, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533 |
2021-05-05T05:54:06.402-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [118] for vb: 257, closed due to `state changed`, reconnecting ... |
2021-05-05T05:54:06.402-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [116] for vb: 257, closed due to `state changed`, reconnecting ... |
2021-05-05T05:54:06.410-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [118] for vb: 339, closed due to `state changed`, reconnecting ... |
2021-05-05T05:54:06.410-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [116] for vb: 339, closed due to `state changed`, reconnecting ... |
2021-05-05T05:54:06.410-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Received rollback, for vb: 257, seqno requested: 222457 |
2021-05-05T05:54:06.410-07:00 [WARN] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 257, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533 |
2021-05-05T05:54:06.414-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Received rollback, for vb: 257, seqno requested: 234028 |
2021-05-05T05:54:06.414-07:00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 257, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533 |
2021-05-05T05:54:06.438-07:00 [INFO] janitor: pindexes to remove: 0 |
2021-05-05T05:54:06.438-07:00 [INFO] janitor: pindexes to add: 0 |
2021-05-05T05:54:06.438-07:00 [INFO] janitor: pindexes to restart: 0 |
2021-05-05T05:54:06.438-07:00 [INFO] janitor: feeds to remove: 0 |
2021-05-05T05:54:06.438-07:00 [INFO] janitor: feeds to add: 2 |
2021-05-05T05:54:06.438-07:00 [INFO] a4_users_562107192a6809c6_4c1c5584 |
2021-05-05T05:54:06.438-07:00 [INFO] a1_users_377bff4da3dcab77_4c1c5584 |
2021-05-05T05:54:06.439-07:00 [WARN] janitor: JanitorOnce, err: janitor: JanitorOnce errors: 2, []string{"#0: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a4_users, err: bleve: BleveDest already closed", "#1: janitor: adding feed, err: feed_dcp_gocbcore: StartGocbcoreDCPFeed, could not prepare DCP feed, server: http://127.0.0.1:8091, bucketName: travel-sample, indexName: a1_users, err: bleve: BleveDest already closed"} -- cbgt.(*Manager).JanitorLoop() at manager_janitor.go:97 |
2021-05-05T05:54:06.439-07:00 [INFO] janitor: awakes, op: kick, msg: gocbcore-feed |
2021-05-05T05:54:06.486-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Received rollback, for vb: 339, seqno requested: 201006 |
2021-05-05T05:54:06.486-07:00 [WARN] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 339, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533 |
2021-05-05T05:54:06.487-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Received rollback, for vb: 339, seqno requested: 211545 |
2021-05-05T05:54:06.487-07:00 [WARN] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] Rollback to seqno: 0, vbuuid: 0 for vb: 339, failed with err: bleve: BleveDest already closed -- cbgt.(*GocbcoreDCPFeed).rollback() at feed_dcp_gocbcore.go:1533 |
2021-05-05T05:54:06.544-07:00 [INFO] feed_dcp_gocbcore: [a2_users_4389e9ba8dc93303_4c1c5584] DCP stream [118] for vb: 1021, closed due to `state changed`, reconnecting ... |
2021-05-05T05:54:06.544-07:00 [INFO] feed_dcp_gocbcore: [routes_6fc715a37b5980d2_4c1c5584] DCP stream [116] for vb: 1021, closed due to `state changed`, reconnecting ... |
2021-05-05T05:54:06.546-07:00 [INFO] janitor: pindexes to remove: 0 |
2021-05-05T05:54:06.546-07:00 [INFO] janitor: pindexes to add: 0 |
2021-05-05T05:54:06.546-07:00 [INFO] janitor: pindexes to restart: 0 |
2021-05-05T05:54:06.547-07:00 [INFO] janitor: feeds to remove: 0 |
2021-05-05T05:54:06.547-07:00 [INFO] janitor: feeds to add: 2 |
2021-05-05T05:54:06.547-07:00 [INFO] a4_users_562107192a6809c6_4c1c5584 |
2021-05-05T05:54:06.547-07:00 [INFO] a1_users_377bff4da3dcab77_4c1c5584 |
2021-05-05T05:54:06.548-07:00 [WARN] janitor: JanitorOnce, err: janitor: Janit |
Sreekanth Sivasankaran accurate findings on this issue.
I've uploaded a couple changes for this ..
- http://review.couchbase.org/c/cbgt/+/152939 - improves logging in the area
- http://review.couchbase.org/c/cbgt/+/152942 - conditionally notify the manager when feed initiation fails
The idea is to not notify manager always, but only on an agent setup failure - which is essentially the only situation when the feed initiation should be re-attempted. This fix will ensure that MB-45989 doesn't occur as well.
Hoping the above fixes will address this issue. A build should become available soon with them.
Build couchbase-server-7.0.0-5109 contains cbgt commit 52db00a with commit message:
MB-46112: Notify manager only on agent error on feed initiation
Build couchbase-server-7.0.0-5109 contains cbgt commit fa82fd7 with commit message:
MB-46112: Log feed name when StartGocbcoreDCPFeed fails
Not seeing this issue on latest run using 7.0.0-5112-enterprise.
Test has run for ~8hrs till now (will continue to run for some more time)
Closing this ticket.
Ashwin Govindarajulu It's necessary that in tests conducted on multiple nodes - you point out which nodes host the service of interest!