Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.6.0
-
7.6.0-1861 on AWS
-
Untriaged
-
0
-
Unknown
Description
Rebalance has failed because of a indexer crash in node 011.
The cluster was initially 9 nodes and 3 new nodes were added. Newly added nodes were svc-d-node-010, svc-i-node-011, and svc-q-node-012. The index rebalance has failed because of a crash on 011
2023-11-29T14:03:26.273+00:00 [Info] rpcServer(v1:0): rpc request:url:/rpc/SyncAndCloseFile args:fh:{id:"/plasma_storage_v1/2eecc1a7a97188a1224a4bdca9b43b75_ShardTokenb3_73_79_f6_26_e4_55_9e/9444600327820328330/shards/shard9444600327820328330/data/recovery/log.00000000000000.data" gen:1701266606} rsp: |
2023-11-29T14:03:26.273+00:00 [Info] ClustMgr:handleInstAsyncRecoveryDone mType: CLUST_MGR_ASYNC_RECOVERY_DONE indexList: [ |
InstId: 16047153310603626124 |
Defn: DefnId: 14842410186205359249 Name: idx11_IWCJ Using: plasma Bucket: default9 Scope/Id: scope_0/9 Collection/Id: coll_1/b IsPrimary: false NumReplica: 1 InstVersion: 0 |
SecExprs: <ud>([(all (array flatten_keys(((`r`.`ratings`).`Rooms`), ((`r`.`ratings`).`Cleanliness`)) for `r` in `reviews` end)) `email` `free_parking`])</ud> |
Desc: [false false false false] |
IndexMissingLeadingKey: false |
IsPartnKeyDocId: true |
PartitionScheme: KEY
|
HashScheme: CRC32 PartitionKeys: [(meta().`id`)] WhereExpr: <ud>()</ud> RetainDeletedXATTR: false |
AlternateShardIds: map[2:[10038805698820661113-1-0 10038805698820661113-1-1]] |
State: INDEX_STATE_RECOVERED
|
RState: RebalPending
|
Stream: NIL_STREAM
|
Version: 1 |
ReplicaId: 1 |
RealInstId: 14391717656047278189 |
PartitionContainer: <nil>
|
] bucket: scope: collection: streamId: NIL_STREAM syncUpdate: false respCh: <nil> |
2023-11-29T14:03:26.274+00:00 [Info] StorageMgr::updateIndexSnapMapForIndex IndexInst 17742441636678151956 Partitions [6] |
2023-11-29T14:03:26.276+00:00 [Info] ShardRebalancer::waitForIndexState: Indexes: map[10207862105539645177:INDEX_STATE_RECOVERED] reached state: INDEX_STATE_RECOVERED |
2023-11-29T14:03:26.279+00:00 [Info] rpcServer(v1:0): rpc request:url:/rpc/SyncAndCloseFile args:fh:{id:"/plasma_storage_v1/2eecc1a7a97188a1224a4bdca9b43b75_ShardTokenb3_73_79_f6_26_e4_55_9e/9444600327820328330/shards/shard9444600327820328330/data/recovery/log.00000000000000.data" gen:1701266606} rsp:err:{errCode:"rpc remote close in progress"} |
2023-11-29T14:03:26.282+00:00 [Info] ShardRebalancer::waitForIndexState: Indexes: map[173221550760487635:INDEX_STATE_RECOVERED] reached state: INDEX_STATE_RECOVERED |
2023-11-29T14:03:26.282+00:00 [Info] StorageMgr::openSnapshot IndexInst:17742441636678151956 Partition:6 Attempting to open snapshot (SnapshotInfo: count:35699 committed:false) |
2023-11-29T14:03:26.282+00:00 [Info] Indexer::handleRecoverIndex |
InstId: 1712459921039109313 |
Defn: DefnId: 4714142993699911685 Name: idx10_t3NuqB Using: plasma Bucket: default7 Scope/Id: _default/0 Collection/Id: _default/0 IsPrimary: false NumReplica: 1 InstVersion: 1 |
SecExprs: <ud>([(all (array (all (array flatten_keys(`n`, `v`) for `n` : `v` in (`r`.`ratings`) end)) for `r` in `reviews` end))])</ud> |
Desc: [false false] |
IndexMissingLeadingKey: false |
IsPartnKeyDocId: true |
PartitionScheme: KEY
|
HashScheme: CRC32 PartitionKeys: [(meta().`id`)] WhereExpr: <ud>()</ud> RetainDeletedXATTR: false |
AlternateShardIds: map[6:[5013103090801137013-1-0 5013103090801137013-1-1]] |
State: INDEX_STATE_CREATED
|
RState: RebalPending
|
Stream: NIL_STREAM
|
Version: 1 |
ReplicaId: 1 |
PartitionContainer: &{map[6:{6 1 [:9105] []}] 7 KEY 0} |
2023-11-29T14:03:26.283+00:00 [Info] Indexer::run:msg_loop: CLUST_MGR_RECOVER_INDEX message from internalAdminRecvCh channel processing took 956.837µs |
2023-11-29T14:03:26.283+00:00 [Info] Indexer::handleMergePartition Source 2587575381279254604 Target 9849878935785170232 |
2023-11-29T14:03:26.283+00:00 [Info] MergePartitions: keyspaceId default8 streamId NIL_STREAM |
2023-11-29T14:03:26.283+00:00 [Info] MergePartition: Merge instance 2587575381279254604 to instance 9849878935785170232 |
2023-11-29T14:03:26.283+00:00 [Warn] KeyPartitionContainer: Invalid Partition Id 5 |
2023-11-29T14:03:26.283+00:00 [Info] Indexer::listenAdminMsgs:msg_loop: CLUST_MGR_RECOVER_INDEX message from adminRecvCh channel processing took 30.499486ms |
2023-11-29T14:03:26.283+00:00 [Info] Indexer::initPartnInstance Initialized Partition: |
Index: 1712459921039109313 Partition: PartitionId: 6 Endpoints: [:9105] , shardIds: map[6:[2506683526578874307 9652920938307205094]], alternateShardIds: map[6:[5013103090801137013-1-0 5013103090801137013-1-1]] |
2023-11-29T14:03:26.283+00:00 [Info] skip validation in merge partitions [5] between inst 2587575381279254604 and 9849878935785170232 |
fatal error: concurrent map read and map write
|
2023-11-29T14:03:26.283+00:00 [Info] ClustMgr:handleMergePartition&{4763118600742059358 2587575381279254604 3 9849878935785170232 [5] [1] 1 0xc0bef7f2c0} |
|
goroutine 3583387 [running]: |
github.com/couchbase/indexing/secondary/indexer.(*IndexerStats).GetPartitionStats(...)
|
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/stats_manager.go:1152 |
github.com/couchbase/indexing/secondary/indexer.NewSlice(0x71?, 0xc0deb9bc68, 0xc0deb9bbb8, 0x4?, 0xc0035d8a80, 0x0, 0x1?, 0x15e?, 0x162?, {0xc05a779210, ...}) |
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:10496 +0x373 |
github.com/couchbase/indexing/secondary/indexer.(*indexer).initPartnInstance(_, {0x17c3e139a5a12cc1, {0x416c00137fe13805, {0xc05a7791b0, 0xc}, {0xc05a779198, 0x6}, {0xc05a7791c0, 0x8}, {0xc00ae85a20, ...}, ...}, ...}, ...) |
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:6302 +0x354 |
github.com/couchbase/indexing/secondary/indexer.(*indexer).handleRecoverIndex.func2()
|
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:2381 +0xa5 |
created by github.com/couchbase/indexing/secondary/indexer.(*indexer).handleRecoverIndex in goroutine 1 |
/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:2379 +0x10b6 |
Rebalance report ->
Rebalance exited with reason {service_rebalance_failed,index,
|
{agent_died,<37208.4395.0>, |
{lost_connection,
|
{'ns_1@svc-i-node-011.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com', |
shutdown}}}}.
|
Rebalance Operation Id = 7d199ac78bdf6f84617a21b7af42db0d
|
cbcollect ->
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-001.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-002.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-003.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-010.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-004.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-005.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-006.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-007.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-011.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-q-node-008.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-q-node-009.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-q-node-012.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip