Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59886

[System Test on cloud] Rebalance failure - concurrent map read and map write - GetPartitionStats

    XMLWordPrintable

Details

    • Untriaged
    • 0
    • Unknown

    Description

      Rebalance has failed because of a indexer crash in node 011.

      The cluster was initially 9 nodes and 3 new nodes were added. Newly added nodes were svc-d-node-010, svc-i-node-011, and svc-q-node-012. The index rebalance has failed because of a crash on 011

      2023-11-29T14:03:26.273+00:00 [Info] rpcServer(v1:0): rpc request:url:/rpc/SyncAndCloseFile args:fh:{id:"/plasma_storage_v1/2eecc1a7a97188a1224a4bdca9b43b75_ShardTokenb3_73_79_f6_26_e4_55_9e/9444600327820328330/shards/shard9444600327820328330/data/recovery/log.00000000000000.data" gen:1701266606} rsp: 
      2023-11-29T14:03:26.273+00:00 [Info] ClustMgr:handleInstAsyncRecoveryDone mType: CLUST_MGR_ASYNC_RECOVERY_DONE indexList: [
      	InstId: 16047153310603626124
      	Defn: DefnId: 14842410186205359249 Name: idx11_IWCJ Using: plasma Bucket: default9 Scope/Id: scope_0/9 Collection/Id: coll_1/b IsPrimary: false NumReplica: 1 InstVersion: 0 
      		SecExprs: <ud>([(all (array flatten_keys(((`r`.`ratings`).`Rooms`), ((`r`.`ratings`).`Cleanliness`)) for `r` in `reviews` end)) `email` `free_parking`])</ud> 
      		Desc: [false false false false]
      		IndexMissingLeadingKey: false
      		IsPartnKeyDocId: true
      		PartitionScheme: KEY 
      		HashScheme: CRC32 PartitionKeys: [(meta().`id`)] WhereExpr: <ud>()</ud> RetainDeletedXATTR: false 
      		AlternateShardIds: map[2:[10038805698820661113-1-0 10038805698820661113-1-1]] 
      	State: INDEX_STATE_RECOVERED
      	RState: RebalPending
      	Stream: NIL_STREAM
      	Version: 1
      	ReplicaId: 1
      	RealInstId: 14391717656047278189
      	PartitionContainer: <nil>
      ] bucket:  scope:  collection:  streamId: NIL_STREAM syncUpdate: false respCh: <nil> 
      2023-11-29T14:03:26.274+00:00 [Info] StorageMgr::updateIndexSnapMapForIndex IndexInst 17742441636678151956 Partitions [6]
      2023-11-29T14:03:26.276+00:00 [Info] ShardRebalancer::waitForIndexState: Indexes: map[10207862105539645177:INDEX_STATE_RECOVERED] reached state: INDEX_STATE_RECOVERED
      2023-11-29T14:03:26.279+00:00 [Info] rpcServer(v1:0): rpc request:url:/rpc/SyncAndCloseFile args:fh:{id:"/plasma_storage_v1/2eecc1a7a97188a1224a4bdca9b43b75_ShardTokenb3_73_79_f6_26_e4_55_9e/9444600327820328330/shards/shard9444600327820328330/data/recovery/log.00000000000000.data" gen:1701266606} rsp:err:{errCode:"rpc remote close in progress"} 
      2023-11-29T14:03:26.282+00:00 [Info] ShardRebalancer::waitForIndexState: Indexes: map[173221550760487635:INDEX_STATE_RECOVERED] reached state: INDEX_STATE_RECOVERED
      2023-11-29T14:03:26.282+00:00 [Info] StorageMgr::openSnapshot IndexInst:17742441636678151956 Partition:6 Attempting to open snapshot (SnapshotInfo: count:35699 committed:false)
      2023-11-29T14:03:26.282+00:00 [Info] Indexer::handleRecoverIndex 
      	InstId: 1712459921039109313
      	Defn: DefnId: 4714142993699911685 Name: idx10_t3NuqB Using: plasma Bucket: default7 Scope/Id: _default/0 Collection/Id: _default/0 IsPrimary: false NumReplica: 1 InstVersion: 1 
      		SecExprs: <ud>([(all (array (all (array flatten_keys(`n`, `v`) for `n` : `v` in (`r`.`ratings`) end)) for `r` in `reviews` end))])</ud> 
      		Desc: [false false]
      		IndexMissingLeadingKey: false
      		IsPartnKeyDocId: true
      		PartitionScheme: KEY 
      		HashScheme: CRC32 PartitionKeys: [(meta().`id`)] WhereExpr: <ud>()</ud> RetainDeletedXATTR: false 
      		AlternateShardIds: map[6:[5013103090801137013-1-0 5013103090801137013-1-1]] 
      	State: INDEX_STATE_CREATED
      	RState: RebalPending
      	Stream: NIL_STREAM
      	Version: 1
      	ReplicaId: 1
      	PartitionContainer: &{map[6:{6 1 [:9105] []}] 7 KEY 0}
      2023-11-29T14:03:26.283+00:00 [Info] Indexer::run:msg_loop: CLUST_MGR_RECOVER_INDEX message from internalAdminRecvCh channel processing took 956.837µs
      2023-11-29T14:03:26.283+00:00 [Info] Indexer::handleMergePartition Source 2587575381279254604 Target 9849878935785170232
      2023-11-29T14:03:26.283+00:00 [Info] MergePartitions: keyspaceId default8 streamId NIL_STREAM
      2023-11-29T14:03:26.283+00:00 [Info] MergePartition: Merge instance 2587575381279254604 to instance 9849878935785170232
      2023-11-29T14:03:26.283+00:00 [Warn] KeyPartitionContainer: Invalid Partition Id 5
      2023-11-29T14:03:26.283+00:00 [Info] Indexer::listenAdminMsgs:msg_loop: CLUST_MGR_RECOVER_INDEX message from adminRecvCh channel processing took 30.499486ms
      2023-11-29T14:03:26.283+00:00 [Info] Indexer::initPartnInstance Initialized Partition: 
      	 Index: 1712459921039109313 Partition: PartitionId: 6 Endpoints: [:9105] , shardIds: map[6:[2506683526578874307 9652920938307205094]], alternateShardIds: map[6:[5013103090801137013-1-0 5013103090801137013-1-1]]
      2023-11-29T14:03:26.283+00:00 [Info] skip validation in merge partitions [5] between inst 2587575381279254604 and 9849878935785170232
      fatal error: concurrent map read and map write
      2023-11-29T14:03:26.283+00:00 [Info] ClustMgr:handleMergePartition&{4763118600742059358 2587575381279254604 3 9849878935785170232 [5] [1] 1 0xc0bef7f2c0}
       
      goroutine 3583387 [running]:
      github.com/couchbase/indexing/secondary/indexer.(*IndexerStats).GetPartitionStats(...)
      	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/stats_manager.go:1152
      github.com/couchbase/indexing/secondary/indexer.NewSlice(0x71?, 0xc0deb9bc68, 0xc0deb9bbb8, 0x4?, 0xc0035d8a80, 0x0, 0x1?, 0x15e?, 0x162?, {0xc05a779210, ...})
      	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:10496 +0x373
      github.com/couchbase/indexing/secondary/indexer.(*indexer).initPartnInstance(_, {0x17c3e139a5a12cc1, {0x416c00137fe13805, {0xc05a7791b0, 0xc}, {0xc05a779198, 0x6}, {0xc05a7791c0, 0x8}, {0xc00ae85a20, ...}, ...}, ...}, ...)
      	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:6302 +0x354
      github.com/couchbase/indexing/secondary/indexer.(*indexer).handleRecoverIndex.func2()
      	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:2381 +0xa5
      created by github.com/couchbase/indexing/secondary/indexer.(*indexer).handleRecoverIndex in goroutine 1
      	/home/couchbase/jenkins/workspace/couchbase-server-unix/goproj/src/github.com/couchbase/indexing/secondary/indexer/indexer.go:2379 +0x10b6
      

      Rebalance report ->

      Rebalance exited with reason {service_rebalance_failed,index,
      {agent_died,<37208.4395.0>,
      {lost_connection,
      {'ns_1@svc-i-node-011.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com',
      shutdown}}}}.
      Rebalance Operation Id = 7d199ac78bdf6f84617a21b7af42db0d
      

      cbcollect ->

      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-001.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-002.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-003.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-d-node-010.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-004.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-005.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-006.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-007.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-i-node-011.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-q-node-008.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-q-node-009.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestNov29RebalFail/collectinfo-2023-11-29T140541-ns_1%40svc-q-node-012.rmckhdwxbz6i1dqp.sandbox.nonprod-project-avengers.com.zip

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            pavan.pb Pavan PB
            pavan.pb Pavan PB
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty