Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-30220

Lost items in metadata bucket because of KV hard failover, lead to Eventing rebalance hang

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 5.5.0
    • 5.5.0
    • eventing
    • Untriaged
    • Unknown

    Description

      centos longevity - 5.5.0-2923 - following rebalance failure noticed:

      Rebalance exited with reason {service_rebalance_failed,eventing,
      {rebalance_failed,

      {service_error, <<"eventing rebalance hasn't made progress for past 3600 secs">>}

      }}

      At the same time in eventing logs, following is observed:

      2018-06-21T12:24:49.074-07:00 [Info] Consumer::RebalanceTaskProgress [worker_bucket_op_complex_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_1.sock:27968] vbsRemainingToGiveUp len: 0 dump: [] vbsRemainingToOwn len: 115 dump: [171, 186-199, 232, 237-249, 251-285, 287-288, 290-313, 315-320, 322-335, 337-341]
      2018-06-21T12:24:49.077-07:00 [Info] Consumer::RebalanceTaskProgress [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] vbsRemainingToGiveUp len: 0 dump: [] vbsRemainingToOwn len: 2 dump: [395, 409]
      2018-06-21T12:24:49.077-07:00 [Info] ServiceMgr::getRebalanceProgress rebalance progress from node with rest port: 8091 progress: &{118 512}
      2018-06-21T12:24:49.079-07:00 [Info] Consumer::RebalanceTaskProgress [worker_bucket_op_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_function_0.sock:27843] vbsRemainingToGiveUp len: 0 dump: [] vbsRemainingToOwn len: 158 dump: [0, 10, 15-170]
      2018-06-21T12:24:49.081-07:00 [Info] Consumer::RebalanceTaskProgress [worker_bucket_op_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_function_1.sock:27838] vbsRemainingToGiveUp len: 0 dump: [] vbsRemainingToOwn len: 156 dump: [171, 186-252, 254-341]
      2018-06-21T12:24:49.084-07:00 [Info] Consumer::RebalanceTaskProgress [worker_bucket_op_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_function_2.sock:27837] vbsRemainingToGiveUp len: 0 dump: [] vbsRemainingToOwn len: 4 dump: [404-406, 413]
      2018-06-21T12:24:49.085-07:00 [Info] ServiceMgr::getRebalanceProgress rebalance progress from node with rest port: 8091 progress: &{318 512}
      2018-06-21T12:24:49.086-07:00 [Info] Consumer::vbsStateUpdate [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] vb takeover routine id: 2, vbs assigned len: 0 dump: []
      2018-06-21T12:24:49.086-07:00 [Info] Consumer::vbsStateUpdate [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] vb takeover routine id: 0, vbs assigned len: 1 dump: [395
      2018-06-21T12:24:49.086-07:00 [Info] Consumer::vbsStateUpdate [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] vb takeover routine id: 1, vbs assigned len: 1 dump: [409
      2018-06-21T12:24:49.086-07:00 [Info] Consumer::vbsStateUpdate [worker_bucket_op_complex_function_2:takeover_r_1:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] vb: 409 skipping vbTakeover as dcp request stream already in flight
      2018-06-21T12:24:49.087-07:00 [Info] Consumer::doVbTakeover [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] vb: 395 vbblob stream status: stopped, starting dcp stream
      2018-06-21T12:24:49.087-07:00 [Info] Consumer::updateVbOwnerAndStartDCPStream [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] vb: 395 Sending streamRequestInfo size: 3589
      2018-06-21T12:24:49.123-07:00 [Info] ServiceMgr::GetTaskList rev: service.Revision(nil)
      2018-06-21T12:24:49.124-07:00 [Info] ServiceMgr::GetCurrentTopology rev: service.Revision(nil)
      2018-06-21T12:24:49.124-07:00 [Info] ServiceMgr::GetTaskList rev: service.Revision{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x75}
      2018-06-21T12:24:49.124-07:00 [Info] ServiceMgr::GetCurrentTopology rev: service.Revision{0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x2, 0x75}
      2018-06-21T12:24:49.141-07:00 [Info] SuperSupervisor::TopologyChangeNotifCallback [2] Path => /eventing/rebalanceToken/9fec36d2f59678aeb140ab782aab00e1 value => stopRebalance
      2018-06-21T12:24:49.154-07:00 [Info] SuperSupervisor::TopologyChangeNotifCallback [2] Apps in primary store: [bucket_op_complex_function bucket_op_function], running apps: map[bucket_op_function:Producer => app: bucket_op_function tcpPort: /tmp/127.0.0.1:8091_worker_bucket_op_function_2.sock bucket_op_complex_function:Producer => app: bucket_op_complex_function tcpPort: /tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock]
      2018-06-21T12:24:49.157-07:00 [Info] SuperSupervisor::TopologyChangeNotifCallback [2] App: bucket_op_complex_function deployment_status: true processing_status: true runningProducer: Producer => app: bucket_op_complex_function tcpPort: /tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock
      2018-06-21T12:24:49.160-07:00 [Info] SuperSupervisor::TopologyChangeNotifCallback [2] App: bucket_op_function deployment_status: true processing_status: true runningProducer: Producer => app: bucket_op_function tcpPort: /tmp/127.0.0.1:8091_worker_bucket_op_function_2.sock
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_complex_function:3] Got topology change msg: &{stop-rebalance} from super_supervisor
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_complex_function:3] Consumer: worker_bucket_op_complex_function_0 sent stop rebalance message from producer
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_complex_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_0.sock:27978] Got notification about rebalance stop
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_function:3] Got topology change msg: &{stop-rebalance} from super_supervisor
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_complex_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_0.sock:27978] Updated isRebalanceOngoing to false
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_function:3] Consumer: worker_bucket_op_function_0 sent stop rebalance message from producer
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_function_0.sock:27843] Got notification about rebalance stop
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_function_0.sock:27843] Updated isRebalanceOngoing to false
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_function:3] Consumer: worker_bucket_op_function_1 sent stop rebalance message from producer
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_function_1.sock:27838] Got notification about rebalance stop
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_function_1.sock:27838] Updated isRebalanceOngoing to false
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_function:3] Consumer: worker_bucket_op_function_2 sent stop rebalance message from producer
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_function_2.sock:27837] Got notification about rebalance stop
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_function_2.sock:27837] Updated isRebalanceOngoing to false
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_complex_function:3] Consumer: worker_bucket_op_complex_function_1 sent stop rebalance message from producer
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_complex_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_1.sock:27968] Got notification about rebalance stop
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_complex_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_1.sock:27968] Updated isRebalanceOngoing to false
      2018-06-21T12:24:49.160-07:00 [Info] Producer::Serve [bucket_op_complex_function:3] Consumer: worker_bucket_op_complex_function_2 sent stop rebalance message from producer
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] Got notification about rebalance stop
      2018-06-21T12:24:49.160-07:00 [Info] Consumer::NotifyRebalanceStop [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] Updated isRebalanceOngoing to false
      2018-06-21T12:24:49.271-07:00 [Error] Consumer::addOwnershipHistorySRRCallback [worker_bucket_op_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_function_2.sock:27837] Key: bucket_op_function::vb::404, subdoc operation failed post STREAMREQ from Consumer, err: key not found
      2018-06-21T12:24:49.291-07:00 [Info] Consumer::processEvents [worker_bucket_op_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_function_0.sock:27843] vb: 15 STREAMREQ metadataUpdated: false
      2018-06-21T12:24:49.304-07:00 [Info] Consumer::checkAndUpdateMetadataLoop [worker_bucket_op_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_function_1.sock:27838] Exiting the routine
      2018-06-21T12:24:49.306-07:00 [Info] Consumer::checkAndUpdateMetadataLoop [worker_bucket_op_complex_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_2.sock:27973] Exiting the routine
      2018-06-21T12:24:49.324-07:00 [Info] Consumer::checkAndUpdateMetadataLoop [worker_bucket_op_complex_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_0.sock:27978] Exiting the routine
      2018-06-21T12:24:49.329-07:00 [Info] Consumer::checkAndUpdateMetadataLoop [worker_bucket_op_complex_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_1.sock:27968] Exiting the routine
      2018-06-21T12:24:49.337-07:00 [Info] Consumer::checkAndUpdateMetadataLoop [worker_bucket_op_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_function_0.sock:27843] Exiting the routine
      2018-06-21T12:24:49.339-07:00 [Info] Consumer::checkAndUpdateMetadataLoop [worker_bucket_op_function_2:/tmp/127.0.0.1:8091_worker_bucket_op_function_2.sock:27837] Exiting the routine
      2018-06-21T12:24:49.354-07:00 [Info] Consumer::processEvents [worker_bucket_op_complex_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_1.sock:27968] vb: 237 STREAMREQ metadataUpdated: false
      2018-06-21T12:24:49.354-07:00 [Info] Consumer::processEvents [worker_bucket_op_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_function_1.sock:27838] vb: 309 STREAMREQ metadataUpdated: false
      2018-06-21T12:24:49.461-07:00 [Info] Consumer::processEvents [worker_bucket_op_complex_function_0:/tmp/127.0.0.1:8091_worker_bucket_op_complex_function_0.sock:27978] vb: 157 STREAMREQ metadataUpdated: false
      2018-06-21T12:24:49.530-07:00 [Error] Consumer::addOwnershipHistorySRRCallback [worker_bucket_op_function_1:/tmp/127.0.0.1:8091_worker_bucket_op_function_1.s
      

      supportal: http://supportal.couchbase.com/snapshot/e6ed4ed14c86079378d112de54f99884::0
      cluster: http://172.23.108.103:8091/

      Issue seems to be same as MB-30012

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            arunkumar Arunkumar Senthilnathan (Inactive)
            arunkumar Arunkumar Senthilnathan (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              PagerDuty