Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49604

Performance: Rebalance out 4->3 hung

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Not a Bug
    • 7.0.0
    • Neo
    • secondary-index
    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0 GA, same test passed on 7.1.0-1695

      Test: Its a new test to benchmark the smart batching against 7.0.0

      1. Create 6 node cluster (2 kv , 4 indexes)
      2. Load 50M doc
      3. Create 1k indexes (1 partition, mix replica 0,1,2)
      4. Wait for index build
      5. Rebalance out one of the node, it hung for 3 hours. I need to force stop it

      Run: 

      http://perf.jenkins.couchbase.com/job/hemera-temp/701/consoleFull

      19:33:25 2021-11-16T06:03:25 [INFO] Rebalance progress: 59.46666666666666 %
      19:33:27 2021-11-16T06:03:27 [INFO] Rebalance progress: 59.46666666666666 %
      19:33:29 2021-11-16T06:03:29 [INFO] Rebalance progress: 59.46666666666666 %
      19:33:31 2021-11-16T06:03:31 [INFO] Rebalance progress: 59.46666666666666 %
      19:33:33 2021-11-16T06:03:33 [INFO] Rebalance progress: 59.46666666666666 %
      19:33:35 2021-11-16T06:03:35 [INFO] Rebalance progress: 59.46666666666666 %
      19:33:37 2021-11-16T06:03:37 [INFO] Rebalance progress: 59.46666666666666 %
      19:33:39 2021-11-16T06:03:39 [INFO] Rebalance progress: 59.46666666666666 %
      .
      .
      .
      23:28:52 2021-11-16T09:58:52 [INFO] Rebalance progress: 59.46666666666666 %
      23:28:54 2021-11-16T09:58:54 [INFO] Rebalance progress: 59.46666666666666 %
      23:28:56 2021-11-16T09:58:56 [INFO] Rebalance progress: 59.46666666666666 %
      23:28:58 2021-11-16T09:58:58 [INFO] Rebalance progress: 59.46666666666666 %
      23:29:00 2021-11-16T09:59:00 [INFO] Rebalance progress: 59.46666666666666 %
      23:29:02 2021-11-16T09:59:02 [INFO] Rebalance progress: 59.46666666666666 %

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          kevin.cherkauer Kevin Cherkauer added a comment - - edited

          Vikas Chaudhary Did you see the following section of "fatal errors" near the beginning of the long period rebalance progress was stuck at 59.46666666666666%? It seems like something may have gone wrong with the test environment. I don't know if this is related to the rebalance hang or not at this point.

          06:15:39 2021-11-16T06:15:39 [INFO] Rebalance progress: 59.46666666666666 %
          06:15:41 2021-11-16T06:15:41 [INFO] Rebalance progress: 59.46666666666666 %
          06:15:42 
          06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode
          06:15:42 
          06:15:42 Aborting.
          06:15:42 
          06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode
          06:15:42 
          06:15:42 Aborting.
          06:15:42 
          06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode
          06:15:42 
          06:15:42 Aborting.
          06:15:42 
          06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode
          06:15:42 
          06:15:42 Aborting.
          06:15:42 
          06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode
          06:15:42 
          06:15:42 Aborting.
          06:15:42 
          06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode
          06:15:42 
          06:15:42 Aborting.
          06:15:42 
          06:15:42 Warning: One or more hosts failed while executing task 'get_server_samples'
          06:15:42 
          06:15:42 
          06:15:42 Warning: One or more hosts failed while executing task 'get_server_samples'
          06:15:42 
          06:15:42 
          06:15:42 Warning: One or more hosts failed while executing task 'get_samples'
          06:15:42 
          06:15:42 
          06:15:42 Warning: One or more hosts failed while executing task 'get_samples'
          06:15:42 
          06:15:42 
          06:15:42 Warning: One or more hosts failed while executing task 'get_samples'
          06:15:42 
          06:15:42 
          06:15:42 Warning: One or more hosts failed while executing task 'get_samples'
          06:15:42 
          06:15:43 2021-11-16T06:15:43 [INFO] Rebalance progress: 59.46666666666666 %
          06:15:45 2021-11-16T06:15:45 [INFO] Rebalance progress: 59.46666666666666 %
          

          kevin.cherkauer Kevin Cherkauer added a comment - - edited Vikas Chaudhary Did you see the following section of "fatal errors" near the beginning of the long period rebalance progress was stuck at 59.46666666666666%? It seems like something may have gone wrong with the test environment. I don't know if this is related to the rebalance hang or not at this point. 06:15:39 2021-11-16T06:15:39 [INFO] Rebalance progress: 59.46666666666666 % 06:15:41 2021-11-16T06:15:41 [INFO] Rebalance progress: 59.46666666666666 % 06:15:42 06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode 06:15:42 06:15:42 Aborting. 06:15:42 06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode 06:15:42 06:15:42 Aborting. 06:15:42 06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode 06:15:42 06:15:42 Aborting. 06:15:42 06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode 06:15:42 06:15:42 Aborting. 06:15:42 06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode 06:15:42 06:15:42 Aborting. 06:15:42 06:15:42 Fatal error: Needed to prompt for a connection or sudo password (host: cen-s710.perf.couchbase.com), but input would be ambiguous in parallel mode 06:15:42 06:15:42 Aborting. 06:15:42 06:15:42 Warning: One or more hosts failed while executing task 'get_server_samples' 06:15:42 06:15:42 06:15:42 Warning: One or more hosts failed while executing task 'get_server_samples' 06:15:42 06:15:42 06:15:42 Warning: One or more hosts failed while executing task 'get_samples' 06:15:42 06:15:42 06:15:42 Warning: One or more hosts failed while executing task 'get_samples' 06:15:42 06:15:42 06:15:42 Warning: One or more hosts failed while executing task 'get_samples' 06:15:42 06:15:42 06:15:42 Warning: One or more hosts failed while executing task 'get_samples' 06:15:42 06:15:43 2021-11-16T06:15:43 [INFO] Rebalance progress: 59.46666666666666 % 06:15:45 2021-11-16T06:15:45 [INFO] Rebalance progress: 59.46666666666666 %

          723 - ns_server orchestrator
          709 - GSI master
          b908d71c08a7dac115a194390a842ba2 - Rebalance ID in ns_server

          2000 indexes
          250 transfer tokens generated

          At the end of the logs Rebalance was still running. There is no indication of it ending either in ns_server or GSI. Presumably this means some kind of "kill server" script was run.

          kevin.cherkauer Kevin Cherkauer added a comment - 723 - ns_server orchestrator 709 - GSI master b908d71c08a7dac115a194390a842ba2 - Rebalance ID in ns_server 2000 indexes 250 transfer tokens generated At the end of the logs Rebalance was still running. There is no indication of it ending either in ns_server or GSI. Presumably this means some kind of "kill server" script was run.

          Kevin Cherkauer Re-run on the different cluster, it worked fine. Seeing some connection issues with the above run. Re-running on the same cluster to debug it.

          vikas.chaudhary Vikas Chaudhary added a comment - Kevin Cherkauer  Re-run on the different cluster, it worked fine. Seeing some connection issues with the above run. Re-running on the same cluster to debug it.

          Vikas Chaudhary Thank you.

          Note also that the rebalance that appeared to hang was still making progress up until it was killed as evidenced by GetTaskList and GetCurrentTopology version vectors incrementing every few seconds, e.g. the following are the last two call-response pairs of each from the rebalance GSI master node (709):

          2021-11-16T09:41:55.179-08:00 [Info] ServiceMgr::GetTaskList [0 0 0 0 0 0 10 227]
          2021-11-16T09:41:55.179-08:00 [Info] ServiceMgr::GetCurrentTopology [0 0 0 0 0 0 10 227]
          ...
          2021-11-16T09:42:00.116-08:00 [Info] ServiceMgr::GetTaskList returns &{[0 0 0 0 0 0 10 228] [{[0 0 0 0 0 0 0 0] prepare/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-prepared task-running true 0 map[]   map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]} {[0 0 0 0 0 0 10 225] rebalance/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-rebalance task-running true 0.392 map[]   map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]}]}
          2021-11-16T09:42:00.116-08:00 [Info] ServiceMgr::GetCurrentTopology returns &{[0 0 0 0 0 0 10 228] [1d9dfeaaa7ea802ca07444b333c23620 4fe6ee96cc7956482da5a4166b6adf38 a589c95e77863a7b4acc2407ddfb4069] true []}
           
           
          2021-11-16T09:42:00.118-08:00 [Info] ServiceMgr::GetTaskList [0 0 0 0 0 0 10 228]
          2021-11-16T09:42:00.118-08:00 [Info] ServiceMgr::GetCurrentTopology [0 0 0 0 0 0 10 228]
          ...
          2021-11-16T09:42:05.160-08:00 [Info] ServiceMgr::GetCurrentTopology returns &{[0 0 0 0 0 0 10 229] [1d9dfeaaa7ea802ca07444b333c23620 4fe6ee96cc7956482da5a4166b6adf38 a589c95e77863a7b4acc2407ddfb4069] true []}
          2021-11-16T09:42:05.160-08:00 [Info] ServiceMgr::GetTaskList returns &{[0 0 0 0 0 0 10 229] [{[0 0 0 0 0 0 0 0] prepare/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-prepared task-running true 0 map[]   map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]} {[0 0 0 0 0 0 10 226] rebalance/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-rebalance task-running true 0.392 map[]   map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]}]}
          

          kevin.cherkauer Kevin Cherkauer added a comment - Vikas Chaudhary Thank you. Note also that the rebalance that appeared to hang was still making progress up until it was killed as evidenced by GetTaskList and GetCurrentTopology version vectors incrementing every few seconds, e.g. the following are the last two call-response pairs of each from the rebalance GSI master node (709): 2021-11-16T09:41:55.179-08:00 [Info] ServiceMgr::GetTaskList [0 0 0 0 0 0 10 227] 2021-11-16T09:41:55.179-08:00 [Info] ServiceMgr::GetCurrentTopology [0 0 0 0 0 0 10 227] ... 2021-11-16T09:42:00.116-08:00 [Info] ServiceMgr::GetTaskList returns &{[0 0 0 0 0 0 10 228] [{[0 0 0 0 0 0 0 0] prepare/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-prepared task-running true 0 map[] map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]} {[0 0 0 0 0 0 10 225] rebalance/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-rebalance task-running true 0.392 map[] map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]}]} 2021-11-16T09:42:00.116-08:00 [Info] ServiceMgr::GetCurrentTopology returns &{[0 0 0 0 0 0 10 228] [1d9dfeaaa7ea802ca07444b333c23620 4fe6ee96cc7956482da5a4166b6adf38 a589c95e77863a7b4acc2407ddfb4069] true []}     2021-11-16T09:42:00.118-08:00 [Info] ServiceMgr::GetTaskList [0 0 0 0 0 0 10 228] 2021-11-16T09:42:00.118-08:00 [Info] ServiceMgr::GetCurrentTopology [0 0 0 0 0 0 10 228] ... 2021-11-16T09:42:05.160-08:00 [Info] ServiceMgr::GetCurrentTopology returns &{[0 0 0 0 0 0 10 229] [1d9dfeaaa7ea802ca07444b333c23620 4fe6ee96cc7956482da5a4166b6adf38 a589c95e77863a7b4acc2407ddfb4069] true []} 2021-11-16T09:42:05.160-08:00 [Info] ServiceMgr::GetTaskList returns &{[0 0 0 0 0 0 10 229] [{[0 0 0 0 0 0 0 0] prepare/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-prepared task-running true 0 map[] map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]} {[0 0 0 0 0 0 10 226] rebalance/41ac6a8a5e8914e73a2c4ebeaaad39f7 task-rebalance task-running true 0.392 map[] map[rebalanceId:41ac6a8a5e8914e73a2c4ebeaaad39f7]}]}

          People

            vikas.chaudhary Vikas Chaudhary
            vikas.chaudhary Vikas Chaudhary
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty