Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49827

[System Test] Query rebalance stuck for over 4 days.

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Blocker
    • 7.1.0
    • 7.1.0
    • query

    Description

      Build : 7.1.0-1773
      Test : -test tests/integration/neo/test_neo_couchstore_milestone3.yml -scope tests/integration/neo/scope_couchstore.yml
      Scale : 3
      Iteration : 3

      Rebalance started at 2021-11-25T13:09:40 to remove query node 172.23.121.77 from the cluster

      [2021-11-25T13:09:40-08:00, sequoiatools/couchbase-cli:7.1:03742a] rebalance -c 172.23.97.74:8091 --server-remove 172.23.121.77 -u Administrator -p password

      It has been over 4 days, but rebalance is not yet complete. On 172.23.121.77, the following can be seen in the logs :

      2021-11-25T13:10:34.403-08:00 [INFO] Graceful shutdown initiated.
      2021-11-25T13:10:34.403-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-25T13:10:34.436-08:00 [INFO] Incoming request from '172.23.104.254:54118' rejected during service shutdown.
      2021-11-25T13:10:34.907-08:00 [Info] serviceChangeNotifier: received PoolChangeNotification
      2021-11-25T13:10:34.913-08:00 [INFO] Incoming request from '172.23.104.254:53400' rejected during service shutdown.
      

      and even later

      2021-11-29T16:52:15.047-08:00 [INFO] Incoming request from '172.23.104.254:55748' rejected during service shutdown.
      2021-11-29T16:52:20.583-08:00 [INFO] Incoming request from '172.23.104.254:56002' rejected during service shutdown.
      2021-11-29T16:52:26.090-08:00 [INFO] Incoming request from '172.23.104.254:56234' rejected during service shutdown.
      2021-11-29T16:52:37.283-08:00 [INFO] Incoming request from '172.23.104.254:56730' rejected during service shutdown.
      

      2021-11-29T16:52:14.627-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:24.680-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:34.735-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:44.795-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:54.883-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete. 
      

      Upon querying the system:active_requests, I can find one query in timeout state.

      {
          "active_requests": {
            "clientContextID": "query_thread_4",
            "elapsedTime": "99h48m1.920782733s",
            "executionTime": "99h48m1.920713777s",
            "n1qlFeatCtrl": 76,
            "node": "172.23.121.77:8091",
            "phaseOperators": {
              "authorize": 1,
              "stream": 1
            },
            "phaseTimes": {
              "authorize": "38.83µs",
              "instantiate": "63.99µs",
              "parse": "2.179918ms",
              "plan": "3.920764ms",
              "stream": "3.652µs"
            },
            "remoteAddr": "172.23.104.254:45268",
            "requestId": "a295f87b-e361-4959-aaa8-74ed2b290c06",
            "requestTime": "2021-11-25T13:07:09.949-08:00",
            "scanConsistency": "scan_plus",
            "state": "timeout",
            "statement": "SELECT h.name, h.city, r.author FROM `default`:`bucket6`.`scope_2`.`coll_0` h UNNEST reviews AS r WHERE r.ratings.Rooms < 2 AND h.avg_rating >= 3 ORDER BY r.author DESC LIMIT 100;",
            "statementType": "SELECT",
            "useCBO": true,
            "userAgent": "couchbase-java-client/${project.version} (git: ${git.commit.id.describe}, core: ${git.commit.id.describe}) (Linux/3.10.0-1062.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_302-b08)",
            "users": "Administrator"
          }
        }
      

      So this query must be the one that is holding up the graceful shutdown.

      1. Shouldn't a query that has timed out not get listed as an active request.
      2. Graceful shutdown should wait only for requests in active state.

      Query nodes : 172.23.120.58, 172.23.121.77, 172.23.120.81

      The logs attached are from about 2 hrs after the rebalance started. If you want logs from any other time stamp, please let me know.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty