Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49827

[System Test] Query rebalance stuck for over 4 days.

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Blocker
    • Resolution: Fixed
    • Neo
    • Neo
    • query

    Description

      Build : 7.1.0-1773
      Test : -test tests/integration/neo/test_neo_couchstore_milestone3.yml -scope tests/integration/neo/scope_couchstore.yml
      Scale : 3
      Iteration : 3

      Rebalance started at 2021-11-25T13:09:40 to remove query node 172.23.121.77 from the cluster

      [2021-11-25T13:09:40-08:00, sequoiatools/couchbase-cli:7.1:03742a] rebalance -c 172.23.97.74:8091 --server-remove 172.23.121.77 -u Administrator -p password

      It has been over 4 days, but rebalance is not yet complete. On 172.23.121.77, the following can be seen in the logs :

      2021-11-25T13:10:34.403-08:00 [INFO] Graceful shutdown initiated.
      2021-11-25T13:10:34.403-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-25T13:10:34.436-08:00 [INFO] Incoming request from '172.23.104.254:54118' rejected during service shutdown.
      2021-11-25T13:10:34.907-08:00 [Info] serviceChangeNotifier: received PoolChangeNotification
      2021-11-25T13:10:34.913-08:00 [INFO] Incoming request from '172.23.104.254:53400' rejected during service shutdown.
      

      and even later

      2021-11-29T16:52:15.047-08:00 [INFO] Incoming request from '172.23.104.254:55748' rejected during service shutdown.
      2021-11-29T16:52:20.583-08:00 [INFO] Incoming request from '172.23.104.254:56002' rejected during service shutdown.
      2021-11-29T16:52:26.090-08:00 [INFO] Incoming request from '172.23.104.254:56234' rejected during service shutdown.
      2021-11-29T16:52:37.283-08:00 [INFO] Incoming request from '172.23.104.254:56730' rejected during service shutdown.
      

      2021-11-29T16:52:14.627-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:24.680-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:34.735-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:44.795-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete.
      2021-11-29T16:52:54.883-08:00 [INFO] Shutdown: Waiting for 1 active request(s) and 0 active transaction(s) to complete. 
      

      Upon querying the system:active_requests, I can find one query in timeout state.

      {
          "active_requests": {
            "clientContextID": "query_thread_4",
            "elapsedTime": "99h48m1.920782733s",
            "executionTime": "99h48m1.920713777s",
            "n1qlFeatCtrl": 76,
            "node": "172.23.121.77:8091",
            "phaseOperators": {
              "authorize": 1,
              "stream": 1
            },
            "phaseTimes": {
              "authorize": "38.83µs",
              "instantiate": "63.99µs",
              "parse": "2.179918ms",
              "plan": "3.920764ms",
              "stream": "3.652µs"
            },
            "remoteAddr": "172.23.104.254:45268",
            "requestId": "a295f87b-e361-4959-aaa8-74ed2b290c06",
            "requestTime": "2021-11-25T13:07:09.949-08:00",
            "scanConsistency": "scan_plus",
            "state": "timeout",
            "statement": "SELECT h.name, h.city, r.author FROM `default`:`bucket6`.`scope_2`.`coll_0` h UNNEST reviews AS r WHERE r.ratings.Rooms < 2 AND h.avg_rating >= 3 ORDER BY r.author DESC LIMIT 100;",
            "statementType": "SELECT",
            "useCBO": true,
            "userAgent": "couchbase-java-client/${project.version} (git: ${git.commit.id.describe}, core: ${git.commit.id.describe}) (Linux/3.10.0-1062.9.1.el7.x86_64 amd64; OpenJDK 64-Bit Server VM 1.8.0_302-b08)",
            "users": "Administrator"
          }
        }
      

      So this query must be the one that is holding up the graceful shutdown.

      1. Shouldn't a query that has timed out not get listed as an active request.
      2. Graceful shutdown should wait only for requests in active state.

      Query nodes : 172.23.120.58, 172.23.121.77, 172.23.120.81

      The logs attached are from about 2 hrs after the rebalance started. If you want logs from any other time stamp, please let me know.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          mihir.kamdar Mihir Kamdar added a comment -

          Thank you Donald Haggart for the quick turnaround.

          Re: do you have copies of the logs where you saw the messages ? I'm not seeing any logs for 77
          I think you were looking at the cbcollect for 172.23.120.77, while the query node that was being removed was 172.23.121.77

          Re: So a go-routine dump on 77 during the initial shutdown attempt would be most useful. I don't suppose one was gathered by any chance?
          Yes, we do not have a go-routine dumps separately. Does cbcollect collect them ?

          mihir.kamdar Mihir Kamdar added a comment - Thank you Donald Haggart for the quick turnaround. Re: do you have copies of the logs where you saw the messages ? I'm not seeing any logs for 77 I think you were looking at the cbcollect for 172.23.120.77, while the query node that was being removed was 172.23.121.77 Re: So a go-routine dump on 77 during the initial shutdown attempt would be most useful. I don't suppose one was gathered by any chance? Yes, we do not have a go-routine dumps separately. Does cbcollect collect them ?

          > I think you were looking at the cbcollect for 172.23.120.77, while the query node that was being removed was 172.23.121.77

          Doh!   Indeed... guess I need glasses.

          Still, there is nothing logged at least to suggest a problem with that request; the topology changed to add the node at 13:01:22.420 and the exit commenced at 13:10:32.276 and there is nothing untoward in the intervening period.

          Unfortunately cbcollect doesn't; the closest being the pprof but that's a sample of activity and won't readily highlight something stuck.  I don't see anything to highlight something stuck, nor though do I see anything to highlight current activity - I can see a request is executing, waiting on the necessary channel for results/completion, but not the servicing routines.

          Donald.haggart Donald Haggart added a comment - > I think you were looking at the cbcollect for 172.23.120.77, while the query node that was being removed was 172.23.121.77 Doh!   Indeed... guess I need glasses. Still, there is nothing logged at least to suggest a problem with that request; the topology changed to add the node at 13:01:22.420 and the exit commenced at 13:10:32.276 and there is nothing untoward in the intervening period. Unfortunately cbcollect doesn't; the closest being the pprof but that's a sample of activity and won't readily highlight something stuck.  I don't see anything to highlight something stuck, nor though do I see anything to highlight current activity - I can see a request is executing, waiting on the necessary channel for results/completion, but not the servicing routines.

          Resolving the issue with the noted code changes. 

          If it is possible to reproduce the stray application not getting its shut-down, we should investigate that in a new bug.

          Donald.haggart Donald Haggart added a comment - Resolving the issue with the noted code changes.  If it is possible to reproduce the stray application not getting its shut-down, we should investigate that in a new bug.

          It was in 7.1.0-1809. We had a hiccup with the build database around that time which caused some Jira notifications to be missed; apologies for that.

          ceej Chris Hillery added a comment - It was in 7.1.0-1809. We had a hiccup with the build database around that time which caused some Jira notifications to be missed; apologies for that.

          Closing as this issue wasnt seen after the fix. Last checked on 7.1.0-1997.

          mihir.kamdar Mihir Kamdar added a comment - Closing as this issue wasnt seen after the fix. Last checked on 7.1.0-1997.

          People

            mihir.kamdar Mihir Kamdar
            mihir.kamdar Mihir Kamdar
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty