Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-61066

[upgrade] Query service crashing continuously on upgrading a node from 7.2.1 to 7.6.0

    XMLWordPrintable

Details

    Description

      While upgrading the greenboard cluster, the cluster hit into following errors. Steps taken during upgrade of cluster:
      1. Upgrade the query only nodes (there are 2 nodes in the cluster). This upgrade went in fine without any issue. I am upgrading by failing over the nodes, doing a package upgrade, and then adding back the nodes back to cluster
      2. Upgrade 172.23.104.153 node. This has data, index and query services in the node. The package upgrade went in fine. But on adding back the node using full recovery, the rebalance fails. Following are the logs seen after the failure.

      Service 'n1ql' exited with status 1. Restarting. Messages: 2024-03-08T03:13:54.567-08:00 [INFO] FFDC: Found 0 existing dump file(s); 0 bytes. 2024-03-08T03:13:54.568-08:00 [Info] GSIClient DeploymentModel is set to: default, input: default 2024-03-08T03:13:54.568-08:00 [INFO] Temporary file path set to: /tmp, quota: 0 2024-03-08T03:13:54.568-08:00 [INFO] Initialization of cbauth succeeded 2024-03-08T03:13:55.552-08:00 DEBU REGU.impl.init.0.func1() at aggrecorder.go:57 [id 68] will report aggregate recorder stats every 5m0s  2024-03-08T03:14:02.251-08:00 [ERROR] Cannot connect to default pool: Namespace not found in CB datastore: default - cause: HTTP error 500 Internal Server Error getting "http://127.0.0.1:8091/pools/default?uuid=4f2dc47e3983e463d1e69fbd638d091b": ["Unexpected server error, request logged."] 2024-03-08T03:14:02.251-08:00 [ERROR] Namespace not found in CB datastore: default - cause: HTTP error 500 Internal Server Error getting "http://127.0.0.1:8091/pools/default?uuid= show...
      ns_log 000
      ns_1@172.23.104.153
      3:14:02 AM 8 Mar, 2024
       Service 'n1ql' exited with status 1. Restarting. Messages: 2024-03-08T03:13:46.552-08:00 [INFO] n1fty: onTopologyChange, indexerAvail:false 2024-03-08T03:13:46.553-08:00 [INFO] n1fty: onTopologyChange, keepHosts:[], removedHosts:[], addedHosts:[172.23.121.84:9130] 2024-03-08T03:13:46.553-08:00 [INFO] n1fty: updateConnPools done, hosts:[172.23.121.84:9130] 2024-03-08T03:13:47.534-08:00 DEBU REGU.impl.init.0.func1() at aggrecorder.go:57 [id 54] will report aggregate recorder stats every 5m0s  2024-03-08T03:13:54.218-08:00 [ERROR] Cannot connect to default pool: Namespace not found in CB datastore: default - cause: HTTP error 500 Internal Server Error getting "http://127.0.0.1:8091/pools/default?uuid=4f2dc47e3983e463d1e69fbd638d091b": ["Unexpected server error, request logged."] 2024-03-08T03:13:54.218-08:00 [ERROR] Namespace not found in CB datastore: default - cause: HTTP error 500 Internal Server Error getting "http://127.0.0.1:8091/pools/default?uuid=4f2dc47e3983e463d1e69fbd638d091b": show...
      ns_log 000
      ns_1@172.23.104.153
      3:13:54 AM 8 Mar, 2024
       Service 'n1ql' exited with status 1. Restarting. Messages: 2024-03-08T03:13:38.502-08:00 [INFO] FFDC: Capture path: /opt/couchbase/var/lib/couchbase/logs 2024-03-08T03:13:38.503-08:00 [INFO] FFDC: Found 0 existing dump file(s); 0 bytes. 2024-03-08T03:13:38.503-08:00 [INFO] Temporary file path set to: /tmp, quota: 0 2024-03-08T03:13:38.504-08:00 [INFO] Initialization of cbauth succeeded 2024-03-08T03:13:39.485-08:00 DEBU REGU.impl.init.0.func1() at aggrecorder.go:57 [id 33] will report aggregate recorder stats every 5m0s  2024-03-08T03:13:46.184-08:00 [ERROR] Cannot connect to default pool: Namespace not found in CB datastore: default - cause: HTTP error 500 Internal Server Error getting "http://127.0.0.1:8091/pools/default?uuid=4f2dc47e3983e463d1e69fbd638d091b": ["Unexpected server error, request logged."] 2024-03-08T03:13:46.184-08:00 [ERROR] Namespace not found in CB datastore: default - cause: HTTP error 500 Internal Server Error getting "http://127.0.0.1:8091/pools/default?uuid=4f2 show...
      ns_log 000
      ns_1@172.23.104.153
      3:13:46 AM 8 Mar, 2024
       Service 'n1ql' exited with status 1. Restarting. Messages: 2024-03-08T03:13:30.463-08:00 [INFO] n1fty: onTopologyChange, keepHosts:[], removedHosts:[], addedHosts:[172.23.121.84:9130] 2024-03-08T03:13:30.463-08:00 [INFO] n1fty: updateConnPools done, hosts:[172.23.121.84:9130] 2024/03/08 03:13:30 cfg_metakv: metaKVCallback, path: /fts/cbgt/cfg/version, key: version, deletion: false 2024-03-08T03:13:30.464-08:00 [INFO] Initialization of cbauth succeeded 2024-03-08T03:13:31.450-08:00 DEBU REGU.impl.init.0.func1() at aggrecorder.go:57 [id 33] will report aggregate recorder stats every 5m0s  2024-03-08T03:13:38.151-08:00 [ERROR] Cannot connect to default pool: Namespace not found in CB datastore: default - cause: HTTP error 500 Internal Server Error getting "http://127.0.0.1:8091/pools/default?uuid=4f2dc47e3983e463d1e69fbd638d091b": ["Unexpected server error, request logged."] 2024-03-08T03:13:38.151-08:00 [ERROR] Namespace not found in CB datastore: default - cause: HTTP error 500 Internal show...
      ns_log 000
      ns_1@172.23.104.153
      3:13:38 AM 8 Mar, 2024
      

      After this, I tried to take out the node (remove) and add it back. But that too failed. I then failed over the node again and restarted couchbase service and then added back the node. But the issue persisted even after all these steps. Currently the cluster is in a bad state and indexes are warming up on other indexe nodes too, thus rendering the cluster unusable for querying.

      Note that this same upgrade path was tested on a similar XDCR cluster (albeit a smaller cluster) on RC 3 and things had worked well. So marking the bug as a regression. Last known good build is 7.6.0 RC3 builds. (7.6.0-2090)

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dfinlay Dave Finlay
              bharath.gp Bharath G P
              Votes:
              0 Vote for this issue
              Watchers:
              16 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty