Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44612

Analytics crash with query service terminates client connection

    XMLWordPrintable

Details

    Description

      System test: There are 2 questions:

      a) Why did the server shutdown at this point of time.
      b) Client connection was broken, of course if the port for the communication was not available, this is expected behaviour.

      Step to run queries

      �2021-02-26 06:31:30,617 - root - INFO - CouchbaseException from Java SDK. com.couchbase.client.core.ServiceNotAvailableException: The Analytics service is not enabled or no node in the cluster supports it.
      �CouchbaseException from Java SDK. com.couchbase.client.core.ServiceNotAvailableException: The Analytics service is not enabled or no node in the cluster supports it.
      #Traceback (most recent call last):
      J  File "/AnalyticsQueryApp/Query/load_queries.py", line 1060, in <module>
          main()
      F  File "/AnalyticsQueryApp/Query/load_queries.py", line 1009, in main
      Z    queries = load.generate_queries_for_analytics(options.analytics_queries, bucket_list,
      _  File "/AnalyticsQueryApp/Query/load_queries.py", line 865, in generate_queries_for_analytics
      \    output = self.execute_statement_on_cbas(statement, pretty=True, client_context_id=None,
      Z  File "/AnalyticsQueryApp/Query/load_queries.py", line 538, in execute_statement_on_cbas
      *    raise Exception("CouchbaseException")
      Exception: CouchbaseException
      
      

      Logs in node - 172.23.120.74

      2021-02-26T00:22:31.355-08:00 ERRO CBAS.nc.NodeControllerService [ShutdownHook-a83c5d9735862de47f511c4b9eb9413d] Some jobs failed to exit, continuing with abnormal shutdown
      2021-02-26T00:22:34.677-08:00 ERRO CBAS.impl.IPCConnectionManager [IPC Network Listener Thread [/0:0:0:0:0:0:0:0:9115]] TCP read error from /0:0:0:0:0:0:0:0:9112
      java.util.concurrent.RejectedExecutionException: Task java.util.concurrent.FutureTask@4a35c8d4[Not completed, task = java.util.concurrent.Executors$RunnableAdapter@7bc65160[Wrapped task = org.apache.hyracks.control.nc.task.HeartbeatAckTask@420d6731]] rejected from org.apache.hyracks.util.MaintainedThreadNameExecutorService@1db3d2be[Shutting down, pool size = 291, active threads = 291, queued tasks = 0, completed tasks = 2806893]
              at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(Unknown Source) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.reject(Unknown Source) ~[?:?]
              at java.util.concurrent.ThreadPoolExecutor.execute(Unknown Source) ~[?:?]
              at java.util.concurrent.AbstractExecutorService.submit(Unknown Source) ~[?:?]
              at org.apache.hyracks.control.nc.NodeControllerIPCI.deliverIncomingMessage(NodeControllerIPCI.java:147) ~[hyracks-control-nc.jar:7.0.0-4547]
              at org.apache.hyracks.ipc.impl.IPCSystem.deliverIncomingMessage(IPCSystem.java:102) ~[hyracks-ipc.jar:7.0.0-4547]
              at org.apache.hyracks.ipc.impl.IPCHandle.processIncomingMessages(IPCHandle.java:189) ~[hyracks-ipc.jar:7.0.0-4547]
              at org.apache.hyracks.ipc.impl.IPCConnectionManager$NetworkThread.read(IPCConnectionManager.java:440) [hyracks-ipc.jar:7.0.0-4547]
              at org.apache.hyracks.ipc.impl.IPCConnectionManager$NetworkThread.processSelectedKeys(IPCConnectionManager.java:261) [hyracks-ipc.jar:7.0.0-4547]
              at org.apache.hyracks.ipc.impl.IPCConnectionManager$NetworkThread.doRun(IPCConnectionManager.java:231) [hyracks-ipc.jar:7.0.0-4547]
              at org.apache.hyracks.ipc.impl.IPCConnectionManager$NetworkThread.run(IPCConnectionManager.java:213) [hyracks-ipc.jar:7.0.0-4547]
      2021-02-26T00:22:40.101-08:00 ERRO CBAS.lifecycle.LifeCycleComponentManager [ShutdownHook-a83c5d9735862de47f511c4b9eb9413d] Stopping instance
      

      • Not sure why the nodes had to shutdown.

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            steve.watanabe Steve Watanabe added a comment - - edited

            Here are the steps I use to reproduce this on my macbook.

            • Start two nodes (./cluster_run -n 2 --dont-rename)
            • From the UI configure the first node with all services
            • setup node to node encryption
              • ./couchbase-cli setting-autofailover -c localhost:9000 -u Administrator -p asdasd --enable-auto-failover 0
              • ./couchbase-cli node-to-node-encryption -c localhost:9000 -u Administrator -p asdasd --enable
              • ./couchbase-cli setting-security -c localhost:9000 -u Administrator -p asdasd --set --cluster-encryption-level all
              • ./couchbase-cli setting-autofailover -c localhost:9000 -u Administrator -p asdasd --enable-auto-failover 1 --auto-failover-timeout 120 --enable-failover-of-server-groups 1 --max-failovers 2 --can-abort-rebalance 1
              • ./couchbase-cli node-to-node-encryption -c localhost:9000 -u Administrator -p asdasd --get
            • From the UI add the second node (with all services) - no rebalance is done in the reproduction
            • Look at the cbas logs on the first node...why did the cluster encryption go from true to false to true during the addition of the second node?

            $ grep "cluster encrypt" analytics_info.log
            2021-03-04T14:29:39.072-08:00 INFO CBAS.cbas initialized cluster encryption with data encryption as false
            2021-03-04T14:29:39.084-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server
            2021-03-04T14:29:45.922-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server
            2021-03-04T14:29:45.922-08:00 WARN CBAS.cbas cluster encryption has changed to true
            2021-03-04T14:30:13.745-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server
            2021-03-04T14:30:13.745-08:00 WARN CBAS.cbas cluster encryption has changed to false
            2021-03-04T14:30:18.809-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server
            2021-03-04T14:30:18.809-08:00 WARN CBAS.cbas cluster encryption has changed to true
            

             

            steve.watanabe Steve Watanabe added a comment - - edited Here are the steps I use to reproduce this on my macbook. Start two nodes (./cluster_run -n 2 --dont-rename) From the UI configure the first node with all services setup node to node encryption ./couchbase-cli setting-autofailover -c localhost:9000 -u Administrator -p asdasd --enable-auto-failover 0 ./couchbase-cli node-to-node-encryption -c localhost:9000 -u Administrator -p asdasd --enable ./couchbase-cli setting-security -c localhost:9000 -u Administrator -p asdasd --set --cluster-encryption-level all ./couchbase-cli setting-autofailover -c localhost:9000 -u Administrator -p asdasd --enable-auto-failover 1 --auto-failover-timeout 120 --enable-failover-of-server-groups 1 --max-failovers 2 --can-abort-rebalance 1 ./couchbase-cli node-to-node-encryption -c localhost:9000 -u Administrator -p asdasd --get From the UI add the second node (with all services) - no rebalance is done in the reproduction Look at the cbas logs on the first node...why did the cluster encryption go from true to false to true during the addition of the second node? $ grep "cluster encrypt" analytics_info.log 2021-03-04T14:29:39.072-08:00 INFO CBAS.cbas initialized cluster encryption with data encryption as false 2021-03-04T14:29:39.084-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server 2021-03-04T14:29:45.922-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server 2021-03-04T14:29:45.922-08:00 WARN CBAS.cbas cluster encryption has changed to true 2021-03-04T14:30:13.745-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server 2021-03-04T14:30:13.745-08:00 WARN CBAS.cbas cluster encryption has changed to false 2021-03-04T14:30:18.809-08:00 WARN CBAS.cbas cluster encryption has been refreshed by ns server 2021-03-04T14:30:18.809-08:00 WARN CBAS.cbas cluster encryption has changed to true  
            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - - edited

            Based on Steve's repro steps. I was able to reproduce it and through some sleeps and extending engage_cluster timeout, I was able to gather the below info. 

            Turns out we have a situation where nodes_wanted is updated on the node(n_0) which is adding the new node(n_3) but n_3 node config has not been synced to node n_0, which includes the node_encryption flag. 

            (n_0@127.0.0.1)8> ns_config:search_node('n_3@127.0.0.1', ns_config:latest(), node_encryption). 
            false
            (n_0@127.0.0.1)9> ns_node_disco:nodes_wanted().
            ['n_0@127.0.0.1','n_1@127.0.0.1','n_2@127.0.0.1']
             
            (n_0@127.0.0.1)12> ns_config:search_node('n_3@127.0.0.1', ns_config:latest(), node_encryption).
            false
            (n_0@127.0.0.1)13> ns_node_disco:nodes_wanted().
            ['n_0@127.0.0.1','n_1@127.0.0.1','n_2@127.0.0.1','n_3@127.0.0.1']
             
            after a brief moment when config sync from n_3 has propogated to n_0
            (n_0@127.0.0.1)1> ns_config:search_node('n_3@127.0.0.1', ns_config:latest(), node_encryption). 
            {value,true}
            (n_0@127.0.0.1)2> ns_node_disco:nodes_wanted().
            ['n_0@127.0.0.1','n_1@127.0.0.1','n_2@127.0.0.1','n_3@127.0.0.1'] 

            Because of this delay, we see that computation of misc:is_cluster_encryption_fully_enabled() goes from true -> false(till n_3 config is synced to n_0) -> true.

            Abhijeeth.Nuthan Abhijeeth Nuthan added a comment - - edited Based on Steve's repro steps. I was able to reproduce it and through some sleeps and extending engage_cluster timeout, I was able to gather the below info.  Turns out we have a situation where nodes_wanted is updated on the node(n_0) which is adding the new node(n_3) but n_3 node config has not been synced to node n_0, which includes the node_encryption flag.  (n_0 @127 .0. 0.1 ) 8 > ns_config:search_node( 'n_3@127.0.0.1' , ns_config:latest(), node_encryption). false (n_0 @127 .0. 0.1 ) 9 > ns_node_disco:nodes_wanted(). [ 'n_0@127.0.0.1' , 'n_1@127.0.0.1' , 'n_2@127.0.0.1' ]   (n_0 @127 .0. 0.1 ) 12 > ns_config:search_node( 'n_3@127.0.0.1' , ns_config:latest(), node_encryption). false (n_0 @127 .0. 0.1 ) 13 > ns_node_disco:nodes_wanted(). [ 'n_0@127.0.0.1' , 'n_1@127.0.0.1' , 'n_2@127.0.0.1' , 'n_3@127.0.0.1' ]   after a brief moment when config sync from n_3 has propogated to n_0 (n_0 @127 .0. 0.1 ) 1 > ns_config:search_node( 'n_3@127.0.0.1' , ns_config:latest(), node_encryption). {value, true } (n_0 @127 .0. 0.1 ) 2 > ns_node_disco:nodes_wanted(). [ 'n_0@127.0.0.1' , 'n_1@127.0.0.1' , 'n_2@127.0.0.1' , 'n_3@127.0.0.1' ] Because of this delay, we see that computation of misc:is_cluster_encryption_fully_enabled() goes from true -> false(till n_3 config is synced to n_0) -> true.

            Added to 6.6.2 as per severity.

            meni.hillel Meni Hillel (Inactive) added a comment - Added to 6.6.2 as per severity.

            Build couchbase-server-6.6.2-9575 contains ns_server commit a080b6a with commit message:
            MB-44612 Do not derive encryption level from node encryption

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.2-9575 contains ns_server commit a080b6a with commit message: MB-44612 Do not derive encryption level from node encryption

            Build couchbase-server-7.0.0-4741 contains ns_server commit a080b6a with commit message:
            MB-44612 Do not derive encryption level from node encryption

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.0-4741 contains ns_server commit a080b6a with commit message: MB-44612 Do not derive encryption level from node encryption

            People

              ritam.sharma Ritam Sharma
              ritam.sharma Ritam Sharma
              Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty