Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-48103

[BP 7.0.2 MB-47946] - [Eventing][enforce-tls]: handler stuck in deploying state

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Critical
    • Resolution: Fixed
    • 7.1.0, 7.0.2
    • 7.0.2
    • eventing

    Description

      Build - 7.0.2 - 6504

      STEPS TO REPRODUCE

      • Disable auto failover and enable n2n encryption.
      • Create and deploy eventing handler.
      • Load docs into source bucket, mutations are processed.
      • Pause eventing handler.
      • Resume eventing handler.
      • While handler is in deploying state, enforce tls by changing ClusterEncryptionLevel to strict.

      Handler is forever stuck in deploying state.

      On 10.112.190.102

      • eventing.log

        2021-08-13T11:25:15.146+00:00 [Info] Cluster Encryption Settings have been changed by ns server.
        2021-08-13T11:25:15.146+00:00 [Info] Updating node-to-node encryption level: {EncryptData:true DisableNonSSLPorts:true}
        2021-08-13T11:25:15.146+00:00 [Info] Attempting to restart HTTP server to listen on loopback interface
        2021-08-13T11:25:15.146+00:00 [Info] Successfully stopped running HTTP server
        2021-08-13T11:25:15.146+00:00 [Info] ServiceMgr::initService Got a signal to stop running HTTP server
        2021-08-13T11:25:15.146+00:00 [Info] serviceChangeNotifier: received EncryptionLevelChangeNotification
        2021-08-13T11:25:15.146+00:00 [Info] ServiceMgr::initService Admin HTTP server started: 127.0.0.1:8096
        2021-08-13T11:25:15.146+00:00 [Error] ServiceMgr::initService metakv observe error for primary store, err: unexpected EOF. Retrying...
        2021-08-13T11:25:15.146+00:00 [Error] ServiceMgr::initService metakv observe error for temp store, err: unexpected EOF. Retrying...
        2021-08-13T11:25:15.146+00:00 [Error] Eventing::main metakv observe error for rebalance token, err: unexpected EOF. Retrying...
        2021-08-13T11:25:15.146+00:00 [Error] ServiceMgr::initService metakv observe error for setting store, err: unexpected EOF. Retrying...
        2021-08-13T11:25:15.146+00:00 [Error] Eventing::main metakv observe error for global config, err: unexpected EOF. Retrying...
        2021-08-13T11:25:15.146+00:00 [Error] Eventing::main metakv observe error for debugger, err: unexpected EOF. Retrying.
        2021-08-13T11:25:15.146+00:00 [Error] Eventing::main metakv observe error for apps retry, err: unexpected EOF. Retrying.
        2021-08-13T11:25:15.146+00:00 [Error] Eventing::main metakv observe error for event handler code, err: unexpected EOF. Retrying...
        2021-08-13T11:25:15.146+0000 [WARN] Store::ObserveChanges : Unable to observe metakv unexpected EOF ..Retrying
        2021-08-13T11:25:15.146+00:00 [Warn] servicesChangeNotifier: Connection terminated for pool notifier instance of http://%40eventing-cbauth@127.0.0.1:8091, default (unexpected EOF). Retrying...
        2021-08-13T11:25:15.146+00:00 [Warn] servicesChangeNotifier: Connection terminated for services notifier instance of http://%40eventing-cbauth@127.0.0.1:8091, default (unexpected EOF). Retrying...
        2021-08-13T11:25:15.262+00:00 [Error] [gocb] memdClient read failure: EOF
        

      Attachments

        Issue Links

          Activity

            Ritam SharmaChanabasappa Ghali We need to take this fix into 7.0.2 as its related to enforce-tls. Its a race condition but it causes functions to get stuck. Dev has a fix in the works.

            jeelan.poola Jeelan Poola added a comment - Ritam Sharma Chanabasappa Ghali We need to take this fix into 7.0.2 as its related to enforce-tls. Its a race condition but it causes functions to get stuck. Dev has a fix in the works.
            wayne Wayne Siu added a comment -

            Jeelan Poola
            Can you provide an ETA on this? Thanks.

            wayne Wayne Siu added a comment - Jeelan Poola Can you provide an ETA on this? Thanks.

            Chanabasappa GhaliSujay Gad Would be great to get some early feedback on an official build with this fix. Regression/system/longevity feedback is needed. Volume if possible. Thank you!

            jeelan.poola Jeelan Poola added a comment - Chanabasappa Ghali Sujay Gad Would be great to get some early feedback on an official build with this fix. Regression/system/longevity feedback is needed. Volume if possible. Thank you!

            Sujay Gad Change http://review.couchbase.org/c/eventing/+/160594 has been submitted. To test please do the following:

            • For deployment hung scenario:

            1. Take a handler that is currently undeployed and encryption level set to control.
            2. Deploy
            3. While deployment is running, change encryption level to "strict"
            4. Deployment will take longer than usual as handler state is cleaned up and is re-deployed if change in encryption level is detected.
            5. Push some mutations and verify that bucket ops, timers and N1QL queries are firing with "strict" mode.

            ------

            1. Now change encryption level from strict back to control.
            2. Pause handler. (Handler is on plain text non-tls mode now).
            3. Resume handler.
            4. While resume is going on change encryption level from control to strict.
            5. Resume will take longer than usual as handler state is cleaned up and is re-deployed if change in encryption level is detected.
            6. Push some mutations and verify that bucket ops, timers and N1QL queries are firing with "strict" mode.

            • For rebalance hung scenario:

            1. Change encryption level while rebalance in, out or failover is happening. Eventing rebalance should fail with "encryption level changed" message.
            2. Retry the rebalance now and it should succeed.
            3. Pause-resume handlers as mentioned in scenarios above.

            abhishek.jindal Abhishek Jindal added a comment - Sujay Gad Change http://review.couchbase.org/c/eventing/+/160594 has been submitted. To test please do the following: For deployment hung scenario: 1. Take a handler that is currently undeployed and encryption level set to control. 2. Deploy 3. While deployment is running, change encryption level to "strict" 4. Deployment will take longer than usual as handler state is cleaned up and is re-deployed if change in encryption level is detected. 5. Push some mutations and verify that bucket ops, timers and N1QL queries are firing with "strict" mode. ------ 1. Now change encryption level from strict back to control. 2. Pause handler. (Handler is on plain text non-tls mode now). 3. Resume handler. 4. While resume is going on change encryption level from control to strict. 5. Resume will take longer than usual as handler state is cleaned up and is re-deployed if change in encryption level is detected. 6. Push some mutations and verify that bucket ops, timers and N1QL queries are firing with "strict" mode. For rebalance hung scenario: 1. Change encryption level while rebalance in, out or failover is happening. Eventing rebalance should fail with "encryption level changed" message. 2. Retry the rebalance now and it should succeed. 3. Pause-resume handlers as mentioned in scenarios above.

            Build couchbase-server-7.0.2-6637 contains eventing commit da2b841 with commit message:
            MB-48103 : Repair metadata handles, dcp streams, restart goroutines if

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.2-6637 contains eventing commit da2b841 with commit message: MB-48103 : Repair metadata handles, dcp streams, restart goroutines if
            sujay.gad Sujay Gad added a comment -

            Reopening as issue still persists.

            Build - 7.0.2 - 6637
            STEPS TO REPRODUCE

            • Enable n2n encryption.
            • Create and deploy handler.
            • While handler is in deploying stage, update clusterEncryptionLevel to strict.

              2021-09-02T15:52:13.513+05:30 [Info] Updating node-to-node encryption level: {EncryptData:true DisableNonSSLPorts:true}
              

            OBSERVATION
            Handler deployment does not complete. REST calls for eg - /api/v1/status, /api/v1/functions/<function name>/<lifecycle operation> fail.

            PF logs attached.

            sujay.gad Sujay Gad added a comment - Reopening as issue still persists. Build - 7.0.2 - 6637 STEPS TO REPRODUCE Enable n2n encryption. Create and deploy handler. While handler is in deploying stage, update clusterEncryptionLevel to strict. 2021 - 09 -02T15: 52 : 13.513 + 05 : 30 [Info] Updating node-to-node encryption level: {EncryptData: true DisableNonSSLPorts: true } OBSERVATION Handler deployment does not complete. REST calls for eg - /api/v1/status, /api/v1/functions/<function name>/<lifecycle operation> fail. PF logs attached.

            An issue was introduced while fixing MB-48268 which is http server to spawn. Reopening MB-48268 too.

            abhishek.jindal Abhishek Jindal added a comment - An issue was introduced while fixing MB-48268 which is http server to spawn. Reopening MB-48268 too.

            Build couchbase-server-7.0.2-6638 contains eventing commit cfca429 with commit message:
            Revert "MB-48103 : Repair metadata handles, dcp streams, restart goroutines if"

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.2-6638 contains eventing commit cfca429 with commit message: Revert " MB-48103 : Repair metadata handles, dcp streams, restart goroutines if"

            Build couchbase-server-7.0.2-6646 contains eventing commit 5d3ed35 with commit message:
            MB-48103 : Repair function during lifecycle operation if

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.0.2-6646 contains eventing commit 5d3ed35 with commit message: MB-48103 : Repair function during lifecycle operation if
            sujay.gad Sujay Gad added a comment -

            Verified using 7.0.2 - 6646/ toy build.

            sujay.gad Sujay Gad added a comment - Verified using 7.0.2 - 6646/ toy build.

            People

              sujay.gad Sujay Gad
              jeelan.poola Jeelan Poola
              Votes:
              0 Vote for this issue
              Watchers:
              6 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                PagerDuty