Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-54699

BP 6.6.6 - Repair LCB handles on AUTH error

    XMLWordPrintable

Details

    Description

      Build: 7.0.0-5238

      Scenario:

      Rebalancing out Eventing node from the cluster with multiple services enabled.

      (Operation Id = b8f8038679f6dd16dc26c2e7eb755ba3)

      +----------------+-------------+-----------------------+---------------+--------------+
      | Nodes          | Services    | Version               | CPU           | Status       |
      +----------------+-------------+-----------------------+---------------+--------------+
      | 172.23.107.142 | eventing    | 7.0.0-5238-enterprise | 15.3928202393 | Cluster node |
      | 172.23.106.116 | backup      | 7.0.0-5238-enterprise | 6.95321744638 | Cluster node |
      | 172.23.107.127 | cbas        | 7.0.0-5238-enterprise | 2.52666666667 | Cluster node |
      | 172.23.107.129 | kv          | 7.0.0-5238-enterprise | 39.6083333333 | Cluster node |
      | 172.23.107.126 | cbas        | 7.0.0-5238-enterprise | 7.94460276986 | Cluster node |
      | 172.23.104.247 | kv          | 7.0.0-5238-enterprise | 47.6291271521 | Cluster node |
      | 172.23.105.137 | kv          | 7.0.0-5238-enterprise | 49.554159236  | Cluster node |
      | 172.23.105.1   | index, n1ql | 7.0.0-5238-enterprise | 17.5264594289 | Cluster node |
      | 172.23.105.183 | eventing    | 7.0.0-5238-enterprise | 42.136226522  | --- OUT ---> |
      | 172.23.107.131 | index, n1ql | 7.0.0-5238-enterprise | 8.54319094682 | Cluster node |
      +----------------+-------------+-----------------------+---------------+--------------+

      Observation:

      Eventing rebalance stuck around 79% and not proceeding further for 2.5 hrs.

      Also seeing failures and timeouts in the deployed Eventing function "a3_users_search"

      Note: Possible regression due to MB-46543

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            Steps to reproduce:

            1. Setup a cluster with 2 KV nodes DataNode-A, DataNode-B and 1 eventing node - EvtNode-C
            2. Create 3 buckets for source, metadata and destination
            3. Deploy a function listening to src, metadata at meta and destination bucket binding as "dst".
            4. Have onUpdate code as :

            function OnUpdate(meta, doc) {
                dst[meta.id] = doc
            }
            

            5. Deploy the function
            6. Push 5-10 documents on the source bucket. The OnUpdate handler will create these many documents in the destination bucket.
            7. Rebalance out both KV nodes DataNode-A, DataNode-B and rebalance in DataNode-D Make sure there are no operations being pushed to source bucket while this topology change in going on.
            8. Once rebalance in of DataNode-D is complete, push 5-10 documents again to the source bucket.

            Observation without the fix:

            Eventing function continuously fails with LCB_AUTH_ERR while processing mutations from Step 8. This is because of the stale cluster map with the libcouchbase instance that still assumes DataNode-A and DataNode-B to be part of the cluster.

            Observation with the fix:

            No error observed. On the first LCB_AUTH_ERR, eventing will detect this error and will repair the connections with latest cluster map.

            abhishek.jindal Abhishek Jindal added a comment - Steps to reproduce: 1. Setup a cluster with 2 KV nodes DataNode-A, DataNode-B and 1 eventing node - EvtNode-C 2. Create 3 buckets for source, metadata and destination 3. Deploy a function listening to src, metadata at meta and destination bucket binding as "dst". 4. Have onUpdate code as : function OnUpdate(meta, doc) { dst[meta.id] = doc } 5. Deploy the function 6. Push 5-10 documents on the source bucket. The OnUpdate handler will create these many documents in the destination bucket. 7. Rebalance out both KV nodes DataNode-A, DataNode-B and rebalance in DataNode-D Make sure there are no operations being pushed to source bucket while this topology change in going on . 8. Once rebalance in of DataNode-D is complete, push 5-10 documents again to the source bucket. — Observation without the fix: Eventing function continuously fails with LCB_AUTH_ERR while processing mutations from Step 8. This is because of the stale cluster map with the libcouchbase instance that still assumes DataNode-A and DataNode-B to be part of the cluster. Observation with the fix: No error observed. On the first LCB_AUTH_ERR, eventing will detect this error and will repair the connections with latest cluster map.

            Build couchbase-server-6.6.6-10557 contains eventing commit a7d408f with commit message:
            MB-54699 : Recreate lcb_Instance upon AUTH error during bucket op

            build-team Couchbase Build Team added a comment - Build couchbase-server-6.6.6-10557 contains eventing commit a7d408f with commit message: MB-54699 : Recreate lcb_Instance upon AUTH error during bucket op
            sujay.gad Sujay Gad added a comment - - edited

            Reproduced the issue on 6.6.6-10556 and verified the fix on 6.6.6-10566 using the steps mentioned above.

            sujay.gad Sujay Gad added a comment - - edited Reproduced the issue on 6.6.6-10556 and verified the fix on 6.6.6-10566 using the steps mentioned above.

            People

              sujay.gad Sujay Gad
              abhishek.jindal Abhishek Jindal
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty