Uploaded image for project: 'Couchbase Kubernetes'
  1. Couchbase Kubernetes
  2. K8S-2069

STIME - rebalance fails on FTS or cbas or index

    XMLWordPrintable

Details

    • 1

    Description

      While trying to deploy AO 2.1 on Open Shift, we have trouble during the rebalance operation:

      See cbpinfo logs + couchbase-cluster.yaml file attached.

      Note : there is an unused volumeClaimTemplate inside the couchbase-cluster.yaml file : it is because we though the problem may come from PV usage. It is not the case : problem is still present without using it.

      Attachments

        1. Capture d’écran 2021-03-11 à 18.36.42.png
          Capture d’écran 2021-03-11 à 18.36.42.png
          942 kB
        2. cbcollect_info_KO.png
          cbcollect_info_KO.png
          560 kB
        3. cbopinfo-20210311T170058+0100.tar.gz
          105 kB
        4. couchbase-cluster.yaml
          0.9 kB
        5. logs.tar.gz
          34.05 MB
        6. logs1.tar.gz
          2.15 MB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            simon.murray Simon Murray added a comment - For future reference https://access.redhat.com/solutions/5366631
            fabrice.leray Fabrice Leray added a comment - - edited

            And may be this kind of fix (no feedback yet from the customer but should be the fix):

              limits:
                nproc:
                  hard: _put-something-big-here_
                  soft: _put-something-big-here_
            

            where put-something-big-here is greater than 10000 I would say.

            fabrice.leray Fabrice Leray added a comment - - edited And may be this kind of fix ( no feedback yet from the customer but should be the fix): limits: nproc: hard: _put-something-big-here_ soft: _put-something-big-here_ where put-something-big-here is greater than 10000 I would say.

            Apparently the customer was successful using this procedure :

             https://access.redhat.com/solutions/5305611

            To be checked tomorrow...

            fabrice.leray Fabrice Leray added a comment - Apparently the customer was successful using this procedure :   https://access.redhat.com/solutions/5305611 To be checked tomorrow...
            fabrice.leray Fabrice Leray added a comment - - edited

            After the customer modification following this procedure on the OCP4 cluster, I can confirm:

            • rebalance is now OK
            • cbcollect_info is now OK.

            Now maybe this workaround is too high level because it defined a pidsLimit (i.e. 65536) at ContainerRuntimeConfig level.

             

            The other workaround (apparently not used by the customer) would have been to follow this other procedure 

            1. pidsLimit inside ContainerRuntimeConfig is set to: -1. -1 will make sure there is no limit enforced by CRIO and the limits set by the kubelet will be honored.
             
            2. podPidsLimit inside KubeletConfig is set to 65536

            But in both cases, those workarounds override the enforced 1024 limit config for all pods right?

            So not sure if it is the ideal scenario...

            fabrice.leray Fabrice Leray added a comment - - edited After the customer modification following this procedure on the OCP4 cluster, I can confirm: rebalance is now OK cbcollect_info is now OK. Now maybe this workaround is too high level because it defined a pidsLimit (i.e. 65536) at ContainerRuntimeConfig level.   The other workaround (apparently not used by the customer) would have been to follow  this other procedure  1 . pidsLimit inside ContainerRuntimeConfig is set to: - 1 . - 1 will make sure there is no limit enforced by CRIO and the limits set by the kubelet will be honored.   2 . podPidsLimit inside KubeletConfig is set to 65536 But in both cases, those workarounds override the enforced 1024 limit config for all pods right? So not sure if it is the ideal scenario...
            simon.murray Simon Murray added a comment -

            Fabrice Leraycan this be closed out now?

            simon.murray Simon Murray added a comment - Fabrice Leray can this be closed out now?

            People

              simon.murray Simon Murray
              fabrice.leray Fabrice Leray
              Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty