Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-42274

Rebalance failure with error {badmatch,failed}

    XMLWordPrintable

Details

    • Bug
    • Status: Closed
    • Major
    • Resolution: Incomplete
    • 6.6.1
    • None
    • qe
    • Couchbase server build 6.6.1-9143
    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Error while rebalancing nodes in cluster.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          owend Daniel Owen added a comment -

          Hi Umang Please could you provide the steps to reproduce?
          Looks like /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem is missing.

          In {{ns_server.error.log }} we see

          [ns_server:error,2020-10-25T13:28:11.270Z,ns_1@cb.local:service_status_keeper_worker<0.504.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error,
                                                       {econnrefused,
          

          In ns_server.indexer.log we see

          [Fatal] Fail to due generate SSL certificate: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory
          

          See similar error in {{ns_server.fts.log }}

          2020-10-25T13:28:19.151+00:00 [FATA] init_grpc: LoadX509KeyPair, err: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory -- main.getGrpcOpts() at init_grpc.go:190
          

          and in ns_server.query.log

          _time=2020-10-25T13:28:09.193+00:00 _level=INFO _msg=ERROR: Starting TLS listener - open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory 
          

          owend Daniel Owen added a comment - Hi Umang Please could you provide the steps to reproduce? Looks like /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem is missing. In {{ns_server.error.log }} we see [ns_server:error,2020-10-25T13:28:11.270Z,ns_1@cb.local:service_status_keeper_worker<0.504.0>:rest_utils:get_json:59]Request to (indexer) getIndexStatus failed: {error, {econnrefused, In ns_server.indexer.log we see [Fatal] Fail to due generate SSL certificate: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory See similar error in {{ns_server.fts.log }} 2020-10-25T13:28:19.151+00:00 [FATA] init_grpc: LoadX509KeyPair, err: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory -- main.getGrpcOpts() at init_grpc.go:190 and in ns_server.query.log _time=2020-10-25T13:28:09.193+00:00 _level=INFO _msg=ERROR: Starting TLS listener - open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory
          dfinlay Dave Finlay added a comment - - edited

          There has been an epidemic of this kind of thing on 6.6.1 tests recently.

          E.g.
          MB-41865 - missing memcached-cert.pem
          MB-41783 - missing memcached.json file
          MB-41892 - missing encrypted_data_keys file

          I took a closer look at this issue. Upon start up, the certs should be generated and all of the certificate and private key files created. We see the server start up:

          [ns_server:info,2020-10-25T13:26:44.337Z,nonode@nohost:<0.116.0>:ns_server:init_logging:151]Started & configured logging
          ...
          

          Cluster CA gets generated:

          [ns_server:debug,2020-10-25T13:26:44.756Z,ns_1@cb.local:ns_ssl_services_setup<0.213.0>:ns_server_cert:generate_cert_and_pkey:84]Generated certificate and private key in 328137 us
          [ns_server:debug,2020-10-25T13:26:44.757Z,ns_1@cb.local:ns_config_log<0.199.0>:ns_config_log:log_common:229]config change:
          cert_and_pkey ->
          [{'_vclock',[{<<"a2df954ab79b443d1c880d0742a68202">>,{1,63770851604}}]}|
           {<<"-----BEGIN CERTIFICATE-----\nMIIDAjCCAeqgAwIBAgIIFkE/LPsu2zIwDQYJKoZIhvcNAQELBQAwJDEiMCAGA1UE\nAxMZQ291Y2hiYXNlIFNlcnZlciA3MWRkNWY2ZjAeFw0xMzAxMDEwMDAwMDBaFw00\nOTEyMzEyMzU5NTlaMCQxIjAgBgNVBAMTGUNvdWNoYmFzZSBTZXJ2ZXIgNzFkZDVm\nNmYwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDLpi/pvvLCIDttSlFT\nwtmtPI0bsl7XNi5ADiTzuQxoXkG8KuuoZ0chqrWOpgd+EiU+LRLAsMBtqql7pvTw\nw4/RPhkIy1jFgYug6iAYTLnoCOfaGWj"...>>,
            <<"*****">>}]
          

          We check for a node cert but it's not there so we generate and save it to disk.

          [ns_server:info,2020-10-25T13:26:44.757Z,ns_1@cb.local:ns_ssl_services_setup<0.213.0>:ns_ssl_services_setup:maybe_generate_local_cert:636]Failed to read node certificate. Perhaps it wasn't created yet. Error: {error, {badmatch,    {error, enoent}}}
          ...
          [ns_server:info,2020-10-25T13:26:45.019Z,ns_1@cb.local:ns_ssl_services_setup<0.213.0>:ns_ssl_services_setup:do_generate_local_cert:624]Saved local cert for node 'ns_1@cb.local'
          

          By this point there should be a collection of certificate and private key files on disk including memcached-cert.pem. However very quickly indexing starts to complain about no cert file being present. FTS follows shortly thereafter:

          [user:info,2020-10-25T13:28:08.896Z,ns_1@cb.local:<0.2387.0>:indexing:unknown:-1]Fail to due generate SSL certificate: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory
          ...
          [user:info,2020-10-25T13:28:08.999Z,ns_1@cb.local:<0.2463.0>:indexing:unknown:-1]Fail to due generate SSL certificate: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory
          ...
          2020-10-25T13:28:09.006+00:00 [FATA] init_grpc: LoadX509KeyPair, err: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory -- main.getGrpcOpts() at init_grpc.go:190
          ...
          

          When the cbcollect runs, here's what we see in the couchbase/config directory:

          /opt/couchbase/var/lib/couchbase/config:
          total 44
          68633084 drwxr-xr-x 2 couchbase couchbase    44 Oct 25 13:30 .
          35282234 drwxr-xr-x 8 couchbase couchbase  4096 Oct 25 13:31 ..
          68633085 -rw-rw---- 1 couchbase couchbase 36044 Oct 25 13:30 config.dat
          68633086 -rw-rw---- 1 couchbase couchbase  1645 Oct 25 13:28 memcached.rbac
          

          Here's what we should see:

          20971712 -rw-rw---- 1 couchbase couchbase     322 Oct  9 09:31 audit.json
          20971720 -rw-rw---- 1 couchbase couchbase 1284044 Oct  9 13:46 config.dat
          20971812 -rw-rw---- 1 couchbase couchbase     138 Oct  1 18:03 dist_cfg
          20971716 -rw-rw---- 1 couchbase couchbase      63 Oct  1 18:02 encrypted_data_keys
          20971811 -rw-rw---- 1 couchbase couchbase    1139 Oct  1 18:08 local-ssl-cert.pem
          20971717 -rw-rw---- 1 couchbase couchbase      39 Oct  1 18:08 local-ssl-meta
          20971807 -rw-rw---- 1 couchbase couchbase    1675 Oct  1 18:08 local-ssl-pkey.pem
          20973106 -rw-rw---- 1 couchbase couchbase    2242 Oct  9 09:31 memcached-cert.pem
          20971798 -rw-rw---- 1 couchbase couchbase    1675 Oct  9 09:31 memcached-key.pem
          20971719 -rw-rw---- 1 couchbase couchbase    2122 Oct  9 09:31 memcached.json
          20973107 -rw-rw---- 1 couchbase couchbase    2633 Oct  9 09:31 memcached.rbac
          20971710 -rw-rw---- 1 couchbase couchbase    2814 Oct  9 09:31 ssl-cert-key.pem
          20971711 -rw-rw---- 1 couchbase couchbase    1103 Oct  9 09:31 ssl-cert-key.pem-ca
          20971808 -rw-rw---- 1 couchbase couchbase   22906 Oct  2 10:07 users.dets
          

          So, you can see there are real problems with this deployment and there's no point doing any investigation on the rebalance failure.

          My guess is that there are rogue processes hanging around holding onto file or directory handles and it compromises our ability to write files to disk. I don't see anything suspicious in the process list however. If this happens again, it would be good to get lsof output on the config directory and see if there are older processes that are not part of the current server deployment holding onto files. If it's not that then there's some serious problems with the disks on these machines that report that files are saved correctly and then they disappear into a black hole.

          But we need to get to the bottom of this as it's a pretty big waste of time for everyone involved.

          CC: Ritam Sharma, Raju Suravarjjala

          dfinlay Dave Finlay added a comment - - edited There has been an epidemic of this kind of thing on 6.6.1 tests recently. E.g. MB-41865 - missing memcached-cert.pem MB-41783 - missing memcached.json file MB-41892 - missing encrypted_data_keys file I took a closer look at this issue. Upon start up, the certs should be generated and all of the certificate and private key files created. We see the server start up: [ns_server:info,2020-10-25T13:26:44.337Z,nonode@nohost:<0.116.0>:ns_server:init_logging:151]Started & configured logging ... Cluster CA gets generated: [ns_server:debug,2020-10-25T13:26:44.756Z,ns_1@cb.local:ns_ssl_services_setup<0.213.0>:ns_server_cert:generate_cert_and_pkey:84]Generated certificate and private key in 328137 us [ns_server:debug,2020-10-25T13:26:44.757Z,ns_1@cb.local:ns_config_log<0.199.0>:ns_config_log:log_common:229]config change: cert_and_pkey -> [{'_vclock',[{<<"a2df954ab79b443d1c880d0742a68202">>,{1,63770851604}}]}| {<<"-----BEGIN CERTIFICATE-----\nMIIDAjCCAeqgAwIBAgIIFkE/LPsu2zIwDQYJKoZIhvcNAQELBQAwJDEiMCAGA1UE\nAxMZQ291Y2hiYXNlIFNlcnZlciA3MWRkNWY2ZjAeFw0xMzAxMDEwMDAwMDBaFw00\nOTEyMzEyMzU5NTlaMCQxIjAgBgNVBAMTGUNvdWNoYmFzZSBTZXJ2ZXIgNzFkZDVm\nNmYwggEiMA0GCSqGSIb3DQEBAQUAA4IBDwAwggEKAoIBAQDLpi/pvvLCIDttSlFT\nwtmtPI0bsl7XNi5ADiTzuQxoXkG8KuuoZ0chqrWOpgd+EiU+LRLAsMBtqql7pvTw\nw4/RPhkIy1jFgYug6iAYTLnoCOfaGWj"...>>, <<"*****">>}] We check for a node cert but it's not there so we generate and save it to disk. [ns_server:info,2020-10-25T13:26:44.757Z,ns_1@cb.local:ns_ssl_services_setup<0.213.0>:ns_ssl_services_setup:maybe_generate_local_cert:636]Failed to read node certificate. Perhaps it wasn't created yet. Error: {error, {badmatch, {error, enoent}}} ... [ns_server:info,2020-10-25T13:26:45.019Z,ns_1@cb.local:ns_ssl_services_setup<0.213.0>:ns_ssl_services_setup:do_generate_local_cert:624]Saved local cert for node 'ns_1@cb.local' By this point there should be a collection of certificate and private key files on disk including memcached-cert.pem. However very quickly indexing starts to complain about no cert file being present. FTS follows shortly thereafter: [user:info,2020-10-25T13:28:08.896Z,ns_1@cb.local:<0.2387.0>:indexing:unknown:-1]Fail to due generate SSL certificate: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory ... [user:info,2020-10-25T13:28:08.999Z,ns_1@cb.local:<0.2463.0>:indexing:unknown:-1]Fail to due generate SSL certificate: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory ... 2020-10-25T13:28:09.006+00:00 [FATA] init_grpc: LoadX509KeyPair, err: open /opt/couchbase/var/lib/couchbase/config/memcached-cert.pem: no such file or directory -- main.getGrpcOpts() at init_grpc.go:190 ... When the cbcollect runs, here's what we see in the couchbase/config directory: /opt/couchbase/var/lib/couchbase/config: total 44 68633084 drwxr-xr-x 2 couchbase couchbase 44 Oct 25 13:30 . 35282234 drwxr-xr-x 8 couchbase couchbase 4096 Oct 25 13:31 .. 68633085 -rw-rw---- 1 couchbase couchbase 36044 Oct 25 13:30 config.dat 68633086 -rw-rw---- 1 couchbase couchbase 1645 Oct 25 13:28 memcached.rbac Here's what we should see: 20971712 -rw-rw---- 1 couchbase couchbase 322 Oct 9 09:31 audit.json 20971720 -rw-rw---- 1 couchbase couchbase 1284044 Oct 9 13:46 config.dat 20971812 -rw-rw---- 1 couchbase couchbase 138 Oct 1 18:03 dist_cfg 20971716 -rw-rw---- 1 couchbase couchbase 63 Oct 1 18:02 encrypted_data_keys 20971811 -rw-rw---- 1 couchbase couchbase 1139 Oct 1 18:08 local-ssl-cert.pem 20971717 -rw-rw---- 1 couchbase couchbase 39 Oct 1 18:08 local-ssl-meta 20971807 -rw-rw---- 1 couchbase couchbase 1675 Oct 1 18:08 local-ssl-pkey.pem 20973106 -rw-rw---- 1 couchbase couchbase 2242 Oct 9 09:31 memcached-cert.pem 20971798 -rw-rw---- 1 couchbase couchbase 1675 Oct 9 09:31 memcached-key.pem 20971719 -rw-rw---- 1 couchbase couchbase 2122 Oct 9 09:31 memcached.json 20973107 -rw-rw---- 1 couchbase couchbase 2633 Oct 9 09:31 memcached.rbac 20971710 -rw-rw---- 1 couchbase couchbase 2814 Oct 9 09:31 ssl-cert-key.pem 20971711 -rw-rw---- 1 couchbase couchbase 1103 Oct 9 09:31 ssl-cert-key.pem-ca 20971808 -rw-rw---- 1 couchbase couchbase 22906 Oct 2 10:07 users.dets So, you can see there are real problems with this deployment and there's no point doing any investigation on the rebalance failure. My guess is that there are rogue processes hanging around holding onto file or directory handles and it compromises our ability to write files to disk. I don't see anything suspicious in the process list however. If this happens again, it would be good to get lsof output on the config directory and see if there are older processes that are not part of the current server deployment holding onto files. If it's not that then there's some serious problems with the disks on these machines that report that files are saved correctly and then they disappear into a black hole. But we need to get to the bottom of this as it's a pretty big waste of time for everyone involved. CC: Ritam Sharma , Raju Suravarjjala
          wayne Wayne Siu added a comment -

          Ritam Sharma Umang

          I'm marking the ticket incomplete.  Please (re)open it if we could reproduce the issue in a proper setup.  Thanks.

          wayne Wayne Siu added a comment - Ritam Sharma Umang I'm marking the ticket incomplete.  Please (re)open it if we could reproduce the issue in a proper setup.  Thanks.
          umang.agrawal Umang added a comment -

          Closing as not reproducible any more.

          umang.agrawal Umang added a comment - Closing as not reproducible any more.

          People

            umang.agrawal Umang
            umang.agrawal Umang
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty