Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-45893

Windows : Unable to add a new 7.0 node to a mixed mode cluster - pre_rebalance_janitor_run_failed

    XMLWordPrintable

Details

    Description

      Steps to repro :
      1. Setup a 2-node cluster running 6.6.2-9588 with kv+n1ql+index+fts services on both nodes
      2. Install the sample buckets and create FTS indexes
      3. Swap rebalance node2 with a spare node running 7.0.0-5016.
      4. The cluster will now be in mixed mode
      5. Add a new 7.0 node to the cluster.

      The rebalance fails with error :

      Rebalance exited with reason {pre_rebalance_janitor_run_failed,
      "travel-sample",
      {error,wait_for_memcached_failed,
      ['ns_1@172.23.107.142']}}.
      Rebalance Operation Id = fda046d63e6f8a8b8014af2288bb470a
      

      6. Retried the rebalance. That too failed as ns_server on the existing 7.0 node 172.23.107.142 has crashed with the following error -

      Service 'ns_server' exited with status 1. Restarting. Messages:
      working as port
      4092: Booted. Waiting for shutdown request
      working as port
      eheap_alloc: Cannot allocate 1573200 bytes of memory (of type "heap").
       
      Crash dump is being written to: erl_crash.dump...done
      [os_mon] win32 supervisor port (win32sysinfo): Erlang has closed
      

      Attachments

        1. screenshot-1.png
          screenshot-1.png
          32 kB
        2. screenshot-2.png
          screenshot-2.png
          37 kB
        3. screenshot-3.png
          screenshot-3.png
          31 kB
        4. screenshot-4.png
          screenshot-4.png
          48 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            bharath.gp Bharath G P added a comment -

            I saw the same error when trying to do an upgrade of a a cluster with all the services. Attaching the logs from the cluster for further help. I was upgrading from 6.6.2 to 7.0.0-5017 build. Saw the following error when doing a swap rebalance of a data node.

             Rebalance exited with reason {pre_rebalance_janitor_run_failed,"test2",
             {error,wait_for_memcached_failed,
             ['ns_1@172.23.107.126']}}.
             Rebalance Operation Id = 8b15f8583462c83750f575400df9c22

             

            bharath.gp Bharath G P added a comment - I saw the same error when trying to do an upgrade of a a cluster with all the services. Attaching the logs from the cluster for further help. I was upgrading from 6.6.2 to 7.0.0-5017 build. Saw the following error when doing a swap rebalance of a data node.  Rebalance exited with reason {pre_rebalance_janitor_run_failed,"test2", {error,wait_for_memcached_failed, ['ns_1@172.23.107.126']}}. Rebalance Operation Id = 8b15f8583462c83750f575400df9c22  
            dfinlay Dave Finlay added a comment - - edited

            The proximate cause of the rebalance failing is that ns_server isn't able to get a reply from memcached in time. I think that the root cause is that XDCR memory has spiked and this and its use of the ns_server REST APIs and it has pushed the system into swap causing the slowness.

            Prior to the rebalance running, it's already the case that janitor calls from ns_server are timing out and the bucket is classed as "not ready". Here we can see that the vbucket stats details call is taking more than 7 seconds.

            [ns_server:info,2021-04-23T15:55:20.650-07:00,ns_1@172.23.107.142:<0.13271.34>:ns_janitor:cleanup_with_membase_bucket_vbucket_map:126]Bucket "travel-sample" not yet ready on ['ns_1@172.23.107.142']
            [ns_server:debug,2021-04-23T15:55:23.386-07:00,ns_1@172.23.107.142:<0.14201.13>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state"]} took too long: 7745000 us
            

            The rebalance starts:

            [user:info,2021-04-23T15:56:02.411-07:00,ns_1@172.23.107.142:<0.15047.8>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.107.126','ns_1@172.23.107.129',
                                             'ns_1@172.23.107.142'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = fda046d63e6f8a8b8014af2288bb470a
            

            And immediately runs into the same issue that KV is not responding quickly enough to ns_server. The janitor timeout is 10s when run as part of rebalance:

            [ns_server:info,2021-04-23T15:56:13.288-07:00,ns_1@172.23.107.142:<0.16897.34>:ns_janitor:cleanup_with_membase_bucket_vbucket_map:126]Bucket "travel-sample" not yet ready on ['ns_1@172.23.107.142']
            

            There are a bunch of slow stats calls logged in memcached at this time:

            $ grep -i "T15:5[567].*slow operation" memcached.log | grep travel-sample | grep "peer.*142"
            2021-04-23T15:55:04.963712-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"9281 ms","trace":"request=253235101470000:9280711 execute=253235101470000:9280711","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}}
            2021-04-23T15:55:45.263788-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"12 s","trace":"request=253272956075056:11729227 execute=253272956075056:11729227","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}}
            2021-04-23T15:56:28.180883-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"9615 ms","trace":"request=253317986687808:9615250 execute=253317986687808:9615250","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}}
            2021-04-23T15:57:36.859451-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"9903 ms","trace":"request=253386379486928:9902962 execute=253386379486928:9902962","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}}
            2021-04-23T15:57:59.360788-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"6481 ms","trace":"request=253412302868096:6481450 execute=253412302868096:6481450","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}}
            

            However, none of these calls are seem to be the the call from the rebalance. In ns_server we see:

            $ grep "T15:5[567].*get_vbucket_details_stats,all" ns_server.debug.log 
            [ns_server:debug,2021-04-23T15:55:04.997-07:00,ns_1@172.23.107.142:<0.30187.15>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state","topology"]} took too long: 9297000 us
            [ns_server:debug,2021-04-23T15:55:23.386-07:00,ns_1@172.23.107.142:<0.14201.13>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state"]} took too long: 7745000 us
            [ns_server:debug,2021-04-23T15:56:18.408-07:00,ns_1@172.23.107.142:<0.14201.13>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state"]} took too long: 22029000 us
            

            It's hard to correlate these calls, but certainly it seems that memcached is slow in replying and it may be that ns_server processing is also impacted.

            What's noticeable is that system memory use is close to the limit and that ns_server has trouble responding to Prometheus around the time of the rebalance failure:

            XDCR is using most of the memory:

            Swap increases across this time period:

            In addition there's a spike in the number of REST requests at this time increasing the load on ns_server:

            $ cat ns_server.http_access_internal.log| grep -E -o "2./.../2021:15:5[567]:." | uniq -c
             390 23/Apr/2021:15:55:0
             388 23/Apr/2021:15:55:1
             315 23/Apr/2021:15:55:2
             356 23/Apr/2021:15:55:3
             609 23/Apr/2021:15:55:4
             511 23/Apr/2021:15:55:5
             405 23/Apr/2021:15:56:0
             487 23/Apr/2021:15:56:1
             333 23/Apr/2021:15:56:2
             307 23/Apr/2021:15:56:3
             188 23/Apr/2021:15:56:4
             292 23/Apr/2021:15:56:5
             427 23/Apr/2021:15:57:0
             221 23/Apr/2021:15:57:1
             340 23/Apr/2021:15:57:2
             324 23/Apr/2021:15:57:3
             348 23/Apr/2021:15:57:4
             292 23/Apr/2021:15:57:5
            

            Note the increase to ~500 / 10 second interval at around the time of the rebalance failure. These come predominantly from XDCR, indexing and cbq-engine:

            $ cat ns_server.http_access_internal.log | egrep -E "@goxdcr|@cbq|@index" | grep -E -o "2./.../2021:15:5[567]:." | uniq -c
             332 23/Apr/2021:15:55:0
             338 23/Apr/2021:15:55:1
             280 23/Apr/2021:15:55:2
             291 23/Apr/2021:15:55:3
             528 23/Apr/2021:15:55:4
             433 23/Apr/2021:15:55:5
             322 23/Apr/2021:15:56:0
             398 23/Apr/2021:15:56:1
             278 23/Apr/2021:15:56:2
             274 23/Apr/2021:15:56:3
             149 23/Apr/2021:15:56:4
             252 23/Apr/2021:15:56:5
             349 23/Apr/2021:15:57:0
             182 23/Apr/2021:15:57:1
             279 23/Apr/2021:15:57:2
             288 23/Apr/2021:15:57:3
             292 23/Apr/2021:15:57:4
             244 23/Apr/2021:15:57:5
            

            In particular XDCR issues 84 pools/nodes requests at exactly 15:56:00. Combined the overall high memory usage and the fact that the system is in swap, I think this is going to be sufficient to slow the system enough to cause rebalance to fail.

            $ cat ns_server.http_access_internal.log | egrep -E 15:56:00 | grep @goxdcr | wc -l
                  84
            

            XDCR should look into its excessive use of the REST APIs and its high memory usage.

            dfinlay Dave Finlay added a comment - - edited The proximate cause of the rebalance failing is that ns_server isn't able to get a reply from memcached in time. I think that the root cause is that XDCR memory has spiked and this and its use of the ns_server REST APIs and it has pushed the system into swap causing the slowness. Prior to the rebalance running, it's already the case that janitor calls from ns_server are timing out and the bucket is classed as "not ready". Here we can see that the vbucket stats details call is taking more than 7 seconds. [ns_server:info,2021-04-23T15:55:20.650-07:00,ns_1@172.23.107.142:<0.13271.34>:ns_janitor:cleanup_with_membase_bucket_vbucket_map:126]Bucket "travel-sample" not yet ready on ['ns_1@172.23.107.142'] [ns_server:debug,2021-04-23T15:55:23.386-07:00,ns_1@172.23.107.142:<0.14201.13>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state"]} took too long: 7745000 us The rebalance starts: [user:info,2021-04-23T15:56:02.411-07:00,ns_1@172.23.107.142:<0.15047.8>:ns_orchestrator:idle:773]Starting rebalance, KeepNodes = ['ns_1@172.23.107.126','ns_1@172.23.107.129', 'ns_1@172.23.107.142'], EjectNodes = [], Failed over and being ejected nodes = []; no delta recovery nodes; Operation Id = fda046d63e6f8a8b8014af2288bb470a And immediately runs into the same issue that KV is not responding quickly enough to ns_server. The janitor timeout is 10s when run as part of rebalance: [ns_server:info,2021-04-23T15:56:13.288-07:00,ns_1@172.23.107.142:<0.16897.34>:ns_janitor:cleanup_with_membase_bucket_vbucket_map:126]Bucket "travel-sample" not yet ready on ['ns_1@172.23.107.142'] There are a bunch of slow stats calls logged in memcached at this time: $ grep -i "T15:5[567].*slow operation" memcached.log | grep travel-sample | grep "peer.*142" 2021-04-23T15:55:04.963712-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"9281 ms","trace":"request=253235101470000:9280711 execute=253235101470000:9280711","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}} 2021-04-23T15:55:45.263788-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"12 s","trace":"request=253272956075056:11729227 execute=253272956075056:11729227","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}} 2021-04-23T15:56:28.180883-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"9615 ms","trace":"request=253317986687808:9615250 execute=253317986687808:9615250","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}} 2021-04-23T15:57:36.859451-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"9903 ms","trace":"request=253386379486928:9902962 execute=253386379486928:9902962","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}} 2021-04-23T15:57:59.360788-07:00 WARNING 1628: Slow operation. {"cid":"172.23.107.142:54752/e03de","duration":"6481 ms","trace":"request=253412302868096:6481450 execute=253412302868096:6481450","command":"STAT","peer":"{"ip":"172.23.107.142","port":54752}","bucket":"travel-sample","packet":{"bodylen":15,"cas":0,"datatype":"raw","extlen":0,"key":"<ud>vbucket-details</ud>","keylen":15,"magic":"ClientRequest","opaque":3724742144,"opcode":"STAT","vbucket":0}} However, none of these calls are seem to be the the call from the rebalance. In ns_server we see: $ grep "T15:5[567].*get_vbucket_details_stats,all" ns_server.debug.log [ns_server:debug,2021-04-23T15:55:04.997-07:00,ns_1@172.23.107.142:<0.30187.15>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state","topology"]} took too long: 9297000 us [ns_server:debug,2021-04-23T15:55:23.386-07:00,ns_1@172.23.107.142:<0.14201.13>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state"]} took too long: 7745000 us [ns_server:debug,2021-04-23T15:56:18.408-07:00,ns_1@172.23.107.142:<0.14201.13>:ns_memcached:verify_report_long_call:343]call {get_vbucket_details_stats,all,["state"]} took too long: 22029000 us It's hard to correlate these calls, but certainly it seems that memcached is slow in replying and it may be that ns_server processing is also impacted. What's noticeable is that system memory use is close to the limit and that ns_server has trouble responding to Prometheus around the time of the rebalance failure: XDCR is using most of the memory: Swap increases across this time period: In addition there's a spike in the number of REST requests at this time increasing the load on ns_server: $ cat ns_server.http_access_internal.log| grep -E -o "2./.../2021:15:5[567]:." | uniq -c 390 23/Apr/2021:15:55:0 388 23/Apr/2021:15:55:1 315 23/Apr/2021:15:55:2 356 23/Apr/2021:15:55:3 609 23/Apr/2021:15:55:4 511 23/Apr/2021:15:55:5 405 23/Apr/2021:15:56:0 487 23/Apr/2021:15:56:1 333 23/Apr/2021:15:56:2 307 23/Apr/2021:15:56:3 188 23/Apr/2021:15:56:4 292 23/Apr/2021:15:56:5 427 23/Apr/2021:15:57:0 221 23/Apr/2021:15:57:1 340 23/Apr/2021:15:57:2 324 23/Apr/2021:15:57:3 348 23/Apr/2021:15:57:4 292 23/Apr/2021:15:57:5 Note the increase to ~500 / 10 second interval at around the time of the rebalance failure. These come predominantly from XDCR, indexing and cbq-engine: $ cat ns_server.http_access_internal.log | egrep -E "@goxdcr|@cbq|@index" | grep -E -o "2./.../2021:15:5[567]:." | uniq -c 332 23/Apr/2021:15:55:0 338 23/Apr/2021:15:55:1 280 23/Apr/2021:15:55:2 291 23/Apr/2021:15:55:3 528 23/Apr/2021:15:55:4 433 23/Apr/2021:15:55:5 322 23/Apr/2021:15:56:0 398 23/Apr/2021:15:56:1 278 23/Apr/2021:15:56:2 274 23/Apr/2021:15:56:3 149 23/Apr/2021:15:56:4 252 23/Apr/2021:15:56:5 349 23/Apr/2021:15:57:0 182 23/Apr/2021:15:57:1 279 23/Apr/2021:15:57:2 288 23/Apr/2021:15:57:3 292 23/Apr/2021:15:57:4 244 23/Apr/2021:15:57:5 In particular XDCR issues 84 pools/nodes requests at exactly 15:56:00. Combined the overall high memory usage and the fact that the system is in swap, I think this is going to be sufficient to slow the system enough to cause rebalance to fail. $ cat ns_server.http_access_internal.log | egrep -E 15:56:00 | grep @goxdcr | wc -l 84 XDCR should look into its excessive use of the REST APIs and its high memory usage.
            neil.huang Neil Huang added a comment -

            The error that leads to the pipeline restart is due to MB-45892 - where KV wouldn't allow XDCR to establish DCP connections with HELO collections. This leads to multiple pipeline restarts. Each restart causes 2 things:
            1. The memory to leak, as Dave has pointed out
            and
            2. Some inefficiencies of how XDCR is with pools/nodes endpoints are also seen.

            As the errors list builds up, the faulty code is located at {{ populateReplInfos }} where it attempts to retrieve {{XDCRCompTopologyService().MyHost()
            }}, and each one ends up calling pools/nodes as a part of getHostInfo

            The XDCR Topology Svc has had its share of inefficiencies (i.e. MB-44823), but this is also one of the inefficiencies that has been around.

            The two issues above need to be looked at and fixed for CC.
            However, the original blocking issue should have been addressed by the fix MB-45892, which should allow the pipelines to run in mixed mode and prevent the pipelines from restarting. I think if you try with 7.0.0-5038, it can at the very least unblock the upgrade and get the cluster going without all these pipeline restarts that turn out to be problematic.

            I will open up a different MB marked critical to address the memory leak and also address the pools/default inefficiencies (MB-44823).

            Please try out 7.0.0-5038 for the mixed-mode upgrade and see if this unblocks you. Thanks.

            neil.huang Neil Huang added a comment - The error that leads to the pipeline restart is due to MB-45892 - where KV wouldn't allow XDCR to establish DCP connections with HELO collections. This leads to multiple pipeline restarts. Each restart causes 2 things: 1. The memory to leak, as Dave has pointed out and 2. Some inefficiencies of how XDCR is with pools/nodes endpoints are also seen. As the errors list builds up, the faulty code is located at {{ populateReplInfos }} where it attempts to retrieve {{XDCRCompTopologyService().MyHost() }}, and each one ends up calling pools/nodes as a part of getHostInfo The XDCR Topology Svc has had its share of inefficiencies (i.e. MB-44823 ), but this is also one of the inefficiencies that has been around. The two issues above need to be looked at and fixed for CC. However, the original blocking issue should have been addressed by the fix MB-45892 , which should allow the pipelines to run in mixed mode and prevent the pipelines from restarting. I think if you try with 7.0.0-5038, it can at the very least unblock the upgrade and get the cluster going without all these pipeline restarts that turn out to be problematic. I will open up a different MB marked critical to address the memory leak and also address the pools/default inefficiencies ( MB-44823 ). Please try out 7.0.0-5038 for the mixed-mode upgrade and see if this unblocks you. Thanks.

            Issue not seen on 7.0.0-5038. The fixes for XDCR are verified.

            mihir.kamdar Mihir Kamdar (Inactive) added a comment - Issue not seen on 7.0.0-5038. The fixes for XDCR are verified.

            People

              mihir.kamdar Mihir Kamdar (Inactive)
              mihir.kamdar Mihir Kamdar (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty