Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-49147

User queries slow due to usage of dets - 6.6.4

    XMLWordPrintable

Details

    • Triaged
    • Yes

    Description

      As seen in a case in the field, customer need to create/delete large number of users in short succession. The customer is using secrete management solution (such as HashiCorp) in Kubernetes, where containers are created and destroyed in quick succession, and for each container, an ephemeral user is created and destroyed as well.

      Current user CRUD API latency increases as number of users increases, due to the implementation nature of replication (replicated_dets).

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            bryan.mccoid Bryan McCoid added a comment -

            Great work Sumedh Basarkod  .. results are pretty interesting already. Check out this .dets file size! Keep the test going we wanna hit 2gig if we can. Just wanted to give a little update from the box for the ticket but my real worry here is still this growing file size, and not so much any of the other stuff. 

            ***********************************
            835M
            real    0m0.083s
            user    0m0.006s
            sys     0m0.010s
             
            real    0m0.016s
            user    0m0.005s
            sys     0m0.004s
            ***********************************
            838M
            real    0m0.054s
            user    0m0.002s
            sys     0m0.006s
             
            real    0m0.020s
            user    0m0.005s
            sys     0m0.006s
            ***********************************
            841M
            real    0m0.060s
            user    0m0.002s
            sys     0m0.006s
             
            real    0m0.016s
            user    0m0.003s
            sys     0m0.005s
            ***********************************

            I think whether or not this will work for customers using hashicorp depends on whether or not they are using the dynamic users and the rate at which they are creating them is high enough. 

            bryan.mccoid Bryan McCoid added a comment - Great work Sumedh Basarkod   .. results are pretty interesting already. Check out this .dets file size! Keep the test going we wanna hit 2gig if we can. Just wanted to give a little update from the box for the ticket but my real worry here is still this growing file size, and not so much any of the other stuff.  *********************************** 835M real    0m0.083s user    0m0.006s sys     0m0.010s   real    0m0.016s user    0m0.005s sys     0m0.004s *********************************** 838M real    0m0.054s user    0m0.002s sys     0m0.006s   real    0m0.020s user    0m0.005s sys     0m0.006s *********************************** 841M real    0m0.060s user    0m0.002s sys     0m0.006s   real    0m0.016s user    0m0.003s sys     0m0.005s *********************************** I think whether or not this will work for customers using hashicorp depends on whether or not they are using the dynamic users and the rate at which they are creating them is high enough. 

            Bryan McCoid Would you mind taking a look at the cluster now? Seems like one of the nodes went down and was auto-failed-over, possibly because of high RAM usage on the nodes. And I see

            ervice 'ns_server' exited with status 1. Restarting. Messages:
            working as port
            75953: Booted. Waiting for shutdown request
            working as port
            eheap_alloc: Cannot allocate 1318267840 bytes of memory (of type "old_heap").
            Crash dump is being written to: erl_crash.dump.1641436795.75853.ns_server.

            And seems like the file size was 

            ***********************************
            985M
             
             
            real	0m0.134s
            user	0m0.003s
            sys	0m0.006s
             
             
            real	0m0.077s
            user	0m0.004s
            sys	0m0.016s
            ***********************************

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - Bryan McCoid Would you mind taking a look at the cluster now? Seems like one of the nodes went down and was auto-failed-over, possibly because of high RAM usage on the nodes. And I see ervice 'ns_server' exited with status 1. Restarting. Messages: working as port 75953: Booted. Waiting for shutdown request working as port eheap_alloc: Cannot allocate 1318267840 bytes of memory (of type "old_heap"). Crash dump is being written to: erl_crash.dump.1641436795.75853.ns_server. And seems like the file size was  *********************************** 985M     real 0m0.134s user 0m0.003s sys 0m0.006s     real 0m0.077s user 0m0.004s sys 0m0.016s ***********************************

            Sumedh Basarkod - Yeah it looks like one node crashed and was failed-over. This is a graph of the ets table as it clearly just grows and grows. The boxes these tests are on must be too small to reach the .dets file maximum. 

            This looks like the most obvious reason the process crashed. 

            It has seemed like (from me eye-balling it) that latency hasn't been massively affected by the growing ets until obviously we ran out of memory and maybe that can cause a bit of a problem. Is this your assessment of the numbers regarding latency? That was the primary issue on this ticket and I think we are probably done figuring that part out and we can probably close this completely. There are other issues, but we are well aware of them and they won't be solved in the process of this ticket (there's structural changes that will have to happen). What are your thoughts Sumedh? Do you agree with that assessment? If so feel free to close this issue, and if not please present any information you have in that area. I also want to thank you for helping look into this ancillary file-size/memory usage stuff! It's what I thought would happen theoretically and it seems to be born-out in the results. 

            Thanks!

            bryan.mccoid Bryan McCoid added a comment - Sumedh Basarkod - Yeah it looks like one node crashed and was failed-over. This is a graph of the ets table as it clearly just grows and grows. The boxes these tests are on must be too small to reach the .dets file maximum.  This looks like the most obvious reason the process crashed.  It has seemed like (from me eye-balling it) that latency hasn't been massively affected by the growing ets until obviously we ran out of memory and maybe that can cause a bit of a problem. Is this your assessment of the numbers regarding latency? That was the primary issue on this ticket and I think we are probably done figuring that part out and we can probably close this completely. There are other issues, but we are well aware of them and they won't be solved in the process of this ticket (there's structural changes that will have to happen). What are your thoughts Sumedh? Do you agree with that assessment? If so feel free to close this issue, and if not please present any information you have in that area. I also want to thank you for helping look into this ancillary file-size/memory usage stuff! It's what I thought would happen theoretically and it seems to be born-out in the results.  Thanks!
            bryan.mccoid Bryan McCoid added a comment - Just for future reference here are the logs from the test system after one node crashed: s3:// cb-engineering/bryanmccoid/collectinfo-2022-01-11T184342-ns_1@172.23.105.215.zip s3:// cb-engineering/bryanmccoid/collectinfo-2022-01-11T184342-ns_1@172.23.105.219.zip s3:// cb-engineering/bryanmccoid/collectinfo-2022-01-11T184342-ns_1@172.23.106.237.zip  

            Dave Finlay: The sleep time is 0.09 s. ie;

            1. Create 6K users in 6 equal iterations and measure the latency (initial step)
            2. Loop infinitely the following cycle: (and print every one-thousandth create time, delete time, file size);
                Create user
                Sleep for 0.09 seconds
                Delete user

            And also, please note that when the .dets file size was around ~280M, I aborted that job and continued the experiment with the bash script that I mention in my previous comments. The bash script also runs the same second step as above except that there is no sleep of 0.09 s (and stats get printed after every 10K cycles instead of 1K cycles).  I have attached the "nohup.out" - which is the output from the bash script.  The node crashes when the file size was ~985M

            sumedh.basarkod Sumedh Basarkod (Inactive) added a comment - Dave Finlay : The sleep time is 0.09 s. ie; 1. Create 6K users in 6 equal iterations and measure the latency (initial step) 2. Loop infinitely the following cycle: (and print every one-thousandth create time, delete time, file size);     Create user     Sleep for 0.09 seconds     Delete user And also, please note that when the .dets file size was around ~280M, I aborted that job and continued the experiment with the bash script that I mention in my previous comments. The bash script also runs the same second step as above except that there is no sleep of 0.09 s (and stats get printed after every 10K cycles instead of 1K cycles).  I have attached the "nohup.out" - which is the output from the bash script.  The node crashes when the file size was ~985M

            People

              sumedh.basarkod Sumedh Basarkod (Inactive)
              meni.hillel Meni Hillel (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              11 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty