Details
-
Bug
-
Status: Resolved
-
Major
-
Resolution: Fixed
-
3.0.1
-
None
-
None
-
1
-
SDK48: FTS Score/Incl, Docs., SDK51: FLEpt1, Txns Test Compl, SDK2: FLEpt2, TxnsTest, SDK4: 3.1GA,Ruby3.1,TxnTst,FLE
Description
Comparing SDK3 to SDK2 kv get/set latency, we see that across all latency percentiles, SDK3 is ~20-30% higher than SDK2. Here are the two tests for comparison:
SDK 2.5.0 - 24 workers:
http://perf.jenkins.couchbase.com/job/ares/17051/
SDK 3 - 24 workers:
http://perf.jenkins.couchbase.com/job/ares/17049/
In both tests, kv ops are against default scope/collection.
SDK3 is this patch: git+http://review.couchbase.org/couchbase-python-client@refs/changes/91/135991/3
Here is a comparison graph:
The only difference in these tests, other than sdk version, cur_connections to bucket and ESTABILSHED connections to master node 172.23.133.13. Perhaps these "extra" connections are causing the slow down as the "extra" connections is around 30% more than sdk2 which corresponds to the increase in latency
Workaround
Remove any externally installed libcouchbase releases from the system. When run with the built-in libcouchbase, the higher latency is not observed.
Attachments
- profile003.gif
- 185 kB
- sdk2.kv.reads.repeated.perf.data
- 31.04 MB
- sdk2.perf.data
- 34.09 MB
- sdk3.perf.data
- 32.77 MB
- sdk3withlcb293.data
- 425 kB
- sdk3withlcb306.data
- 150 kB
For Gerrit Dashboard: PYCBC-1037 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
166865,6 | Stop installing SDKs if no version specified | master | perfrunner | Status: MERGED | +2 | +1 |
Activity
Matt Ingenthron the test is run from perfrunner master branch, no modifications. The python sdk 3 is installed from the gerrit commit: git+http://review.couchbase.org/couchbase-python-client@refs/changes/91/135991/3.
You can probably see this issue from a simple script that loads some documents then does gets on the doc keys, taking the time right before the get call and right after to calculate the latency. Nothing fancy but you probably need a perf machine to see the .5 ms latency increase.
I'll be running some additional experiments/grabbing profiles to try and narrow in one where the extra time spent in the sdk 3 is occuring.
If theres anything specific you need from my side please let me know.
Did some profiling and found some interesting things. Two tests were run, one with PySDK 2.5.0 and on with PySDK patch set 3.0.5.dev3+g5a04cfd, where a single worker is doing gets on 100 keys in an infinite loop. The servers were loaded with 20M docs and compacted before the test. I have uploaded the profiles as:
sdk2.kv.reads.repeated.perf.data
sdk3.kv.reads.repeated.3.perf.data
you can view them with these commands:
perf report -i sdk2.kv.reads.repeated.perf.data
perf report -i sdk3.kv.reads.repeated.3.perf.data
perf diff sdk3.kv.reads.repeated.3.perf.data sdk2.kv.reads.repeated.perf.data
SDK3 is spending 88%+ time in _PyLong_FromNbInt and PyUnicode_DecodeUTF8Stateful, whereas in SDK2 most time was spent in epoll_wait, 97%+. Looks like the Python SDK3 is occupying more cpu cycles which are mostly decoding UTF8 strings. Possibly related to this file: https://github.com/couchbase/couchbase-python-client/blob/master/src/pycbc.h
And this issue might be relevant: https://stackoverflow.com/questions/36098984/python-3-3-c-api-and-utf-8-strings
Excellent sleuthing Korrigan Clark! I'll definitely have a look at this in more detail. Just so you know, at the moment that probably means later next week.
Not at the moment Kamini Jagtiani. The primary engineer for this is OOO and we have a few other higher priority issues.
If you want to assist, you might experiment with logging as David Kelly did check and IIRC there is no direct usage of PyUnicode_DecodeUTF8Stateful. It was thought that it might be indirect in logging. Lower the logging level (DEBUG isn't probably needed) and that may avoid this issue.
Thanks Matt Ingenthron.
Can we try the tests with a lower logging level, as suggested by Matt?
Thanks
Yes I will give it a try shortly Kamini Jagtiani Matt Ingenthron
Matt Ingenthron Kamini Jagtiani ran the tests with LCB_LOGLEVEL set to 0, which should log only in cases of fatal errors. The results show no difference. Here is graph comparing sdk3(with default log level) to sdk3(with log level 0): http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-3090_access_ae94&snapshot=ares_700-3090_access_80cf
Korrigan Clark
Can we try using REST API and see if you still see the performance difference?
The reason I say this, when I was trying out transactions I noticed that
-When I moved from buckets to collections, some of my joins are taking over 8 minutes instead of a few seconds.
-It goes back to seconds if i specifically set "scan_consistency" to "not_bounded".If not specifically set, it assumes "request_plus" and takes much longer
Raised MB-42655
Thats expected for scan consistency if there is load on the system. Perfrunner does not have rest query support right now and I currently do not have time to add it. However, it shouldnt be to hard to add. The performance here is only visible for very low latency ops. If the query is long running the impact will be negligible from sdk issue side of things. The sdk actually calls the rest api underneath. The reason we are seeing sdk slowness is due to decoding the returned results within the sdk before returning. With rest we would skip the decode phase so latency will look better. I dont think its of much value to run the rest api for this scenario but would be good to have support for future. Ill discuss with Wayne Siu.
Korrigan Clark I can repro it stand-alone - with no load on teh system.
Take a look at MB-42655 for details
I'm looking at this a bit now. My assumption is, if 88% of the time spent is in {{ _PyLong_FromNbInt and PyUnicode_DecodeUTF8Stateful}} vs epoll, I should be able to see that right away, using typical profiling tools like yep which uses google_perftools under the hood. So locally, I just have a script that uses yep to profile a loop of 200K get calls to 20K documents, figuring I'd see the . But I don't - running on my Mac, using pprof to output a call graph with times, I see So - not seeing the decode calls take much of the time.
FWIW, I don't see how 88% of the time could be consumed by decoding strings and only suffer a small percentage change in latency, since the network latency should be the same. Perhaps I'm completely missing something here. I'll do more today, have questions, etc...
David Kelly I think what you info + my info could be implying is that OSX and Linux could be causing the sdk to behave differently with regards to py unicode decoding. Could you try on a Linux box and see if there is a difference?
Korrigan Clark I agree – I'll fire up a linux box and check it out, and get some data on that Monday. Also - could be the python version - LMK what version you were using and I'll match that. When I installed the performance tools, brew updated me to python3.9 (which I didn't realize until later), and that could be something.
Korrigan Clark - My ubuntu image came with python 3.6 (3.6.9), and so I used it for this first try. The call graph looks roughly same as above. I'm curious about 3.5.6 so will install that on my image later, try again.
Korrigan Clark Sorry – completely forgot to update this. The profiling I've done shows pretty much exactly the same as the above, even on the same python version, etc... It would be good for me to understand how you got your profile info. Then at least, I should be able to see the same thing. That seems like a good place to start.
Can you let me know what tools you use to profile? I'd like to use the same tools, as I can't see why we'd see such different things from the profiler.
Kamini Jagtiani, since you've observed this too, can you provide some details on where you've seen it? Do you have a profile to share?
Korrigan Clark - I used 3.5.10 I believe, on ubuntu18. Sadly, that virtualbox instance won't come up for me to check that (not 3.5.6, but close), but in any case, I can't really reproduce the same sort of profile - spending the vast majority of the time decoding strings. I think it is worth pairing up – I'll ping you on slack, and perhaps we can figure out some time to get together and look at this.
Korrigan Clark and David Kelly Are you able to connect with each other? What are the next steps here?
Was working with David Kelly and found some potential issues with the SDK3 dependency on LCB. David is taking a deeper look now and I am running some experiments. We should have some more info in the morning to update with.
Update: it appears the performance for python sdk 3 is dependent on which preinstalled version of LCB is present prior to installing python sdk 3. We ran two tests, one where LCB 2.9.3 was installed prior to pysdk 3.0.6 and one where LCB 3.0.2 was installed prior to pysdk 3.0.6. Here are the graphs: http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_700-4078_access_d7de&snapshot=ares_700-4078_access_f0c2
You will see that latency is higher for LCB 3.0.2 + pysdk3 3.0.6 (orange). Thats is also the only difference visible in the comparision graphs.
This is odd because it was my understanding that the pysdk3 install should build its own isolated LCB and should not use any preinstalled LCB
This may also be due to install of libcouchbase3-libevent (or maybe even libcouchbase3-tools). I'm also going to look at runtime what is actually loaded - I sort of expect an issue with the eventing backend conflicting somehow, but we will see.
Yes, David Kelly is looking into the underlying issue which we believe is caused by the sdk loading incorrect libraries when libcouchbase is preinstalled before the sdk. We have a work around which I will implement early next. In earlier 3.0.x releases, LCB needed to be installed before the sdk, and we built some scripts to take of this before each perf test. This has changed and LCB no longer needs to be preinstalled in most recent releases. The client install scripts need to be updated to remove LCB before any test using python sdk 3. Once I have updated the script we should be able get clean perf runs. I will update again when the scripts have been updated and we have new numbers, early next week.
Korrigan Clark Thanks. Let me know once you have the new numbers
Raju Suravarjjala David Kelly I ran kv and n1ql tests with the sdk 3.0.8 and no lcb preinstalled. Here are the results with green being old runs and orange being new runs:
90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, default collection
Old: 1.2
New: 1.2
90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, s=1 c=1
Old: 1.2
New: 1.2
90th percentile query latency (ms), Q1, Key-Value Lookup, 10K queries/sec, s=1 c=1000
Old: 1.2
New: 1.2
Avg. Query Throughput (queries/sec), Q1, Key-Value Lookup, s=1 c=1
Old: 127,774
New: 129,814
Avg. Query Throughput (queries/sec), Q1, Key-Value Lookup, default collection
Old: 127,943
New: 130,978
99.9th percentile GET/SET latency (ms), 4 nodes, 1 bucket x 20M x 1KB, 10K ops/sec, default scope/collection
Old: Get - 0.85, Set - 0.98
New: Get - 1.0, Set - 1.12
99.9th percentile GET/SET latency (ms), 4 nodes, 1 bucket x 20M x 1KB, 10K ops/sec, s=1 c=1
Old: Get - 0.96, Set - 1.18
New: Get - 0.91, Set - 1.12
There are some things to note:
1 - n1ql latency remains about the same as before. Default, 1 collection and 1000 collections all have same latency. The main difference in the n1ql test is the number of connection in the timewait state. This is significantly less with the new runs. N1ql also sees a mild bump in throughput. The ops and cpu utlization are also smoother and more consistent throughout the tests.
2 - KV default collection latency seems to have increase, both get and set. KV 1 collection seems to have decreased latency, both get and set. We do not see any difference in the number of connection in timewait state in the kv tests.
Comparing SDK2 to SDK3 default collection:
There still is a difference (30%) between sdk2 and sdk3 default collection for latency tests. However the n1ql throughput numbers for sdk2 and sdk3 are now roughly the same. This could possible be the effect of the new libevent IO in server.
Thanks Korrigan Clark Did you publish all the numbers into showfast? Any other bugs/issues you are tracking?
Raju Suravarjjala yes these results should be reflected on showfast now for latest build and going forward the n1ql and kv latency tests will all be clean with 3.0.8 sdk version. I am in the processes of doing some retuning for these tests as the performance has changed slightly so there may be some quirks for a test or two. I am in the processes analyzing the rest of the tests and will open a bug if I see any issues. The latency increase is still present but there appears to be increased throughput as well. So, this issue may come down to tradeoffs.
Nice work Korrigan Clark and David Kelly.
One thing I would note by looking at the results you linked to. Through 99%ile, the results are exactly the same. For gets/sets, at least as I read it, above 99%ile it's not always consistently one test that is lower latency between sets and gets. It's just a guess, but multiple runs might show things flipping in these upper %iles. Looking at the scatterplot seems to confirm that. If you were to update that chart to plot only the points beyond 1σ, we could probably see if it's noise or an actual difference. Then again, we'd expect higher latency as throughput increases and that we don't have that seems to mean we've made things better.
From our side, it looks like this bug is resolved. If you guys have other thing in mind, let me know and we can keep it open, otherwise I think we should close. Will leave it up to David.
Thanks Korrigan Clark. I had a look at the params in use and was looking into how to repro/extract a test case.
Question: is this from the master branch of perfrunner, or is it one with some modifications?