Details
-
Bug
-
Status: Closed
-
Major
-
Resolution: Fixed
-
6.5.0
-
Ares (KV) cluster
-
Untriaged
-
Centos 64-bit
-
Yes
Description
Spdlog migration mistake caused us to allocate a buffer far bigger than we need (2GB).
Original Description
====================================================================
We have few memory-oriented tests like memory fragmentation, max .memcached RSS, memory overhead. Most of them show degradation in Mad-Hatter: http://showfast.sc.couchbase.com/#/timeline/Linux/kv/memory/all
I wasn't able to identify the root cause or the code change of this issue yet.
The issue was introduced somewhere between 6.5.0-1398 (good build) and 6.5.0-1405. Can't find exact build as we didn't have any centos enterprise builds in between. And this issue doesn't reproduce neither in CE not in Windows for some reason.
Logs from "memory fragmentation" test on 1405.
https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-8930/172.23.133.11.zip
Also comparison between 1398 and 1405:
Attachments
Issue Links
For Gerrit Dashboard: MB-32061 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
101953,2 | MB-32061: Lower spdlog buffer size | master | kv_engine | Status: MERGED | +2 | +1 |
Activity
Field | Original Value | New Value |
---|---|---|
Assignee | Trond Norbye [ trond ] | Ben Huddleston [ ben.huddleston ] |
Attachment | Screen Shot 2018-11-20 at 13.30.18.png [ 61803 ] |
Description |
We have few memory-oriented tests like memory fragmentation, max .memcached RSS, memory overhead. Most of them show degradation in Mad-Hatter: [http://showfast.sc.couchbase.com/#/timeline/Linux/kv/memory/all]
I wasn't able to identify the root cause or the code change of this issue yet. The issue was introduced somewhere between 6.5.0-1398 (good build) and 6.5.0-1405. Can't find exact build as we didn't have any centos enterprise builds in between. And this issue doesn't reproduce neither in CE not in Windows for some reason. Logs from "memory fragmentation" test on 1405. [https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-8930/172.23.133.11.zip] Also comparison between 1398 and 1405: [http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_650-1405_load_and_append_a3f4&label=6.5.0-1405&snapshot=ares_650-1398_load_and_append_b045&label=6.5.0-1398] !image-2018-11-16-13-31-05-208.png! |
Original Description ==================================================================== We have few memory-oriented tests like memory fragmentation, max .memcached RSS, memory overhead. Most of them show degradation in Mad-Hatter: [http://showfast.sc.couchbase.com/#/timeline/Linux/kv/memory/all] I wasn't able to identify the root cause or the code change of this issue yet. The issue was introduced somewhere between 6.5.0-1398 (good build) and 6.5.0-1405. Can't find exact build as we didn't have any centos enterprise builds in between. And this issue doesn't reproduce neither in CE not in Windows for some reason. Logs from "memory fragmentation" test on 1405. [https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-8930/172.23.133.11.zip] Also comparison between 1398 and 1405: [http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_650-1405_load_and_append_a3f4&label=6.5.0-1405&snapshot=ares_650-1398_load_and_append_b045&label=6.5.0-1398] !image-2018-11-16-13-31-05-208.png! |
Description |
Original Description ==================================================================== We have few memory-oriented tests like memory fragmentation, max .memcached RSS, memory overhead. Most of them show degradation in Mad-Hatter: [http://showfast.sc.couchbase.com/#/timeline/Linux/kv/memory/all] I wasn't able to identify the root cause or the code change of this issue yet. The issue was introduced somewhere between 6.5.0-1398 (good build) and 6.5.0-1405. Can't find exact build as we didn't have any centos enterprise builds in between. And this issue doesn't reproduce neither in CE not in Windows for some reason. Logs from "memory fragmentation" test on 1405. [https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-8930/172.23.133.11.zip] Also comparison between 1398 and 1405: [http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_650-1405_load_and_append_a3f4&label=6.5.0-1405&snapshot=ares_650-1398_load_and_append_b045&label=6.5.0-1398] !image-2018-11-16-13-31-05-208.png! |
Spdlog migration mistake caused us to allocate a buffer far bigger than we need (2GB).
Original Description ==================================================================== We have few memory-oriented tests like memory fragmentation, max .memcached RSS, memory overhead. Most of them show degradation in Mad-Hatter: [http://showfast.sc.couchbase.com/#/timeline/Linux/kv/memory/all] I wasn't able to identify the root cause or the code change of this issue yet. The issue was introduced somewhere between 6.5.0-1398 (good build) and 6.5.0-1405. Can't find exact build as we didn't have any centos enterprise builds in between. And this issue doesn't reproduce neither in CE not in Windows for some reason. Logs from "memory fragmentation" test on 1405. [https://s3-us-west-2.amazonaws.com/perf-artifacts/jenkins-ares-8930/172.23.133.11.zip] Also comparison between 1398 and 1405: [http://cbmonitor.sc.couchbase.com/reports/html/?snapshot=ares_650-1405_load_and_append_a3f4&label=6.5.0-1405&snapshot=ares_650-1398_load_and_append_b045&label=6.5.0-1398] !image-2018-11-16-13-31-05-208.png! |
Assignee | Ben Huddleston [ ben.huddleston ] | Alex Gyryk [ oleksandr.gyryk ] |
Attachment | Screen Shot 2018-11-27 at 10.20.33 AM.png [ 61940 ] |
Assignee | Alex Gyryk [ oleksandr.gyryk ] | Ben Huddleston [ ben.huddleston ] |
Assignee | Ben Huddleston [ ben.huddleston ] | Alex Gyryk [ oleksandr.gyryk ] |
Resolution | Fixed [ 1 ] | |
Status | Open [ 1 ] | Resolved [ 5 ] |
Actual End | 2019-02-07 10:12 (issue has been resolved) |
Assignee | Alex Gyryk [ oleksandr.gyryk ] | Wayne Siu [ wayne ] |
Attachment | MB-32061.png [ 77394 ] |
Attachment | MB-32061.png [ 77395 ] |
Status | Resolved [ 5 ] | Closed [ 6 ] |
Thanks Alex.
Changelog between 1398 & 1405 shows a number of changes; including a bunch of forward merges from alice: http://172.23.123.43:8000/getchangelog?product=couchbase-server&fromb=6.5.0-1398&tob=6.5.0-1405
CHANGELOG for kv_engine
* Commit: 81226d895ba8b167e080bbe5648a852c20b0fd98 in build: 6.5.0-1405
Merge MB-31481 from branch 'alice'
See http://review.couchbase.org/100241 for commit details
Merge also updates test to work in master branch
Change-Id: I7101e376b4609cb5e4f7e176760e8ccdb92515f3
* Commit: 66985a89b3f93c0dc981f5f43fe35055f99f84d0 in build: 6.5.0-1405
MB-31481: Don't skip STREAM_END msg after cursor drop during backfill
For DCP producers needing backfill, There's an edge-case in handling
cursor dropping:
- When a backfill is re-scheduled due to a slow stream trigger (cursor
dropping), if the second backfill isn't necessary (as the required
items are still in the checkpoing manager) then the second backfill
is skipped and the stream transition to in-memory. - If the
in-memory stream then doesn't need to fetch anymore data (as no more
mutations have occurred), then the stream advances directly to
endStream - see ActiveStream::transitionState():
case StreamState::InMemory:
// Check if the producer has sent up till the last requested
// sequence number already, if not - move checkpoint items into
// the ready queue.
if (lastSentSeqno.load() >= end_seqno_) {
// Stream transitioning to DEAD state
endStream(END_STREAM_OK); /// <--- HERE
notifyStreamReady();
}
However, there is a bug where this END_STREAM message is never
transmitted to the DCP client, resulting in the stream hanging.
The problem is that stream end message gets stuck in the readyQ, as
the stream is never notified as ready to send it. This is due to the
expectation of Stream::next() dealing with queued items being broken -
if next() retuns null then the itemsReady flag is cleared as it
assumes a null retudn means no more items are available. However in
this situation there /are/ items ready, but not until
transitionState() is called.
Following from this, the stream is never notified so doesnt
send a closure message, which means that DCP consumers can hang
waiting for this closure message.
This patch triggers nextQueuedItem inside the pendingBackfill case to
ensure that the end stream message is notified and executed. The
patch also includes a test to follow this problematic sequence.
Change-Id: I28885cca1e853b72cc886f756000190b903d1d53
Reviewed-on: http://review.couchbase.org/100241
Tested-by: Build Bot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Well-Formed: Build Bot <build@couchbase.com>
* Commit: 5a6eed4e0ea7153b4225bddef893d719d13c87b8 in build: 6.5.0-1401
MB-30041: Migrate audit and event descriptor to nlohmann::json
Migrate the auditd eventdescriptor file to nlohmann::json
and make the necessary changes to audit parsing to do so.
Change-Id: I5847553adfd24abddbe65d0562fb2c2127bd059a
Reviewed-on: http://review.couchbase.org/99887
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: Build Bot <build@couchbase.com>
* Commit: 851d6b182d9feaffc5fc2c5082b694faa867637e in build: 6.5.0-1400
unmerged-commits.py: Add alice branch
Change-Id: Ie5d1c953886a63d728b0db1c9b249c2339e6c7b1
Reviewed-on: http://review.couchbase.org/100350
Tested-by: Dave Rigby <daver@couchbase.com>
Tested-by: Build Bot <build@couchbase.com>
Reviewed-by: Daniel Owen <owend@couchbase.com>
* Commit: 2657a2bc8e515c58e09a11ca149eadb9c3dd745a in build: 6.5.0-1400
Merge remote-tracking branch 'couchbase/alice'
* couchbase/alice:
MB-31495: Fix bug in getRandomKey
MB-31175: Ephemeral HTTombstonePurger underflows age check
MB-31327: Ephemeral backfill does not respect purge seqno.
MB-31141: Don't reject snappy|raw DCP deletes
MB-31141: Account for nmeta in deleteWithMeta
Change-Id: I15638b53f28908581011a852798494c4e76c54d2
* Commit: 0238430fdd69a2e78a064ea7a683d3981468968f in build: 6.5.0-1399
MB-30798: Update spdlog to version 1.1
As we need to make changes to spdlog we should first update
our version of it to 1.1. Make the necessary changes to
our code to use the new spdlog api.
Use the new buffer/size style string returned by fmtlib
where we perform custom formatting.
Change-Id: I616628d50008758e21a54003cf676c1eb3109cc9
Reviewed-on: http://review.couchbase.org/100224
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: Build Bot <build@couchbase.com>
* Commit: 9444e64e9ada06b9fd6c1aac365a49d084e64465 in build: 6.5.0-1400
MB-31495: Fix bug in getRandomKey
This patch fixes a bug where getRandomKey would loop indefinitely if
start == 0.
The bug is that we check for curr == start before checking if
curr == max. See the original code below.
start = 0;
curr = start;
VBucketPtr vb = getVBucket(Vbid(curr++));
while (!vb || vb->getState() != vbucket_state_active) {
if (curr == start) {
return GetValue(NULL, ENGINE_KEY_ENOENT);
}
if (curr == max) {
curr = 0;
}
vb = getVBucket(Vbid(curr++));
}
Assume the bucket is empty. The first time we enter the while loop
curr == 1. curr != start and curr != max and so we will call
getVBucket and increment curr to 2. We repeat until curr == 1024.
On going round the while loop again curr != start, but curr == max
and so set curr = 0.
We then call getVbucket and increment curr to 1. On going round the
while loop again curr != start (as it is 1), and hence we loop
indefinitely.
Change-Id: I0de040627c481fb53e0faeddd34f44f31055c241
Reviewed-on: http://review.couchbase.org/100280
Well-Formed: Build Bot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Reviewed-by: Trond Norbye <trond.norbye@gmail.com>
Tested-by: Build Bot <build@couchbase.com>
* Commit: 69082c1564d91be98467d63f57ec4045b90b1bd8 in build: 6.5.0-1400
MB-31175: Ephemeral HTTombstonePurger underflows age check
There exists a bug in the HTTombstonePurger such that we can cause
an integer underflow when determining if a deleted item should be
marked stale (for removal). This occurs because the task records and
uses a start time when it is created. When the task runs slowly due
to heavy write locking on the seqList an item can be deleted seemingly
in the future to the HTTombstonePurger. This causes an integer
underflow when we perform the "now - osv->getDeletedTime() >= purgeAge"
check and the subsequent removal of the tombstone before the purgeAge.
Fix this bug by checking that now >= the deleted time of an osv.
Change-Id: Ibffdb88820d5335dfb7b75c2cc250fc766847cf7
Reviewed-on: http://review.couchbase.org/99744
Well-Formed: Build Bot <build@couchbase.com>
Reviewed-by: David Haikney <david.haikney@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: Dave Rigby <daver@couchbase.com>
* Commit: 90eadc1556511a7f35118d45890070bd53df67b1 in build: 6.5.0-1400
MB-31327: Ephemeral backfill does not respect purge seqno.
There exists a bug in Ephemeral backfill_memory such that we can
start a backfill for a startSeqno > purgeSeqno of a given vBucket.
This leads to data inconsistencies as we may not replciate a delete.
Fix the bug by checking that the purgeSeqno < startSeqno of a
backfill when startSeqno != 1.
Only fix for buffered memory backfill as the non-buffered backfill
is not used and will be deleted in master.
Change-Id: Ia91d6d1e82fcfb7ea2d6463c76a531ea0d8aa182
Reviewed-on: http://review.couchbase.org/99713
Well-Formed: Build Bot <build@couchbase.com>
Reviewed-by: David Haikney <david.haikney@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: Build Bot <build@couchbase.com>
* Commit: 0540f14ebf4cb7a0f1de71c717b4fb0141547b02 in build: 6.5.0-1400
MB-31141: Merge couchbase/vulcan to couchbase/alice
* couchbase/vulcan:
MB-31141: Don't reject snappy|raw DCP deletes
MB-31141: Account for nmeta in deleteWithMeta
Change-Id: Iaaa6e99ff0c868f8fe8874abd7673022bb9c4e87
* Commit: 84af6315742fadb27d0d76473ff1391104222073 in build: 6.5.0-1400
MB-31141: Don't reject snappy|raw DCP deletes
A related bug means that is possible for 5.x to create
deletes with a non-zero raw value. When 5.5x backfills such
an item for transmission to another 5.5x node (and snappy
is enabled), the delete gets sent with a snappy datatype
and rejected by the target node, which triggers a rebalance
failure.
Change-Id: Ib91453f96882ef4e01ee0e2a3e5e51ed0786a325
Reviewed-on: http://review.couchbase.org/99414
Well-Formed: Build Bot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>
Tested-by: Dave Rigby <daver@couchbase.com>
* Commit: bb60376f9ad3adba5741c6fcffcd87a551c2c61a in build: 6.5.0-1400
MB-31141: Account for nmeta in deleteWithMeta
When calculating the item size in deleteWithMeta we must
ensure nmeta is removed from the calculated size.
Tests added for both set and del withMeta
Change-Id: Iab31591c521e9556d436905be0177da773afc058
Reviewed-on: http://review.couchbase.org/99152
Tested-by: Build Bot <build@couchbase.com>
Well-Formed: Build Bot <build@couchbase.com>
Reviewed-by: Dave Rigby <daver@couchbase.com>