memcached-memcached horisontal DCP connections
Description
Components
Affects versions
Fix versions
Labels
Environment
Release Notes Description
relates to
Activity
Alexander Petrossian (PAF) June 27, 2019 at 11:45 AM
David, colleagues, humble reminder that here we can save considerable number of CPUs.
I feel much more than many-hour profiling would suggest.
But for some reason we're not doing it.
Alexander Petrossian (PAF) December 19, 2016 at 6:51 AM
David, thanks a lot for this insight.
This coordination can be done differently, with ns_server doing all the magic from meta-data about replication, going through some notification mech (maybe some already exist). And do just a same – coordinate well.
I sincerely hope you'll reconsider at some point in future.
Because all you're talking about happens during rebalance... which is not the main use-case, right?
In main use-case, the thing cluster does most of the time everything is stable, no rebalance going on... we experience considerable (to us; in our workload pattern) waste of CPU power.
Others also experience it... but don't complain?

Dave Finlay December 19, 2016 at 12:01 AM
Alexander:
As I mentioned in my previous comment, the essential reason is simplicity of implementation in memcached. Proxying DCP streams through ns_server allows memcached / ep-engine to do what it is good at, namely persisting and caching data and being a replication source and sink; and it asks ns_server to do what it is good at: coordination. For example, when active vbuckets are moving during rebalance, ns_server monitors replications and when the replica is “almost caught up” with the active, it flips the DCP stream to a “takeover” stream allowing the active transition to happen. Since ns_server does essentially all of the non-trivial coordination during a rebalance, this logic fits is well placed in ns_server.
Could it have been written differently? Sure. But we are happy with the decision right now and have no near or medium term plans to change it.
Alexander Petrossian (PAF) December 8, 2016 at 6:59 AM
ping
Alexander Petrossian (PAF) November 11, 2016 at 6:57 AM
As non-paying happy customer, we would not expect this to become suddenly a top priority in the light of what was discussed:
others not having the issue
and in-house tests showing no problems.
Obviously, we have some special workload-pattern, that differentiates our system from others.
And we know that, since nobody in the world faced this https://couchbasecloud.atlassian.net/browse/MB-14496#icft=MB-14496, and we did, and dug to the root of that, and now it our proud little bit in great Couchbase product.
What special is, I feel, multitude of small updates.
Probably, others use Couchbase differently, and they have lesser number of updates.
For us, these updates do lots of bad things: starting with bad persistence, as discussed in https://couchbasecloud.atlassian.net/browse/MB-17525#icft=MB-17525.
BTW, we have patched that for us, by introducing a forced delay in persistence, to instigate coalescing of consecutive updates of same element (configurable static/dynamic).
So, knowing all that I do not really expect you, guys, to spend a lot of your time/money on our proved-to-be-specific problem.
All I'm after now is advice.
What is so special about this proxying that it was designed that way?
Reading code gave very little insight as to why.
So far my shallow understanding is that ep-engine could very easily talk to the other ep-engines directly, without any proxying.
And all that erlang code does is sometimes force-breaks those links.
If so, that could have been done by a command from erlang to local ep-engine to break some link, if erlang feels it should be force-disconnected.
But I know my understanding is not very deep, and so seek advice.
Details
Assignee
Dave FinlayDave FinlayReporter
Alexander Petrossian (PAF)Alexander Petrossian (PAF)Priority
MajorInstabug
Open Instabug
Details
Details
Assignee

Reporter
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support
Linked Tickets
Zendesk Support

Friends,
I feel that current dcp stream flow...
no
language
module
action
1
C++
memcached
active stream flow out
2
erlang
dcp_proxy
gets proxied to proper replica node
3
C++
memcached
consumed
...can be improved.
Currently erlang module, and does NOT what erlang does best, consuming >>>>unreasonable proportion<<<< of CPU power (see first comment).
Please consider changing the flow to allow memcached-memcached horizontal DCP connections (11209).
That would
clearly require more complicated control protocol and changes on both sides (ns_server and memcached/ep);
but it will remove erlang from data flow, simplicity , leaving it on control flow;
considerably reduce CPU usage .