DCP needs correct ref-counted pointers
Description
Components
Labels
Environment
Link to Log File, atop/blg, CBCollectInfo, Core dump
Release Notes Description
blocks
Activity

Jim Walker December 16, 2015 at 9:19 AM
Just for clarity.
This fix is needed in 4.x however the shipped 4 and 4.1 are not yet impacted because a memory leak is actually preventing this defect from occuring, but once the memory leak is fixed, this bug can be triggered.
Plan is to merge both the memory leak fix and this fix together.
in 4.1 this defect is hidden by https://couchbasecloud.atlassian.net/browse/MB-16949#icft=MB-16949

Eric Cooper December 8, 2015 at 10:23 PM
@Jim WalkerThank you for the good description, I am able to repro the bug in 3.1.2 and verified the fix in 3.1.3, and for good measure tested on 4.1.0 and the problem does not occur.
Here is the scenario that I used:
8 vbuckets
200 Meg bucket
Initial population
/usr/bin/cbc-pillowfight -I 120000 -m 2 -M 2000 -U couchbase://localhost/default -t 25 -p test-stop-stream -r 100
Disk usage is 180 M
Ongoing load (2500 ops/sec per the UI)
/usr/bin/cbc-pillowfight -I 50000000 -m 2 -M 2000 -U couchbase://localhost/default -t 25 -p test-stop-stream -r 5 --rate-limit=100
Then run the pydcp test - test_open_producer_connection_command_and_close_immediately:
def test_open_producer_connection_command_and_close_immediately(self):
for i in range(100):
self.dcp_client = DcpClient(self.host, self.port)
response = self.dcp_client.open_producer("mystream")
for j in range(8):
response = self.dcp_client.stream_req(j, 0, 0, 100000, 0)
response = self.dcp_client.close_stream(j)
I will be adding this to the pydcp test shortly.

Jim Walker December 4, 2015 at 9:41 AM
@Eric Co This problem may not manifest itself as a crash.
The memory which gets prematurely freed may or may not get reallocated.
Even if it does get reallocated, it needs to be reallocated to something that will cause a segmentation fault. I.e. something which is now an unterminated string that takes you to a page boundary.
So you need:
1) DGM to try and get longer running backfills.
2) Lots of operations that will cause memory to be recycled. E.g. updating lots of keys.
3) Luck
If there's no crash, the race may still have occurred and you may be able to observe that fact from evidence in memcached.log. Some log messages will contain junk.
E.g. good message
Fri Dec 4 11:22:55.821357 EST 3: (default) DCP (Producer) ep_dcpq:replication:ns_1@node1->ns_1@node2 (vb 1):default - Backfill task (5205478 to 315) finished. disk seqno 5206261 memory seqno 5206261
Bad/broken one showing evidence of a race, see that there's now junk.
Fri Dec 4 11:22:55.821357 EST 3: (default) <88><CF>ћ^R (vb 1) Backfill task (5205478 to 315) finished. disk seqno 5206261 memory seqno 5206261

Eric Cooper December 3, 2015 at 12:28 AM
I attempted to repro this as follows:
varied the number of v-buckets from 4 to 32
added up to 5,000,000 KVs
created a pydcp test which continually creates producer instances, add streams for all the v-buckets and then immediately closes them
changed the access scanner so that it runs more frequently
on the Couchbase server ran a CPU hogger which ran at a higher priority than memcached
This was running on a CentOS 6 VM on 3.1.2.
I was unable to reproduce the problem.
Tomorrow I will try with an actual physical server.

Wayne Siu December 2, 2015 at 10:50 PM
Maintenance Meeting 12.02.2015.
This is approved for 3.1.3.
Details
Assignee
Jim WalkerJim WalkerReporter
Jim WalkerJim WalkerIs this a Regression?
NoTriage
UntriagedPriority
MajorInstabug
Open Instabug
Details
Details
Assignee

Reporter

Is this a Regression?
Triage
Priority
Instabug
PagerDuty
PagerDuty Incident
PagerDuty
PagerDuty Incident
PagerDuty

Sentry
Linked Issues
Sentry
Linked Issues
Sentry
Zendesk Support
Linked Tickets
Zendesk Support
Linked Tickets
Zendesk Support

The stream_t is used by DcpBackfillTask and the DcpProducer, so at least 2 threads are accessing the value. stream_t is defined as a SingleThreadedRCPtr, not the concurrent-safe RCPtr
ActiveStream, PassiveStream and NotifierStream use base pointers to DcpProducer/DcpConsumer yet can schedule background tasks which may access those pointers past the objects deletion.
Change stream_t to RCPtr
Add RCPtr's to the Stream classes so that the object deletion occurs once no one references the producer.