DCP needs correct ref-counted pointers

Description

The stream_t is used by DcpBackfillTask and the DcpProducer, so at least 2 threads are accessing the value. stream_t is defined as a SingleThreadedRCPtr, not the concurrent-safe RCPtr
ActiveStream, PassiveStream and NotifierStream use base pointers to DcpProducer/DcpConsumer yet can schedule background tasks which may access those pointers past the objects deletion.

Change stream_t to RCPtr
Add RCPtr's to the Stream classes so that the object deletion occurs once no one references the producer.

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Linked issues

blocks

relates to

MB-16949

Memory leak in BackfillManager due to circular dependancy with DcpProducer

Activity

Jim Walker December 16, 2015 at 9:19 AM

Just for clarity.

This fix is needed in 4.x however the shipped 4 and 4.1 are not yet impacted because a memory leak is actually preventing this defect from occuring, but once the memory leak is fixed, this bug can be triggered.

Plan is to merge both the memory leak fix and this fix together.

in 4.1 this defect is hidden by https://couchbasecloud.atlassian.net/browse/MB-16949#icft=MB-16949

Eric Cooper December 8, 2015 at 10:23 PM

@Jim WalkerThank you for the good description, I am able to repro the bug in 3.1.2 and verified the fix in 3.1.3, and for good measure tested on 4.1.0 and the problem does not occur.

Here is the scenario that I used:
8 vbuckets
200 Meg bucket

Initial population
/usr/bin/cbc-pillowfight -I 120000 -m 2 -M 2000 -U couchbase://localhost/default -t 25 -p test-stop-stream -r 100

Disk usage is 180 M

Ongoing load (2500 ops/sec per the UI)
/usr/bin/cbc-pillowfight -I 50000000 -m 2 -M 2000 -U couchbase://localhost/default -t 25 -p test-stop-stream -r 5 --rate-limit=100

Then run the pydcp test - test_open_producer_connection_command_and_close_immediately:

def test_open_producer_connection_command_and_close_immediately(self):

        for i in range(100):

            self.dcp_client = DcpClient(self.host, self.port)
            response = self.dcp_client.open_producer("mystream")

            for j in range(8):
                response = self.dcp_client.stream_req(j, 0, 0, 100000, 0)

                response = self.dcp_client.close_stream(j)

I will be adding this to the pydcp test shortly.

Jim Walker December 4, 2015 at 9:41 AM

@Eric Co This problem may not manifest itself as a crash.

The memory which gets prematurely freed may or may not get reallocated.

Even if it does get reallocated, it needs to be reallocated to something that will cause a segmentation fault. I.e. something which is now an unterminated string that takes you to a page boundary.

So you need:

1) DGM to try and get longer running backfills.
2) Lots of operations that will cause memory to be recycled. E.g. updating lots of keys.
3) Luck

If there's no crash, the race may still have occurred and you may be able to observe that fact from evidence in memcached.log. Some log messages will contain junk.

E.g. good message

Fri Dec 4 11:22:55.821357 EST 3: (default) DCP (Producer) ep_dcpq:replication:ns_1@node1->ns_1@node2 (vb 1):default - Backfill task (5205478 to 315) finished. disk seqno 5206261 memory seqno 5206261

Bad/broken one showing evidence of a race, see that there's now junk.

Fri Dec 4 11:22:55.821357 EST 3: (default) <88><CF>ћ^R (vb 1) Backfill task (5205478 to 315) finished. disk seqno 5206261 memory seqno 5206261

Eric Cooper December 3, 2015 at 12:28 AM

I attempted to repro this as follows:

varied the number of v-buckets from 4 to 32
added up to 5,000,000 KVs
created a pydcp test which continually creates producer instances, add streams for all the v-buckets and then immediately closes them
changed the access scanner so that it runs more frequently
on the Couchbase server ran a CPU hogger which ran at a higher priority than memcached

This was running on a CentOS 6 VM on 3.1.2.

I was unable to reproduce the problem.

Tomorrow I will try with an actual physical server.

Wayne Siu December 2, 2015 at 10:50 PM

Maintenance Meeting 12.02.2015.
This is approved for 3.1.3.

Fixed

Pinned fields

Click on the next to a field label to start pinning.

Details
Assignee
Jim Walker
Reporter
Jim Walker
Is this a Regression?
No
Triage
Untriaged
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support

Created November 28, 2015 at 5:39 PM

Updated January 19, 2017 at 5:09 PM

Resolved December 8, 2015 at 10:24 PM

Instabug

DCP needs correct ref-counted pointers

Description

Components

Affects versions

Fix versions

Labels

Environment

Link to Log File, atop/blg, CBCollectInfo, Core dump

Release Notes Description

Linked issues

blocks

relates to

Activity

Jim Walker December 16, 2015 at 9:19 AM

Eric Cooper December 8, 2015 at 10:23 PM

Jim Walker December 4, 2015 at 9:41 AM

Eric Cooper December 3, 2015 at 12:28 AM

Wayne Siu December 2, 2015 at 10:50 PM

DetailsAssigneeJim WalkerJim WalkerReporterJim WalkerJim WalkerIs this a Regression?NoTriageUntriagedPriorityMajorInstabugOpen Instabug

Details

Assignee

Reporter

Is this a Regression?

Triage

Priority

Instabug

PagerDutyPagerDuty Incident

PagerDuty

Sentry Linked Issues

Sentry

Zendesk SupportLinked Tickets

Zendesk Support

Details
Assignee
Jim Walker
Reporter
Jim Walker
Is this a Regression?
No
Triage
Untriaged
Priority
Major
Instabug
Open Instabug

PagerDuty

Sentry

Zendesk Support