DCP needs correct ref-counted pointers

Description

  1. The stream_t is used by DcpBackfillTask and the DcpProducer, so at least 2 threads are accessing the value. stream_t is defined as a SingleThreadedRCPtr, not the concurrent-safe RCPtr

  2. ActiveStream, PassiveStream and NotifierStream use base pointers to DcpProducer/DcpConsumer yet can schedule background tasks which may access those pointers past the objects deletion.

  1. Change stream_t to RCPtr

  2. Add RCPtr's to the Stream classes so that the object deletion occurs once no one references the producer.

Components

Fix versions

Labels

Environment

None

Link to Log File, atop/blg, CBCollectInfo, Core dump

None

Release Notes Description

None

Activity

Jim Walker December 16, 2015 at 9:19 AM

Just for clarity.

This fix is needed in 4.x however the shipped 4 and 4.1 are not yet impacted because a memory leak is actually preventing this defect from occuring, but once the memory leak is fixed, this bug can be triggered.

Plan is to merge both the memory leak fix and this fix together.

in 4.1 this defect is hidden by https://couchbasecloud.atlassian.net/browse/MB-16949#icft=MB-16949

Eric Cooper December 8, 2015 at 10:23 PM

Thank you for the good description, I am able to repro the bug in 3.1.2 and verified the fix in 3.1.3, and for good measure tested on 4.1.0 and the problem does not occur.

Here is the scenario that I used:
8 vbuckets
200 Meg bucket

Initial population
/usr/bin/cbc-pillowfight -I 120000 -m 2 -M 2000 -U couchbase://localhost/default -t 25 -p test-stop-stream -r 100

Disk usage is 180 M

Ongoing load (2500 ops/sec per the UI)
/usr/bin/cbc-pillowfight -I 50000000 -m 2 -M 2000 -U couchbase://localhost/default -t 25 -p test-stop-stream -r 5 --rate-limit=100

Then run the pydcp test - test_open_producer_connection_command_and_close_immediately:

def test_open_producer_connection_command_and_close_immediately(self): for i in range(100): self.dcp_client = DcpClient(self.host, self.port) response = self.dcp_client.open_producer("mystream") for j in range(8): response = self.dcp_client.stream_req(j, 0, 0, 100000, 0) response = self.dcp_client.close_stream(j)

I will be adding this to the pydcp test shortly.

Jim Walker December 4, 2015 at 9:41 AM

@Eric Co This problem may not manifest itself as a crash.

The memory which gets prematurely freed may or may not get reallocated.

Even if it does get reallocated, it needs to be reallocated to something that will cause a segmentation fault. I.e. something which is now an unterminated string that takes you to a page boundary.

So you need:

1) DGM to try and get longer running backfills.
2) Lots of operations that will cause memory to be recycled. E.g. updating lots of keys.
3) Luck

If there's no crash, the race may still have occurred and you may be able to observe that fact from evidence in memcached.log. Some log messages will contain junk.

E.g. good message

Fri Dec 4 11:22:55.821357 EST 3: (default) DCP (Producer) ep_dcpq:replication:ns_1@node1->ns_1@node2 (vb 1):default - Backfill task (5205478 to 315) finished. disk seqno 5206261 memory seqno 5206261

Bad/broken one showing evidence of a race, see that there's now junk.

Fri Dec 4 11:22:55.821357 EST 3: (default) <88><CF>ћ^R (vb 1) Backfill task (5205478 to 315) finished. disk seqno 5206261 memory seqno 5206261

Eric Cooper December 3, 2015 at 12:28 AM

I attempted to repro this as follows:

  • varied the number of v-buckets from 4 to 32

  • added up to 5,000,000 KVs

  • created a pydcp test which continually creates producer instances, add streams for all the v-buckets and then immediately closes them

  • changed the access scanner so that it runs more frequently

  • on the Couchbase server ran a CPU hogger which ran at a higher priority than memcached

This was running on a CentOS 6 VM on 3.1.2.

I was unable to reproduce the problem.

Tomorrow I will try with an actual physical server.

Wayne Siu December 2, 2015 at 10:50 PM

Maintenance Meeting 12.02.2015.
This is approved for 3.1.3.

Fixed
Pinned fields
Click on the next to a field label to start pinning.

Details

Assignee

Reporter

Is this a Regression?

No

Triage

Untriaged

Priority

Instabug

Open Instabug

PagerDuty

Sentry

Zendesk Support

Created November 28, 2015 at 5:39 PM
Updated January 19, 2017 at 5:09 PM
Resolved December 8, 2015 at 10:24 PM
Instabug