Details
-
Bug
-
Resolution: Fixed
-
Critical
-
6.0.3
-
Untriaged
-
Unknown
Description
ns_server uses dcp-vbtakeover estimates when deciding if the bulk of data was transferred to all replicas during rebalance. While looking at the logs from a recent performance run, I noticed that this step takes anomalously short amount of time on vbuckets that actually have lots of items in them.
For instance,
[rebalance:debug,2019-07-18T07:09:33.658-07:00,ns_1@172.23.96.100:<0.25140.41>:ns_single_vbucket_mover:wait_dcp_data_move:455]Will wait for backfill on all opened streams for bucket = "bucket-1" partition 912 src node = 'ns_1@172.23.96.108' dest nodes = ['ns_1@172.23.96.109',
|
'ns_1@172.23.96.100']
|
[rebalance:debug,2019-07-18T07:09:33.660-07:00,ns_1@172.23.96.100:<0.25140.41>:ns_single_vbucket_mover:wait_dcp_data_move:460]DCP data is up to date for bucket = "bucket-1" partition 912 src node = 'ns_1@172.23.96.108' dest nodes = ['ns_1@172.23.96.109',
|
'ns_1@172.23.96.100']
|
While later I can see that the data wasn't actually transferred:
[ns_server:debug,2019-07-18T07:09:38.143-07:00,ns_1@172.23.96.100:<0.29085.20>:ns_rebalance_observer:docs_left_updater_loop:436]Starting docs_left_updater_loop:"bucket-1"
|
...
|
{912,
|
{vbucket_info,['ns_1@172.23.96.108','ns_1@172.23.96.100'],
|
['ns_1@172.23.96.109','ns_1@172.23.96.100'],
|
[{replica_building_stats,'ns_1@172.23.96.109',976680,976680},
|
{replica_building_stats,'ns_1@172.23.96.100',5,5}],
|
{stat_info,{1563,458973,619895},false},
|
{stat_info,{1563,458973,658136},false},
|
{stat_info,false,false},
|
{stat_info,{1563,458973,661269},false}}},
|
...
|
|
[ns_server:debug,2019-07-18T07:12:08.143-07:00,ns_1@172.23.96.100:<0.29085.20>:ns_rebalance_observer:docs_left_updater_loop:436]Starting docs_left_updater_loop:"bucket-1"
|
...
|
{912,
|
{vbucket_info,['ns_1@172.23.96.108','ns_1@172.23.96.100'],
|
['ns_1@172.23.96.109','ns_1@172.23.96.100'],
|
[{replica_building_stats,'ns_1@172.23.96.109',976680,459977},
|
{replica_building_stats,'ns_1@172.23.96.100',5,5}],
|
{stat_info,{1563,458973,619895},false},
|
{stat_info,{1563,458973,658136},false},
|
{stat_info,false,false},
|
{stat_info,{1563,458973,661269},false}}},
|
And finally the data gets moved 4.5 minutes later:
[ns_server:debug,2019-07-18T07:13:58.159-07:00,ns_1@172.23.96.100:ns_rebalance_observer<0.29086.20>:ns_rebalance_observer:handle_cast:231]Got update_stats: 912, [{'ns_1@172.23.96.109',5800}]
|
I tried reproducing this locally, and it turned out that dcp-vbtakeover stats return an estimate of 0 items to replicate. So ns_server deems that the data is sufficiently up to date on the replica and it can proceed to later stages of the move:
[rebalance:debug,2019-08-18T23:20:45.871-07:00,n_0@127.0.0.1:<0.5161.0>:ns_single_vbucket_mover:wait_dcp_data_move:455]Will wait for backfill on all opened streams for bucket = "default" partition 1 src node = 'n_0@127.0.0.1' dest nodes = ['n_1@127.0.0.1']
|
[ns_server:debug,2019-08-18T23:20:45.871-07:00,n_0@127.0.0.1:<0.5162.0>:dcp_replicator:wait_for_data_move_on_one_node:182]items remaining on connection "replication:n_0@127.0.0.1->n_1@127.0.0.1:default", partition 1: 0
|
[ns_server:debug,2019-08-18T23:20:45.872-07:00,n_0@127.0.0.1:janitor_agent-default<0.2041.0>:janitor_agent:handle_info:950]Got done message from subprocess: <0.5162.0> (ok)
|
[rebalance:debug,2019-08-18T23:20:45.872-07:00,n_0@127.0.0.1:<0.5161.0>:ns_single_vbucket_mover:wait_dcp_data_move:460]DCP data is up to date for bucket = "default" partition 1 src node = 'n_0@127.0.0.1' dest nodes = ['n_1@127.0.0.1']
|
Evidence that the replication was successfully created prior to requesting stats:
[ns_server:debug,2019-08-18T23:20:45.865-07:00,n_1@127.0.0.1:<0.3590.0>:dcp_commands:add_stream:87]Add stream for partition 1, opaque = 0x1, type = add
|
[ns_server:debug,2019-08-18T23:20:45.865-07:00,n_1@127.0.0.1:<0.3590.0>:dcp_consumer_conn:handle_call:199]Setup DCP streams:
|
Current [0]
|
Streams to open [1]
|
Streams to close []
|
|
[ns_server:debug,2019-08-18T23:20:45.866-07:00,n_1@127.0.0.1:<0.3590.0>:dcp_proxy:handle_packet:246]Proxy packet: REQUEST: 0x53 (dcp_stream_req) vbucket = 1 opaque = 0xE000000
|
80 53 00 00
|
30 00 00 01
|
00 00 00 30
|
0E 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
FF FF FF FF
|
FF FF FF FF
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 00 00
|
|
[ns_server:debug,2019-08-18T23:20:45.869-07:00,n_1@127.0.0.1:<0.3591.0>:dcp_proxy:handle_packet:246]Proxy packet: RESPONSE: 0x53 (dcp_stream_req) vbucket = 0 opaque = 0xE000000 status = 0x0 (success)
|
81 53 00 00
|
00 00 00 00
|
00 00 00 10
|
0E 00 00 00
|
00 00 00 00
|
00 00 00 00
|
00 00 44 16
|
F9 4D FB B9
|
00 00 00 00
|
00 00 00 00
|
|
[ns_server:debug,2019-08-18T23:20:45.870-07:00,n_1@127.0.0.1:<0.3590.0>:dcp_proxy:handle_packet:246]Proxy packet: RESPONSE: 0x51 (dcp_add_stream) vbucket = 0 opaque = 0x1 status = 0x0 (success)
|
81 51 00 00
|
04 00 00 00
|
00 00 00 04
|
00 00 00 01
|
00 00 00 00
|
00 00 00 00
|
00 00 00 0E
|
|
[rebalance:debug,2019-08-18T23:20:45.870-07:00,n_1@127.0.0.1:<0.3590.0>:dcp_consumer_conn:handle_packet:75]Stream has been added for partition 1, stream opaque = 0xE000000
|
[ns_server:debug,2019-08-18T23:20:45.870-07:00,n_1@127.0.0.1:<0.3590.0>:dcp_consumer_conn:maybe_reply_setup_streams:418]Setup stream request completed with ok. Moving to idle state
|
This affects rebalance scheduling and sync-writes latencies during rebalance among other things.
Attachments
Issue Links
- is a backport of
-
MB-35594 dcp-vbtakeover stats are unreliable shortly after replication creation
-
- Closed
-
For Gerrit Dashboard: MB-36802 | ||||||
---|---|---|---|---|---|---|
# | Subject | Branch | Project | Status | CR | V |
118410,6 | [BP] MB-36802: Don't return estimate=0 for dcp-takeover stats before backfill | alice | kv_engine | Status: MERGED | +2 | +1 |
118585,3 | MB-36802 [bp] Handle new dcp-vbtakeover status. | alice | ns_server | Status: MERGED | +2 | +1 |
118601,1 | Merge commit 'couchbase/alice~3' into madhatter | mad-hatter | ns_server | Status: MERGED | +2 | +1 |
118750,1 | Merge remote-tracking branch 'couchbase/mad-hatter' | master | ns_server | Status: MERGED | +2 | +1 |
120079,3 | Merge alice into mad-hatter | mad-hatter | kv_engine | Status: MERGED | +2 | +1 |
120283,1 | Merge branch 'mad-hatter' | master | kv_engine | Status: MERGED | +2 | +1 |