Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: 5.5.0
Affects Version/s: 5.5.0
Component/s: clients
Labels:
- ga
- releasenote
Environment:
5.5.0-2388

Triage:
Untriaged
Operating System:
Centos 64-bit
Link to Log File, atop/blg, CBCollectInfo, Core dump:

Hide
Relevant cbcollect(log snippets from it referenced below): https://s3.amazonaws.com/bugdb/jira/MB-28968/test_11.zip
Another cbcollect where same problem manifested: https://s3.amazonaws.com/bugdb/jira/MB-28968/test_16.zip

Show
Relevant cbcollect(log snippets from it referenced below): https://s3.amazonaws.com/bugdb/jira/MB-28968/test_11.zip Another cbcollect where same problem manifested: https://s3.amazonaws.com/bugdb/jira/MB-28968/test_16.zip
Is this a Regression?:
Unknown

Description

See the following mail thread for the discussion that is happening so that we have a context of why this bug was failed. Meanwhile I will also have few more runs with cbcollect logs before the cluster being dismantled as requested by Abhishek Singh .

I have attached the existing cbcollect_info in the logs section.

From: Brett Lawson <>

Date: Friday, April 6, 2018 at 2:20 PM

To: Abhishek Singh <>, Sriram Melkote <>, Matt Ingenthron <>, Dave Finlay <>

Cc: Balakumaran Gopal <>

Subject: Re: MB-28968 - System RBAC causing intermittent rebalance hangs

Hey Abhishek,

I would suggest looking at [2] instead as it has slightly more information and comes from the latest documentation, but it is effectively the same thing.

====

As I mentioned above, NO_ACCESS is simply a notification that you do not have access to a specific resource (whether that be a specific bucket, document, operation, etc..).  According to the spec, it would be a bug if libcouchbase is throwing away connections based on that status (though that would explain how it resolves the connection issue so quickly).  I’ll speak with Sergey, our libcouchbase maintainer in the morning and see if that might be the reason for our differences.  Additionally, I’ll try to think up some potential fixes overnight and hopefully we can get a Wireshark capture that concretely points at this being the cause.

Cheers, Brett

[2] https://developer.couchbase.com/documentation/server/5.1/sdk/go/collecting-information-and-logging.html

From: Abhishek Singh <>

Date: Friday, April 6, 2018 at 1:39 AM

To: Brett Lawson <>, Sriram Melkote <>, Matt Ingenthron <>, Dave Finlay <>

Cc: Balakumaran Gopal <>

Subject: Re: MB-28968 - System RBAC causing intermittent rebalance hangs

Hi Brett,

> It may also be helpful to enable gocb/gocbcore logging if you have not already.

In Eventing master builds, we haven’t done anything specific change w.r.t. gocb logging. We simply use it as a package with all default setup configs. I suspect based on messages within Eventing log files, gocb/gocbcore related log messages are absent. Please share how to enable the logging for gocb/gocbcore to capture debugging info from SDK side or is it ok to just follow step mentioned on [1]?

====

Just based on Eventing stats dump, it seems like this problem isn’t manifested by LCB. With LCB, I think it internally keeps on retrying until it gets a valid rbac auth to perform bucket/n1ql op – so even if cbauth provides stale creds or Eventing creds cached is stale, it would hit script timeout. Script timeout is execution time threshold that Eventing enforces on JS code execution, in this case stats from V8 sandbox (which uses LCB) doesn’t report non-zero “timeout_count”.

2018-04-04T02:16:14.598-07:00 [Info] eventing-consumer [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_7:33998:9666] v8worker failure stats : {"bucket_op_exception_count":0, "n1ql_op_exception_count":0, "timeout_count":0, "checkpoint_failure_count":0,"dcp_events_lost": 0,"v8worker_events_lost": 0,"app_worker_setting_events_lost": 0,"timer_events_lost": 0,"debugger_events_lost": 0,"mutation_events_lost": 0,"delete_events_lost": 0,"cron_timer_events_lost": 0,"doc_timer_events_lost": 0,"timestamp" : "2018-04-04T09:16:14Z"}

Also, there is bit of difference seen w.r.t. period rbac creds are requested by LCB vs GoCB(Note: KV rebalance has finished before 2018-04-04T02:18:02.975-07:00):

LCB

431 2018-04-04T02:15

272 2018-04-04T02:16

GoCB

  51 2018-04-04T02:15

  89 2018-04-04T02:16

  48 2018-04-04T02:17

   5 2018-04-04T02:23

[1] https://developer.couchbase.com/documentation/server/4.0/sdks/go-beta/logging.html

Thanks,

Abhishek

From: Brett Lawson <>

Date: Friday, April 6, 2018 at 1:27 PM

To: Sriram Melkote <>, Abhishek Singh <>, Matt Ingenthron <>, Dave Finlay <>

Cc: Balakumaran Gopal <>

Subject: Re: MB-28968 - System RBAC causing intermittent rebalance hangs

Hey Siri,

Any chance we could gather together a Wireshark capture and eventing log (as below) that can be vaguely correlated in time so that I can investigate what the SDK is doing and seeing when this starts to occur?  This is probably the quickest path to identifying the specific cause for the SDKs inability to perform operations.  It may also be helpful to enable gocb/gocbcore logging if you have not already.

A possible avenue to investigate here in the packet trace: NO_ACCESS is not able to be a ‘connection fatal’ event due to it being seen for various valid RBAC instances; I’m not certain about the architecture of cbauth and credentials passing, but we’ve previous had issues related to gocb reacting to rebalances very quickly, even before the appropriate permissions had been propagated (this came up in MB-26144 or one of its spin-offs).  It’s possible we are seeing that again here, and the SDK doesn’t have enough context to know it needs to attempt to establish the connection again for up to date permissions.  You can check if this is happening by investigating that the SDK is correctly performing SASL_AUTH and  SELECT_BUCKET, and then watching to see if the SDK is getting NO_ACCESS all the way until it creates a new connection (which it will occasionally do during cluster map change notifications).

Cheers, Brett

From: Sriram Melkote <>

Date: Friday, April 6, 2018 at 12:32 AM

To: Abhishek Singh <>, Matt Ingenthron <>, Brett Lawson <>, Dave Finlay <>

Cc: Balakumaran Gopal <>

Subject: Re: MB-28968 - System RBAC causing intermittent rebalance hangs

Hi Brett/Dave,

Bala is filing a bug with cbcollects. But I'm emailing to get this thread started as it appears in several bugs.

My understanding is:

1.  It is responsibility of gocb to refresh the node maps when rebalance occurs

2. And it is ns_server's responsibility to provide us with working credentials to access a remote node

Both these being fairly old code, I'd be surprised if it has a bug. But I'm puzzled what else can lead to what we're seeing of access being denied for SET/GET operations for extended durations (6 minutes) below?

Thanks,

Siri

From: Abhishek Singh <>

Date: Thursday, April 5, 2018 at 9:36 PM

To: Sriram Melkote <>

Cc: eventing <>

Subject: MB-28968 - System RBAC causing intermittent rebalance hangs

Hi Siri,

QE test case that happen to replicate the issue (but it's intermittent):

./testrunner -i /tmp/testexec.17209.ini -p get-cbcollect-info=True,GROUP=bucket_op_with_timers -t eventing.eventing_rebalance.EventingRebalance.test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutations,nodes_init=6,services_init=kv-kv-eventing-eventing-eventing-index:n1ql,dataset=default,groups=simple,reset_services=True,doc-per-day=10,sock_batch_size=8,worker_count=8,cpp_worker_thread_count=8,handler_code=bucket_op_with_timers,GROUP=bucket_op_with_timers

Relevant cbcollect(log snippets from it referenced below): https://s3.amazonaws.com/bugdb/jira/MB-28968/test_11.zip

Another cbcollect where same problem manifested: https://s3.amazonaws.com/bugdb/jira/MB-28968/test_16.zip

Cluster had 3 eventing nodes [172.23.109.71:8096 172.23.109.70:8096 172.23.109.69:8096] and in the test a new KV node is rebalanced in

Eventing Node 172.23.109.70:8096 receives notif from ns_server for rebalance start and 172.23.109.70:8096 notifies other Eventing nodes in the cluster about it. Snippet from node 170 showing Eventing rebalance started at 2018-04-04T02:18:02.975-07:00

2018-04-04T02:18:02.975-07:00 [Info] ServiceMgr::StartTopologyChange change: service.TopologyChange{ID:"13d3d8939120d769b51395838c7d4de5", CurrentTopologyRev:service.Revision(nil), Type:"topology-change-rebalance", KeepNodes:[]struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{NodeInfo:service.NodeInfo{NodeID:"da8e09e209d20b328d6667e85eb27af9", Priority:0, Opaque:interface {}(nil)}, RecoveryType:"recovery-full"}, struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{NodeInfo:service.NodeInfo{NodeID:"bbdd26f7126754137fc4955609f0dd86", Priority:0, Opaque:interface {}(nil)}, RecoveryType:"recovery-full"}, struct { NodeInfo service.NodeInfo "json:\"nodeInfo\""; RecoveryType service.RecoveryType "json:\"recoveryType\"" }{NodeInfo:service.NodeInfo{NodeID:"971ab380217bab731ddbdf226a0de842", Priority:0, Opaque:interface {}(nil)}, RecoveryType:"recovery-full"}}, EjectNodes:[]service.NodeInfo{}}

2018-04-04T02:18:03.101-07:00 [Info] ServiceMgr::StartTopologyChange Active Eventing nodes in the cluster: [172.23.109.71:8096 172.23.109.70:8096 172.23.109.69:8096]

2018-04-04T02:18:03.101-07:00 [Info] ServiceMgr::StartTopologyChange Querying nodes: [172.23.109.71:8096 172.23.109.70:8096 172.23.109.69:8096] for bootstrap status

2018-04-04T02:18:03.110-07:00 [Info] ServiceMgr::StartTopologyChange Status of app bootstrap across all Eventing nodes: false

2018-04-04T02:18:03.132-07:00 [Info] ServiceMgr::StartTopologyChange Starting up rebalancer

All 6 workers on node 172.23.109.69 got the notification that rebalance has started and start with this work of vbucket migration as per the plan

1227010:2018-04-04T02:18:08.491-07:00 [Info] Consumer::controlRoutine [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_2:43729:9602] Got notification that cluster state has changed

1227013:2018-04-04T02:18:08.493-07:00 [Info] Consumer::controlRoutine [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_6:36338:9651] Got notification that cluster state has changed

1227016:2018-04-04T02:18:08.495-07:00 [Info] Consumer::controlRoutine [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_0:35565:9595] Got notification that cluster state has changed

1227019:2018-04-04T02:18:08.498-07:00 [Info] Consumer::controlRoutine [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_1:34750:9621] Got notification that cluster state has changed

1227021:2018-04-04T02:18:08.499-07:00 [Info] Consumer::controlRoutine [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_3:41311:9612] Got notification that cluster state has changed

1237730:2018-04-04T02:18:23.398-07:00 [Info] Consumer::controlRoutine [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_4:42526:9637] Got notification that cluster state has changed

1238065:2018-04-04T02:18:24.158-07:00 [Info] Consumer::controlRoutine [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_5:33579:9633] Got notification that cluster state has changed

As ns_server rebalances KV + Views first before informing any MDS role about rebalance, in this case too it can be seen that Eventing started seeing STREAMEND because of vbucket movement as part of KV rebalance before Eventing rebalance was kicked off.

Around that time, I also see GoCB making calls to request rbac auth for KV node (log snippet):

2018-04-04T02:15:14.350-07:00 [Info] DynamicAuthenticator::Credentials Authenticating endpoint: 172.23.109.49:11210 bucket: metadata

2018-04-04T02:15:14.356-07:00 [Info] DynamicAuthenticator::Credentials Authenticating endpoint: 172.23.109.67:11210 bucket: metadata

2018-04-04T02:15:14.356-07:00 [Info] DynamicAuthenticator::Credentials Authenticating endpoint: 172.23.109.49:11210 bucket: metadata

2018-04-04T02:15:14.520-07:00 [Info] DynamicAuthenticator::Credentials Authenticating endpoint: 172.23.109.67:11210 bucket: src_bucket

But starting at 2018-04-04T02:16:14.757-07:00, bucket reads/writes via GoCB starts starts throwing "no access" error - which likely means memcached is discarding supplied rbac auth:

(first entry)

2018-04-04T02:16:14.757-07:00 [Error] Consumer::getMetaOpCallback [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_6:36338:9651] Key: <ud>Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation::vb::273</ud> path: <ud>last_processed_seq_no</ud> subdoc lookup failed, err: no access

(last entry from supplied cbcollect)

2018-04-04T02:22:54.898-07:00 [Error] Consumer::getOpCallback [worker_Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation_7:33998:9666] Bucket fetch failed for key: <ud>Function_900706978_test_kv_rebalance_in_when_existing_eventing_node_is_processing_mutation::vb::318</ud>, err: no access

So, the "no access" error lasts for good 6 minutes and Eventing workers aren't able to update metadata - which can basically hang KV rebalance-in in this test case

➜  cbcollect_info_ns_1@172.23.109.69_20180404-092733 grep "no access" ns_server.eventing.log | wc -l

Thanks,

Abhishek

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

outfile1.zip
8.35 MB
11/Apr/18 8:34 AM
Screen Shot 2018-04-13 at 2.19.37 PM.png
262 kB
13/Apr/18 1:58 AM

Issue Links

blocks

MB-28968 [functional test] : Rebalance hangs seen (swap/in)

Closed

depends on

MB-29371 panic: runtime error: index out of range

Closed

relates to

MB-28968 [functional test] : Rebalance hangs seen (swap/in)

Closed

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Balakumaran Gopal

Reporter:: Balakumaran Gopal

Votes:: 0 Vote for this issue

Watchers:: 17 Start watching this issue

Dates

Created:: 06/Apr/18 2:02 AM

Updated:: 11/Jun/18 10:34 AM

Resolved:: 23/May/18 3:23 PM

Gerrit Reviews

There are no open Gerrit changes

Show There are 4 closed Gerrit changes

Hide There are 4 closed Gerrit changes

MB-29147: Improve stream cleanup upon rebalance recovery: Gerrit Review:

MB-29147: Update gocb to v1.3.5 and gocbcore to v7.1.0: Gerrit Review:

MB-29147 Pick up gocbcore patch for mitigating system rbac issue: Gerrit Review:

MB-29371, MB-29147: Pulled two commits for gocbcore for MB fixes.: Gerrit Review:

Details

Description

Attachments

Attachments

Issue Links

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty