Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Fixed
Priority: Blocker
Fix Version/s: 2.1.0
Affects Version/s: 2.1.0
Component/s: couchbase-bucket
Security Level: Public
Labels:
None
Environment:
linux, windows

Description

On the ec2 cluster I used for cbrecovery, the disk write queue didn't flush for a very long time. (this symptom was observed twice)

It's an xdcr dest cluster, no other ops rather than xdcr incoming traffic.

Fired as blocker, since it blocks the cbrecovery testing. The clusters are left intact as they are for developer to diagnose, please email back to me when the problem got resolved then I could continue the test. Thanks!

Ronnie

Destination:
http://ec2-184-169-190-197.us-west-1.compute.amazonaws.com:8091

And the source:
http://ec2-23-21-15-103.compute-1.amazonaws.com:8091

User: Administrator
Passwd: password

SSH:
user: root
passwd: couchbase

Now attach the original email thread with diagnosis from Jin and Junyi:

This should be a bug. There are a lot of crash in ns_server/baby_sitter and couchdb/writer process. Couchdb is unable to spawn writer process for some reasons.

I am not 100% sure which caused which but It should be unrelated to XDCR.

Please file a 2.0.2 bug and assign to Filipe to take first look. Thanks.

Junyi

On May 13, 2013, at 10:56 AM, Jin Lim wrote:

Logs from the ec2-184-169-190-197.us-west-1.compute.amazonaws.com. I see following error. After the error I see couch_db/updater kept crashing. I will wait until Junyi's quick looking at it from the xdcr side then figure out what is next.

Thanks,
Jin

[ns_server:info,2013-05-10T18:00:22.665,ns_1@ec2-184-169-190-197.us-west-1.compute.amazonaws.com:ns_config_log<0.562.0>:ns_config_log:handle_info:57]config change: rest_creds -> ********
[user:info,2013-05-10T18:00:22.718,ns_1@ec2-184-169-190-197.us-west-1.compute.amazonaws.com:<0.570.0>:ns_log:crash_consumption_loop:64]Port server moxi on node 'babysitter_of_ns_1@127.0.0.1' exited with status 0. Restarting. Messages: WARNING: curl error: transfer closed with outstanding read data remaining from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming
WARNING: curl error: couldn't connect to host from: http://127.0.0.1:8091/pools/default/saslBucketsStreaming
ERROR: could not contact REST server(s): http://127.0.0.1:8091/pools/default/saslBucketsStreaming
EOL on stdin. Exiting
[user:info,2013-05-10T18:00:32.169,ns_1@ec2-184-169-190-197.us-west-1.compute.amazonaws.com:<0.570.0>:ns_log:crash_consumption_loop:64]Port server moxi on node 'babysitter_of_ns_1@127.0.0.1' exited with status 0. Restarting. Messages: 2013-05-10 18:00:22: (cproxy_config.c.317) env: MOXI_SASL_PLAIN_USR (13)
2013-05-10 18:00:22: (cproxy_config.c.326) env: MOXI_SASL_PLAIN_PWD (8)

On May 13, 2013, at 10:41 AM, Ronnie Sun wrote:

Hi Jin, Junyi

On the ec2 cluster I used for cbrecovery, the disk write queue didn't flush for a very long time. (this symptom was observed twice)

It's an xdcr dest cluster, no other ops rather than xdcr incoming traffic.

Everything looks normal, not sure if it's related to the recent reb regression. Would you please take a look at it, thanks!

Ronnie

Destination:
http://ec2-184-169-190-197.us-west-1.compute.amazonaws.com:8091

And the source:
http://ec2-23-21-15-103.compute-1.amazonaws.com:8091

User: Administrator
Passwd: password

SSH:
user: root
passwd: couchbase

Attachments

- Sort By Name
- Sort By Date
- Ascending
- Descending
- Thumbnails
- List
- Download All

ns-diag-20130513235834.txt.zip
14.94 MB
13/May/13 5:53 PM
Screen Shot 2013-05-13 at 4.58.24 PM.png
67 kB
13/May/13 4:59 PM
Screen Shot 2013-05-16 at 2.29.11 PM.png
33 kB
16/May/13 2:31 PM

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

For Gerrit Dashboard: MB-8259
#	Subject	Branch	Project	Status	CR	V
26612,3	MB-8259 Maintain a disk queue size stat per shard.	2.0.2	ep-engine	Status: ABANDONED	0	0
26621,2	MB-8259 Reflect checkpoint collasping impact on disk queue size.	2.0.2	ep-engine	Status: MERGED	+2	+1

Activity

People

Assignee:: Ronnie Sun (Inactive)

Reporter:: Ronnie Sun (Inactive)

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 13/May/13 4:55 PM

Updated:: 29/Jun/13 12:55 PM

Resolved:: 02/Jun/13 10:15 PM

Gerrit Reviews

There are no open Gerrit changes

Show There are 2 closed Gerrit changes

Hide There are 2 closed Gerrit changes

MB-8259 Maintain a disk queue size stat per shard.: Gerrit Review:

MB-8259 Reflect checkpoint collasping impact on disk queue size.: Gerrit Review:

Disk not flush on xdcr dest node

Details

Description

Attachments

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty