Loading...

XML

Word

Printable

Details

Type: Bug
Resolution: Cannot Reproduce
Priority: Major
Fix Version/s: 3.0
Affects Version/s: 2.0.1
Component/s: RESTful-APIs, UI, view-engine, XDCR
Security Level: Public
Labels:
- customer
Environment:

Hide
- 6 nodes couchbase cluster v2.0.1 Community on CentOS 6.4 X86_64
- Each couchbase node is running inside an OpenVZ (2.6.32-042stab076.5) container with 6 "dedicated" CPUs and 87GB RAM each
- Erlang memsup patch applied : CF ~~MB-4376~~
- OpenVZ containers split on 3 hardware servers with lot of CPU, RAM and disk space vacant
- 2 couchbase servers per OpenVZ containers
- Hardware servers are dedicated to couchbase and the application servers that query the cluster - Nothing else is running on those servers

- 5 buckets in the cluster
- 2 replicas activated for each bucket
- Views on 2 buckets
- All the 5 buckets are XDCR replicated to the destination cluster

- XDCR destination cluster is a one node couchbase dedicated AWS EC2 m2.4xlarge instance store (no EBS)
- No "memcached operations" on this cluster
- No views on this cluster
- No XDCR defined on this cluster

- Source and Destination cluster are connected through a VPN
- One OpenVPN server is running on the source cluster side
- Our one node AWS cluster connects as a client to the OpenVPN server through the public internet
- Traffic is routed (not "openVPN bridged")

- Our application stack is node.js with node-memcache and client-side moxi v1.8.1

Show
- 6 nodes couchbase cluster v2.0.1 Community on CentOS 6.4 X86_64 - Each couchbase node is running inside an OpenVZ (2.6.32-042stab076.5) container with 6 "dedicated" CPUs and 87GB RAM each - Erlang memsup patch applied : CF MB-4376 - OpenVZ containers split on 3 hardware servers with lot of CPU, RAM and disk space vacant - 2 couchbase servers per OpenVZ containers - Hardware servers are dedicated to couchbase and the application servers that query the cluster - Nothing else is running on those servers - 5 buckets in the cluster - 2 replicas activated for each bucket - Views on 2 buckets - All the 5 buckets are XDCR replicated to the destination cluster - XDCR destination cluster is a one node couchbase dedicated AWS EC2 m2.4xlarge instance store (no EBS) - No "memcached operations" on this cluster - No views on this cluster - No XDCR defined on this cluster - Source and Destination cluster are connected through a VPN - One OpenVPN server is running on the source cluster side - Our one node AWS cluster connects as a client to the OpenVPN server through the public internet - Traffic is routed (not "openVPN bridged") - Our application stack is node.js with node-memcache and client-side moxi v1.8.1

Triage:
Untriaged
Operating System:
Centos 64-bit

Description

Hello the Couchbase team,

We are experiencing really slow views response time and HTTP API calls on our "source" couchbase cluster since we have activated uni directional XDCR

Our configuration is detailed at the end of this post.

Symptoms :

##The symptom is really visible while the XDCR is first activated and the "initial items load" takes place from source to dest :

Couhbase GUI:8091 on the source cluster is unresponsive from time to time (more than 30/60sec from click to display, or displaying ~"XHR Failed to connect")
Call to Views are taking ages or timing out (HTTP time out, not the "connection_timeout" query string parameter of the view call)
CouchBase API call to ":8091/pools/default/buckets/bucketName", ":8091/pools/default/tasks" or ":8091/nodeStatuses" are really slow or timing out (HTTP timeOut set to 60 Sec)
Call to ":8091/nodeStatuses" is reporting nodes swapping back and forth from healthy to unhealthy
In the GUI:8091, Nodes see others as pending for a few seconds and back again, but no node crashes (no erl_crash.dump or core in /opt/couchbase/var/lib/couchbase)
no impact detected on the get/set/etc... "memcached" operations : same rate of operations per seconds and same response time (always under an average 500µs for each bucket)
(<- Because the API call to ":8091/pools/default/buckets/bucketName" was timing out, we had to use the moxi "'stats proxy timings" to make sure of this point)
no couchbase connection error or set/get/etc... reported in our application logs
no impact detected on the dest cluster (GUI and APIs appear to be ok) : The XDCR ongoing replication reports gets and sets

##The symptom is not so visible, but still present when source and dest buckets are in sync and XDCR is just replicating ongoing mutations

Views are longer than usual to render
Couhbase GUI:8091 on the source cluster is a bit slower than usual (2/5sec from click to display) and slower than the dest cluster
Couchbase APIs appear to be responsive, but we do not track there response time (we just know they do not time out any more)
We have more errors on the source cluster in /opt/couchbase/var/lib/couchbase/logs/error.* than before

<- Logs in /opt/couchbase/var/lib/couchbase/logs/error.* reports :
Lots of 'topkeys took too long' => 9K/Day since XDCR vs <1.5K before XDCR
Lots of '

{stats,<<>>}

took too long' => 10K/Day since XDCR vs <20 before XDCR
A few "Exception in stats collector" => 0 before XDCR vs 5.5K/Day During XDCR initial load vs <200 with XDCR after initial load
Some 'Mnesia is overloaded'

<- Logs in /opt/couchbase/var/lib/couchbase/logs/xdcr.* are filling up 10MB every minutes and reports each minute the following average number of messages :
Nbr Message
1160 concurrency_throttle clean_concurr_throttle_state
1160 concurrency_throttle signal_waiting
1160 xdc_vbucket_rep_worker flush_docs
1165 xdc_vbucket_rep_worker find_missing
1166 xdc_vbucket_rep_worker local_process_batch
1191 concurrency_throttle handle_call
1199 xdc_vbucket_rep_ckpt start_timer
2220 xdc_replication handle_info
2321 xdc_vbucket_rep handle_call
2359 xdc_vbucket_rep_ckpt cancel_timer
2382 xdc_vbucket_rep handle_info
4798 xdc_vbucket_rep_worker start_link
5996 xdc_vbucket_rep start_replication
7120 xdc_vbucket_rep_worker queue_fetch_loop

We have a lot of errors on the dest cluster in /opt/couchbase/var/lib/couchbase/logs/errors.* :
ns_memcached:verify_report_long_call:297]call {get_meta,<<"idXXX">>,793}
took too long: 1 122 684 us
ns_memcached-bucketName<0.8810.711>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long
ale_error_logger_handler:log_msg:76]Mnesia('ns_1@127.0.0.1'): ** WARNING ** Mnesia is overloaded
Uncaught error in HTTP request: {error,

Unknown macro: {badmatch,{error,time_out_polling}}

}

We have a lot of errors in /opt/couchbase/var/lib/couchbase/logs/xdcr_errors :
capi_replication:get_missing_revs:58][Bucket:"XXX", Vb:844]: conflict resolution for 500 docs takes too long to finish!(total time spent: 181 secs, default connection time out: 180 secs)
capi_replication:update_replicated_docs:100][Bucket:"XXX", Vb:351]: update 157 docs takes too long to finish!(total time spent: 185 secs, default connection time out: 180 secs)

We have found bu ~~MB-6787~~ which symptom looks a bit like ours, but it is supposed to be fixed
We didn't find any thing else in the forum or issue tracker

We have uploaded the "Diagnostic report" from our source cluster and a cbcollect_info from our dest cluster to your S3 storage into the zbo/ directory :

"Diagnostic report" from source : zbo/source*_Couchbase_cluster_ns-diag-20130603133644.txt.zip
"cbcollect_info" from dest : zbo/AWS_dest*_Couchbase_cluster.zip

Both of them have been taken while XDCR was activated, but after "initial load".

Questions :

Did we miss something while configuring our clusters or XDCR ?
Is this a known issue and is there a fix date ?
What could we do to investigate further with this issue ?
Do you confirm that if we had activated auto-failover, the situation would have been worse, with node taking over each other vBuckets while the nodes were "http unresponsive" and reported as pending by the others ?

#Our configuration :

1. Source cluster :

We are running a 6 nodes couchbase source cluster v2.0.1 Community on CentOS 6.4 X86_64
Each couchbase node is running inside an OpenVZ (2.6.32-042stab076.5) container with 6 "dedicated" CPUs and 87GB RAM
We have applied the erlang memsup patch : CF ~~MB-4376~~
Those OpenVZ containers are split on 3 hardware servers with lot of CPU, RAM and disk space vacant (Those hardware servers are dedicated to couchbase and the application servers that query the cluster - Nothing else is running on those servers)
None of the couchbase nodes are reporting more than 40% CPU used (even during view calls, indexing or compaction with ou without XDCR)
None of the couchbase nodes are reporting memory swap, and they all have free memory available
The reported Disk IOs are way under the specs the underlying SSDs we use are supposed to deliver (we still have 100% active items for all buckets and 0% miss)
There is 5 buckets in this couchbase cluster :

2 replicas are activated for each bucket
2 buckets have few items (25K items, 62 items)
3 buckets have a few millions items (41M, 3.5M, 1M)
2 buckets have views (the 3.5M items bucket has 2 designed_docs with 5 and 1 view;s The 25K bucket has 1 designed_doc with 2 views)
Most views are called once every minutes (or less) with "?stale=false"
2 of those buckets have and average 300 Ops
The remaining 3 buckets have between 10 and 50 average Ops
There is one pic period at night due to batch activity where couchbase Ops increase to ~5K Ops
This pic period is less than 5 minutes without XDCR and 60 minutes or more when XDCR is activated
<- The time consuming step when XDCR is ON is the repeated "paginated" view calls our batch is doing with "&limit=${step2}&skip=${step1}"
All the 5 buckets are XDCR replicated to the destination cluster
No XDCR errors are reported in the couhbase GUI:8091/#sec=replications "Last error"
A lot of "web request failed" errors are reported in the couhbase GUI:8091/#sec=log, but all related (I think) to the slow couchbase HTTP API response time during the inital XDCR load :
Server error during processing: ["web request failed", {path,"/pools/default"}
,
(...)
{path,"/pools/default/tasks"}
,
(...)
{"/pools/default/bucketsStreaming/settings"}
,
(...)
etc ...

1. Destination cluster :

The XDCR destination cluster is a one node couchbase dedicated AWS EC2 m2.4xlarge
There is no "memcached operations" on this cluster
There is no views on this cluster
There is no XDCR defined on this cluster
CPU is less than 60% on this node (most of it being beam.smp)
There is still 1GB free memory
Disk IO is write only (5 to 10MB/s) <- Disks are instance store only (no EBS)

1. Interconnexion between source and dest cluster :

Source and Destination cluster are connected through a VPN
One OpenVPN server is running on the source cluster side
Our one node AWS cluster connects as a client to the OpenVPN server through the public internet
The traffic is routed (not "openVPN bridged") to and from the different couchbase nodes openVZ containers
Our internet connectivity on the source side is 1Gbs and we are averaging a small 20Mbs TX+RX (Most of it being XDCR traffic)

1. Our application :

node.js with node-memcache and client-side moxi v1.8.1

—

Bonus questions :

What happens if the "Cluster Reference" couchbase node used when creating the initial XDCR conf is later taken out of the cluster (during a legitimate cluster shrink operation on the destination cluster for example) ?

Couchbase appears to be very verbose when logging to /opt/couchbase/var/lib/couchbase/logs/xdcr.* <- Is this a normal behaviour and can we turn it off during standard production shift out of any debug needs ?

Regards,

Xavier.

Attachments

Gerrit Reviews

- Issue Only
- Show All Reviews
- Show Open Reviews
- Show All Issues
- Show Open Issues

No reviews matched the request. Check your Options in the drop-down menu of this sections header.

Activity

People

Assignee:: Andrei Baranouski

Reporter:: Xav M

Votes:: 0 Vote for this issue

Watchers:: 8 Start watching this issue

Dates

Created:: 03/Jun/13 8:33 AM

Updated:: 04/Jun/14 2:46 PM

Resolved:: 04/Jun/14 2:46 PM

Gerrit Reviews

There are no open Gerrit changes

Views and couchbase HTTP API really slow or timing out when XDCR is activated | Nodes in pending state during XDCR initial load

Details

Description

Attachments

Gerrit Reviews

Activity

People

Dates

Gerrit Reviews

PagerDuty