Details
-
Improvement
-
Resolution: Fixed
-
Major
-
None
-
2.0.1
-
None
-
Security Level: Public
-
None
Description
Hi Karen,
Here are 2 sections to incorporate into the manual somewhere. Feel free to edit the text as necessary, and if any of this doesn't make sense let me know.
marty
Understanding the Couchbase plugin for Elasticsearch Performance in Practice
----------------------------------------------------------------------------
The Couchbase plugin for Elasticsearch uses XDCR for the transport of data. One of the
most important parameters controlling the performance of XDCR is "xdcrMaxConcurrentReps".
This value represents the maximum number of replication operations that will take place
concurrently from each node in the Couchbase cluster and it defaults to 32.
In practice this means if I'm replicating from a 5 node Couchbase cluster to a 1 node
Elasticsearch cluster I may have up to 160 concurrent replications targeting a single
Elasticsearch node. Each replication may require multiple TCP connections and this
can end up overwhelming the Elasticsearch node.
Once an Elasticsearch node is overwhelmed a variety of errors may occur. Some of them
are:
Error replicating vbucket 7:
{badmatch, {error,all_nodes_failed,
<<"Failed to grab remote bucket info from any of known nodes">>}}
Error replicating vbucket 7:
{error,
}}
These errors occur because Couchbase is unable to communicate with Elasticsearch in a
reasonable amount of time. XDCR can recover from these types of errors, but your
replication may take longer to complete, or operate with higher latency because these
operations must be retried at a later time.
In circumstances such as this, it may help to lower the "xdcrMaxConcurrentReps" so that
the total number of concurrent replications for the whole cluster is a more reasonable
number.
Initial Elasticsearch Indexing of an Existing Couchbase Bucket
--------------------------------------------------------------
Often times you have an existing Couchbase bucket with a large number of documents in
production. When you initially start to index this data with Elasticsearch a large
number of documents will be transferred in bulk. While this should work with the default
settings, there are some settings which can be tweaked in Elasticsearch to make this
initial indexing phase complete faster.
The "refresh_interval" setting in Elasticsearch controls how frequently newly indexed
items become available in search results. During a bulk load, we trade-off access to the
newly indexed items, in exchange for faster overall indexing time.
Full details about disabling and reenabling index refresh, see this section of the
Elasticsearch guide:
http://www.elasticsearch.org/guide/reference/api/admin-indices-update-settings.html