Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-8394

Views and couchbase HTTP API really slow or timing out when XDCR is activated | Nodes in pending state during XDCR initial load

    XMLWordPrintable

Details

    • Bug
    • Resolution: Cannot Reproduce
    • Major
    • 3.0
    • 2.0.1
    • RESTful-APIs, UI, view-engine, XDCR
    • Security Level: Public
    • Untriaged
    • Centos 64-bit

    Description

      Hello the Couchbase team,

      We are experiencing really slow views response time and HTTP API calls on our "source" couchbase cluster since we have activated uni directional XDCR

      Our configuration is detailed at the end of this post.

      1. Symptoms :

      ##The symptom is really visible while the XDCR is first activated and the "initial items load" takes place from source to dest :

      • Couhbase GUI:8091 on the source cluster is unresponsive from time to time (more than 30/60sec from click to display, or displaying ~"XHR Failed to connect")
      • Call to Views are taking ages or timing out (HTTP time out, not the "connection_timeout" query string parameter of the view call)
      • CouchBase API call to ":8091/pools/default/buckets/bucketName", ":8091/pools/default/tasks" or ":8091/nodeStatuses" are really slow or timing out (HTTP timeOut set to 60 Sec)
      • Call to ":8091/nodeStatuses" is reporting nodes swapping back and forth from healthy to unhealthy
      • In the GUI:8091, Nodes see others as pending for a few seconds and back again, but no node crashes (no erl_crash.dump or core in /opt/couchbase/var/lib/couchbase)
      • no impact detected on the get/set/etc... "memcached" operations : same rate of operations per seconds and same response time (always under an average 500µs for each bucket)
        (<- Because the API call to ":8091/pools/default/buckets/bucketName" was timing out, we had to use the moxi "'stats proxy timings" to make sure of this point)
      • no couchbase connection error or set/get/etc... reported in our application logs
      • no impact detected on the dest cluster (GUI and APIs appear to be ok) : The XDCR ongoing replication reports gets and sets

      ##The symptom is not so visible, but still present when source and dest buckets are in sync and XDCR is just replicating ongoing mutations

      • Views are longer than usual to render
      • Couhbase GUI:8091 on the source cluster is a bit slower than usual (2/5sec from click to display) and slower than the dest cluster
      • Couchbase APIs appear to be responsive, but we do not track there response time (we just know they do not time out any more)
      • We have more errors on the source cluster in /opt/couchbase/var/lib/couchbase/logs/error.* than before

      <- Logs in /opt/couchbase/var/lib/couchbase/logs/error.* reports :
      Lots of 'topkeys took too long' => 9K/Day since XDCR vs <1.5K before XDCR
      Lots of '

      {stats,<<>>}

      took too long' => 10K/Day since XDCR vs <20 before XDCR
      A few "Exception in stats collector" => 0 before XDCR vs 5.5K/Day During XDCR initial load vs <200 with XDCR after initial load
      Some 'Mnesia is overloaded'

      <- Logs in /opt/couchbase/var/lib/couchbase/logs/xdcr.* are filling up 10MB every minutes and reports each minute the following average number of messages :
      Nbr Message
      1160 concurrency_throttle clean_concurr_throttle_state
      1160 concurrency_throttle signal_waiting
      1160 xdc_vbucket_rep_worker flush_docs
      1165 xdc_vbucket_rep_worker find_missing
      1166 xdc_vbucket_rep_worker local_process_batch
      1191 concurrency_throttle handle_call
      1199 xdc_vbucket_rep_ckpt start_timer
      2220 xdc_replication handle_info
      2321 xdc_vbucket_rep handle_call
      2359 xdc_vbucket_rep_ckpt cancel_timer
      2382 xdc_vbucket_rep handle_info
      4798 xdc_vbucket_rep_worker start_link
      5996 xdc_vbucket_rep start_replication
      7120 xdc_vbucket_rep_worker queue_fetch_loop

      • We have a lot of errors on the dest cluster in /opt/couchbase/var/lib/couchbase/logs/errors.* :
        ns_memcached:verify_report_long_call:297]call {get_meta,<<"idXXX">>,793}

        took too long: 1 122 684 us
        ns_memcached-bucketName<0.8810.711>:ns_memcached:handle_info:630]handle_info(ensure_bucket,..) took too long
        ale_error_logger_handler:log_msg:76]Mnesia('ns_1@127.0.0.1'): ** WARNING ** Mnesia is overloaded
        Uncaught error in HTTP request: {error,

        Unknown macro: {badmatch,{error,time_out_polling}}

        }

      • We have a lot of errors in /opt/couchbase/var/lib/couchbase/logs/xdcr_errors :
        capi_replication:get_missing_revs:58][Bucket:"XXX", Vb:844]: conflict resolution for 500 docs takes too long to finish!(total time spent: 181 secs, default connection time out: 180 secs)
        capi_replication:update_replicated_docs:100][Bucket:"XXX", Vb:351]: update 157 docs takes too long to finish!(total time spent: 185 secs, default connection time out: 180 secs)

      We have found bu MB-6787 which symptom looks a bit like ours, but it is supposed to be fixed
      We didn't find any thing else in the forum or issue tracker

      We have uploaded the "Diagnostic report" from our source cluster and a cbcollect_info from our dest cluster to your S3 storage into the zbo/ directory :

      "Diagnostic report" from source : zbo/source*_Couchbase_cluster_ns-diag-20130603133644.txt.zip
      "cbcollect_info" from dest : zbo/AWS_dest*_Couchbase_cluster.zip

      Both of them have been taken while XDCR was activated, but after "initial load".

      1. Questions :
      • Did we miss something while configuring our clusters or XDCR ?
      • Is this a known issue and is there a fix date ?
      • What could we do to investigate further with this issue ?
      • Do you confirm that if we had activated auto-failover, the situation would have been worse, with node taking over each other vBuckets while the nodes were "http unresponsive" and reported as pending by the others ?

      #Our configuration :

        1. Source cluster :

      We are running a 6 nodes couchbase source cluster v2.0.1 Community on CentOS 6.4 X86_64
      Each couchbase node is running inside an OpenVZ (2.6.32-042stab076.5) container with 6 "dedicated" CPUs and 87GB RAM
      We have applied the erlang memsup patch : CF MB-4376
      Those OpenVZ containers are split on 3 hardware servers with lot of CPU, RAM and disk space vacant (Those hardware servers are dedicated to couchbase and the application servers that query the cluster - Nothing else is running on those servers)
      None of the couchbase nodes are reporting more than 40% CPU used (even during view calls, indexing or compaction with ou without XDCR)
      None of the couchbase nodes are reporting memory swap, and they all have free memory available
      The reported Disk IOs are way under the specs the underlying SSDs we use are supposed to deliver (we still have 100% active items for all buckets and 0% miss)
      There is 5 buckets in this couchbase cluster :

      • 2 replicas are activated for each bucket
      • 2 buckets have few items (25K items, 62 items)
      • 3 buckets have a few millions items (41M, 3.5M, 1M)
      • 2 buckets have views (the 3.5M items bucket has 2 designed_docs with 5 and 1 view;s The 25K bucket has 1 designed_doc with 2 views)
      • Most views are called once every minutes (or less) with "?stale=false"
      • 2 of those buckets have and average 300 Ops
      • The remaining 3 buckets have between 10 and 50 average Ops
      • There is one pic period at night due to batch activity where couchbase Ops increase to ~5K Ops
      • This pic period is less than 5 minutes without XDCR and 60 minutes or more when XDCR is activated
        <- The time consuming step when XDCR is ON is the repeated "paginated" view calls our batch is doing with "&limit=${step2}&skip=${step1}"
      • All the 5 buckets are XDCR replicated to the destination cluster
      • No XDCR errors are reported in the couhbase GUI:8091/#sec=replications "Last error"
      • A lot of "web request failed" errors are reported in the couhbase GUI:8091/#sec=log, but all related (I think) to the slow couchbase HTTP API response time during the inital XDCR load :
        Server error during processing: ["web request failed", {path,"/pools/default"}

        ,
        (...)

        {path,"/pools/default/tasks"}

        ,
        (...)

        {"/pools/default/bucketsStreaming/settings"}

        ,
        (...)
        etc ...

        1. Destination cluster :

      The XDCR destination cluster is a one node couchbase dedicated AWS EC2 m2.4xlarge
      There is no "memcached operations" on this cluster
      There is no views on this cluster
      There is no XDCR defined on this cluster
      CPU is less than 60% on this node (most of it being beam.smp)
      There is still 1GB free memory
      Disk IO is write only (5 to 10MB/s) <- Disks are instance store only (no EBS)

        1. Interconnexion between source and dest cluster :

      Source and Destination cluster are connected through a VPN
      One OpenVPN server is running on the source cluster side
      Our one node AWS cluster connects as a client to the OpenVPN server through the public internet
      The traffic is routed (not "openVPN bridged") to and from the different couchbase nodes openVZ containers
      Our internet connectivity on the source side is 1Gbs and we are averaging a small 20Mbs TX+RX (Most of it being XDCR traffic)

        1. Our application :

      node.js with node-memcache and client-side moxi v1.8.1

      1. Bonus questions :
      • What happens if the "Cluster Reference" couchbase node used when creating the initial XDCR conf is later taken out of the cluster (during a legitimate cluster shrink operation on the destination cluster for example) ?
      • Couchbase appears to be very verbose when logging to /opt/couchbase/var/lib/couchbase/logs/xdcr.* <- Is this a normal behaviour and can we turn it off during standard production shift out of any debug needs ?

      Regards,

      Xavier.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            andreibaranouski Andrei Baranouski
            XavM Xav M
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty