Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-27159

Sometimes, when a hard kv node fail over happens, CBAS has to resync data from KV node from scratch.

    XMLWordPrintable

Details

    • Task
    • Resolution: Fixed
    • Major
    • 6.0.0
    • CBAS DP4
    • analytics
    • hwanalytics.spec
    • CX Sprint 117, CX Sprint 118, CX Sprint 119, CX Sprint 120

    Description

      Sometimes, when a hard kv node fail over happens,  CBAS has to resync data from KV node from scratch. Which causes query to render incorrect result during that time. Which could confuse customer.

      In the following test, when kv node failing over, the query against GleambookMessagesbucket-1 and GleambookUsersbucket-1 renders wrong result, it should be within 4500 to 5000 rows. But it renders 1235. 

      Analyzed the trace with Abdullah, it looks like the CBAS node's GleambookMessagesbucket-1 table is all in memory, and in this case when a KV node fail over happens, a KV rollback (of a very small amount of data) happened, and since the CBAS table GleambookMessagesbucket-1 is all in memory, this caused a full resync of GleambookMessagesbucket-1 table from kv node. Hence cause the query during the syncing giving wrong result.

      http://perf.jenkins.couchbase.com/job/hw/536/

      From the test run log above you can see the issue mentioned above:
      22:02:54 Exception: Q13 Select left-outer equi-join indexnl (User message nested join user_since range 14 years send_time range 14 years skip user_since index) has invalid result count 1235.0 not in range 4500.0 - 5500.0
       

      This test runs the following test steps:

      Populates bigfun dataset (at 10k gbook user scale level) in Couchbase.

      Connect Couchbase bucket with data sets in Analtyics, wait for them to be populated in Analytics.

      Start 3 threads, each thread run all bigfun queries (3 times each) and report the server side execution latency.

      Meanwhile, start a background mutation process which does 6k mutations per second in the background. (80% update, 10% insert, 10% delete)

      Meanwhile, hard-failover 1 non-master kv node and add it back and rebalance.

      Meanwhile there is a background process keep inserting new document into Couchbase and query Analytics to make sure it is appearing in Analytics side in 10 minutes. 

       

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              dmitry.lychagin Dmitry Lychagin (Inactive)
              hui.wang Hui Wang (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty