Details
-
Task
-
Resolution: Fixed
-
Major
-
CBAS DP4
-
hwanalytics.spec
-
CX Sprint 117, CX Sprint 118, CX Sprint 119, CX Sprint 120
Description
Sometimes, when a hard kv node fail over happens, CBAS has to resync data from KV node from scratch. Which causes query to render incorrect result during that time. Which could confuse customer.
In the following test, when kv node failing over, the query against GleambookMessagesbucket-1 and GleambookUsersbucket-1 renders wrong result, it should be within 4500 to 5000 rows. But it renders 1235.
Analyzed the trace with Abdullah, it looks like the CBAS node's GleambookMessagesbucket-1 table is all in memory, and in this case when a KV node fail over happens, a KV rollback (of a very small amount of data) happened, and since the CBAS table GleambookMessagesbucket-1 is all in memory, this caused a full resync of GleambookMessagesbucket-1 table from kv node. Hence cause the query during the syncing giving wrong result.
http://perf.jenkins.couchbase.com/job/hw/536/
From the test run log above you can see the issue mentioned above:
22:02:54 Exception: Q13 Select left-outer equi-join indexnl (User message nested join user_since range 14 years send_time range 14 years skip user_since index) has invalid result count 1235.0 not in range 4500.0 - 5500.0
This test runs the following test steps:
Populates bigfun dataset (at 10k gbook user scale level) in Couchbase.
Connect Couchbase bucket with data sets in Analtyics, wait for them to be populated in Analytics.
Start 3 threads, each thread run all bigfun queries (3 times each) and report the server side execution latency.
Meanwhile, start a background mutation process which does 6k mutations per second in the background. (80% update, 10% insert, 10% delete)
Meanwhile, hard-failover 1 non-master kv node and add it back and rebalance.
Meanwhile there is a background process keep inserting new document into Couchbase and query Analytics to make sure it is appearing in Analytics side in 10 minutes.