Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-7169

missed items after rebalance out 2 failover nodes with replica=2 in 5 nodes cluster

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker
    • Resolution: Fixed
    • Affects Version/s: 2.0
    • Fix Version/s: 2.0
    • Component/s: couchbase-bucket, ns_server
    • Security Level: Public
    • Labels:
      None

      Description

      version=2.0.0-1949-rel
      http://qa.hq.northscale.net/job/centos-64-2.0-failover-tests/448/consoleFull
      testrunner -i resources/jenkins/centos-64-5node-failover.ini get-logs=True,GROUP=BAT -t failovertests.FailoverTests.test_failover_firewall,replica=2,keys_count=20000,GROUP=BAT

      steps:
      1. 5 nodes in cluster with default bucket replica=2,keys_count=20000
      3. failover 2 nodes

      [2012-11-13 02:31:42,760] - [rest_client:849] INFO - fail_over successful
      [2012-11-13 02:31:42,761] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.117
      [2012-11-13 02:31:43,328] - [rest_client:849] INFO - fail_over successful
      [2012-11-13 02:31:43,328] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.115

      2012-11-12 11:14:51,928 - root - INFO - node 10.1.3.117:8091 is 'unhealthy' as expected
      2012-11-12 11:15:22,229 - root - INFO - fail_over successful
      2012-11-12 11:15:22,229 - root - INFO - failed over node : ns_1@10.1.3.117

      2012-11-12 11:15:26,041 - root - INFO - node 10.1.3.115:8091 is 'unhealthy' as expected
      2012-11-12 11:15:31,055 - root - INFO - fail_over successful
      2012-11-12 11:15:31,055 - root - INFO - failed over node : ns_1@10.1.3.115

      3. rebalance out failover nodes

      [2012-11-13 02:31:43,328] - [failovertests:292] INFO - 10 seconds sleep after failover before invoking rebalance...
      [2012-11-13 02:31:53,328] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.117%2Cns_1%4010.1.3.115&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.117%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4010.1.3.116
      result:
      rebalance passed successful but after progress 59 % it completed instantly

      [2012-11-13 02:31:53,337] - [rest_client:890] INFO - rebalance operation started
      [2012-11-13 02:31:53,347] - [rest_client:986] INFO - rebalance percentage : 0 %
      [2012-11-13 02:31:55,376] - [rest_client:986] INFO - rebalance percentage : 0.415885290979 %
      [2012-11-13 02:31:57,385] - [rest_client:986] INFO - rebalance percentage : 0.966587084853 %
      [2012-11-13 02:31:59,394] - [rest_client:986] INFO - rebalance percentage : 1.57911268089 %
      [2012-11-13 02:32:01,401] - [rest_client:986] INFO - rebalance percentage : 2.05668833217 %
      [2012-11-13 02:32:03,417] - [rest_client:986] INFO - rebalance percentage : 2.6692139282 %
      [2012-11-13 02:32:05,421] - [rest_client:986] INFO - rebalance percentage : 3.08496577732 %
      [2012-11-13 02:32:07,437] - [rest_client:986] INFO - rebalance percentage : 3.41655602713 %.......
      ............
      [2012-11-13 02:39:22,147] - [rest_client:986] INFO - rebalance percentage : 58.7635239567 %
      [2012-11-13 02:39:24,159] - [rest_client:986] INFO - rebalance percentage : 59.0726429675 %
      [2012-11-13 02:39:26,205] - [rest_client:986] INFO - rebalance percentage : 59.3199381762 %
      [2012-11-13 02:39:28,212] - [rest_client:986] INFO - rebalance percentage : 59.5672333849 %
      [2012-11-13 02:39:30,226] - [rest_client:986] INFO - rebalance percentage : 59.8145285935 %
      [2012-11-13 02:39:34,237] - [rest_client:943] INFO - rebalance progress took 460.899084091 seconds
      [2012-11-13 02:39:34,237] - [rest_client:944] INFO - sleep for 10 seconds after rebalance...

      then verification part:

      [2012-11-13 02:39:44,236] - [failovertests:304] INFO - Begin VERIFICATION ...
      [2012-11-13 02:39:45,146] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
      [2012-11-13 02:39:45,591] - [data_helper:289] INFO - creating direct client 10.1.3.116:11210 default
      [2012-11-13 02:39:46,002] - [data_helper:289] INFO - creating direct client 10.1.3.118:11210 default
      [2012-11-13 02:39:46,420] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:46,446] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
      [2012-11-13 02:39:46,969] - [data_helper:289] INFO - creating direct client 10.1.3.116:11210 default
      [2012-11-13 02:39:47,409] - [data_helper:289] INFO - creating direct client 10.1.3.118:11210 default
      [2012-11-13 02:39:47,793] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:47,820] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
      [2012-11-13 02:39:48,255] - [data_helper:289] INFO - creating direct client 10.1.3.116:11210 default
      [2012-11-13 02:39:48,676] - [data_helper:289] INFO - creating direct client 10.1.3.118:11210 default
      [2012-11-13 02:39:49,084] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:49,108] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
      [2012-11-13 02:39:49,520] - [data_helper:289] INFO - creating direct client 10.1.3.116:11210 default
      [2012-11-13 02:39:49,939] - [data_helper:289] INFO - creating direct client 10.1.3.118:11210 default
      [2012-11-13 02:39:50,364] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:52,404] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:53,433] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:54,461] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:55,489] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:57,518] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:58,549] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:39:59,581] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:00,610] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:02,638] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:03,670] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:04,702] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:05,732] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:07,760] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:08,789] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:09,818] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:10,851] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:12,880] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:13,907] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:14,937] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:15,968] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:17,998] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:19,032] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:20,062] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:21,096] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:23,124] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:24,154] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:25,186] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:26,215] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:28,243] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:29,272] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:30,301] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:31,329] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:33,359] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:34,389] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:35,419] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:36,446] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:38,474] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:39,502] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:40,531] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:41,564] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      [2012-11-13 02:40:43,594] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
      this job didn't get cbcollect_info, will add it

      Alk, my question is:
      is it possible that the rebalance is over at the end so fast?

      If you do not see any problem here, assign ticket back to me or Abhinav that we could provide cbcollect_info next time for bucket-engine team

      1. 145691ce-6eb4-4a60-8b1c-e5bc792ae9e0-10.1.3.114-diag.txt.gz
        5.31 MB
        Andrei Baranouski
      2. 145691ce-6eb4-4a60-8b1c-e5bc792ae9e0-10.1.3.116-diag.txt.gz
        3.57 MB
        Andrei Baranouski
      3. 145691ce-6eb4-4a60-8b1c-e5bc792ae9e0-10.1.3.118-diag.txt.gz
        3.04 MB
        Andrei Baranouski
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        chiyoung Chiyoung Seo added a comment -

        Farshid, Andrei,

        In most cases, failover would cause data loss. Can you check the test case again?

        Show
        chiyoung Chiyoung Seo added a comment - Farshid, Andrei, In most cases, failover would cause data loss. Can you check the test case again?
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Andrei please proceed with making the test do what i posted in the comment above

        Show
        farshid Farshid Ghods (Inactive) added a comment - Andrei please proceed with making the test do what i posted in the comment above
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        steps of the last test:
        1. 5 nodes in cluster: 10.1.3.114, 10.1.3.117,10.1.3.115,10.1.3.118,10.1.3.116
        2. create 1 default bucket with replica=2 and 20K items
        3. verify/wait until all items drain on all 5 nodes
        [2012-11-13 07:21:28,128] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
        [2012-11-13 07:21:28,539] - [task:333] INFO - Saw ep_queue_size 0 == 0 expected on '10.1.3.114:8091'
        [2012-11-13 07:21:28,561] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
        [2012-11-13 07:21:28,961] - [task:333] INFO - Saw ep_flusher_todo 0 == 0 expected on '10.1.3.114:8091'
        4.wait while replication is completed ( now we wait when node.replication == 1.0 on all nodes, I will modify with backfill_completed and idle stats as Chiyoung sugested)

        [2012-11-13 07:21:32,828] - [rest_client:148] INFO - replication state : True
        [2012-11-13 07:21:32,852] - [failovertests:246] INFO - replication state after waiting for up to 15 minutes : True

        5. failover 2 nodes
        [2012-11-13 07:21:33,378] - [rest_client:849] INFO - fail_over successful
        [2012-11-13 07:21:33,378] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.117
        [2012-11-13 07:21:33,987] - [rest_client:849] INFO - fail_over successful
        [2012-11-13 07:21:33,987] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.115
        6.rebalance out 2 nodes

        [2012-11-13 07:21:43,986] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.117%2Cns_1%4010.1.3.115&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.117%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4.010.1.3.116
        ...
        [2012-11-13 07:27:24,697] - [rest_client:943] INFO - rebalance progress took 340.681210995 seconds
        [2012-11-13 07:27:24,697] - [rest_client:944] INFO - sleep for 10 seconds after rebalance...

        7. verify/wait until all items drain on all 3 nodes
        [2012-11-13 07:27:34,756] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
        [2012-11-13 07:27:35,125] - [task:333] INFO - Saw ep_queue_size 0 == 0 expected on '10.1.3.114:8091'
        [2012-11-13 07:27:35,147] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default
        [2012-11-13 07:27:35,647] - [task:333] INFO - Saw ep_flusher_todo 0 == 0 expected on '10.1.3.114:8091'
        8.verify that active/replica count matches with expected values that we had before failover and rebalance( 5 nodes , 1 bucket with replica=2, 2 nodes failover, there shouldn't be lost items?)
        test falls:
        [2012-11-13 07:27:43,469] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
        [2012-11-13 07:27:45,499] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
        [2012-11-13 07:27:46,527] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'
        [2012-11-13 07:27:47,557] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091'

        I'll add some more stats/wait verification, but I think that entire test is valid now

        I would like to note that failures with lost items occurred after I set 1024 vBuckets count for centos-64-2.0-failover-tests job. for 128 vBuckets they always passed

        Show
        andreibaranouski Andrei Baranouski added a comment - steps of the last test: 1. 5 nodes in cluster: 10.1.3.114, 10.1.3.117,10.1.3.115,10.1.3.118,10.1.3.116 2. create 1 default bucket with replica=2 and 20K items 3. verify/wait until all items drain on all 5 nodes [2012-11-13 07:21:28,128] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default [2012-11-13 07:21:28,539] - [task:333] INFO - Saw ep_queue_size 0 == 0 expected on '10.1.3.114:8091' [2012-11-13 07:21:28,561] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default [2012-11-13 07:21:28,961] - [task:333] INFO - Saw ep_flusher_todo 0 == 0 expected on '10.1.3.114:8091' 4.wait while replication is completed ( now we wait when node.replication == 1.0 on all nodes, I will modify with backfill_completed and idle stats as Chiyoung sugested) [2012-11-13 07:21:32,828] - [rest_client:148] INFO - replication state : True [2012-11-13 07:21:32,852] - [failovertests:246] INFO - replication state after waiting for up to 15 minutes : True 5. failover 2 nodes [2012-11-13 07:21:33,378] - [rest_client:849] INFO - fail_over successful [2012-11-13 07:21:33,378] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.117 [2012-11-13 07:21:33,987] - [rest_client:849] INFO - fail_over successful [2012-11-13 07:21:33,987] - [failovertests:278] INFO - failed over node : ns_1@10.1.3.115 6.rebalance out 2 nodes [2012-11-13 07:21:43,986] - [rest_client:883] INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.1.3.117%2Cns_1%4010.1.3.115&user=Administrator&knownNodes=ns_1%4010.1.3.114%2Cns_1%4010.1.3.117%2Cns_1%4010.1.3.115%2Cns_1%4010.1.3.118%2Cns_1%4.010.1.3.116 ... [2012-11-13 07:27:24,697] - [rest_client:943] INFO - rebalance progress took 340.681210995 seconds [2012-11-13 07:27:24,697] - [rest_client:944] INFO - sleep for 10 seconds after rebalance... 7. verify/wait until all items drain on all 3 nodes [2012-11-13 07:27:34,756] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default [2012-11-13 07:27:35,125] - [task:333] INFO - Saw ep_queue_size 0 == 0 expected on '10.1.3.114:8091' [2012-11-13 07:27:35,147] - [data_helper:289] INFO - creating direct client 10.1.3.114:11210 default [2012-11-13 07:27:35,647] - [task:333] INFO - Saw ep_flusher_todo 0 == 0 expected on '10.1.3.114:8091' 8.verify that active/replica count matches with expected values that we had before failover and rebalance( 5 nodes , 1 bucket with replica=2, 2 nodes failover, there shouldn't be lost items?) test falls: [2012-11-13 07:27:43,469] - [task:329] WARNING - Not Ready: curr_items_tot 54051 == 60000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091' [2012-11-13 07:27:45,499] - [task:329] WARNING - Not Ready: curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091' [2012-11-13 07:27:46,527] - [task:329] WARNING - Not Ready: vb_active_curr_items 18017 == 20000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091' [2012-11-13 07:27:47,557] - [task:329] WARNING - Not Ready: vb_replica_curr_items 36034 == 40000 expected on '10.1.3.114:8091''10.1.3.116:8091''10.1.3.118:8091' I'll add some more stats/wait verification, but I think that entire test is valid now I would like to note that failures with lost items occurred after I set 1024 vBuckets count for centos-64-2.0-failover-tests job. for 128 vBuckets they always passed
        Hide
        ketaki Ketaki Gangal added a comment -

        From Chiyoung, - We should wait for replication to catch up on the nodes to be-failed-over (only) before failing-over
        @Andrei - Could you modify the test to wait for only the replication-catching up and not for the additional 5 minutes or the disk-write-queue to drain.

        1. 5 nodes in cluster:10.1.3.114, 10.1.3.117,10.1.3.115,10.1.3.118,10.1.3.116
        2. create 1 default bucket with replica=2 and 20K items
        3. Only - wait while replication is completed ( wait - backfill_completed and idle stats as Chiyoung sugested)

        • Eliminating wait-time for disk queue draining and any additional time-wait added on the automation.
          4. Failover 2 nodes
          5. Rebalance out 2 nodes
          6.verify that active/replica count matches with expected values that we had before failover and rebalance( 5 nodes , 1 bucket with replica=2, 2 nodes failover, there shouldn't be lost items?)
          test falls:

        rest as expected.

        Show
        ketaki Ketaki Gangal added a comment - From Chiyoung, - We should wait for replication to catch up on the nodes to be-failed-over (only) before failing-over @Andrei - Could you modify the test to wait for only the replication-catching up and not for the additional 5 minutes or the disk-write-queue to drain. 1. 5 nodes in cluster:10.1.3.114, 10.1.3.117,10.1.3.115,10.1.3.118,10.1.3.116 2. create 1 default bucket with replica=2 and 20K items 3. Only - wait while replication is completed ( wait - backfill_completed and idle stats as Chiyoung sugested) Eliminating wait-time for disk queue draining and any additional time-wait added on the automation. 4. Failover 2 nodes 5. Rebalance out 2 nodes 6.verify that active/replica count matches with expected values that we had before failover and rebalance( 5 nodes , 1 bucket with replica=2, 2 nodes failover, there shouldn't be lost items?) test falls: rest as expected.
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        it's not produced any more on latest run and manually

        Show
        andreibaranouski Andrei Baranouski added a comment - it's not produced any more on latest run and manually

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes