Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6987

curr_items_tot does not correspond to sum of curr_items from nodes after swap rebalance of 2 nodes with data load

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: None
    • Fix Version/s: 2.0.1
    • Component/s: ns_server, storage-engine
    • Security Level: Public
    • Labels:
      None

      Description

      version=2.0.0-1884-rel
      http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/90/consoleFull
      ./testrunner -i /tmp/rebalance_regression.ini get-logs=True -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=2,num-buckets=1,num-swap=2,swap-orchestrator=False

      2012-10-23 01:44:52,802 - root - INFO - rebalance params : password=password&ejectedNodes=ns_1%4010.3.121.94%2Cns_1%4010.3.121.93&user=Administrator&knownNodes=ns_1%4010.3.121.93%2Cns_1%4010.3.121.94%2Cns_1%4010.3.121.92%2Cns_1%4010.3.121.95%2Cns_1%4010.3.121.96
      ...

      2012-10-23 01:57:41,129 - root - INFO - curr_items from 10.3.121.95:8091 : 32540
      2012-10-23 01:57:41,130 - root - INFO - vb_pending_num from 10.3.121.95:8091 : 0
      2012-10-23 01:57:41,131 - root - INFO - vb_active_num from 10.3.121.95:8091 : 341
      2012-10-23 01:57:41,131 - root - INFO - vb_replica_num from 10.3.121.95:8091 : 683
      2012-10-23 01:57:41,131 - root - INFO - curr_items from 10.3.121.92:8091 : 32490
      2012-10-23 01:57:41,132 - root - INFO - vb_pending_num from 10.3.121.92:8091 : 0
      2012-10-23 01:57:41,132 - root - INFO - vb_active_num from 10.3.121.92:8091 : 342
      2012-10-23 01:57:41,133 - root - INFO - vb_replica_num from 10.3.121.92:8091 : 682
      2012-10-23 01:57:41,134 - root - INFO - curr_items from 10.3.121.96:8091 : 32704
      2012-10-23 01:57:41,134 - root - INFO - vb_pending_num from 10.3.121.96:8091 : 0
      2012-10-23 01:57:41,134 - root - INFO - vb_active_num from 10.3.121.96:8091 : 341
      2012-10-23 01:57:41,135 - root - INFO - vb_replica_num from 10.3.121.96:8091 : 683
      2012-10-23 01:57:41,135 - root - INFO - summation of vb_active_num : 1024 vb_pending_num : 0 vb_replica_num : 2048
      2012-10-23 01:57:41,136 - root - INFO - sum : 97734 and sum * (replica_factor + 1) (3) : 293202
      2012-10-23 01:57:41,483 - root - INFO - master_stats : 293215
      2012-10-23 01:57:41,484 - root - INFO - delta : 13 missing_percentage : 4.43380331648e-05 replica_factor : 2
      2012-10-23 01:57:43,570 - root - INFO - creating direct client 10.3.121.95:11210 default
      2012-10-23 01:57:43,975 - root - INFO - getting tap stats.. for 10.3.121.95
      2012-10-23 01:57:43,983 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:ack_window_full false
      2012-10-23 01:57:43,984 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:has_queued_item false
      2012-10-23 01:57:43,984 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:ack_log_size 0
      2012-10-23 01:57:43,984 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:pending_backfill false
      2012-10-23 01:57:43,984 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:ack_seqno 52343
      2012-10-23 01:57:43,985 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:paused 1
      2012-10-23 01:57:43,985 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:idle true
      2012-10-23 01:57:43,986 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:ack_seqno 41264
      2012-10-23 01:57:43,986 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:ack_log_size 0
      2012-10-23 01:57:43,986 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:has_queued_item false
      2012-10-23 01:57:43,987 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:idle true
      2012-10-23 01:57:43,987 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:pending_disk_backfill false
      2012-10-23 01:57:43,988 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:pending_disk_backfill false
      2012-10-23 01:57:43,988 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:ack_window_full false
      2012-10-23 01:57:43,988 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:recv_ack_seqno 52342
      2012-10-23 01:57:43,989 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:recv_ack_seqno 41263
      2012-10-23 01:57:43,989 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.96:paused 1
      2012-10-23 01:57:43,989 - root - INFO - TAP ns_1@10.3.121.95 :eq_tapq:replication_ns_1@10.3.121.92:pending_backfill false
      2012-10-23 01:57:44,018 - root - INFO - creating direct client 10.3.121.96:11210 default
      2012-10-23 01:57:44,430 - root - INFO - getting tap stats.. for 10.3.121.96
      2012-10-23 01:57:44,438 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:ack_window_full false
      2012-10-23 01:57:44,438 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:has_queued_item false
      2012-10-23 01:57:44,439 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:ack_log_size 0
      2012-10-23 01:57:44,439 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:ack_log_size 0
      2012-10-23 01:57:44,440 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:paused 1
      2012-10-23 01:57:44,440 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:ack_window_full false
      2012-10-23 01:57:44,440 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:idle true
      2012-10-23 01:57:44,441 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:ack_seqno 40878
      2012-10-23 01:57:44,441 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:paused 1
      2012-10-23 01:57:44,442 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:pending_disk_backfill false
      2012-10-23 01:57:44,442 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:ack_seqno 82461
      2012-10-23 01:57:44,442 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:has_queued_item false
      2012-10-23 01:57:44,443 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:idle true
      2012-10-23 01:57:44,443 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:pending_backfill false
      2012-10-23 01:57:44,444 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:recv_ack_seqno 40877
      2012-10-23 01:57:44,444 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:pending_disk_backfill false
      2012-10-23 01:57:44,445 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.92:pending_backfill false
      2012-10-23 01:57:44,445 - root - INFO - TAP ns_1@10.3.121.96 :eq_tapq:replication_ns_1@10.3.121.95:recv_ack_seqno 82460
      2012-10-23 01:57:44,475 - root - INFO - creating direct client 10.3.121.92:11210 default
      2012-10-23 01:57:44,886 - root - INFO - getting tap stats.. for 10.3.121.92
      2012-10-23 01:57:44,894 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:ack_log_size 0
      2012-10-23 01:57:44,894 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:pending_backfill false
      2012-10-23 01:57:44,895 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:pending_backfill false
      2012-10-23 01:57:44,895 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:ack_seqno 59391
      2012-10-23 01:57:44,896 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:ack_window_full false
      2012-10-23 01:57:44,897 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:ack_log_size 0
      2012-10-23 01:57:44,898 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:has_queued_item false
      2012-10-23 01:57:44,898 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:idle true
      2012-10-23 01:57:44,899 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:paused 1
      2012-10-23 01:57:44,899 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:pending_disk_backfill false
      2012-10-23 01:57:44,899 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:ack_seqno 59944
      2012-10-23 01:57:44,900 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:has_queued_item false
      2012-10-23 01:57:44,900 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:idle true
      2012-10-23 01:57:44,900 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:ack_window_full false
      2012-10-23 01:57:44,901 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:recv_ack_seqno 59390
      2012-10-23 01:57:44,901 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:recv_ack_seqno 59943
      2012-10-23 01:57:44,902 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.96:paused 1
      2012-10-23 01:57:44,902 - root - INFO - TAP ns_1@10.3.121.92 :eq_tapq:replication_ns_1@10.3.121.95:pending_disk_backfill false
      FAIL
      ======================================================================
      FAIL: test_failed_swap_rebalance (swaprebalance.SwapRebalanceFailedTests)
      ----------------------------------------------------------------------
      Traceback (most recent call last):
      File "pytests/swaprebalance.py", line 597, in test_failed_swap_rebalance
      SwapRebalanceBase._common_test_body_failed_swap_rebalance(self)
      File "pytests/swaprebalance.py", line 394, in _common_test_body_failed_swap_rebalance
      SwapRebalanceBase.items_verification(master, self)
      File "pytests/swaprebalance.py", line 148, in items_verification
      test.assertTrue(verified, "Lost items!!.. failing test in

      {0}

      secs".format(timeout))
      AssertionError: Lost items!!.. failing test in 600 secs

      [error_logger:error,2012-10-23T1:46:34.206,ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ebucketmigrator_srv:init/1
      pid: <0.15957.3>
      registered_name: []
      exception exit: downstream_closed
      in function gen_server:terminate/6
      ancestors: ['ns_vbm_new_sup-default','single_bucket_sup-default',
      <0.780.0>]
      messages: []
      links: Port<0.61066>,<0.806.0>,<0.15990.3>,#Port<0.61062>
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 4181
      stack_size: 24
      reductions: 129203
      neighbours:

      [error_logger:error,2012-10-23T1:46:34.207,ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_msg:76]** Generic server <0.15980.3> terminating

        • Last message in was {tcp_closed,#Port<0.61086>}
        • When Server state == {state,#Port<0.61079>,#Port<0.61060>,#Port<0.61086>,
          #Port<0.61065>,<0.15991.3>,<<>>,<<>>,
          {set,77,16,16,8,80,48,
          {[],[],[],[],[],[],[],[],[],[],[],[],[], [],[],[]}

          ,
          {{[670,654,638,622,606],
          [673,657,641,625,609],
          [676,660,644,628,612],
          [679,663,647,631,615],
          [682,666,650,634,618],
          [669,653,637,621],
          [672,656,640,624,608],
          [675,659,643,627,611],
          [678,662,646,630,614],
          [681,665,649,633,617],
          [668,652,636,620],
          [671,655,639,623,607],
          [674,658,642,626,610],
          [677,661,645,629,613],
          [680,664,648,632,616],
          [667,651,635,619]}}},
          -1,false,false,0,

          {1350,981993,591548}

          ,
          not_started,undefined,
          <<"replication_ns_1@10.3.121.92">>,<0.15980.3>,

          {had_backfill,false,undefined,[]}

          ,
          false}

        • Reason for termination ==
        • downstream_closed

      [error_logger:error,2012-10-23T1:46:34.213,ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ebucketmigrator_srv:init/1
      pid: <0.15980.3>
      registered_name: []
      exception exit: downstream_closed
      in function gen_server:terminate/6
      ancestors: ['ns_vbm_new_sup-default','single_bucket_sup-default',
      <0.780.0>]
      messages: []
      links: Port<0.61065>,<0.806.0>,<0.15991.3>,#Port<0.61060>
      dictionary: []
      trap_exit: true
      status: running
      heap_size: 4181
      stack_size: 24
      reductions: 38360
      neighbours:

      [error_logger:error,2012-10-23T1:46:34.213,ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor:

      {local,'ns_vbm_new_sup-default'}
      Context: child_terminated
      Reason: downstream_closed
      Offender: [{pid,<0.15957.3>},
      {name,{new_child_id,[513,514,515,516,517,518,519,520,521,
      522,523,524,525,526,527,528,529,530,
      531,532,533,534,535,536,537,538,539,
      540,541,542,543,544,545,546,547,548,
      549,550,551,552,553,554,555,556,557,
      558,559,560,561,562,563,564,565,566,
      567,568,569,570,571,572,573,574,575,
      576,577,578,579,580,581,582,583,584,
      585,586,587,588,589,590,591,592,593,
      594,595,596,597,598,599,600,601,602,
      603,604,683,684,685,686,687,688,689,
      690,691,692,693,694,695,696,697,698,
      699,700,701,702,703,704,705,706,707,
      708,709,710,711,712,713,714,715,716,
      717,718,719,720,721,722,723,724,725,
      726,727,728,729,730,731,732,733,734,
      735,736,737,738,739,740,741,742,743,
      744,745,746,747,748,749,750,751,752,
      753,754,755,756,757,758,759,760,761,
      762,763,764,765,766,767,768,769,770,
      771,772,773,774,775,776,777,778,779,
      780,781,782,783,784,785,786,787,788,
      789,790,791,792,793,794,795,796,797,
      798,799,800,801,802,803,804,805,806,
      807,808,809,810,811,812,813,814,815,
      816,817,818,819,820,821,822,823,824,
      825,826,827,828,829,830,831,832,833,
      834,835,836,837,838,839,840,841,842,
      843,844,845,846,847,848,849,850,851,
      852,853],
      'ns_1@10.3.121.95'}},
      {mfargs,{ebucketmigrator_srv,start_link,undefined}},
      {restart_type,temporary},
      {shutdown,60000},
      {child_type,worker}]


      [ns_server:info,2012-10-23T1:46:34.219,ns_1@10.3.121.92:<0.15981.3>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.93': [<<"replication_building_605_'ns_1@10.3.121.95'">>,
      <<"replication_building_605_'ns_1@10.3.121.92'">>]
      [rebalance:info,2012-10-23T1:46:34.222,ns_1@10.3.121.92:<0.23130.2>:janitor_agent:bulk_set_vbucket_state:386]Doing bulk vbucket 605 state change
      [{'ns_1@10.3.121.94',replica,undefined,undefined},
      {'ns_1@10.3.121.95',replica,undefined,'ns_1@10.3.121.96'},
      {'ns_1@10.3.121.92',replica,undefined,'ns_1@10.3.121.95'}]
      [error_logger:error,2012-10-23T1:46:34.222,ns_1@10.3.121.92:error_logger:ale_error_logger_handler:log_report:72]
      =========================SUPERVISOR REPORT=========================
      Supervisor: {local,'ns_vbm_new_sup-default'}

      Context: child_terminated
      Reason: downstream_closed
      Offender: [

      {pid,<0.15980.3>}

      ,
      {name,{new_child_id,[606,607,608,609,610,611,612,613,614,
      615,616,617,618,619,620,621,622,623,
      624,625,626,627,628,629,630,631,632,
      633,634,635,636,637,638,639,640,641,
      642,643,644,645,646,647,648,649,650,
      651,652,653,654,655,656,657,658,659,
      660,661,662,663,664,665,666,667,668,
      669,670,671,672,673,674,675,676,677,
      678,679,680,681,682],
      'ns_1@10.3.121.94'}},
      {mfargs,{ebucketmigrator_srv,start_link,undefined}},

      {restart_type,temporary}

      ,

      {shutdown,60000}

      ,

      {child_type,worker}

      ]

      1. 045dde1d-1eb1-4213-afc2-ddaf014a5997-10.3.121.92-diag.txt.gz
        8.77 MB
        Andrei Baranouski
      2. 10.3.121.92-8091-diag.txt.gz
        9.91 MB
        Andrei Baranouski
      3. 10.3.121.93-8091-diag.txt.gz
        2.79 MB
        Andrei Baranouski
      4. 10.3.121.94-8091-diag.txt.gz
        3.01 MB
        Andrei Baranouski
      5. 10.3.121.95-8091-diag.txt.gz
        3.96 MB
        Andrei Baranouski
      6. 10.3.121.96-8091-diag.txt.gz
        3.65 MB
        Andrei Baranouski
      7. 10.3.121.97-8091-diag.txt.gz
        55 kB
        Andrei Baranouski
      8. 10.3.121.98-8091-diag.txt.gz
        55 kB
        Andrei Baranouski
      9. 1df09542-e44c-40d6-abfc-2984ed3d7fc1-10.1.3.112-diag.txt.gz
        195 kB
        Andrei Baranouski
      10. 1df09542-e44c-40d6-abfc-2984ed3d7fc1-10.1.3.120-diag.txt.gz
        418 kB
        Andrei Baranouski
      11. 1df09542-e44c-40d6-abfc-2984ed3d7fc1-10.1.3.121-diag.txt.gz
        64 kB
        Andrei Baranouski
      12. 1df09542-e44c-40d6-abfc-2984ed3d7fc1-10.1.3.160-diag.txt.gz
        64 kB
        Andrei Baranouski
      13. 1df09542-e44c-40d6-abfc-2984ed3d7fc1-10.3.121.41-diag.txt.gz
        64 kB
        Andrei Baranouski
      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        andreibaranouski Andrei Baranouski added a comment -

        test code that verify_items_count
        https://github.com/couchbase/testrunner/blob/master/lib/membase/helper/rebalance_helper.py#L301

        2012-10-23 01:57:41,129 - root - INFO - curr_items from 10.3.121.95:8091 : 32540
        2012-10-23 01:57:41,131 - root - INFO - curr_items from 10.3.121.92:8091 : 32490
        2012-10-23 01:57:41,134 - root - INFO - curr_items from 10.3.121.96:8091 : 32704
        2012-10-23 01:57:41,136 - root - INFO - sum : 97734 and sum * (replica_factor + 1) (3) : 293202
        2012-10-23 01:57:41,483 - root - INFO - master_stats : 293215

        #32540 + 32490 + 32704 = 97734
        #97734*3=293202
        #293202!=293215

        Show
        andreibaranouski Andrei Baranouski added a comment - test code that verify_items_count https://github.com/couchbase/testrunner/blob/master/lib/membase/helper/rebalance_helper.py#L301 2012-10-23 01:57:41,129 - root - INFO - curr_items from 10.3.121.95:8091 : 32540 2012-10-23 01:57:41,131 - root - INFO - curr_items from 10.3.121.92:8091 : 32490 2012-10-23 01:57:41,134 - root - INFO - curr_items from 10.3.121.96:8091 : 32704 2012-10-23 01:57:41,136 - root - INFO - sum : 97734 and sum * (replica_factor + 1) (3) : 293202 2012-10-23 01:57:41,483 - root - INFO - master_stats : 293215 #32540 + 32490 + 32704 = 97734 #97734*3=293202 #293202!=293215
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Andrei,

        few things to verify
        1- did all items drain on all nodes
        2- is time in sync on all nodes ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Andrei, few things to verify 1- did all items drain on all nodes 2- is time in sync on all nodes ?
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        please assign the ticker back to me after commenting on the ticket

        Show
        farshid Farshid Ghods (Inactive) added a comment - please assign the ticker back to me after commenting on the ticket
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        python scripts/ssh.py -i centos-64-2.0-rebalance-regressions.ini "date"
        10.3.121.98
        Tue Oct 23 05:44:52 PDT 2012

        10.3.121.97
        Tue Oct 23 05:44:52 PDT 2012

        10.3.121.93
        Tue Oct 23 05:44:52 PDT 2012

        10.3.121.94
        Tue Oct 23 05:44:53 PDT 2012

        10.3.121.95
        Tue Oct 23 05:44:53 PDT 2012

        10.3.121.92
        Tue Oct 23 05:45:02 PDT 2012

        10.3.121.96
        Tue Oct 23 05:44:33 PDT 2012

        retest it after time sync

        Show
        andreibaranouski Andrei Baranouski added a comment - python scripts/ssh.py -i centos-64-2.0-rebalance-regressions.ini "date" 10.3.121.98 Tue Oct 23 05:44:52 PDT 2012 10.3.121.97 Tue Oct 23 05:44:52 PDT 2012 10.3.121.93 Tue Oct 23 05:44:52 PDT 2012 10.3.121.94 Tue Oct 23 05:44:53 PDT 2012 10.3.121.95 Tue Oct 23 05:44:53 PDT 2012 10.3.121.92 Tue Oct 23 05:45:02 PDT 2012 10.3.121.96 Tue Oct 23 05:44:33 PDT 2012 retest it after time sync
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        the problem is that we load 0.03% expired items in test using mcsoda(by default) and this affect that curr_items_tot != sum(curr_items)
        I tried the same test with param ratio-expiry=0

        python testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=80 -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=2,num-buckets=1,num-swap=2,swap-orchestrator=False,ratio-expiry=0,skip-cleanup=True

        http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/94/

        Farshid, if it's not a bug test should be modified. possible changes to fix
        1) increase wait timeout, now it's 10 min
        2) set ratio-expiry=0
        3) modify code and pass test if the missing percentage roughly equivalent to ratio-expiry value

        Show
        andreibaranouski Andrei Baranouski added a comment - the problem is that we load 0.03% expired items in test using mcsoda(by default) and this affect that curr_items_tot != sum(curr_items) I tried the same test with param ratio-expiry=0 python testrunner -i /tmp/rebalance_regression.ini get-logs=True,wait_timeout=80 -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=2,num-buckets=1,num-swap=2,swap-orchestrator=False,ratio-expiry=0,skip-cleanup=True http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/94/ Farshid, if it's not a bug test should be modified. possible changes to fix 1) increase wait timeout, now it's 10 min 2) set ratio-expiry=0 3) modify code and pass test if the missing percentage roughly equivalent to ratio-expiry value
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        in second time with set ratio-expiry=0 test failed http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/95/console

        time in sync

        /opt/python2.7/bin/python scripts/ssh.py -i /tmp/rebalance_regression.ini date
        10.3.121.93
        Thu Oct 25 05:03:04 PDT 2012

        10.3.121.94
        Thu Oct 25 05:03:04 PDT 2012

        10.3.121.95
        Thu Oct 25 05:03:04 PDT 2012

        10.3.121.98
        Thu Oct 25 05:03:04 PDT 2012

        10.3.121.97
        Thu Oct 25 05:03:04 PDT 2012

        10.3.121.96
        Thu Oct 25 05:03:04 PDT 2012

        10.3.121.92
        Thu Oct 25 05:03:04 PDT 2012

        it's a bug

        Show
        andreibaranouski Andrei Baranouski added a comment - in second time with set ratio-expiry=0 test failed http://qa.hq.northscale.net/job/centos-64-2.0-rebalance-regressions/95/console time in sync /opt/python2.7/bin/python scripts/ssh.py -i /tmp/rebalance_regression.ini date 10.3.121.93 Thu Oct 25 05:03:04 PDT 2012 10.3.121.94 Thu Oct 25 05:03:04 PDT 2012 10.3.121.95 Thu Oct 25 05:03:04 PDT 2012 10.3.121.98 Thu Oct 25 05:03:04 PDT 2012 10.3.121.97 Thu Oct 25 05:03:04 PDT 2012 10.3.121.96 Thu Oct 25 05:03:04 PDT 2012 10.3.121.92 Thu Oct 25 05:03:04 PDT 2012 it's a bug
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        is this test passing now after syncing the time between vms ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - is this test passing now after syncing the time between vms ?
        Hide
        andreibaranouski Andrei Baranouski added a comment -

        no, test fails from time to time even with time sync
        for instance, on another vms and slightly different params:
        testrunner -i /tmp/swaprebalance-32.ini get-logs=True,GROUP=P0 -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=1,num-buckets=1,num-swap=3,GROUP=P0
        http://qa.hq.northscale.net/job/ubuntu-32-2.0-swaprebalance-tests-P0/316/consoleFull

        2012-10-30 01:48:51,296 - root - INFO - sum : 97723 and sum * (replica_factor + 1) (2) : 195446
        2012-10-30 01:48:51,772 - root - INFO - master_stats : 195438
        2012-10-30 01:48:51,773 - root - INFO - delta : 8 missing_percentage : 4.09320221442e-05 replica_factor : 1

        Show
        andreibaranouski Andrei Baranouski added a comment - no, test fails from time to time even with time sync for instance, on another vms and slightly different params: testrunner -i /tmp/swaprebalance-32.ini get-logs=True,GROUP=P0 -t swaprebalance.SwapRebalanceFailedTests.test_failed_swap_rebalance,replica=1,num-buckets=1,num-swap=3,GROUP=P0 http://qa.hq.northscale.net/job/ubuntu-32-2.0-swaprebalance-tests-P0/316/consoleFull 2012-10-30 01:48:51,296 - root - INFO - sum : 97723 and sum * (replica_factor + 1) (2) : 195446 2012-10-30 01:48:51,772 - root - INFO - master_stats : 195438 2012-10-30 01:48:51,773 - root - INFO - delta : 8 missing_percentage : 4.09320221442e-05 replica_factor : 1
        Hide
        steve Steve Yen added a comment -

        resolved per bug-scrub

        Show
        steve Steve Yen added a comment - resolved per bug-scrub

          People

          • Assignee:
            andreibaranouski Andrei Baranouski
            Reporter:
            andreibaranouski Andrei Baranouski
          • Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes