Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6595

[RN 2.0.1]][longevity] something unknown is causing severe timeouts in ns_server. Particularly under views building and/or compaction. Which causes rebalance to fail and other types of badness.

    Details

    • Flagged:
      Release Note

      Description

      Cluster information:

      • 11 centos 6.2 64bit server with 4 cores CPU
      • Each server has 10 GB RAM and 150 GB disk.
      • 8 GB RAM for couchbase server at each node (80% total system memmories)
      • Disk format ext3 on both data and root
      • Each server has its own drive, no disk sharing with other server.
      • Load 9 million items to both buckets
      • Initial indexing, so cpu a little heavy load
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)
      • Create cluster with 10 nodes installed couchbase server 2.0.0-1697

      10.3.121.13
      10.3.121.14
      10.3.121.15
      10.3.121.16
      10.3.121.17
      10.3.121.20
      10.3.121.22
      10.3.121.24
      10.3.121.25
      10.3.121.23

      • Data path /data
      • View path /data
      • Do swap rebalance. Add node 26 and remove node 25
      • Rebalance failed as in bug MB-6573
      • Then do rebalance again. Rebalance failed again with error in log page point to node 14

      Rebalance exited with reason {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,

      {eval, #Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default', {change_vbucket_replication,726, undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,
      undefined,undefined}},
      infinity]}}}}]}
      ns_orchestrator002 ns_1@10.3.121.14 01:18:02 - Sat Sep 8, 2012

      <0.19004.2> exited with {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,{eval, #Fun<cluster_compat_mode.0.45438860>}

      ]}},
      {gen_server,call,
      ['tap_replication_manager-default',

      {change_vbucket_replication,726,undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]} ns_vbucket_mover000 ns_1@10.3.121.14 01:18:01 - Sat Sep 8, 2012
      Server error during processing: ["web request failed",


      * Go to node 14, I see many tap_replication_manager-default crash right before rebalane failed at 01:18:01 - Sat Sep 8, 2012

      [error_logger:error,2012-09-08T1:18:01.836,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.19330.2>
      registered_name: []
      exception exit: {exited,
      {'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726,undefined}

      ,
      infinity]}},
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.16591.2>,<0.22331.1>]
      messages: [{'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726,undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.3.121.13'}

      ,
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}]
      links: [<0.16591.2>]
      dictionary: [

      {cleanup_list,[<0.19392.2>]}

      ]
      trap_exit: true
      status: running
      heap_size: 4181
      stack_size: 24
      reductions: 4550
      neighbours:

      [ns_server:info,2012-09-08T1:18:01.835,ns_1@10.3.121.14:<0.19487.2>:ns_replicas_builder:build_replicas_main:94]Got exit not from child ebucketmigrator. Assuming it's our parent:

      {'EXIT', <0.19393.2>, shutdown}

      [ns_server:info,2012-09-08T1:18:01.880,ns_1@10.3.121.14:ns_config_rep:ns_config_rep:do_pull:341]Pulling config from: 'ns_1@10.3.121.13'

      [ns_server:info,2012-09-08T1:18:01.885,ns_1@10.3.121.14:<0.19328.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.22': [<<"replication_building_704_'ns_1@10.3.121.26'">>,
      <<"replication_building_704_'ns_1@10.3.121.24'">>,
      <<"replication_building_704_'ns_1@10.3.121.23'">>]
      [error_logger:error,2012-09-08T1:18:01.895,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.19276.2>
      registered_name: []
      exception exit: {exited,
      {'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,

      {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726,undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.16591.2>,<0.22331.1>]
      messages: [{'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval,#Fun<cluster_compat_mode.0.45438860>}

      ]}},
      {gen_server,call,
      ['tap_replication_manager-default',

      {change_vbucket_replication,726,undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}]
      links: [<0.16591.2>]
      dictionary: [{cleanup_list,[<0.19328.2>]}]
      trap_exit: true
      status: running
      heap_size: 4181
      stack_size: 24
      reductions: 4434
      neighbours:

      [ns_server:info,2012-09-08T1:18:01.930,ns_1@10.3.121.14:<0.19487.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.16': [<<"replication_building_399_'ns_1@10.3.121.26'">>,
      <<"replication_building_399_'ns_1@10.3.121.24'">>,
      <<"replication_building_399_'ns_1@10.3.121.14'">>]
      [error_logger:error,2012-09-08T1:18:01.937,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.19393.2>
      registered_name: []
      exception exit: {exited,
      {'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726,undefined}

      ,
      infinity]}},
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.16591.2>,<0.22331.1>]
      messages: [{'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726,undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.3.121.13'}

      ,
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}]
      links: [<0.16591.2>]
      dictionary: [

      {cleanup_list,[<0.19487.2>]}

      ]
      trap_exit: true
      status: running
      heap_size: 4181
      stack_size: 24
      reductions: 4435
      neighbours:

      [couchdb:info,2012-09-08T1:18:01.977,ns_1@10.3.121.14:<0.15832.2>:couch_log:info:39]10.3.121.22 - - POST /_view_merge/?stale=ok&limit=10 200
      [ns_server:error,2012-09-08T1:18:02.072,ns_1@10.3.121.14:<0.5850.0>:ns_memcached:verify_report_long_call:274]call topkeys took too long: 836560 us
      [rebalance:debug,2012-09-08T1:18:02.075,ns_1@10.3.121.14:<0.19493.2>:ns_single_vbucket_mover:mover_inner_old_style:195]child replicas builder for vbucket 138 is <0.19520.2>
      [ns_server:info,2012-09-08T1:18:02.077,ns_1@10.3.121.14:<0.19493.2>:ns_single_vbucket_mover:mover_inner_old_style:199]Got exit message (parent is <0.16591.2>). Exiting...
      {'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,

      {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726,undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}
      [ns_server:debug,2012-09-08T1:18:02.115,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 138 into 'ns_1@10.3.121.26' is <20326.5386.1>
      [ns_server:info,2012-09-08T1:18:02.125,ns_1@10.3.121.14:ns_port_memcached:ns_port_server:log:169]memcached<0.2005.0>: Sat Sep 8 08:18:01.920865 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.16 - Sending TAP_OPAQUE with command "complete_vb_filter_change" and vbucket 0

      [ns_server:debug,2012-09-08T1:18:02.142,ns_1@10.3.121.14:<0.3277.0>:mc_connection:do_delete_vbucket:118]Notifying mc_couch_events of vbucket deletion: default/137
      [views:info,2012-09-08T1:18:02.146,ns_1@10.3.121.14:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:459]
      Calling couch_set_view:set_partition_states([<<"default">>,<<"_design/d1">>,
      [103,104,105,106,107,108,109,110,
      111,112,113,114,115,116,117,118,
      119,120,121,122,123,124,125,126,
      127,128,129,130,131,132,138,139,
      140,141,142,143,144,145,146,147,
      148,149,150,151,152,153,154,155,
      156,157,158,159,160,161,162,163,
      164,165,166,167,168,169,170,171,
      172,173,174,175,176,177,178,179,
      180,181,182,183,184,185,186,187,
      188,189,190,191,192,193,194,195,
      196,197,198,199,200,201,202,203,
      204,205],
      [],
      [0,1,2,3,4,5,6,7,8,9,10,11,12,13,
      14,15,16,17,18,19,20,21,22,23,
      24,25,26,27,28,29,30,31,32,33,
      34,35,36,37,38,39,40,41,42,43,
      44,45,46,47,48,49,50,51,52,53,
      54,55,56,57,58,59,60,61,62,63,
      64,65,66,67,68,69,70,71,72,73,
      74,75,76,77,78,79,80,81,82,83,
      84,85,86,87,88,89,90,91,92,93,
      94,95,96,97,98,99,100,101,102,
      133,134,135,136,137,206,207,208,
      209,210,211,212,213,214,215,216,
      217,218,219,220,221,222,223,224,
      225,226,227,228,229,230,231,232,
      233,234,235,236,237,238,239,240,
      241,242,243,244,245,246,247,248,
      249,250,251,252,253,254,255,256,
      257,258,259,260,261,262,263,264,
      265,266,267,268,269,270,271,272,
      273,274,275,276,277,278,279,280,
      281,282,283,284,285,286,287,288,
      289,290,291,292,293,294,295,296,
      297,298,299,300,301,302,303,304,
      305,306,307,308,309,310,311,312,
      313,314,315,316,317,318,319,320,
      321,322,323,324,325,326,327,328,
      329,330,331,332,333,334,335,336,
      337,338,339,340,341,342,343,344,
      345,346,347,348,349,350,351,352,
      353,354,355,356,357,358,359,360,
      361,362,363,364,365,366,367,368,
      369,370,371,372,373,374,375,376,
      377,378,379,380,381,382,383,384,
      385,386,387,388,389,390,391,392,
      393,394,395,396,397,398,399,400,
      401,402,403,404,405,406,407,408,
      409,410,411,412,413,414,415,416,
      417,418,419,420,421,422,423,424,
      425,426,427,428,429,430,431,432,
      433,434,435,436,437,438,439,440,
      441,442,443,444,445,446,447,448,
      449,450,451,452,453,454,455,456,
      457,458,459,460,461,462,463,464,
      465,466,467,468,469,470,471,472,
      473,474,475,476,477,478,479,480,
      481,482,483,484,485,486,487,488,
      489,490,491,492,493,494,495,496,
      497,498,499,500,501,502,503,504,
      505,506,507,508,509,510,511,512,
      513,514,515,516,517,518,519,520,
      521,522,523,524,525,526,527,528,
      529,530,531,532,533,534,535,536,
      537,538,539,540,541,542,543,544,
      545,546,547,548,549,550,551,552,
      553,554,555,556,557,558,559,560,
      561,562,563,564,565,566,567,568,
      569,570,571,572,573,574,575,576,
      577,578,579,580,581,582,583,584,
      585,586,587,588,589,590,591,592,
      593,594,595,596,597,598,599,600,
      601,602,603,604,605,606,607,608,
      609,610,611,612,613,614,615,616,
      617,618,619,620,621,622,623,624,
      625,626,627,628,629,630,631,632,
      633,634,635,636,637,638,639,640,
      641,642,643,644,645,646,647,648,
      649,650,651,652,653,654,655,656,
      657,658,659,660,661,662,663,664,
      665,666,667,668,669,670,671,672,
      673,674,675,676,677,678,679,680,
      681,682,683,684,685,686,687,688,
      689,690,691,692,693,694,695,696,
      697,698,699,700,701,702,703,704,
      705,706,707,708,709,710,711,712,
      713,714,715,716,717,718,719,720,
      721,722,723,724,725,726,727,728,
      729,730,731,732,733,734,735,736,
      737,738,739,740,741,742,743,744,
      745,746,747,748,749,750,751,752,
      753,754,755,756,757,758,759,760,
      761,762,763,764,765,766,767,768,
      769,770,771,772,773,774,775,776,
      777,778,779,780,781,782,783,784,
      785,786,787,788,789,790,791,792,
      793,794,795,796,797,798,799,800,
      801,802,803,804,805,806,807,808,
      809,810,811,812,813,814,815,816,
      817,818,819,820,821,822,823,824,
      825,826,827,828,829,830,831,832,
      833,834,835,836,837,838,839,840,
      841,842,843,844,845,846,847,848,
      849,850,851,852,853,854,855,856,
      857,858,859,860,861,862,863,864,
      865,866,867,868,869,870,871,872,
      873,874,875,876,877,878,879,880,
      881,882,883,884,885,886,887,888,
      889,890,891,892,893,894,895,896,
      897,898,899,900,901,902,903,904,
      905,906,907,908,909,910,911,912,
      913,914,915,916,917,918,919,920,
      921,922,923,924,925,926,927,928,
      929,930,931,932,933,934,935,936,
      937,938,939,940,941,942,943,944,
      945,946,947,948,949,950,951,952,
      953,954,955,956,957,958,959,960,
      961,962,963,964,965,966,967,968,
      969,970,971,972,973,974,975,976,
      977,978,979,980,981,982,983,984,
      985,986,987,988,989,990,991,992,
      993,994,995,996,997,998,999,
      1000,1001,1002,1003,1004,1005,
      1006,1007,1008,1009,1010,1011,
      1012,1013,1014,1015,1016,1017,
      1018,1019,1020,1021,1022,1023]])
      [ns_server:debug,2012-09-08T1:18:02.161,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 138 into 'ns_1@10.3.121.16' is <18036.10781.2>
      [couchdb:info,2012-09-08T1:18:02.162,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Stopping updater for set view `default`, main group `_design/d1`
      [ns_server:debug,2012-09-08T1:18:02.176,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 138 into 'ns_1@10.3.121.24' is <18041.13682.2>
      [couchdb:info,2012-09-08T1:18:02.179,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Updater, set view `default`, main group `_design/d1`, stopped with reason: {updater_error, shutdown}
      [couchdb:info,2012-09-08T1:18:02.234,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Set view `default`, main group `_design/d1`, partition states updated
      active partitions before: [103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,137,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199]
      active partitions after: [103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199]
      passive partitions before: []
      passive partitions after: []
      cleanup partitions before: [133,200,201,202,203,204,205]
      cleanup partitions after: [133,137,200,201,202,203,204,205]
      unindexable partitions: []
      replica partitions before: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]
      replica partitions after: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]
      replicas on transfer before: []
      replicas on transfer after: []
      pending transition before:
      active: [200,201,202,203,204,205]
      passive: []
      pending transition after:
      active: [200,201,202,203,204,205]
      passive: []

      [ns_server:info,2012-09-08T1:18:02.238,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder:build_replicas_main:94]Got exit not from child ebucketmigrator. Assuming it's our parent: {'EXIT', <0.19493.2>, shutdown}
      [couchdb:info,2012-09-08T1:18:02.241,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Starting updater for set view `default`, main group `_design/d1`
      [couchdb:info,2012-09-08T1:18:02.242,ns_1@10.3.121.14:<0.19533.2>:couch_log:info:39]Updater for set view `default`, main group `_design/d1` started
      Active partitions: [103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,138,139,140,141,142,143,144,145,146,147,148,149,150,151,152,153,154,155,156,157,158,159,160,161,162,163,164,165,166,167,168,169,170,171,172,173,174,175,176,177,178,179,180,181,182,183,184,185,186,187,188,189,190,191,192,193,194,195,196,197,198,199]
      Passive partitions: []
      Cleanup partitions: [133,137,200,201,202,203,204,205]
      Replicas to transfer: []
      Pending transition:
      active: [200,201,202,203,204,205]
      passive: []
      Initial build: false

      [views:info,2012-09-08T1:18:02.242,ns_1@10.3.121.14:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:460]
      couch_set_view:set_partition_states([<<"default">>,<<"_design/d1">>,
      [103,104,105,106,107,108,109,110,111,112,
      113,114,115,116,117,118,119,120,121,122,
      123,124,125,126,127,128,129,130,131,132,
      138,139,140,141,142,143,144,145,146,147,
      148,149,150,151,152,153,154,155,156,157,
      158,159,160,161,162,163,164,165,166,167,
      168,169,170,171,172,173,174,175,176,177,
      178,179,180,181,182,183,184,185,186,187,
      188,189,190,191,192,193,194,195,196,197,
      198,199,200,201,202,203,204,205],
      [],
      [0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,
      16,17,18,19,20,21,22,23,24,25,26,27,28,
      29,30,31,32,33,34,35,36,37,38,39,40,41,
      42,43,44,45,46,47,48,49,50,51,52,53,54,
      55,56,57,58,59,60,61,62,63,64,65,66,67,
      68,69,70,71,72,73,74,75,76,77,78,79,80,
      81,82,83,84,85,86,87,88,89,90,91,92,93,
      94,95,96,97,98,99,100,101,102,133,134,
      135,136,137,206,207,208,209,210,211,212,
      213,214,215,216,217,218,219,220,221,222,
      223,224,225,226,227,228,229,230,231,232,
      233,234,235,236,237,238,239,240,241,242,
      243,244,245,246,247,248,249,250,251,252,
      253,254,255,256,257,258,259,260,261,262,
      263,264,265,266,267,268,269,270,271,272,
      273,274,275,276,277,278,279,280,281,282,
      283,284,285,286,287,288,289,290,291,292,
      293,294,295,296,297,298,299,300,301,302,
      303,304,305,306,307,308,309,310,311,312,
      313,314,315,316,317,318,319,320,321,322,
      323,324,325,326,327,328,329,330,331,332,
      333,334,335,336,337,338,339,340,341,342,
      343,344,345,346,347,348,349,350,351,352,
      353,354,355,356,357,358,359,360,361,362,
      363,364,365,366,367,368,369,370,371,372,
      373,374,375,376,377,378,379,380,381,382,
      383,384,385,386,387,388,389,390,391,392,
      393,394,395,396,397,398,399,400,401,402,
      403,404,405,406,407,408,409,410,411,412,
      413,414,415,416,417,418,419,420,421,422,
      423,424,425,426,427,428,429,430,431,432,
      433,434,435,436,437,438,439,440,441,442,
      443,444,445,446,447,448,449,450,451,452,
      453,454,455,456,457,458,459,460,461,462,
      463,464,465,466,467,468,469,470,471,472,
      473,474,475,476,477,478,479,480,481,482,
      483,484,485,486,487,488,489,490,491,492,
      493,494,495,496,497,498,499,500,501,502,
      503,504,505,506,507,508,509,510,511,512,
      513,514,515,516,517,518,519,520,521,522,
      523,524,525,526,527,528,529,530,531,532,
      533,534,535,536,537,538,539,540,541,542,
      543,544,545,546,547,548,549,550,551,552,
      553,554,555,556,557,558,559,560,561,562,
      563,564,565,566,567,568,569,570,571,572,
      573,574,575,576,577,578,579,580,581,582,
      583,584,585,586,587,588,589,590,591,592,
      593,594,595,596,597,598,599,600,601,602,
      603,604,605,606,607,608,609,610,611,612,
      613,614,615,616,617,618,619,620,621,622,
      623,624,625,626,627,628,629,630,631,632,
      633,634,635,636,637,638,639,640,641,642,
      643,644,645,646,647,648,649,650,651,652,
      653,654,655,656,657,658,659,660,661,662,
      663,664,665,666,667,668,669,670,671,672,
      673,674,675,676,677,678,679,680,681,682,
      683,684,685,686,687,688,689,690,691,692,
      693,694,695,696,697,698,699,700,701,702,
      703,704,705,706,707,708,709,710,711,712,
      713,714,715,716,717,718,719,720,721,722,
      723,724,725,726,727,728,729,730,731,732,
      733,734,735,736,737,738,739,740,741,742,
      743,744,745,746,747,748,749,750,751,752,
      753,754,755,756,757,758,759,760,761,762,
      763,764,765,766,767,768,769,770,771,772,
      773,774,775,776,777,778,779,780,781,782,
      783,784,785,786,787,788,789,790,791,792,
      793,794,795,796,797,798,799,800,801,802,
      803,804,805,806,807,808,809,810,811,812,
      813,814,815,816,817,818,819,820,821,822,
      823,824,825,826,827,828,829,830,831,832,
      833,834,835,836,837,838,839,840,841,842,
      843,844,845,846,847,848,849,850,851,852,
      853,854,855,856,857,858,859,860,861,862,
      863,864,865,866,867,868,869,870,871,872,
      873,874,875,876,877,878,879,880,881,882,
      883,884,885,886,887,888,889,890,891,892,
      893,894,895,896,897,898,899,900,901,902,
      903,904,905,906,907,908,909,910,911,912,
      913,914,915,916,917,918,919,920,921,922,
      923,924,925,926,927,928,929,930,931,932,
      933,934,935,936,937,938,939,940,941,942,
      943,944,945,946,947,948,949,950,951,952,
      953,954,955,956,957,958,959,960,961,962,
      963,964,965,966,967,968,969,970,971,972,
      973,974,975,976,977,978,979,980,981,982,
      983,984,985,986,987,988,989,990,991,992,
      993,994,995,996,997,998,999,1000,1001,
      1002,1003,1004,1005,1006,1007,1008,1009,
      1010,1011,1012,1013,1014,1015,1016,1017,
      1018,1019,1020,1021,1022,1023]]) returned ok in 80ms
      [couchdb:info,2012-09-08T1:18:02.271,ns_1@10.3.121.14:<0.19539.2>:couch_log:info:39]Updater reading changes from active partitions to update main set view group `_design/d1` from set `default`
      [views:info,2012-09-08T1:18:02.282,ns_1@10.3.121.14:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:464]
      Calling couch_set_view:add_replica_partitions([<<"default">>,<<"_design/d1">>,
      [0,1,2,3,4,5,6,7,8,9,10,11,24,
      25,26,27,28,29,36,37,38,39,40,
      41,42,43,44,45,46,47,48,49,50,
      51,58,59,60,61,62,63,80,81,82,
      83,84,85,92,93,94,95,96,97,98,
      99,100,101,102,206,207,208,
      209,210,211,218,219,220,221,
      222,223,224,225,226,227,228,
      229,281,282,283,284,285,286,
      321,322,323,324,325,326,327,
      328,329,330,331,332,333,334,
      335,336,337,338,339,340,341,
      342,343,344,356,357,358,359,
      360,361,374,375,376,377,378,
      395,396,397,398,399,423,424,
      425,426,427,428,429,430,431,
      432,433,514,515,516,517,518,
      519,525,526,527,528,529,530,
      531,532,533,534,535,627,628,
      629,630,631,632,633,634,635,
      636,637,729,730,731,732,733,
      734,735,736,737,738,739,832,
      833,834,835,836,837,838,839,
      840,841,842,854,855,856,857,
      858,933,934,935,936,937,938,
      939,940,941,942,943,944,1000,
      1001,1002,1003,1004,1005,1012,
      1013,1014,1015,1016,1017]])
      [couchdb:info,2012-09-08T1:18:02.285,ns_1@10.3.121.14:<0.15832.2>:couch_log:info:39]10.3.121.22 - - POST /_view_merge/?stale=ok&limit=10 200
      [couchdb:info,2012-09-08T1:18:02.296,ns_1@10.3.121.14:<0.18109.0>:couch_log:info:39]Set view `default`, main group `_design/d1`, defined new replica partitions: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]
      New full set of replica partitions is: [0,1,2,3,4,5,6,7,8,9,10,11,24,25,26,27,28,29,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,58,59,60,61,62,63,80,81,82,83,84,85,92,93,94,95,96,97,98,99,100,101,102,206,207,208,209,210,211,218,219,220,221,222,223,224,225,226,227,228,229,281,282,283,284,285,286,321,322,323,324,325,326,327,328,329,330,331,332,333,334,335,336,337,338,339,340,341,342,343,344,356,357,358,359,360,361,374,375,376,377,378,395,396,397,398,399,423,424,425,426,427,428,429,430,431,432,433,514,515,516,517,518,519,525,526,527,528,529,530,531,532,533,534,535,627,628,629,630,631,632,633,634,635,636,637,729,730,731,732,733,734,735,736,737,738,739,832,833,834,835,836,837,838,839,840,841,842,854,855,856,857,858,933,934,935,936,937,938,939,940,941,942,943,944,1000,1001,1002,1003,1004,1005,1012,1013,1014,1015,1016,1017]

      [ns_server:info,2012-09-08T1:18:02.319,ns_1@10.3.121.14:<0.19520.2>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.14': [<<"replication_building_138_'ns_1@10.3.121.26'">>,
      <<"replication_building_138_'ns_1@10.3.121.16'">>,
      <<"replication_building_138_'ns_1@10.3.121.24'">>]
      [ns_server:info,2012-09-08T1:18:02.321,ns_1@10.3.121.14:'janitor_agent-default':janitor_agent:handle_info:646]Undoing temporary vbucket states caused by rebalance
      [user:info,2012-09-08T1:18:02.322,ns_1@10.3.121.14:<0.20487.1>:ns_orchestrator:handle_info:295]Rebalance exited with reason {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval, #Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726, undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default', 'ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,
      undefined,undefined}},
      infinity]}}}}]}

      [ns_server:debug,2012-09-08T1:18:02.322,ns_1@10.3.121.14:<0.16604.2>:ns_pubsub:do_subscribe_link:134]Parent process of subscription {ns_node_disco_events,<0.16591.2>} exited with reason {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,
      call,
      [ns_config,
      {eval, #Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,
      call,
      ['tap_replication_manager-default',
      {change_vbucket_replication, 726, undefined},
      infinity]}},
      {gen_server,
      call,
      [{'janitor_agent-default', 'ns_1@10.3.121.13'},
      {if_rebalance,
      <0.16591.2>,
      {update_vbucket_state,
      726,
      replica,
      undefined,
      undefined}},
      infinity]}}}}]}
      [ns_server:debug,2012-09-08T1:18:02.341,ns_1@10.3.121.14:<0.16604.2>:ns_pubsub:do_subscribe_link:149]Deleting {ns_node_disco_events,<0.16591.2>} event handler: ok
      [ns_server:debug,2012-09-08T1:18:02.345,ns_1@10.3.121.14:'capi_set_view_manager-saslbucket':capi_set_view_manager:handle_info:337]doing replicate_newnodes_docs
      [ns_server:info,2012-09-08T1:18:02.354,ns_1@10.3.121.14:<0.19587.2>:diag_handler:log_all_tap_and_checkpoint_stats:126]logging tap & checkpoint stats
      [error_logger:error,2012-09-08T1:18:02.343,ns_1@10.3.121.14:error_logger:ale_error_logger_handler:log_report:72]
      =========================CRASH REPORT=========================
      crasher:
      initial call: ns_single_vbucket_mover:mover/6
      pid: <0.19493.2>
      registered_name: []
      exception exit: {exited,
      {'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval,#Fun<cluster_compat_mode.0.45438860>}

      ]}},
      {gen_server,call,
      ['tap_replication_manager-default',

      {change_vbucket_replication,726,undefined},
      infinity]}},
      {gen_server,call,
      [{'janitor_agent-default','ns_1@10.3.121.13'},
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}}
      in function ns_single_vbucket_mover:mover_inner_old_style/6
      in call from misc:try_with_maybe_ignorant_after/2
      in call from ns_single_vbucket_mover:mover/6
      ancestors: [<0.16591.2>,<0.22331.1>]
      messages: [{'EXIT',<0.16591.2>,
      {bulk_set_vbucket_state_failed,
      [{'ns_1@10.3.121.13',
      {'EXIT',
      {{{timeout,
      {gen_server,call,
      [ns_config,
      {eval,#Fun<cluster_compat_mode.0.45438860>}]}},
      {gen_server,call,
      ['tap_replication_manager-default',
      {change_vbucket_replication,726,undefined}

      ,
      infinity]}},
      {gen_server,call,
      [

      {'janitor_agent-default','ns_1@10.3.121.13'}

      ,
      {if_rebalance,<0.16591.2>,
      {update_vbucket_state,726,replica,undefined,
      undefined}},
      infinity]}}}}]}}]
      links: [<0.19498.2>,<0.16591.2>]
      dictionary: [

      {cleanup_list,[<0.19520.2>]}

      ]
      trap_exit: true
      status: running
      heap_size: 2584
      stack_size: 24
      reductions: 4491
      neighbours:

      Link to diags of all nodes

      https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/11nodes-1697-rebalance-failed-bulk_set_vbucket_state_failed-20120908.tgz

      1. erl_healthy-node24-crash.dump.gz
        3.15 MB
        Thuan Nguyen
      2. erl-over-1gb-node14_crash.dump.gz
        5.57 MB
        Thuan Nguyen
      3. ns-diag-20121031094231.txt.xz
        831 kB
        Aleksey Kondratenko
      4. report_atop_10.6.2.37_default_simple-view_test (6).pdf
        297 kB
        Ketaki Gangal

        Issue Links

          Activity

          thuan Thuan Nguyen created issue -
          karan Karan Kumar (Inactive) made changes -
          Field Original Value New Value
          Summary [longevity] rebalance failed due to tap_replication_manager-default crashed [longevity] rebalance failed due to timeout in tap_replication_manager for default bucket
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Seems similar to:-
          http://www.couchbase.com/issues/browse/MB-6058

          And I guess tap-replication-manager was recently added. making sure this is no regrssion?

          Show
          karan Karan Kumar (Inactive) added a comment - Seems similar to:- http://www.couchbase.com/issues/browse/MB-6058 And I guess tap-replication-manager was recently added. making sure this is no regrssion?
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          This one is clearly different and again looks like paging from just looking at error. We're clearly timing out on ns_config access.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - This one is clearly different and again looks like paging from just looking at error. We're clearly timing out on ns_config access.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          There must be duplicate somewhere.

          In order to confirm paging I need atop.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - There must be duplicate somewhere. In order to confirm paging I need atop.
          alkondratenko Aleksey Kondratenko (Inactive) made changes -
          Assignee Aleksey Kondratenko [ alkondratenko ] Karan Kumar [ karan ]
          Hide
          thuan Thuan Nguyen added a comment -

          Fresh install couchbase server 2.0.0-1708 on 10 nodes cluster. Verified swap = 0% and swappiness set to 10
          Create 2 buckets, default (3GB) and saslbucket (3GB)
          Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11)

          10.3.121.13
          10.3.121.14
          10.3.121.15
          10.3.121.16
          10.3.121.17
          10.3.121.20
          10.3.121.22
          10.3.121.24
          10.3.121.25
          10.3.121.23

          Data path /data
          View path /data

          I hit this bug again in build 1708 when failover and rebalance out node 13.
          The failed error also points to node 14

          Rebalance exited with reason {{bulk_set_vbucket_state_failed,
          [{'ns_1@10.3.121.22',
          {'EXIT',
          {{{timeout,
          {gen_server,call,
          [ns_config,

          {eval, #Fun<cluster_compat_mode.0.45438860>}

          ]}},
          {gen_server,call,
          ['tap_replication_manager-saslbucket',

          {change_vbucket_replication,470, undefined}

          ,
          infinity]}},
          {gen_server,call,
          [

          {'janitor_agent-saslbucket', 'ns_1@10.3.121.22'}

          ,
          {if_rebalance,<0.24261.36>,
          {update_vbucket_state,579,replica,
          undefined,undefined}},
          infinity]}}}}]},
          [

          {janitor_agent,bulk_set_vbucket_state,4}

          ,

          {ns_vbucket_mover, update_replication_post_move,3}

          ,

          {ns_vbucket_mover,handle_info,2}

          ,

          {gen_server,handle_msg,5}

          ,

          {proc_lib,init_p_do_apply,3}

          ]}
          ns_orchestrator002 ns_1@10.3.121.14 13:31:20 - Tue Sep 11, 2012

          Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-10nodes-1708-reb-failed-bulk_set_vbucket_state_failed-20120911-140230.tgz

          Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/10nodes-1708-reb-failed-bulk_set_vbucket_state_failed-20120911-140230.tgz

          Show
          thuan Thuan Nguyen added a comment - Fresh install couchbase server 2.0.0-1708 on 10 nodes cluster. Verified swap = 0% and swappiness set to 10 Create 2 buckets, default (3GB) and saslbucket (3GB) Each bucket has one doc and 2 views for each doc (default d1 and saslbucket d11) 10.3.121.13 10.3.121.14 10.3.121.15 10.3.121.16 10.3.121.17 10.3.121.20 10.3.121.22 10.3.121.24 10.3.121.25 10.3.121.23 Data path /data View path /data I hit this bug again in build 1708 when failover and rebalance out node 13. The failed error also points to node 14 Rebalance exited with reason {{bulk_set_vbucket_state_failed, [{'ns_1@10.3.121.22', {'EXIT', {{{timeout, {gen_server,call, [ns_config, {eval, #Fun<cluster_compat_mode.0.45438860>} ]}}, {gen_server,call, ['tap_replication_manager-saslbucket', {change_vbucket_replication,470, undefined} , infinity]}}, {gen_server,call, [ {'janitor_agent-saslbucket', 'ns_1@10.3.121.22'} , {if_rebalance,<0.24261.36>, {update_vbucket_state,579,replica, undefined,undefined}}, infinity]}}}}]}, [ {janitor_agent,bulk_set_vbucket_state,4} , {ns_vbucket_mover, update_replication_post_move,3} , {ns_vbucket_mover,handle_info,2} , {gen_server,handle_msg,5} , {proc_lib,init_p_do_apply,3} ]} ns_orchestrator002 ns_1@10.3.121.14 13:31:20 - Tue Sep 11, 2012 Link to atop of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-10nodes-1708-reb-failed-bulk_set_vbucket_state_failed-20120911-140230.tgz Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/10nodes-1708-reb-failed-bulk_set_vbucket_state_failed-20120911-140230.tgz
          karan Karan Kumar (Inactive) made changes -
          Assignee Karan Kumar [ karan ] Aleksey Kondratenko [ alkondratenko ]
          Hide
          karan Karan Kumar (Inactive) added a comment -

          @Alk

          TOny attached the relevenat atop output in the comment above.

          Show
          karan Karan Kumar (Inactive) added a comment - @Alk TOny attached the relevenat atop output in the comment above.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Thanks. At last some useful data. I observed quite severe increase in resident size of erlang vm. So quite major page faults are not there, we quite possibly observe minor page fault being blocked on lack of memory.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Thanks. At last some useful data. I observed quite severe increase in resident size of erlang vm. So quite major page faults are not there, we quite possibly observe minor page fault being blocked on lack of memory.
          Hide
          thuan Thuan Nguyen added a comment -

          Link to erl dump when resident memory of beam.smp goes over 1GB of node 16 and 17 during rebalance out node 13.

          https://s3.amazonaws.com/packages.couchbase/erlang/orange/201209/2-erl-dump-Rm-over-1GB.tgz

          Show
          thuan Thuan Nguyen added a comment - Link to erl dump when resident memory of beam.smp goes over 1GB of node 16 and 17 during rebalance out node 13. https://s3.amazonaws.com/packages.couchbase/erlang/orange/201209/2-erl-dump-Rm-over-1GB.tgz
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          May I ask for logs from this run ?

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - May I ask for logs from this run ?
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Test

          Show
          karan Karan Kumar (Inactive) added a comment - Test
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Saw the sudden growth of Rsize for beam.smp from ~500MB to ~1.2 GB on the node that timedout and resulted in rebalance failure.

          This is also related to other timeouts we have been seeing (~ddocs/ns_disco/ns_doctor), on the nodes that have high Rsize value.

          Show
          karan Karan Kumar (Inactive) added a comment - Saw the sudden growth of Rsize for beam.smp from ~500MB to ~1.2 GB on the node that timedout and resulted in rebalance failure. This is also related to other timeouts we have been seeing (~ddocs/ns_disco/ns_doctor), on the nodes that have high Rsize value.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Give me logs please. I'm seeing very weird things in crash dumps. Would like to see logs

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Give me logs please. I'm seeing very weird things in crash dumps. Would like to see logs
          Hide
          thuan Thuan Nguyen added a comment -

          Generating diags. Will upload them soon.

          Show
          thuan Thuan Nguyen added a comment - Generating diags. Will upload them soon.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Its almost done..
          You can access the cluster if you wish.
          10.3.121.24:8091

          Show
          karan Karan Kumar (Inactive) added a comment - Its almost done.. You can access the cluster if you wish. 10.3.121.24:8091
          Show
          thuan Thuan Nguyen added a comment - Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/9nodes-1708-res-mem-beam-over-1gb-20120911.tgz
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Looking at the erl crash. Says the following processes are taints.

          Taints "couch_view_parser,couch_ejson_compare,mapreduce,snappy,ejson,crypto"

          Show
          karan Karan Kumar (Inactive) added a comment - Looking at the erl crash. Says the following processes are taints. Taints "couch_view_parser,couch_ejson_compare,mapreduce,snappy,ejson,crypto"
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          taints are fine. Basically it's some 3rd-party C code living in same address space as erlang. Basically any NIF will taint it as far as I understand.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - taints are fine. Basically it's some 3rd-party C code living in same address space as erlang. Basically any NIF will taint it as far as I understand.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          Ok.

          Another thing is number of atoms 20605. seems large?

          Show
          karan Karan Kumar (Inactive) added a comment - Ok. Another thing is number of atoms 20605. seems large?
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          No. Really weird thing is messages queued in error_logger.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - No. Really weird thing is messages queued in error_logger.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Logs of .17 are too old. Latest message is:

          [couchdb:info,2012-09-08T1:25:26.301,ns_1@10.3.121.17:<0.7480.2>:couch_log:info:39]10.3.121.13 - - POST /_view_merge/?stale=ok&limit=500 500
          [views:info,2012-09-08T1:25:26.342,ns_1@10.3.121.17:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:464]
          Calling couch_set_view:add_replica_partitions([<<"default">>,<<"_design/d1">>,
          [36,37,38,39,40,41,42,43,44,45,
          46,139,140,141,142,143,144,
          145,146,147,148,149,212,213,
          214,215,216,217,242,243,244,
          245,246,247,248,249,250,251,
          252,253,292,293,294,295,296,
          297,345,346,347,348,349,350,
          351,352,353,354,355,509,510,
          520,521,522,523,524,525,526,
          527,528,529,530,531,532,533,
          534,535,536,537,538,539,540,
          541,542,543,544,545,546,547,
          548,549,550,551,552,553,554,
          555,556,557,558,559,560,561,
          562,563,564,565,566,567,568,
          569,570,571,572,573,574,575,
          576,577,578,579,580,581,600,
          601,602,603,604,608,609,610,
          611,612,613,614,615,660,661,
          662,663,664,665,666,667,668,
          669,670,671,683,684,685,686,
          687,688,689,690,691,692,693,
          694,762,763,764,765,766,767,
          768,769,770,771,772,773,774,
          775,776,777,778,865,866,867,
          868,869,870,871,872,873,874,
          875,967,968,969,970,971,972,
          973,974,975,976,977,978,979,
          980,981,982,983]])
          [couchdb:info,2012-09-08T1:25:26.382,ns_1@10.3.121.17:<0.26843.2>:couch_log:info:39]Updater reading changes from active partitions to update main set view group `_design/d1` from set `default`

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Logs of .17 are too old. Latest message is: [couchdb:info,2012-09-08T1:25:26.301,ns_1@10.3.121.17:<0.7480.2>:couch_log:info:39] 10.3.121.13 - - POST /_view_merge/?stale=ok&limit=500 500 [views:info,2012-09-08T1:25:26.342,ns_1@10.3.121.17:'capi_set_view_manager-default':capi_set_view_manager:apply_index_states:464] Calling couch_set_view:add_replica_partitions([<<"default">>,<<"_design/d1">>, [36,37,38,39,40,41,42,43,44,45, 46,139,140,141,142,143,144, 145,146,147,148,149,212,213, 214,215,216,217,242,243,244, 245,246,247,248,249,250,251, 252,253,292,293,294,295,296, 297,345,346,347,348,349,350, 351,352,353,354,355,509,510, 520,521,522,523,524,525,526, 527,528,529,530,531,532,533, 534,535,536,537,538,539,540, 541,542,543,544,545,546,547, 548,549,550,551,552,553,554, 555,556,557,558,559,560,561, 562,563,564,565,566,567,568, 569,570,571,572,573,574,575, 576,577,578,579,580,581,600, 601,602,603,604,608,609,610, 611,612,613,614,615,660,661, 662,663,664,665,666,667,668, 669,670,671,683,684,685,686, 687,688,689,690,691,692,693, 694,762,763,764,765,766,767, 768,769,770,771,772,773,774, 775,776,777,778,865,866,867, 868,869,870,871,872,873,874, 875,967,968,969,970,971,972, 973,974,975,976,977,978,979, 980,981,982,983]]) [couchdb:info,2012-09-08T1:25:26.382,ns_1@10.3.121.17:<0.26843.2>:couch_log:info:39] Updater reading changes from active partitions to update main set view group `_design/d1` from set `default`
          Hide
          thuan Thuan Nguyen added a comment -

          Here are 2 erlang crash dump. One from health node 24 and one from problem node (14) with beam memory over 1 GB

          Show
          thuan Thuan Nguyen added a comment - Here are 2 erlang crash dump. One from health node 24 and one from problem node (14) with beam memory over 1 GB
          thuan Thuan Nguyen made changes -
          Attachment erl_healthy-node24-crash.dump.gz [ 14966 ]
          Attachment erl-over-1gb-node14_crash.dump.gz [ 14967 ]
          Show
          thuan Thuan Nguyen added a comment - Link to diags node 17 download from UI. https://s3.amazonaws.com/packages.couchbase/diag-logs/orange/201209/10.3.121.17-diags-20120911195436.txt.gz
          farshid Farshid Ghods (Inactive) made changes -
          Labels sblocker system-test
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          adding this to current sprint since i see Chiyoung and Alk are already working and debugging this issue

          Show
          farshid Farshid Ghods (Inactive) added a comment - adding this to current sprint since i see Chiyoung and Alk are already working and debugging this issue
          farshid Farshid Ghods (Inactive) made changes -
          Sprint Status Current Sprint
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          May I ask you guys to re run while periodically doing this:

          wget O --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> [erlang:garbage_collect(P) || P <- erlang:processes()], ok end, []])" http://myhost:8091/diag/eval

          Doing it on just one node should be enough.

          I want to double check that this is caused by weird GC.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - May I ask you guys to re run while periodically doing this: wget O --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> [erlang:garbage_collect(P) || P <- erlang:processes()] , ok end, []])" http://myhost:8091/diag/eval Doing it on just one node should be enough. I want to double check that this is caused by weird GC.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          What what frequency do you wnat us to run erlang GC.

          Show
          karan Karan Kumar (Inactive) added a comment - What what frequency do you wnat us to run erlang GC.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          30 seconds. And monitor memory usage.

          I've seen how single run will half memory usage. But very interesting to see if some growth happens during rebalance in your case.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - 30 seconds. And monitor memory usage. I've seen how single run will half memory usage. But very interesting to see if some growth happens during rebalance in your case.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          @Alk:

          Running this every 30 secs, seems to work really well.
          The memory usages for beam are around ~30-40% lesser.

          On the same workload, we dont see suspicious rsize growth.

          Show
          karan Karan Kumar (Inactive) added a comment - @Alk: Running this every 30 secs, seems to work really well. The memory usages for beam are around ~30-40% lesser. On the same workload, we dont see suspicious rsize growth.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Great. Use this workaround for now. Once my memory usage optimizations lands we'll re-test and then consider if we need still need this workaround in product or maybe if we need some other tuning

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Great. Use this workaround for now. Once my memory usage optimizations lands we'll re-test and then consider if we need still need this workaround in product or maybe if we need some other tuning
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          I want you to try the following alternative:

          wget O --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> erlang:system_flag(fullsweep_after, 0) end, []])" http://lh:9000/diag/eval

          Run this once on one node. It'll enable more aggressive gc on all nodes. I hope it'll have results very similar to periodic full manual gc.

          After testing setting of 0 (and only it works resonably ok) I want you to try 10.

          wget O --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> erlang:system_flag(fullsweep_after, 10) end, []])" http://lh:9000/diag/eval

          That should provide very valuable information.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - I want you to try the following alternative: wget O --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> erlang:system_flag(fullsweep_after, 0) end, []])" http://lh:9000/diag/eval Run this once on one node. It'll enable more aggressive gc on all nodes. I hope it'll have results very similar to periodic full manual gc. After testing setting of 0 (and only it works resonably ok) I want you to try 10. wget O --user=Administrator --password=asdasd --post-data="rpc:eval_everywhere(erlang, apply, [fun () -> erlang:system_flag(fullsweep_after, 10) end, []])" http://lh:9000/diag/eval That should provide very valuable information.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Ignore message above. Well need this in couchbase-server script

          export ERL_FULLSWEEP_AFTER=0

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Ignore message above. Well need this in couchbase-server script export ERL_FULLSWEEP_AFTER=0
          Hide
          karan Karan Kumar (Inactive) added a comment -

          We are going to try with value 0 first then with value 10.

          Show
          karan Karan Kumar (Inactive) added a comment - We are going to try with value 0 first then with value 10.
          farshid Farshid Ghods (Inactive) made changes -
          Summary [longevity] rebalance failed due to timeout in tap_replication_manager for default bucket [longevity] rebalance fails with error "bulk_set_vbucket_state_failed" due to timeout in tap_replication_manager for default bucket
          Labels sblocker system-test 2.0-beta-release-notes sblocker system-test
          Hide
          thuan Thuan Nguyen added a comment - - edited

          With ERL_FULLSWEEP_AFTER set to 0, I got time out again.
          Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-12nodes-1728-reset-reb-status-since-not-running-20120917.tgz
          Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/12nodes-col-info-1728-reset-reb-status-since-not-running-20120917.tgz

          Set up ERL_FULLSWEEP_AFTER to 10 and has not hit this bug yet. I will try more to confirm it.
          Second try swap rebalance add nodes 22 and 23, remove nodes 14 and 15. Rebalance did not hit this bug

          Show
          thuan Thuan Nguyen added a comment - - edited With ERL_FULLSWEEP_AFTER set to 0, I got time out again. Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/orange/201209/atop-12nodes-1728-reset-reb-status-since-not-running-20120917.tgz Link to collect info of all nodes https://s3.amazonaws.com/packages.couchbase/collect_info/orange/2_0_0/201209/12nodes-col-info-1728-reset-reb-status-since-not-running-20120917.tgz Set up ERL_FULLSWEEP_AFTER to 10 and has not hit this bug yet. I will try more to confirm it. Second try swap rebalance add nodes 22 and 23, remove nodes 14 and 15. Rebalance did not hit this bug
          farshid Farshid Ghods (Inactive) made changes -
          Fix Version/s 2.0 [ 10114 ]
          Fix Version/s 2.0-beta [ 10113 ]
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Thanks I'll take a look asap

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Thanks I'll take a look asap
          farshid Farshid Ghods (Inactive) made changes -
          Priority Major [ 3 ] Blocker [ 1 ]
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          I looked at this and was unable to see anything specific that would cause that weird slowness.

          There's no signs of swapping but we hit those timeouts

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - I looked at this and was unable to see anything specific that would cause that weird slowness. There's no signs of swapping but we hit those timeouts
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          As noted above I looked at latest diags and could not spot anything specifically wrong in ns_server.

          There's still some suspision about environment as well as potential effect of CPU cycles spent in NIFs not properly accounted for.

          Please note, that I do enjoy investigating tough cases like that. But:

          *) apparently our system tests use case has changed and I have no idea if that's still relevant

          *) we may have other folks better equipped and maybe having more time than myself in order to work on that.

          So I need explicit "go ahead" in order to continue working on that.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - As noted above I looked at latest diags and could not spot anything specifically wrong in ns_server. There's still some suspision about environment as well as potential effect of CPU cycles spent in NIFs not properly accounted for. Please note, that I do enjoy investigating tough cases like that. But: *) apparently our system tests use case has changed and I have no idea if that's still relevant *) we may have other folks better equipped and maybe having more time than myself in order to work on that. So I need explicit "go ahead" in order to continue working on that.
          alkondratenko Aleksey Kondratenko (Inactive) made changes -
          Assignee Aleksey Kondratenko [ alkondratenko ] Peter Wansch [ peter ]
          Hide
          peter peter added a comment -

          Farshid, can we hand this one to Mike to look at it from an ep_engine perspective

          Show
          peter peter added a comment - Farshid, can we hand this one to Mike to look at it from an ep_engine perspective
          peter peter made changes -
          Assignee Peter Wansch [ peter ] Farshid Ghods [ farshid ]
          farshid Farshid Ghods (Inactive) made changes -
          Assignee Farshid Ghods [ farshid ] Mike Wiederhold [ mikew ]
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          Mike,

          can you please sync up with system test team and look at the timeout issues from ep-engine perspective ?

          Show
          farshid Farshid Ghods (Inactive) added a comment - Mike, can you please sync up with system test team and look at the timeout issues from ep-engine perspective ?
          Hide
          mikew Mike Wiederhold added a comment -

          There is nothing in this issue that looks suspicious from the ep-engine side. If it is assigne back to the ep-engine team then please include the memcached logs. Karan has asked that this bug be assigned back to him for a follow up.

          Show
          mikew Mike Wiederhold added a comment - There is nothing in this issue that looks suspicious from the ep-engine side. If it is assigne back to the ep-engine team then please include the memcached logs. Karan has asked that this bug be assigned back to him for a follow up.
          mikew Mike Wiederhold made changes -
          Assignee Mike Wiederhold [ mikew ] Karan Kumar [ karan ]
          farshid Farshid Ghods (Inactive) made changes -
          Summary [longevity] rebalance fails with error "bulk_set_vbucket_state_failed" due to timeout in tap_replication_manager for default bucket [longevity] erlang garbage collection causes huge time outs in erlang vm and causes rebalance failures , ns_timeouts (rebalance fails with error "bulk_set_vbucket_state_failed" due to timeout in tap_replication_manager)
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Forgot to ask when saw this. What's evidence that it's caused by garbage collection? Ticket clearly doesn't have any so far.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Forgot to ask when saw this. What's evidence that it's caused by garbage collection? Ticket clearly doesn't have any so far.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          Karan is trying 10, and 10,000
          performance team has tried 16

          Show
          farshid Farshid Ghods (Inactive) added a comment - Karan is trying 10, and 10,000 performance team has tried 16
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          Pavel,

          can you update the ticket with the results you have from view perf tests where erl_sweep is set to 5,10,15 or other values

          Show
          farshid Farshid Ghods (Inactive) added a comment - Pavel, can you update the ticket with the results you have from view perf tests where erl_sweep is set to 5,10,15 or other values
          Hide
          steve Steve Yen added a comment - - edited

          Some results here...

          https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdGd2NmhLOThCRUc0ekFkZ0FDeGdLRXc#gid=0

          Build 1782, aggressive workload							
          	Set latency (90th)	Get latency (90th)	Query latency (80th)	Query throughput	Avg. mem beam.smp (MB)	Avf. drain rate	Runtime (h)
          DEFAULT (64K)	1.39	1.7	17.35	1152	325.73	700.35	3.25
          5	1.05	1.38	20.66	984	232.05	645.96	4.77
          10	1.66	1.18	20.27	1057.9	249.89	668.04	4.37
          15	1.64	1.38	20.07	981.68	248.67	746.57	4.87
          							
          Build 1842, aggressive workload							
          0			32.82	1017.95	151.41		5.55
          512			19.51	1126.27	291.11		4.04
          1K			19.23	1070.77	304.27		4.43
          8K			23.94	1029.81	319.32		5.37
          16K			20.64	1094.99	311.01		4.45
          32K			26.69	1061.74	320.79		4.2
          Build 1724, lite workload							
          0			10.95	464.07	169.62		5.04
          64			9.89	531.2	289.73		3.93
          256			10.31	513.05	307.09		4.19
          DEFAULT (64K)			10.26	527.62	325.37		3.78
          

          Show
          steve Steve Yen added a comment - - edited Some results here... https://docs.google.com/spreadsheet/ccc?key=0AgLUessE73UXdGd2NmhLOThCRUc0ekFkZ0FDeGdLRXc#gid=0 Build 1782, aggressive workload Set latency (90th) Get latency (90th) Query latency (80th) Query throughput Avg. mem beam.smp (MB) Avf. drain rate Runtime (h) DEFAULT (64K) 1.39 1.7 17.35 1152 325.73 700.35 3.25 5 1.05 1.38 20.66 984 232.05 645.96 4.77 10 1.66 1.18 20.27 1057.9 249.89 668.04 4.37 15 1.64 1.38 20.07 981.68 248.67 746.57 4.87 Build 1842, aggressive workload 0 32.82 1017.95 151.41 5.55 512 19.51 1126.27 291.11 4.04 1K 19.23 1070.77 304.27 4.43 8K 23.94 1029.81 319.32 5.37 16K 20.64 1094.99 311.01 4.45 32K 26.69 1061.74 320.79 4.2 Build 1724, lite workload 0 10.95 464.07 169.62 5.04 64 9.89 531.2 289.73 3.93 256 10.31 513.05 307.09 4.19 DEFAULT (64K) 10.26 527.62 325.37 3.78
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          Karan,
          based on the performance test results seems like 512 is the best candidate that does not impact view latency and query througput

          have system test team run any test with 512 ?
          if we can use this value for different workloads and it passes system tests we can assign this ticket to Alk and suggest the right value

          Show
          farshid Farshid Ghods (Inactive) added a comment - Karan, based on the performance test results seems like 512 is the best candidate that does not impact view latency and query througput have system test team run any test with 512 ? if we can use this value for different workloads and it passes system tests we can assign this ticket to Alk and suggest the right value
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          It appears people are confusing timeouts (software doesn't work) and performance numbers (how fast software works)

          The only effect of fullsweep is supposedly full collections all the time. This coupled with erlang's supposedly 'soft-realtime' design where GC is run independently and separately on different erlang process heaps IMHO doesn't explain timeouts. At all.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - It appears people are confusing timeouts (software doesn't work) and performance numbers (how fast software works ) The only effect of fullsweep is supposedly full collections all the time. This coupled with erlang's supposedly 'soft-realtime' design where GC is run independently and separately on different erlang process heaps IMHO doesn't explain timeouts. At all.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          BTW, spreadsheet linked above doesn't have definition of columns. I was unable to read people's mind on first column meaning particularly.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - BTW, spreadsheet linked above doesn't have definition of columns. I was unable to read people's mind on first column meaning particularly.
          Hide
          karan Karan Kumar (Inactive) added a comment -

          The first column is value of ERL_FULLSWEEP

          Show
          karan Karan Kumar (Inactive) added a comment - The first column is value of ERL_FULLSWEEP
          Hide
          steve Steve Yen added a comment - - edited

          alk says there's 2 issues...

          • a better full-sweep setting (e.g., the "512" from the spreadsheet – i'll create a separate jira issue for that)
          • other mysterious erlang timeouts that need root cause understanding/analysis (alk wants to own this, so assigning to him)

          a better the full-sweep setting can help but not fully solve the root cause.

          Show
          steve Steve Yen added a comment - - edited alk says there's 2 issues... a better full-sweep setting (e.g., the "512" from the spreadsheet – i'll create a separate jira issue for that) other mysterious erlang timeouts that need root cause understanding/analysis (alk wants to own this, so assigning to him) a better the full-sweep setting can help but not fully solve the root cause.
          steve Steve Yen made changes -
          Assignee Karan Kumar [ karan ] Aleksey Kondratenko [ alkondratenko ]
          Hide
          steve Steve Yen added a comment -

          see issue on MB-6974 for the full-sweep 512 setting

          Show
          steve Steve Yen added a comment - see issue on MB-6974 for the full-sweep 512 setting
          Hide
          steve Steve Yen added a comment -

          with MB-6974 (full-sweep 512 setting) change, is this bug (MB-6595) no longer blocker level priority?

          Show
          steve Steve Yen added a comment - with MB-6974 (full-sweep 512 setting) change, is this bug ( MB-6595 ) no longer blocker level priority?
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Updated title to reflect reality. We have no idea it's caused by GC.

          Steve, answering your question. We have those weird and maybe GC-related timeouts before. Now we also have a danger of timeouts caused by lack of async threads. We need to get on top of this. I think this is blocker #1, personally.

          And btw, I just hit ns_config:get timeout on pretty harmless indexing on 9E6 simple items.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Updated title to reflect reality. We have no idea it's caused by GC. Steve, answering your question. We have those weird and maybe GC-related timeouts before. Now we also have a danger of timeouts caused by lack of async threads. We need to get on top of this. I think this is blocker #1, personally. And btw, I just hit ns_config:get timeout on pretty harmless indexing on 9E6 simple items.
          alkondratenko Aleksey Kondratenko (Inactive) made changes -
          Summary [longevity] erlang garbage collection causes huge time outs in erlang vm and causes rebalance failures , ns_timeouts (rebalance fails with error "bulk_set_vbucket_state_failed" due to timeout in tap_replication_manager) [longevity] something unknown is causing severe timeouts in ns_server. Particularly under views building and/or compaction. Which causes rebalance to fail and other types of badness.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          changing the priority to critical because with the new erlang gc setting we do not have reports of timeouts from system tests and performance tests.

          Show
          farshid Farshid Ghods (Inactive) added a comment - changing the priority to critical because with the new erlang gc setting we do not have reports of timeouts from system tests and performance tests.
          farshid Farshid Ghods (Inactive) made changes -
          Labels 2.0-beta-release-notes sblocker system-test 2.0-beta-release-notes system-test
          Priority Blocker [ 1 ] Critical [ 2 ]
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Farshid, yet. You don't have reports, yet.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Farshid, yet. You don't have reports, yet.
          Hide
          thuan Thuan Nguyen added a comment -

          Tested with build 2.0.0-1862 with erlang gc set to 10 and consistent view enable, I don't see any timeout during rebalance.

          Show
          thuan Thuan Nguyen added a comment - Tested with build 2.0.0-1862 with erlang gc set to 10 and consistent view enable, I don't see any timeout during rebalance.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          Tony,

          please confirm that after running 2.0 cluster on builds which have MB-7002 fix ( build 1888+ ) you dont see timeouts and rebalance failures.

          Also please do not change the default value of erlang_sweep to 10 anymore because those builds have the value set to 512 already

          Show
          farshid Farshid Ghods (Inactive) added a comment - Tony, please confirm that after running 2.0 cluster on builds which have MB-7002 fix ( build 1888+ ) you dont see timeouts and rebalance failures. Also please do not change the default value of erlang_sweep to 10 anymore because those builds have the value set to 512 already
          farshid Farshid Ghods (Inactive) made changes -
          Assignee Aleksey Kondratenko [ alkondratenko ] Thuan Nguyen [ thuan ]
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Also, stop tuning kernel's swappiness as well.

          Also let me add that we decided QEs normal rebalance tests would be run on physical hardware with durable disk setup. That would be better test imho.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Also, stop tuning kernel's swappiness as well. Also let me add that we decided QEs normal rebalance tests would be run on physical hardware with durable disk setup. That would be better test imho.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          QE will run one of the system tests on a physical environment multiple times but most tests will still be run on ec2 or virtualized xen or vmware environment.

          unfortunately we dont have enough hardware to run all tests on physical environment.

          functional tests are run on virtual environment. the statement applies only to system tests.

          Show
          farshid Farshid Ghods (Inactive) added a comment - QE will run one of the system tests on a physical environment multiple times but most tests will still be run on ec2 or virtualized xen or vmware environment. unfortunately we dont have enough hardware to run all tests on physical environment. functional tests are run on virtual environment. the statement applies only to system tests.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          I'm fine with that. But just wanted to note that this bug is not only for system tests anymore. It's for all kinds of unknown timeouts happening. Because we have no idea what's causing them. Hopefully indeed it's GC.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - I'm fine with that. But just wanted to note that this bug is not only for system tests anymore. It's for all kinds of unknown timeouts happening. Because we have no idea what's causing them. Hopefully indeed it's GC.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          yes. thats a valid point. we dont see these timeouts happening at all on functional testing.

          also i heard that if swappiness value is not set to 10% there is a higher chance that erlang uses some swap and we saw strange failures in xdcr and other system testing on ec2 and hence now its set back to 10%

          if you think this is not necessary after the garbage collection fix i will let the system test team know.

          Show
          farshid Farshid Ghods (Inactive) added a comment - yes. thats a valid point. we dont see these timeouts happening at all on functional testing. also i heard that if swappiness value is not set to 10% there is a higher chance that erlang uses some swap and we saw strange failures in xdcr and other system testing on ec2 and hence now its set back to 10% if you think this is not necessary after the garbage collection fix i will let the system test team know.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          On swappiness. We should use whatever our customers will use. So if we are certain to recommend or even demand that out users decrease swappiness, I'm 100% fine.

          I just want to make sure we test with same settings that people will use.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - On swappiness. We should use whatever our customers will use. So if we are certain to recommend or even demand that out users decrease swappiness, I'm 100% fine. I just want to make sure we test with same settings that people will use.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          I'm seeing occasional timeouts in logs of systems tests.

          Plus just hit this again when trying to benchmark rebalance time with views.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - I'm seeing occasional timeouts in logs of systems tests. Plus just hit this again when trying to benchmark rebalance time with views.
          alkondratenko Aleksey Kondratenko (Inactive) made changes -
          Attachment ns-diag-20121031094231.txt.xz [ 15646 ]
          Hide
          steve Steve Yen added a comment -

          Aliaksey A. is working on systemtap instrumentation.

          Show
          steve Steve Yen added a comment - Aliaksey A. is working on systemtap instrumentation.
          Hide
          steve Steve Yen added a comment -

          Farshid reports not seeing timeouts from user viewpoint in system tests.

          Assigning to Alk per bug-scrub mtg.

          Show
          steve Steve Yen added a comment - Farshid reports not seeing timeouts from user viewpoint in system tests. Assigning to Alk per bug-scrub mtg.
          steve Steve Yen made changes -
          Assignee Thuan Nguyen [ thuan ] Aleksey Kondratenko [ alkondratenko ]
          steve Steve Yen made changes -
          Fix Version/s 2.0.1 [ 10399 ]
          Fix Version/s 2.0 [ 10114 ]
          Hide
          dipti Dipti Borkar added a comment -

          This master bug has other similar bugs marked as sub-tasks.

          Show
          dipti Dipti Borkar added a comment - This master bug has other similar bugs marked as sub-tasks.
          farshid Farshid Ghods (Inactive) made changes -
          Link This issue blocks MB-7234 [ MB-7234 ]
          farshid Farshid Ghods (Inactive) made changes -
          Link This issue blocks MB-7261 [ MB-7261 ]
          farshid Farshid Ghods (Inactive) made changes -
          Priority Critical [ 2 ] Blocker [ 1 ]
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Aliaksey is looking at this

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Aliaksey is looking at this
          alkondratenko Aleksey Kondratenko (Inactive) made changes -
          Assignee Aleksey Kondratenko [ alkondratenko ] Aliaksey Artamonau [ aliaksey artamonau ]
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          related fixes were merged and build 138-rel+ has those fixes. system test team will update the ticket if those issues are discovered.

          Aliaksey, is it okay to assign this to QE ?

          Show
          farshid Farshid Ghods (Inactive) added a comment - related fixes were merged and build 138-rel+ has those fixes. system test team will update the ticket if those issues are discovered. Aliaksey, is it okay to assign this to QE ?
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          We're fine if this is assigned to QE. But given there's still some ongoing investigation of current problems I'm not sure bouncing this ticket is best idea.

          We hope our remaining issues are either windows (where none of our fixes apply) or our virtualization environment.

          And I hope ongoing investigation will confirm that.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - We're fine if this is assigned to QE. But given there's still some ongoing investigation of current problems I'm not sure bouncing this ticket is best idea. We hope our remaining issues are either windows (where none of our fixes apply) or our virtualization environment. And I hope ongoing investigation will confirm that.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          we have not seen these timeouts on the linux SSD Xen virtualized cluster.

          we saw these behaviors however significantly on windows VMware virtualized cluster but i recommend resolving this ticket as this was opened for linux clusters

          Show
          farshid Farshid Ghods (Inactive) added a comment - we have not seen these timeouts on the linux SSD Xen virtualized cluster. we saw these behaviors however significantly on windows VMware virtualized cluster but i recommend resolving this ticket as this was opened for linux clusters
          Hide
          jin Jin Lim (Inactive) added a comment -

          Trond and Sri will be working on many optimization and enhancements that make Couchbase Server overall run better on Windows. As QE (Farshid) pointed out we should add this to the list of things to enhance/address on Windows and move on at this point.

          Aliaksey - please advise if it it ok to resolve this issue and add it to Windows enhancement bin. Also this bug is flagged as release note candidate. Please advise if the timeout behavior on Windows should be described and that we recommend users decrease swappiness (and how much?) in the release note. Thanks.

          Show
          jin Jin Lim (Inactive) added a comment - Trond and Sri will be working on many optimization and enhancements that make Couchbase Server overall run better on Windows. As QE (Farshid) pointed out we should add this to the list of things to enhance/address on Windows and move on at this point. Aliaksey - please advise if it it ok to resolve this issue and add it to Windows enhancement bin. Also this bug is flagged as release note candidate. Please advise if the timeout behavior on Windows should be described and that we recommend users decrease swappiness (and how much?) in the release note. Thanks.
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          Indeed, lets create new ticket for windows work.

          Aliaksey figured out that our tuning option that helped to decrease minor page faults on GNU/Linux does not apply to windows as apparently they are not using equivalent of mmap to allocate memory on windows.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - Indeed, lets create new ticket for windows work. Aliaksey figured out that our tuning option that helped to decrease minor page faults on GNU/Linux does not apply to windows as apparently they are not using equivalent of mmap to allocate memory on windows.
          alkondratenko Aleksey Kondratenko (Inactive) made changes -
          Status Open [ 1 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          http://www.couchbase.com/issues/browse/MB-7658 is created already for addressing issues on windows

          Show
          farshid Farshid Ghods (Inactive) added a comment - http://www.couchbase.com/issues/browse/MB-7658 is created already for addressing issues on windows
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          QE has not seen this issue on linxu cluster on any of the 2.0.1 builds ( 135....148)

          Show
          farshid Farshid Ghods (Inactive) added a comment - QE has not seen this issue on linxu cluster on any of the 2.0.1 builds ( 135....148)
          farshid Farshid Ghods (Inactive) made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          jin Jin Lim (Inactive) added a comment -

          Alk, thanks for quick update. Do you concur with the idea we putting a suggestion for decreasing swappiness in the release note? If so how much we recommend user decrease the setting?

          Show
          jin Jin Lim (Inactive) added a comment - Alk, thanks for quick update. Do you concur with the idea we putting a suggestion for decreasing swappiness in the release note? If so how much we recommend user decrease the setting?
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          AFAIK all latest testing was done with default swappiness. I cannot recommend any particular value.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - AFAIK all latest testing was done with default swappiness. I cannot recommend any particular value.
          Hide
          farshid Farshid Ghods (Inactive) added a comment -

          per system test spec and other experiments the swappiness is set to 10%

          as observed in system test graphs swap usage growth is unbounded ( ketaki to attach some graphs ) during the query and indexing phase so reducing swappiness to 10% instead of default with 60% has helped before

          Show
          farshid Farshid Ghods (Inactive) added a comment - per system test spec and other experiments the swappiness is set to 10% as observed in system test graphs swap usage growth is unbounded ( ketaki to attach some graphs ) during the query and indexing phase so reducing swappiness to 10% instead of default with 60% has helped before
          Show
          ketaki Ketaki Gangal added a comment - Link to system -test results here ( report_atop, Page 5/6 for Swap used) https://github.com/couchbaselabs/couchbase-qe-docs/blob/master/system-tests/linux-views-ssd/2013_24_01/report_atop_10.6.2.42_default_simple-view_test.pdf Link to system test specs for the test runs here http://hub.internal.couchbase.com/confluence/display/QA/views-test
          ketaki Ketaki Gangal made changes -
          Resolution Fixed [ 1 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          Hide
          ketaki Ketaki Gangal added a comment -

          swap results file attached.

          Show
          ketaki Ketaki Gangal added a comment - swap results file attached.
          ketaki Ketaki Gangal made changes -
          ketaki Ketaki Gangal made changes -
          Status Reopened [ 4 ] Closed [ 6 ]
          Resolution Fixed [ 1 ]
          Hide
          jin Jin Lim (Inactive) added a comment -

          Assign it to Karen quickly so she has time to capture what needs to go in to the release note. Karen, please review the above comments about swappiness. Thanks!

          Show
          jin Jin Lim (Inactive) added a comment - Assign it to Karen quickly so she has time to capture what needs to go in to the release note. Karen, please review the above comments about swappiness. Thanks!
          Hide
          jin Jin Lim (Inactive) added a comment -

          Please close this ticket once after you collect all information required for the release note. Thanks much!

          Show
          jin Jin Lim (Inactive) added a comment - Please close this ticket once after you collect all information required for the release note. Thanks much!
          jin Jin Lim (Inactive) made changes -
          Resolution Fixed [ 1 ]
          Status Closed [ 6 ] Reopened [ 4 ]
          Assignee Aliaksey Artamonau [ aliaksey artamonau ] Karen Zeller [ kzeller ]
          kzeller kzeller made changes -
          Summary [longevity] something unknown is causing severe timeouts in ns_server. Particularly under views building and/or compaction. Which causes rebalance to fail and other types of badness. [RN 2.0.1]][longevity] something unknown is causing severe timeouts in ns_server. Particularly under views building and/or compaction. Which causes rebalance to fail and other types of badness.
          Labels 2.0-beta-release-notes system-test 2.0-beta-release-notes 2.0.1-release-notes system-test
          Flagged Release Note [ 10010 ]
          farshid Farshid Ghods (Inactive) made changes -
          Component/s documentation [ 10012 ]
          Component/s ns_server [ 10019 ]
          Hide
          kzeller kzeller added a comment -

          RN: The server experienced severe timeouts during
          rebalance if views were being indexed or compacted at the same time.
          This resulted in the rebalance to fail. This has been fixed.

          Show
          kzeller kzeller added a comment - RN: The server experienced severe timeouts during rebalance if views were being indexed or compacted at the same time. This resulted in the rebalance to fail. This has been fixed.
          kzeller kzeller made changes -
          Status Reopened [ 4 ] Resolved [ 5 ]
          Resolution Fixed [ 1 ]
          Hide
          kzeller kzeller added a comment -

          RN: The server experienced severe timeouts during
          rebalance if views were being indexed or compacted at the same time.
          This resulted in the rebalance to fail. This has been fixed.

          Note from Alk: the guidance on swapness % should not be part of note.

          Show
          kzeller kzeller added a comment - RN: The server experienced severe timeouts during rebalance if views were being indexed or compacted at the same time. This resulted in the rebalance to fail. This has been fixed. Note from Alk: the guidance on swapness % should not be part of note.
          kzeller kzeller made changes -
          Status Resolved [ 5 ] Closed [ 6 ]
          Hide
          alkondratenko Aleksey Kondratenko (Inactive) added a comment -

          correction, this was fixed on GNU/Linux, but not on windows.

          Show
          alkondratenko Aleksey Kondratenko (Inactive) added a comment - correction, this was fixed on GNU/Linux, but not on windows.
          mikew Mike Wiederhold made changes -
          Planned End 2013-03-01 12:00 (generated: set to end of MB-7111)
          mikew Mike Wiederhold made changes -
          Planned End 2013-03-01 12:00 2013-03-04 12:00 (generated: set to end of MB-7111)
          maria Maria McDuff (Inactive) made changes -
          Planned End 2013-03-04 12:00 2013-03-05 12:00 (generated: set to end of MB-7111)
          wayne Wayne Siu made changes -
          Component/s documentation-don't-use-put-in-doc-project [ 10012 ]

            People

            • Assignee:
              kzeller kzeller
              Reporter:
              thuan Thuan Nguyen
            • Votes:
              0 Vote for this issue
              Watchers:
              8 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:

                Gerrit Reviews

                There are no open Gerrit changes