Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-6494

[longevity] rebalance fails with error "Got error while trying to send close confirmation: {error,enotconn} during rebalance" when adding a failed over node back to the cluster and failing over another node

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Cannot Reproduce
    • Affects Version/s: 2.0-beta
    • Fix Version/s: 2.0.1
    • Component/s: ns_server
    • Security Level: Public
    • Labels:
      None
    • Environment:
      centos 6.2 64 bit

      Description

      Cluster information:

      • 11 centos 6.2 64bit server with 4 cores CPU
      • Each server has 10 GB RAM and 150 GB disk.
      • 8 GB RAM for couchbase server at each node
      • Each server has its own drive, no disk sharing with other server.
      • Cluster has 2 buckets, default (3GB) and saslbucket (3GB)
      • Each bucket has one doc and 2 views for each doc.
      • Loader 10.3.2.4
      • Build 2.0.0-1645 on 11 nodes
      • Create cluster with 10 nodes
        10.3.121.13
        10.3.121.14
        10.3.121.15
        10.3.121.16
        10.3.121.17
        10.3.121.20
        10.3.121.22
        10.3.121.24
        10.3.121.25
        10.3.121.23
      • Data path /data
      • View path /data
      • Activities testing:
      • Do failover, rebalance, swap rebalance in and out.
      • Latest step failed for this bug.
      • Failover node 14
      • Rebalance and stop rebalance. So node 14 was kicked out.
      • Add node 14 back to cluster and failover node 13.
      • Rebalance. Rebalance failed and during rebalance, got error point to node 16

      "Got error while trying to send close confirmation:

      {error,enotconn} 13:56:20 - Thu Aug 30, 2012 " on node 16.

      Analyze diags of node 16, see this error at time 2012-08-30T13:56:20

      [ns_server:debug,2012-08-30T13:56:20.614,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:kill_tapname:932]killing tap named: replication_building_30_'ns_1@10.3.121.16'
      [rebalance:info,2012-08-30T13:56:20.620,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:522]Starting tap stream:
      [{vbuckets,[]},
      {checkpoints,[]},
      {name,<<"replication_building_30_'ns_1@10.3.121.16'">>},
      {takeover,false}]
      {{"10.3.121.25",11209},
      {"10.3.121.16",11209},
      [{vbuckets,[30]},
      {takeover,false},
      {suffix,"building_30_'ns_1@10.3.121.16'"},
      {username,"saslbucket"},
      {password,"password"}]}
      [rebalance:debug,2012-08-30T13:56:20.623,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:init:555]upstream_sender pid: <0.13244.35>
      [rebalance:info,2012-08-30T13:56:20.665,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 18
      [rebalance:info,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 24
      [rebalance:info,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 25
      [rebalance:error,2012-08-30T13:56:20.666,ns_1@10.3.121.16:<0.13236.35>:ebucketmigrator_srv:confirm_sent_messages:674]Got error while trying to send close confirmation: {error,enotconn}

      [views:info,2012-08-30T13:56:20.852,ns_1@10.3.121.16:'capi_ddoc_replication_srv-saslbucket':capi_set_view_manager:apply_index_states:351]
      Calling couch_set_view:set_partition_states([<<"saslbucket">>,
      <<"_design/d11">>,
      [12,13,14,15,16,17,19,20,21,22,
      23,24,25,26,36,37,38,39,40,41,
      42,69,70,71,72,73,84,85,86,87,
      92,93,94,103,104,105,106,107,
      108,109,110,111,112,113,114,142,
      143,150,151,152,153,154,155,156,
      178,179,180,181,182,196,197,218,
      219,220,221,222,223,224,225,226,
      227,228,229,230,231,232,233,234,
      235,236,237,242,243,244,245,246,
      247,248,249,250,251,252,253,265,
      266,267,268,288,289,290,291,309,
      310,311,312,313,314,315,321,322,
      323,333,334,335,336,337,338,339,
      340,341,342,343,344,347,348,349,
      350,351,352,353,354,355,362,363,
      364,365,366,367,379,380,381,382,
      383,384,385,395,396,406,407,408,
      409,410,411,446,447,448,449,450,
      451,452,453,503,504,505,506,507,
      508,509,510,511,517,518,519,520,
      521,522,523,524,603,604,613,614,
      615,632,633,634,635,636,637,695,
      696,748,749,750,811,812,813,843,
      844,845,846,847,848,854,876,916,
      917,918,919,920,927,928,929,930,
      931,932,939,940,945,946,947,948,
      949,950,951,965,966,978,979,
      1004,1005,1012,1013,1014,1015,
      1016,1017,1018,1019,1020,1021,
      1022,1023],
      [],
      [0,1,2,3,4,5,6,7,8,9,10,11,18,27,
      28,29,30,31,32,33,34,35,43,44,
      45,46,47,48,49,50,51,52,53,54,
      55,56,57,58,59,60,61,62,63,64,
      65,66,67,68,74,75,76,77,78,79,
      80,81,82,83,88,89,90,91,95,96,
      97,98,99,100,101,102,115,116,
      117,118,119,120,121,122,123,124,
      125,126,127,128,129,130,131,132,
      133,134,135,136,137,138,139,140,
      141,144,145,146,147,148,149,157,
      158,159,160,161,162,163,164,165,
      166,167,168,169,170,171,172,173,
      174,175,176,177,183,184,185,186,
      187,188,189,190,191,192,193,194,
      195,198,199,200,201,202,203,204,
      205,206,207,208,209,210,211,212,
      213,214,215,216,217,238,239,240,
      241,254,255,256,257,258,259,260,
      261,262,263,264,269,270,271,272,
      273,274,275,276,277,278,279,280,
      281,282,283,284,285,286,287,292,
      293,294,295,296,297,298,299,300,
      301,302,303,304,305,306,307,308,
      316,317,318,319,320,324,325,326,
      327,328,329,330,331,332,345,346,
      356,357,358,359,360,361,368,369,
      370,371,372,373,374,375,376,377,
      378,386,387,388,389,390,391,392,
      393,394,397,398,399,400,401,402,
      403,404,405,412,413,414,415,416,
      417,418,419,420,421,422,423,424,
      425,426,427,428,429,430,431,432,
      433,434,435,436,437,438,439,440,
      441,442,443,444,445,454,455,456,
      457,458,459,460,461,462,463,464,
      465,466,467,468,469,470,471,472,
      473,474,475,476,477,478,479,480,
      481,482,483,484,485,486,487,488,
      489,490,491,492,493,494,495,496,
      497,498,499,500,501,502,512,513,
      514,515,516,525,526,527,528,529,
      530,531,532,533,534,535,536,537,
      538,539,540,541,542,543,544,545,
      546,547,548,549,550,551,552,553,
      554,555,556,557,558,559,560,561,
      562,563,564,565,566,567,568,569,
      570,571,572,573,574,575,576,577,
      578,579,580,581,582,583,584,585,
      586,587,588,589,590,591,592,593,
      594,595,596,597,598,599,600,601,
      602,605,606,607,608,609,610,611,
      612,616,617,618,619,620,621,622,
      623,624,625,626,627,628,629,630,
      631,638,639,640,641,642,643,644,
      645,646,647,648,649,650,651,652,
      653,654,655,656,657,658,659,660,
      661,662,663,664,665,666,667,668,
      669,670,671,672,673,674,675,676,
      677,678,679,680,681,682,683,684,
      685,686,687,688,689,690,691,692,
      693,694,697,698,699,700,701,702,
      703,704,705,706,707,708,709,710,
      711,712,713,714,715,716,717,718,
      719,720,721,722,723,724,725,726,
      727,728,729,730,731,732,733,734,
      735,736,737,738,739,740,741,742,
      743,744,745,746,747,751,752,753,
      754,755,756,757,758,759,760,761,
      762,763,764,765,766,767,768,769,
      770,771,772,773,774,775,776,777,
      778,779,780,781,782,783,784,785,
      786,787,788,789,790,791,792,793,
      794,795,796,797,798,799,800,801,
      802,803,804,805,806,807,808,809,
      810,814,815,816,817,818,819,820,
      821,822,823,824,825,826,827,828,
      829,830,831,832,833,834,835,836,
      837,838,839,840,841,842,849,850,
      851,852,853,855,856,857,858,859,
      860,861,862,863,864,865,866,867,
      868,869,870,871,872,873,874,875,
      877,878,879,880,881,882,883,884,
      885,886,887,888,889,890,891,892,
      893,894,895,896,897,898,899,900,
      901,902,903,904,905,906,907,908,
      909,910,911,912,913,914,915,921,
      922,923,924,925,926,933,934,935,
      936,937,938,941,942,943,944,952,
      953,954,955,956,957,958,959,960,
      961,962,963,964,967,968,969,970,
      971,972,973,974,975,976,977,980,
      981,982,983,984,985,986,987,988,
      989,990,991,992,993,994,995,996,
      997,998,999,1000,1001,1002,1003,
      1006,1007,1008,1009,1010,1011]])
      [ns_server:info,2012-08-30T13:56:20.867,ns_1@10.3.121.16:ns_port_memcached:ns_port_server:log:169]memcached<0.674.0>: Thu Aug 30 20:56:20.666298 3: TAP (Consumer) eq_tapq:anon_56 - Failed to reset a vbucket 18. Force disconnect
      memcached<0.674.0>: Thu Aug 30 20:56:20.666374 3: TAP (Consumer) eq_tapq:anon_56 - disconnected

      Link to diags of all nodes https://s3.amazonaws.com/packages.couchbase/diag-logs/large_cluster_2_0/11ndoes-1645-reb-failed-close-confirmation-error-enotconn-20120830.tgz

      Link to atop file of all nodes https://s3.amazonaws.com/packages.couchbase/atop-files/2.0.0/atop-11ndoes-1645-reb-failed-close-confirmation-error-enotconn-20120830.tgz

      No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

        Hide
        chiyoung Chiyoung Seo added a comment -

        We spent lots of time on debugging this issue, but not able to reproduce it. We added more logs to the ep-engine. Let's open the new bug when we hit the this issue again.

        Show
        chiyoung Chiyoung Seo added a comment - We spent lots of time on debugging this issue, but not able to reproduce it. We added more logs to the ep-engine. Let's open the new bug when we hit the this issue again.
        Hide
        thuan Thuan Nguyen added a comment -

        Integrated in github-ep-engine-2-0 #454 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/454/)
        MB-6494 Add more logs to vbucket reset and deletion functions (Revision 96d50b2038777e25f671475ef8fad76ba3a76683)

        Result = SUCCESS
        Chiyoung Seo :
        Files :

        • src/ep_engine.cc
        Show
        thuan Thuan Nguyen added a comment - Integrated in github-ep-engine-2-0 #454 (See http://qa.hq.northscale.net/job/github-ep-engine-2-0/454/ ) MB-6494 Add more logs to vbucket reset and deletion functions (Revision 96d50b2038777e25f671475ef8fad76ba3a76683) Result = SUCCESS Chiyoung Seo : Files : src/ep_engine.cc
        Hide
        chiyoung Chiyoung Seo added a comment -

        This issue looks too strange to me. When the rebalance got stopped by the user, then the current master should still hold the active vbucket 30 with non-zero open checkpoint id. The open checkpoint id is reset to zero iff its vbucket get reset by receiving the backfill stream from the master or when it's deleted and recreated with non-active state.

        I will add more logs to ep-engine, but don't see anything suspicious in ep-engine at this time. We will see if this issue happens again in 2.0 full regressions. I'm fine with moving this to post 2.0 given the fact that it's not easily reproducible.

        Show
        chiyoung Chiyoung Seo added a comment - This issue looks too strange to me. When the rebalance got stopped by the user, then the current master should still hold the active vbucket 30 with non-zero open checkpoint id. The open checkpoint id is reset to zero iff its vbucket get reset by receiving the backfill stream from the master or when it's deleted and recreated with non-active state. I will add more logs to ep-engine, but don't see anything suspicious in ep-engine at this time. We will see if this issue happens again in 2.0 full regressions. I'm fine with moving this to post 2.0 given the fact that it's not easily reproducible.
        Hide
        farshid Farshid Ghods (Inactive) added a comment -

        Chiyoung,

        do you require more information from QE to proceed ? in that case please assign te bug back to Tony.

        also this issue is not reproduced easily in system testing so not sure if Tony can reproduce and collect the new logs so in that case would you be okay with deferring this bug for now ?

        Show
        farshid Farshid Ghods (Inactive) added a comment - Chiyoung, do you require more information from QE to proceed ? in that case please assign te bug back to Tony. also this issue is not reproduced easily in system testing so not sure if Tony can reproduce and collect the new logs so in that case would you be okay with deferring this bug for now ?
        Hide
        Aliaksey Artamonau Aliaksey Artamonau added a comment -

        Here're my observations. Node 23 is the master node of the cluster. At some
        point during rebalance it decides to build replicas of vbucket 30 on nodes 13,
        16 and 22. Current owner of this vbucket is node 25:

        [rebalance:debug,2012-08-29T19:11:20.428,ns_1@10.3.121.23:<0.31805.466>:ns_single_vbucket_mover:mover_inner_old_style:190]child replicas builder for vbucket 30 is <0.31941.466>
        [ns_server:debug,2012-08-29T19:11:20.436,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.13' is <15197.23780.3>
        [ns_server:debug,2012-08-29T19:11:20.447,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.16' is <15202.29015.2>
        [ns_server:debug,2012-08-29T19:11:20.453,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86]Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.22' is <15205.9270.2>

        Destination nodes start ebucketmigrators for replica building. It means that
        at this particular moment of time node 25's vbucket 30 still has non-zero
        checkpoint id:

        [ns_server:debug,2012-08-29T19:11:20.487,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:kill_tapname:932]killing tap named: replication_building_30_'ns_1@10.3.121.13'
        [rebalance:info,2012-08-29T19:11:20.507,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:init:522]Starting tap stream:
        [

        {vbuckets,[30]},
        {checkpoints,[{30,0}]},
        {name,<<"replication_building_30_'ns_1@10.3.121.13'">>},
        {takeover,false}]
        {{"10.3.121.25",11209},
        {"10.3.121.13",11209},
        [{vbuckets,[30]}

        ,

        {takeover,false}

        ,

        {suffix,"building_30_'ns_1@10.3.121.13'"}

        ,

        {username,"saslbucket"}

        ,

        {password,"password"}

        ]}
        [rebalance:debug,2012-08-29T19:11:20.528,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:init:555]upstream_sender pid: <0.23783.3>
        [rebalance:info,2012-08-29T19:11:20.543,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:process_upstream:880]Initial stream for vbucket 30

        On node 25 we see that memcached accordingly schedules backfills to 13, 16 and 22:

        memcached<0.654.0>: Thu Aug 30 02:11:20.515187 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Schedule the backfill for vbucket 30
        memcached<0.654.0>: Thu Aug 30 02:11:20.515253 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
        memcached<0.654.0>: Thu Aug 30 02:11:20.515279 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
        memcached<0.654.0>: Thu Aug 30 02:11:20.515289 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30
        memcached<0.654.0>: Thu Aug 30 02:11:20.545809 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Schedule the backfill for vbucket 30
        memcached<0.654.0>: Thu Aug 30 02:11:20.545856 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
        memcached<0.654.0>: Thu Aug 30 02:11:20.545872 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
        memcached<0.654.0>: Thu Aug 30 02:11:20.545882 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30
        memcached<0.654.0>: Thu Aug 30 02:11:20.551288 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Schedule the backfill for vbucket 30
        memcached<0.654.0>: Thu Aug 30 02:11:20.551456 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0
        memcached<0.654.0>: Thu Aug 30 02:11:20.551477 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0
        memcached<0.654.0>: Thu Aug 30 02:11:20.551487 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30
        memcached<0.654.0>: Thu Aug 30 02:11:20.559238 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Suspend for 5.00 secs

        And sooner or later those backfills complete:

        [ns_server:info,2012-08-29T19:11:23.286,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:23.086272 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Backfill is completed with VBuckets 30,
        memcached<0.654.0>: Thu Aug 30 02:11:23.086351 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30
        [ns_server:info,2012-08-29T19:11:24.864,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:24.663901 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Backfill is completed with VBuckets 30,
        memcached<0.654.0>: Thu Aug 30 02:11:24.670059 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30
        [ns_server:info,2012-08-29T19:11:25.887,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:25.686315 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.13 - Suspend for 5.00 secs

        [ns_server:info,2012-08-29T19:11:53.989,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169]memcached<0.654.0>: Thu Aug 30 02:11:53.786917 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Backfill is completed with VBuckets 30,
        memcached<0.654.0>: Thu Aug 30 02:11:53.787071 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30

        Then rebalance gets stopped by user:

        [ns_server:info,2012-08-29T19:11:59.054,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59]Killed the following tap names on 'ns_1@10.3.121.25': [<<"replication_building_30_'ns_1@10.3.121.13'">>,
        <<"replication_building_30_'ns_1@10.3.121.16'">>,
        <<"replication_building_30_'ns_1@10.3.121.22'">>]

        [user:info,2012-08-29T19:11:59.529,ns_1@10.3.121.23:<0.6719.464>:ns_orchestrator:handle_info:289]Rebalance stopped by user.

        As part of the code that handles rebalance completion (for any reason) we
        immediately log all tap and checkpoint stats on all the nodes. In these stats
        on 25 we can see that vbucket 30 already has checkpoint id of zero:

        [ns_server:info,2012-08-29T19:12:00.001,ns_1@10.3.121.25:<0.25668.2>:diag_handler:log_all_tap_and_checkpoint_stats:128]checkpoint:saslbucket:
        [

        {<<"vb_1007:persisted_checkpoint_id">>,<<"99">>}

        ,
        {<<"vb_1007:eq_tapq:replication_ns_1@10.3.121.13:cursor_checkpoint_id">>,
        .....

        {<<"vb_30:persisted_checkpoint_id">>,<<"0">>}

        ,

        {<<"vb_30:checkpoint_extension">>,<<"false">>}

        ,

        {<<"vb_30:num_items_for_persistence">>,<<"0">>}

        ,

        {<<"vb_30:num_checkpoints">>,<<"1">>}

        ,

        {<<"vb_30:num_open_checkpoint_items">>,<<"0">>}

        ,

        {<<"vb_30:num_checkpoint_items">>,<<"1">>}

        ,

        {<<"vb_30:num_tap_cursors">>,<<"0">>}

        ,

        {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>}

        ,

        {<<"vb_30:open_checkpoint_id">>,<<"0">>}

        ,

        {<<"vb_30:state">>,<<"active">>}

        ,

        I could not find anything abnormal from ns_server perspective during that
        period of time. So probably it's something wrong in ep_engine. Apparently more
        logs from ep_engine would help here. Particularly it would be good if vbucket
        resets were logged.

        Show
        Aliaksey Artamonau Aliaksey Artamonau added a comment - Here're my observations. Node 23 is the master node of the cluster. At some point during rebalance it decides to build replicas of vbucket 30 on nodes 13, 16 and 22. Current owner of this vbucket is node 25: [rebalance:debug,2012-08-29T19:11:20.428,ns_1@10.3.121.23:<0.31805.466>:ns_single_vbucket_mover:mover_inner_old_style:190] child replicas builder for vbucket 30 is <0.31941.466> [ns_server:debug,2012-08-29T19:11:20.436,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86] Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.13' is <15197.23780.3> [ns_server:debug,2012-08-29T19:11:20.447,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86] Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.16' is <15202.29015.2> [ns_server:debug,2012-08-29T19:11:20.453,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:spawn_replica_builder:86] Replica building ebucketmigrator for vbucket 30 into 'ns_1@10.3.121.22' is <15205.9270.2> Destination nodes start ebucketmigrators for replica building. It means that at this particular moment of time node 25's vbucket 30 still has non-zero checkpoint id: [ns_server:debug,2012-08-29T19:11:20.487,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:kill_tapname:932] killing tap named: replication_building_30_'ns_1@10.3.121.13' [rebalance:info,2012-08-29T19:11:20.507,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:init:522] Starting tap stream: [ {vbuckets,[30]}, {checkpoints, [{30,0}] }, {name,<<"replication_building_30_'ns_1@10.3.121.13'">>}, {takeover,false}] {{"10.3.121.25",11209}, {"10.3.121.13",11209}, [{vbuckets,[30]} , {takeover,false} , {suffix,"building_30_'ns_1@10.3.121.13'"} , {username,"saslbucket"} , {password,"password"} ]} [rebalance:debug,2012-08-29T19:11:20.528,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:init:555] upstream_sender pid: <0.23783.3> [rebalance:info,2012-08-29T19:11:20.543,ns_1@10.3.121.13:<0.23780.3>:ebucketmigrator_srv:process_upstream:880] Initial stream for vbucket 30 On node 25 we see that memcached accordingly schedules backfills to 13, 16 and 22: memcached<0.654.0>: Thu Aug 30 02:11:20.515187 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Schedule the backfill for vbucket 30 memcached<0.654.0>: Thu Aug 30 02:11:20.515253 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 memcached<0.654.0>: Thu Aug 30 02:11:20.515279 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 memcached<0.654.0>: Thu Aug 30 02:11:20.515289 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30 memcached<0.654.0>: Thu Aug 30 02:11:20.545809 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Schedule the backfill for vbucket 30 memcached<0.654.0>: Thu Aug 30 02:11:20.545856 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 memcached<0.654.0>: Thu Aug 30 02:11:20.545872 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 memcached<0.654.0>: Thu Aug 30 02:11:20.545882 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30 memcached<0.654.0>: Thu Aug 30 02:11:20.551288 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Schedule the backfill for vbucket 30 memcached<0.654.0>: Thu Aug 30 02:11:20.551456 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "opaque_enable_auto_nack" and vbucket 0 memcached<0.654.0>: Thu Aug 30 02:11:20.551477 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "enable_checkpoint_sync" and vbucket 0 memcached<0.654.0>: Thu Aug 30 02:11:20.551487 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "initial_vbucket_stream" and vbucket 30 memcached<0.654.0>: Thu Aug 30 02:11:20.559238 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Suspend for 5.00 secs And sooner or later those backfills complete: [ns_server:info,2012-08-29T19:11:23.286,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169] memcached<0.654.0>: Thu Aug 30 02:11:23.086272 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Backfill is completed with VBuckets 30, memcached<0.654.0>: Thu Aug 30 02:11:23.086351 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.16' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30 [ns_server:info,2012-08-29T19:11:24.864,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169] memcached<0.654.0>: Thu Aug 30 02:11:24.663901 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Backfill is completed with VBuckets 30, memcached<0.654.0>: Thu Aug 30 02:11:24.670059 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.22' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30 [ns_server:info,2012-08-29T19:11:25.887,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169] memcached<0.654.0>: Thu Aug 30 02:11:25.686315 3: TAP (Producer) eq_tapq:replication_ns_1@10.3.121.13 - Suspend for 5.00 secs [ns_server:info,2012-08-29T19:11:53.989,ns_1@10.3.121.25:ns_port_memcached:ns_port_server:log:169] memcached<0.654.0>: Thu Aug 30 02:11:53.786917 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Backfill is completed with VBuckets 30, memcached<0.654.0>: Thu Aug 30 02:11:53.787071 3: TAP (Producer) eq_tapq:replication_building_30_'ns_1@10.3.121.13' - Sending TAP_OPAQUE with command "close_backfill" and vbucket 30 Then rebalance gets stopped by user: [ns_server:info,2012-08-29T19:11:59.054,ns_1@10.3.121.23:<0.31941.466>:ns_replicas_builder_utils:kill_a_bunch_of_tap_names:59] Killed the following tap names on 'ns_1@10.3.121.25': [<<"replication_building_30_'ns_1@10.3.121.13'">>, <<"replication_building_30_'ns_1@10.3.121.16'">>, <<"replication_building_30_'ns_1@10.3.121.22'">>] [user:info,2012-08-29T19:11:59.529,ns_1@10.3.121.23:<0.6719.464>:ns_orchestrator:handle_info:289] Rebalance stopped by user. As part of the code that handles rebalance completion (for any reason) we immediately log all tap and checkpoint stats on all the nodes. In these stats on 25 we can see that vbucket 30 already has checkpoint id of zero: [ns_server:info,2012-08-29T19:12:00.001,ns_1@10.3.121.25:<0.25668.2>:diag_handler:log_all_tap_and_checkpoint_stats:128] checkpoint:saslbucket: [ {<<"vb_1007:persisted_checkpoint_id">>,<<"99">>} , {<<"vb_1007:eq_tapq:replication_ns_1@10.3.121.13:cursor_checkpoint_id">>, ..... {<<"vb_30:persisted_checkpoint_id">>,<<"0">>} , {<<"vb_30:checkpoint_extension">>,<<"false">>} , {<<"vb_30:num_items_for_persistence">>,<<"0">>} , {<<"vb_30:num_checkpoints">>,<<"1">>} , {<<"vb_30:num_open_checkpoint_items">>,<<"0">>} , {<<"vb_30:num_checkpoint_items">>,<<"1">>} , {<<"vb_30:num_tap_cursors">>,<<"0">>} , {<<"vb_30:last_closed_checkpoint_id">>,<<"0">>} , {<<"vb_30:open_checkpoint_id">>,<<"0">>} , {<<"vb_30:state">>,<<"active">>} , I could not find anything abnormal from ns_server perspective during that period of time. So probably it's something wrong in ep_engine. Apparently more logs from ep_engine would help here. Particularly it would be good if vbucket resets were logged.

          People

          • Assignee:
            chiyoung Chiyoung Seo
            Reporter:
            thuan Thuan Nguyen
          • Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved:

              Gerrit Reviews

              There are no open Gerrit changes