Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-22997

Fix deadlock issue in when closing upr stream

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Major
    • 4.6.2, 5.0.0
    • 5.0.0
    • XDCR

    Description

      Build 5.0.0-2100.

      Setup:

      • 5 source nodes, 4 target nodes
      • 1B items (30% DGM)

      From the test output:

      3:48:12.733 [23/Feb/2017 15:51:45] INFO - replication_changes_left = 23,183,602
      03:48:14.771 [23/Feb/2017 15:51:47] INFO - replication_changes_left = 23,183,602
      03:48:16.809 [23/Feb/2017 15:51:49] INFO - replication_changes_left = 23,183,602
      03:48:18.849 [23/Feb/2017 15:51:51] INFO - replication_changes_left = 23,183,602
      03:48:20.887 [23/Feb/2017 15:51:53] INFO - replication_changes_left = 23,183,602
      03:48:22.923 [23/Feb/2017 15:51:55] INFO - replication_changes_left = 23,183,601
      03:48:24.957 [23/Feb/2017 15:51:57] INFO - replication_changes_left = -177,011,688
      03:48:26.998 [23/Feb/2017 15:51:59] INFO - replication_changes_left = 23,183,602
      03:48:29.034 [23/Feb/2017 15:52:01] INFO - replication_changes_left = 23,183,602
      03:48:31.072 [23/Feb/2017 15:52:03] INFO - replication_changes_left = 23,183,602
      03:48:33.111 [23/Feb/2017 15:52:05] INFO - replication_changes_left = 23,183,602
      03:48:35.148 [23/Feb/2017 15:52:07] INFO - replication_changes_left = 23,183,602
      03:48:37.187 [23/Feb/2017 15:52:09] INFO - replication_changes_left = 23,183,602
      03:48:39.226 [23/Feb/2017 15:52:11] INFO - replication_changes_left = 23,183,602
      03:48:41.263 [23/Feb/2017 15:52:13] INFO - replication_changes_left = 23,183,602
      03:48:43.301 [23/Feb/2017 15:52:15] INFO - replication_changes_left = 23,183,602
      03:48:45.337 [23/Feb/2017 15:52:17] INFO - replication_changes_left = 23,183,602
      03:48:47.376 [23/Feb/2017 15:52:19] INFO - replication_changes_left = 23,183,602
      03:48:49.411 [23/Feb/2017 15:52:21] INFO - replication_changes_left = 23,183,602
      03:48:51.446 [23/Feb/2017 15:52:23] INFO - replication_changes_left = 23,183,602
      03:48:53.482 [23/Feb/2017 15:52:25] INFO - replication_changes_left = 23,183,601
      03:48:55.521 [23/Feb/2017 15:52:27] INFO - replication_changes_left = 23,183,601
      03:48:57.560 [23/Feb/2017 15:52:29] INFO - replication_changes_left = 23,183,602
      03:48:59.596 [23/Feb/2017 15:52:31] INFO - replication_changes_left = 23,183,602
      03:49:01.630 [23/Feb/2017 15:52:34] INFO - replication_changes_left = 23,183,602
      03:49:03.670 [23/Feb/2017 15:52:36] INFO - replication_changes_left = 23,183,602
      03:49:05.706 [23/Feb/2017 15:52:38] INFO - replication_changes_left = 23,183,602
      03:49:07.742 [23/Feb/2017 15:52:40] INFO - replication_changes_left = 23,183,602
      03:49:09.783 [23/Feb/2017 15:52:42] INFO - replication_changes_left = 23,183,602
      03:49:11.822 [23/Feb/2017 15:52:44] INFO - replication_changes_left = 23,183,602
      03:49:13.857 [23/Feb/2017 15:52:46] INFO - replication_changes_left = 23,183,602
      03:49:15.894 [23/Feb/2017 15:52:48] INFO - replication_changes_left = 23,183,602
      03:49:17.931 [23/Feb/2017 15:52:50] INFO - replication_changes_left = 23,183,602
      03:49:19.966 [23/Feb/2017 15:52:52] INFO - replication_changes_left = 23,183,602
      03:49:21.997 [23/Feb/2017 15:52:54] INFO - replication_changes_left = 23,183,602
      03:49:24.033 [23/Feb/2017 15:52:56] INFO - replication_changes_left = 23,183,602
      03:49:26.068 [23/Feb/2017 15:52:58] INFO - replication_changes_left = 23,183,602
      03:49:28.105 [23/Feb/2017 15:53:00] INFO - replication_changes_left = 23,183,602
      03:49:30.141 [23/Feb/2017 15:53:02] INFO - replication_changes_left = 23,183,602
      03:49:32.177 [23/Feb/2017 15:53:04] INFO - replication_changes_left = 23,183,602
      03:49:34.211 [23/Feb/2017 15:53:06] INFO - replication_changes_left = 23,183,602
      03:49:36.247 [23/Feb/2017 15:53:08] INFO - replication_changes_left = 23,183,602
      03:49:38.286 [23/Feb/2017 15:53:10] INFO - replication_changes_left = 23,183,602
      03:49:40.321 [23/Feb/2017 15:53:12] INFO - replication_changes_left = 23,183,602
      03:49:42.360 [23/Feb/2017 15:53:14] INFO - replication_changes_left = 23,183,602
      03:49:44.396 [23/Feb/2017 15:53:16] INFO - replication_changes_left = 23,183,602
      03:49:46.432 [23/Feb/2017 15:53:18] INFO - replication_changes_left = 23,183,602
      03:49:48.468 [23/Feb/2017 15:53:20] INFO - replication_changes_left = 23,183,602
      03:49:50.504 [23/Feb/2017 15:53:22] INFO - replication_changes_left = 23,183,602
      03:49:52.543 [23/Feb/2017 15:53:24] INFO - replication_changes_left = 23,183,602
      03:49:54.578 [23/Feb/2017 15:53:26] INFO - replication_changes_left = 23,183,602
      03:49:56.614 [23/Feb/2017 15:53:29] INFO - replication_changes_left = 23,183,602
      03:49:58.655 [23/Feb/2017 15:53:31] INFO - replication_changes_left = 23,183,602
      03:50:00.693 [23/Feb/2017 15:53:33] INFO - replication_changes_left = 23,183,601
      

      XDCR logs on 172.23.96.106 report at least this error:

       "Errors": "[{\"time\":\"2017-02-23T15:07:37.465500702-08:00\",\"errMsg\":\"dcp_c9b656c9bdad469362874b06b3686bab/bucket-1/bucket-1_172.23.96.106:11210_2:DCP upr feed has been closed.\"}]"
      

      From memcached logs on 172.23.96.106:

      memcached.log.000017.txt:2017-02-23T14:45:12.176863-08:00 WARNING 462 Closing connection (0x7f822c231308) [ 172.23.96.106:49838 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T14:47:16.165660-08:00 WARNING 462 Closing connection (0x7f822c24a308) [ 172.23.96.106:49924 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T14:49:00.143635-08:00 WARNING 463 Closing connection (0x7f823b7c4308) [ 172.23.96.106:49746 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T14:49:20.139975-08:00 WARNING 465 Closing connection (0x7f822c26f308) [ 172.23.96.106:50010 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T14:59:00.125364-08:00 WARNING 465 Closing connection (0x7f822c2b9688) [ 172.23.96.106:50172 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:03:52.111474-08:00 WARNING 462 Closing connection (0x7f552f699588) [ 172.23.96.106:50638 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:05:56.112309-08:00 WARNING 406 Closing connection (0x7f5530252c88) [ 172.23.96.106:50724 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:08:16.791671-08:00 WARNING 463 Closing connection (0x7f822c368c88) [ 172.23.96.106:50604 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:08:17.177541-08:00 WARNING 253 Closing connection (0x7f819f180908) [ 172.23.96.106:49386 - 172.23.96.106:11210 (bucket-1) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:08:17.178722-08:00 WARNING 256 Closing connection (0x7f823b73e308) [ 172.23.96.106:49596 - 172.23.96.106:11210 (bucket-1) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:08:17.179063-08:00 WARNING 257 Closing connection (0x7f823b7ad308) [ 172.23.96.106:49662 - 172.23.96.106:11210 (bucket-1) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:12:23.897122-08:00 WARNING 270 Closing connection (0x7f5531dbb008) [ 172.23.96.106:51900 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:16:31.898586-08:00 WARNING 270 Closing connection (0x7f552cd63d08) [ 172.23.96.106:52072 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:20:40.899371-08:00 WARNING 270 Closing connection (0x7f819f085e88) [ 172.23.96.106:52246 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:28:16.892452-08:00 WARNING 4 Closing connection (0x7f819f0c3d08) [ 172.23.96.106:52314 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:28:57.899042-08:00 WARNING 233 Closing connection (0x7f819ee89908) [ 172.23.96.106:52588 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:31:01.898959-08:00 WARNING 4 Closing connection (0x7f819f180908) [ 172.23.96.106:52674 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:33:05.899161-08:00 WARNING 4 Closing connection (0x7f823b7ad308) [ 172.23.96.106:52760 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:35:09.898768-08:00 WARNING 4 Closing connection (0x7f823b7c4308) [ 172.23.96.106:52848 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:37:13.899243-08:00 WARNING 4 Closing connection (0x7f823b7ed308) [ 172.23.96.106:52932 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:38:16.892835-08:00 WARNING 233 Closing connection (0x7f823b73e308) [ 172.23.96.106:52728 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:41:21.898723-08:00 WARNING 4 Closing connection (0x7f822c24a308) [ 172.23.96.106:53112 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:43:25.900439-08:00 WARNING 4 Closing connection (0x7f822c290308) [ 172.23.96.106:53200 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:48:16.892215-08:00 WARNING 233 Closing connection (0x7f822c26f308) [ 172.23.96.106:53154 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:49:37.898894-08:00 WARNING 4 Closing connection (0x7f822c304308) [ 172.23.96.106:53456 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:51:41.900259-08:00 WARNING 4 Closing connection (0x7f5516213f88) [ 172.23.96.106:53542 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:53:45.899779-08:00 WARNING 4 Closing connection (0x7f551b4a1208) [ 172.23.96.106:53628 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:55:49.899133-08:00 WARNING 4 Closing connection (0x7f822c368c88) [ 172.23.96.106:53714 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      memcached.log.000017.txt:2017-02-23T15:57:53.899065-08:00 WARNING 4 Closing connection (0x7f5516213f88) [ 172.23.96.106:53798 - 172.23.96.106:11210 (_admin) ] due to read error: Connection reset by peer
      

      Please note, there where TMP OOM errors on the target side.

      2017-02-23T16:00:17.340-08:00 ERRO GOXDCR.XmemNozzle: xmem_c9b656c9bdad469362874b06b3686bab/bucket-1/bucket-1_172.23.96.100:11210_0 Received temporary error in setMeta response. Response status=TMPFAIL, err = <nil>, response=MCResponse status=TMPFAIL, opcode=0xa2, opaque=515, msg: 
      2017-02-23T16:00:21.220-08:00 ERRO GOXDCR.XmemNozzle: xmem_c9b656c9bdad469362874b06b3686bab/bucket-1/bucket-1_172.23.96.100:11210_0 Received temporary error in setMeta response. Response status=TMPFAIL, err = <nil>, response=MCResponse status=TMPFAIL, opcode=0xa2, opaque=506, msg: 
      2017-02-23T16:00:25.101-08:00 ERRO GOXDCR.XmemNozzle: xmem_c9b656c9bdad469362874b06b3686bab/bucket-1/bucket-1_172.23.96.100:11210_0 Received temporary error in setMeta response. Response status=TMPFAIL, err = <nil>, response=MCResponse status=TMPFAIL, opcode=0xa2, opaque=516, msg: 
      2017-02-23T16:00:28.991-08:00 ERRO GOXDCR.XmemNozzle: xmem_c9b656c9bdad469362874b06b3686bab/bucket-1/bucket-1_172.23.96.100:11210_0 Received temporary error in setMeta response. Response status=TMPFAIL, err = <nil>, response=MCResponse status=TMPFAIL, opcode=0xa2, opaque=507, msg: 
      2017-02-23T16:00:32.881-08:00 ERRO GOXDCR.XmemNozzle: xmem_c9b656c9bdad469362874b06b3686bab/bucket-1/bucket-1_172.23.96.100:11210_0 Received temporary error in setMeta response. Response status=TMPFAIL, err = <nil>, response=MCResponse status=TMPFAIL, opcode=0xa2, opaque=517, msg: 
      

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              yu Yu Sui (Inactive)
              pavelpaulau Pavel Paulau (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty