Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-51278

[System test upgrade] - Received error on DCP stream for vb: 272, err: document exists

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Critical
    • 7.1.0
    • 7.0.3, 7.1.0
    • fts
    • 7.0.3-7031 --> 7.1.0-2412
    • Untriaged
    • Centos 64-bit
    • 1
    • Unknown

    Description

      Script to Repro
      1. Run CC longevity on 7.0.3 for 2-3 days

      ./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas_scale3.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml -scale 3 -repeat 0 -log_level 0 -version 7.0.3-7031 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
      

      2. Change the encryption level to strict.
      3. Add 2 7.1 kv nodes and 2 7.1 indexing nodes, Remove 2 7.0.3 kv and 2 7.0.3 indexing nodes and do swap rebalance. This was hung because of MB-51196.

      Noticed 100's of occurrences of following errors on all 3 fts nodes right around the same time frame as shown in the log below. Saw some bugs similar to this (MB-48513, MB-48023)
      172.23.106.136

      [root@s82711-w12r2 logs]# grep 'Received error on DCP stream for vb'  fts.log| grep 'err: document exists'
      2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 264, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":302,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
      2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 272, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":303,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
      2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 263, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":304,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
      2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 266, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":305,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
      

      Another interesting thing to note. All these fts nodes are on 7.0.3 and we hit into MB-51099 on this very same cluster because of which cbcollects failed on these nodes all the time. However since today morning cbcollects work on 2/3 nodes.

      I remember Sreekanth Sivasankaran mentioning that restart of service can help with this which I didn't initiate, so I am guessing some kind of restart or crash has happened in fts.

      Also following stats that I had collected couple of days ago to validate the bug MB-51099 does prove that. When I compare it with the stats I took today its gone down considerably lower.

      Stats from 2-3 days before

      Book-Pro-2:testrunner balakumaran.g$ curl -XGET https://172.23.106.136:18094/api/stats -u Administrator:password -k | jq | grep CurFiles
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100  389k    0  389k    0     0  78436      0 --:--:--  0:00:05 --:--:-- 89701
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 64388,
                "CurFilesIneligibleForRemoval": 62420,
                "CurFilesIneligibleForRemoval": 64361,
                "CurFilesIneligibleForRemoval": 61028,
      Balakumarans-MacBook-Pro-2:testrunner balakumaran.g$ 
      

      Stats from today

      Balakumarans-MacBook-Pro-2:TAF balakumaran.g$ curl -XGET https://172.23.106.136:18094/api/stats -u Administrator:password -k | jq | grep CurFiles
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
      100  358k    0  358k    0     0   101k      0 --:--:--  0:00:03 --:--:--  101k
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 0,
                "CurFilesIneligibleForRemoval": 509,
                "CurFilesIneligibleForRemoval": 496,
                "CurFilesIneligibleForRemoval": 476,
                "CurFilesIneligibleForRemoval": 535,
      Balakumarans-MacBook-Pro-2:TAF balakumaran.g$ 
      

      I grepped for Service 'fts' exited with status 137 thinking maybe OOM killer was involved, but I don't see any logs for that.

      cbcollect_info attached

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              Balakumaran.Gopal Balakumaran Gopal
              Balakumaran.Gopal Balakumaran Gopal
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty