Details
-
Bug
-
Resolution: Fixed
-
Critical
-
7.0.3, 7.1.0
-
7.0.3-7031 --> 7.1.0-2412
-
Untriaged
-
Centos 64-bit
-
1
-
Unknown
Description
Script to Repro
1. Run CC longevity on 7.0.3 for 2-3 days
./sequoia -client 172.23.104.254:2375 -provider file:centos_second_cluster.yml -test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas_scale3.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml -scale 3 -repeat 0 -log_level 0 -version 7.0.3-7031 -skip_setup=false -skip_test=false -skip_teardown=true -skip_cleanup=false -continue=false -collect_on_error=false -stop_on_error=false -duration=604800 -show_topology=true
|
2. Change the encryption level to strict.
3. Add 2 7.1 kv nodes and 2 7.1 indexing nodes, Remove 2 7.0.3 kv and 2 7.0.3 indexing nodes and do swap rebalance. This was hung because of MB-51196.
Noticed 100's of occurrences of following errors on all 3 fts nodes right around the same time frame as shown in the log below. Saw some bugs similar to this (MB-48513, MB-48023)
172.23.106.136
[root@s82711-w12r2 logs]# grep 'Received error on DCP stream for vb' fts.log| grep 'err: document exists'
|
2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 264, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":302,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
|
2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 272, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":303,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
|
2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 263, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":304,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
|
2022-03-01T19:09:45.692-08:00 [ERRO] feed_dcp_gocbcore: [social_3e49564e3041dc55_f4e0a48a] Received error on DCP stream for vb: 266, err: document exists | {"status_code":2,"bucket":"default","error_name":"KEY_EEXISTS","error_description":"key already exists, or CAS mismatch","opaque":305,"last_dispatched_to":"172.23.120.86:11207","last_dispatched_from":"172.23.106.136:55600","last_connection_id":"e7b21ba58abb2b00/f31203876363c1c3"} -- cbgt.(*GocbcoreDCPFeed).initiateStreamEx.func1() at feed_dcp_gocbcore.go:964
|
Another interesting thing to note. All these fts nodes are on 7.0.3 and we hit into MB-51099 on this very same cluster because of which cbcollects failed on these nodes all the time. However since today morning cbcollects work on 2/3 nodes.
I remember Sreekanth Sivasankaran mentioning that restart of service can help with this which I didn't initiate, so I am guessing some kind of restart or crash has happened in fts.
Also following stats that I had collected couple of days ago to validate the bug MB-51099 does prove that. When I compare it with the stats I took today its gone down considerably lower.
Stats from 2-3 days before
Book-Pro-2:testrunner balakumaran.g$ curl -XGET https://172.23.106.136:18094/api/stats -u Administrator:password -k | jq | grep CurFiles
|
% Total % Received % Xferd Average Speed Time Time Time Current
|
Dload Upload Total Spent Left Speed
|
100 389k 0 389k 0 0 78436 0 --:--:-- 0:00:05 --:--:-- 89701
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 64388,
|
"CurFilesIneligibleForRemoval": 62420,
|
"CurFilesIneligibleForRemoval": 64361,
|
"CurFilesIneligibleForRemoval": 61028,
|
Balakumarans-MacBook-Pro-2:testrunner balakumaran.g$
|
Stats from today
Balakumarans-MacBook-Pro-2:TAF balakumaran.g$ curl -XGET https://172.23.106.136:18094/api/stats -u Administrator:password -k | jq | grep CurFiles
|
% Total % Received % Xferd Average Speed Time Time Time Current
|
Dload Upload Total Spent Left Speed
|
100 358k 0 358k 0 0 101k 0 --:--:-- 0:00:03 --:--:-- 101k
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 0,
|
"CurFilesIneligibleForRemoval": 509,
|
"CurFilesIneligibleForRemoval": 496,
|
"CurFilesIneligibleForRemoval": 476,
|
"CurFilesIneligibleForRemoval": 535,
|
Balakumarans-MacBook-Pro-2:TAF balakumaran.g$
|
I grepped for Service 'fts' exited with status 137 thinking maybe OOM killer was involved, but I don't see any logs for that.
cbcollect_info attached