Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44660

[System Test] Disk getting full on one of the KV nodes in the cluster

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-4547
      Test : -test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml
      Scale :
      Iteration : 1st

      In the system test, the disk on one KV node 172.23.120.77 gets full, and then the test does not proceed as expected.

      On 172.23.120.77, df shows 100% disk usage on /data (100G), but du shows that only 48 GB is used.
      [root@localhost bin]# df -kh
      Filesystem Size Used Avail Use% Mounted on
      devtmpfs 12G 0 12G 0% /dev
      tmpfs 12G 0 12G 0% /dev/shm
      tmpfs 12G 803M 11G 7% /run
      tmpfs 12G 0 12G 0% /sys/fs/cgroup
      /dev/mapper/centos-root 31G 8.5G 23G 28% /
      /dev/xvdb1 100G 100G 20K 100% /data
      /dev/xvda1 497M 284M 214M 58% /boot
      tmpfs 2.4G 0 2.4G 0% /run/user/0

      [root@localhost data]# pwd
      /data
      [root@localhost data]# du -sh *
      0 archive
      48G couchbase

      From lsof, it is seen that memcached is holding up 9822 deleted files, which might be the cause.
      [root@localhost data]# /usr/sbin/lsof | grep deleted | grep memcached | wc -l
      9822

      The disk_almost_full alarm started going off from 2021-03-01T03:34:16.

      [root@localhost logs]# zgrep -i "disk_almost_full" babysitter.log
      [ns_server:info,2021-03-01T03:34:16.379-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 03:34:16.165561 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{set,{{disk_almost_full,"/data"},[]}}}]}
      [ns_server:info,2021-03-01T03:36:16.376-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 03:36:16.175421 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{clear,{disk_almost_full,"/data"}}}]}
      [ns_server:info,2021-03-01T03:43:16.416-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 03:43:16.216190 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{set,{{disk_almost_full,"/data"},[]}}}]}
      [ns_server:info,2021-03-01T04:02:16.559-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 04:02:16.349853 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{clear,{disk_almost_full,"/data"}}}]}
      [ns_server:info,2021-03-01T04:04:16.578-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 04:04:16.376625 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{set,{{disk_almost_full,"/data"},[]}}}]}
      

      We will continue to check further, but this looks like MB-41924. On another cluster running 7.0.0-4554, we ran into the same issue after around the same duration of test run. However, the test run with 7.0.0-4539 did not show this issue. So this could be a regression.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty