Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-44660

[System Test] Disk getting full on one of the KV nodes in the cluster

    XMLWordPrintable

Details

    Description

      Build : 7.0.0-4547
      Test : -test tests/integration/cheshirecat/test_cheshirecat_kv_gsi_coll_xdcr_backup_sgw_fts_itemct_txns_eventing_cbas.yml -scope tests/integration/cheshirecat/scope_cheshirecat_with_backup.yml
      Scale :
      Iteration : 1st

      In the system test, the disk on one KV node 172.23.120.77 gets full, and then the test does not proceed as expected.

      On 172.23.120.77, df shows 100% disk usage on /data (100G), but du shows that only 48 GB is used.
      [root@localhost bin]# df -kh
      Filesystem Size Used Avail Use% Mounted on
      devtmpfs 12G 0 12G 0% /dev
      tmpfs 12G 0 12G 0% /dev/shm
      tmpfs 12G 803M 11G 7% /run
      tmpfs 12G 0 12G 0% /sys/fs/cgroup
      /dev/mapper/centos-root 31G 8.5G 23G 28% /
      /dev/xvdb1 100G 100G 20K 100% /data
      /dev/xvda1 497M 284M 214M 58% /boot
      tmpfs 2.4G 0 2.4G 0% /run/user/0

      [root@localhost data]# pwd
      /data
      [root@localhost data]# du -sh *
      0 archive
      48G couchbase

      From lsof, it is seen that memcached is holding up 9822 deleted files, which might be the cause.
      [root@localhost data]# /usr/sbin/lsof | grep deleted | grep memcached | wc -l
      9822

      The disk_almost_full alarm started going off from 2021-03-01T03:34:16.

      [root@localhost logs]# zgrep -i "disk_almost_full" babysitter.log
      [ns_server:info,2021-03-01T03:34:16.379-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 03:34:16.165561 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{set,{{disk_almost_full,"/data"},[]}}}]}
      [ns_server:info,2021-03-01T03:36:16.376-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 03:36:16.175421 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{clear,{disk_almost_full,"/data"}}}]}
      [ns_server:info,2021-03-01T03:43:16.416-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 03:43:16.216190 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{set,{{disk_almost_full,"/data"},[]}}}]}
      [ns_server:info,2021-03-01T04:02:16.559-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 04:02:16.349853 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{clear,{disk_almost_full,"/data"}}}]}
      [ns_server:info,2021-03-01T04:04:16.578-08:00,babysitter_of_ns_1@cb.local:<0.121.0>:ns_port_server:log:224]ns_server<0.121.0>: 2021-03-01 04:04:16.376625 std_info            #{label=>{error_logger,info_report},report=>[{alarm_handler,{set,{{disk_almost_full,"/data"},[]}}}]}
      

      We will continue to check further, but this looks like MB-41924. On another cluster running 7.0.0-4554, we ran into the same issue after around the same duration of test run. However, the test run with 7.0.0-4539 did not show this issue. So this could be a regression.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          drigby Dave Rigby added a comment - Changelog between 4539 & 4547: http://changelog.build.couchbase.com/?product=couchbase-server&fromVersion=7.0.0&fromBuild=4539&toVersion=7.0.0&toBuild=4547&f_asterixdb=on&f_backup=on&f_cbas-core=on&f_cbft=on&f_cbgt=on&f_gometa=on&f_kv_engine=on&f_ns_server=on&f_plasma=on&f_query=on&f_query-ui=on&f_testrunner=on&f_tlm=on&f_voltron=on Nothing immediately obvious from the changelog which would suggest why this is now failing...
          drigby Dave Rigby added a comment - - edited

          Unfortunately there's no open file information recorded in the cbcollect; because the machine doesn't have lsof installed:

          172.23.121.77 couchbase.log

          Relevant lsof output
          echo moxi memcached beam.smp couch_compact godu sigar_port cbq-engine indexer projector goxdcr cbft eventing-producer eventing-consumer | xargs -n1 pgrep | xargs -n1 -r -- lsof -n -p
           
          xargs: lsof: No such file or directory
          xargs: pgrep: terminated by signal 13
          

          Mihir Kamdar Could you update these machines / template so the lsof package is added to the standard set which is installed please?

          Edit I was actually looking at a different node to the one the problem was reported on (.121.77 vs .120.77), however the point still stands that we should have tools like lsof installed as standard on cluster nodes.

          Edit 2: However, looking at the node which was reported to have a full disk (.120.77), that also exhibits the same problem - lsof isn't found - so that node also needs fixing:

          172.23.120.77 couchbase.log

          echo moxi memcached beam.smp couch_compact godu sigar_port cbq-engine indexer projector goxdcr cbft eventing-producer eventing-consumer | xargs -n1 pgrep | xargs -n1 -r -- lsof -n -p
          ==============================================================================
          xargs: lsof: No such file or directory
          xargs: pgrep: terminated by signal 13
          

          drigby Dave Rigby added a comment - - edited Unfortunately there's no open file information recorded in the cbcollect; because the machine doesn't have lsof installed: 172.23.121.77 couchbase.log Relevant lsof output echo moxi memcached beam.smp couch_compact godu sigar_port cbq-engine indexer projector goxdcr cbft eventing-producer eventing-consumer | xargs -n1 pgrep | xargs -n1 -r -- lsof -n -p   xargs: lsof: No such file or directory xargs: pgrep: terminated by signal 13 Mihir Kamdar Could you update these machines / template so the lsof package is added to the standard set which is installed please? Edit I was actually looking at a different node to the one the problem was reported on (.121.77 vs .120 .77), however the point still stands that we should have tools like lsof installed as standard on cluster nodes. Edit 2 : However, looking at the node which was reported to have a full disk (.120.77), that also exhibits the same problem - lsof isn't found - so that node also needs fixing: 172.23.120.77 couchbase.log echo moxi memcached beam.smp couch_compact godu sigar_port cbq-engine indexer projector goxdcr cbft eventing-producer eventing-consumer | xargs -n1 pgrep | xargs -n1 -r -- lsof -n -p ============================================================================== xargs: lsof: No such file or directory xargs: pgrep: terminated by signal 13
          drigby Dave Rigby added a comment -

          There is a very large number of DCP streams backfilling on the affected node - in particular from FTS:

          ".120.77 stats.log

          1885074: eq_dcpq:fts:default:b402e957ef282aff8fcfa9d71d8d983b-57347cfb:num_streams:                                                                                15223
          

          For each stream in the backfilling state, it will keep a vbucket file open, which will logically be deleted when compaction runs but the physical space cannot be reclaimed until the backfill completes.

          Note also that this node has less overall disk space on the /data volume compared to others:

          .120.77 couchbase.log

          df -ha
          ==============================================================================
          Filesystem               Size  Used Avail Use% Mounted on
          /dev/xvdb1               100G  100G   20K 100% /data
          

          .121.77 couchbase.log

          /dev/xvdb1               150G   64G   87G  43% /data
          

          Which I believe is why this node suffered the issue and the other didn't (even though FTS is connected to multiple nodes).

          FTS has a known bug with keeping too many streams open - see MB-44562. Please can you re-run once that MB has been fixed.

          drigby Dave Rigby added a comment - There is a very large number of DCP streams backfilling on the affected node - in particular from FTS: ".120.77 stats.log 1885074: eq_dcpq:fts:default:b402e957ef282aff8fcfa9d71d8d983b-57347cfb:num_streams: 15223 For each stream in the backfilling state, it will keep a vbucket file open, which will logically be deleted when compaction runs but the physical space cannot be reclaimed until the backfill completes. Note also that this node has less overall disk space on the /data volume compared to others: .120.77 couchbase.log df -ha ============================================================================== Filesystem Size Used Avail Use% Mounted on /dev/xvdb1 100G 100G 20K 100% /data .121.77 couchbase.log /dev/xvdb1 150G 64G 87G 43% /data Which I believe is why this node suffered the issue and the other didn't (even though FTS is connected to multiple nodes). FTS has a known bug with keeping too many streams open - see MB-44562 . Please can you re-run once that MB has been fixed.

          Resolving it as duplicate of MB-44562. We are running the system test with build 7.0.0-4587, and did not see the issue for 11 hrs of the test run. Will let the test run further and close the issue if its not seen.

          mihir.kamdar Mihir Kamdar (Inactive) added a comment - Resolving it as duplicate of MB-44562 . We are running the system test with build 7.0.0-4587, and did not see the issue for 11 hrs of the test run. Will let the test run further and close the issue if its not seen.

          People

            mihir.kamdar Mihir Kamdar (Inactive)
            mihir.kamdar Mihir Kamdar (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty