Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-59853

Auto-failover: Seeing "Janitor cleanup failed" error message during failover process

    XMLWordPrintable

Details

    Description

      Steps:

      • 3 node KV cluster with one magma bucket

        +---------------+--------+-----------+----------+----------------------+
        | Nodes         | CPU    | Mem_total | Mem_free | Swap_mem_used        |
        +---------------+--------+-----------+----------+----------------------+
        | 172.23.108.69 | 15.549 | 4.03 GiB  | 3.07 GiB | 23.25 MiB / 4.24 GiB |
        | 172.23.108.67 | 27.050 | 4.03 GiB  | 3.04 GiB | 20.50 MiB / 4.24 GiB |
        | 172.23.108.68 | 15.749 | 4.03 GiB  | 3.05 GiB | 4.75 MiB / 4.24 GiB  |
        +---------------+--------+-----------+----------+----------------------+
         
        +---------+-------------------+----------+-------+-----------------------+-----------+
        | Bucket  | Type / Storage    | Replicas | Items | RAM Quota / Used      | Disk Used |
        +---------+-------------------+----------+-------+-----------------------+-----------+
        | default | couchbase / magma | 2        | 3322  | 9.37 GiB / 299.40 MiB | 37.21 MiB |
        +---------+-------------------+----------+-------+-----------------------+-----------+
        

      • Induce failure on one of the node (.68) to trigger auto-failover

      Observation:

      Just after failover starts, seeing 'Janitor cleanup failed on the error induced node'.

      And failover completes successfully as expected apart from this error.

      Logs:

       

      [rebalance:error,2023-11-28T02:12:57.287-08:00,ns_1@172.23.108.67:<0.9458.848>:failover:janitor_buckets:615] Janitor cleanup of ["default"] failed after failover of ['ns_1@172.23.108.68']:
      {error, {badmatch, false},
       [{leader_activities, start_activity, 6,
         [{file, "src/leader_activities.erl"},
          {line, 185}]},
        {leader_activities, run_activity, 6,
         [{file, "src/leader_activities.erl"},
          {line, 141}]},
        {ns_janitor, run_buckets_cleanup_activity, 3,
         [{file, "src/ns_janitor.erl"},
          {line, 86}]},
        {ns_janitor, cleanup_buckets, 2,
         [{file, "src/ns_janitor.erl"},
          {line, 78}]},
        {failover, janitor_buckets, 2,
         [{file, "src/failover.erl"},
          {line, 597}]},
        {failover, janitor_membase_buckets_group, 2,
         [{file, "src/failover.erl"},
          {line, 324}]},
        {lists, flatmap_1, 2,
         [{file, "lists.erl"},
          {line, 1335}]},
        {failover, handle_buckets_failover, 2,
         [{file, "src/failover.erl"},
          {line, 369}]}]}

       

       

      TAF test:

       

      guides/gradlew --refresh-dependencies testrunner -P jython=/opt/jython/bin/jython -P 'args=-i node.ini -p get-cbcollect-info=False,skip_cluster_reset=False,skip_collections_cleanup=True -t failover.AutoFailoverTests.AutoFailoverTests.test_autofailover,timeout=5,num_node_failures=1,nodes_init=3,failover_action=stop_server,num_items=10000,transaction_timeout=150,atomicity=True,durability=MAJORITY,replicas=2'

       

      Issue not seen on 7.6.0-1767

       

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              ashwin.govindarajulu Ashwin Govindarajulu
              ashwin.govindarajulu Ashwin Govindarajulu
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty