Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-40556

[Durability]: Seeing expiry docs not getting expired

    XMLWordPrintable

    Details

      Description

      Build: 6.6.0-7883

      Scenario: (All doc_ops with doc_durability=MAJORITY)

      • 2 node cluster, Couchbase bucket (replica=1)
      • Load 250K docs into the bucket
      • Rebalance-in 2 more nodes into the cluster. (4 node cluster with 250K docs)
      • Start following doc update ops in a thread:
        • Update with expiry=0 for doc indexes 0-125K
        • Update with expiry=10 for doc indexes 127K-250K
      • Rebalance-out one node from the cluster and rebalance-in the same node back
      • Wait for doc_updates to complete and then sleep 12 secs for docs to expire
      • Validate doc count=127K

      Observation:

      At the end of the above scenario, the doc_count is ~ 128K (Expected 127K)

      Test log: http://qa.sc.couchbase.com/job/oel6-4node-rebalance_in_jython/1107/consoleText

      SDK Client side pcap:https://cb-jira.s3.us-east-2.amazonaws.com/logs/expiry_docs/client.pcap

      Cluster nodes pcapshttp://qa.sc.couchbase.com/job/oel6-4node-rebalance_in_jython/1107/artifact/logs/testrunner-20-Jul-21_02-01-13/test_1/

        Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

          Hide
          drigby Dave Rigby added a comment -

          The doc_count is "lazy" - given that items are only expired on one of (access, hourly Expiry Pager (value-eviction), or on compaction), the doc_count will not necessarily be correct with respect to documents with TTLs until one of those criteria has happened.

          How are you performing the final step "validate doc count"? I would suggest one of:

          1. Read every document again (to trigger expiry on access)
          2. Ensure the Expiry Pager has run (value-eviction) - perhaps by reducing it's frequency to less than 1 hour?
          3. Perform a compaction (full-eviction).
          Show
          drigby Dave Rigby added a comment - The doc_count is "lazy" - given that items are only expired on one of (access, hourly Expiry Pager (value-eviction), or on compaction), the doc_count will not necessarily be correct with respect to documents with TTLs until one of those criteria has happened. How are you performing the final step "validate doc count"? I would suggest one of: Read every document again (to trigger expiry on access) Ensure the Expiry Pager has run (value-eviction) - perhaps by reducing it's frequency to less than 1 hour? Perform a compaction (full-eviction).
          Hide
          jwalker Jim Walker added a comment -

          This may not be a bug, will fully check it out, but docs will only expire by one of these activitys.

          1) compaction
          2) expiry pager
          3) read of an expired item

          If none of those processes touched the 1k docs before the validation phase, it is completely expected that the doc count might be higher than expected.

          Show
          jwalker Jim Walker added a comment - This may not be a bug, will fully check it out, but docs will only expire by one of these activitys. 1) compaction 2) expiry pager 3) read of an expired item If none of those processes touched the 1k docs before the validation phase, it is completely expected that the doc count might be higher than expected.
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          As per the test, expiry_pager time is set to 5 seconds during the start of the tests.

          And the doc count didn't update for almost 2 mins in the final validation.

          2020-07-21 02:31:37,476 | test  | WARNING | pool-2-thread-2 | [task:_get_all_stats_and_compare:2176] Not Ready: vb_replica_curr_items 127989 == 127000. Received: {'172.23.105.155': '32075', '172.23.105.206': '31752', '172.23.105.205': '32099', '172.23.105.159': '32063'} for bucket 'default'
          ...
          ...
          ...
          2020-07-21 02:33:36,049 | test  | WARNING | pool-2-thread-2 | [task:_get_all_stats_and_compare:2176] Not Ready: vb_replica_curr_items 127989 == 127000. Received: {'172.23.105.155': '32075', '172.23.105.206': '31752', '172.23.105.205': '32099', '172.23.105.159': '32063'} for bucket 'default'
          2020-07-21 02:33:39,127 | test  | WARNING | pool-2-thread-8 | [task:_get_all_stats_and_compare:2176] Not Ready: curr_items_tot 255978 == 254000. Received: {'172.23.105.155': '63830', '172.23.105.206': '64491', '172.23.105.205': '63849', '172.23.105.159': '63808'} for bucket 'default

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - As per the test, expiry_pager time is set to 5 seconds during the start of the tests. And the doc count didn't update for almost 2 mins in the final validation. 2020-07-21 02:31:37,476 | test | WARNING | pool-2-thread-2 | [task:_get_all_stats_and_compare:2176] Not Ready: vb_replica_curr_items 127989 == 127000. Received: {'172.23.105.155': '32075', '172.23.105.206': '31752', '172.23.105.205': '32099', '172.23.105.159': '32063'} for bucket 'default' ... ... ... 2020-07-21 02:33:36,049 | test | WARNING | pool-2-thread-2 | [task:_get_all_stats_and_compare:2176] Not Ready: vb_replica_curr_items 127989 == 127000. Received: {'172.23.105.155': '32075', '172.23.105.206': '31752', '172.23.105.205': '32099', '172.23.105.159': '32063'} for bucket 'default' 2020-07-21 02:33:39,127 | test | WARNING | pool-2-thread-8 | [task:_get_all_stats_and_compare:2176] Not Ready: curr_items_tot 255978 == 254000. Received: {'172.23.105.155': '63830', '172.23.105.206': '64491', '172.23.105.205': '63849', '172.23.105.159': '63808'} for bucket 'default
          Hide
          jwalker Jim Walker added a comment -

          The expiry pager setting of 5 seconds is only showing on 3 of 4 nodes

          > find . -name stats.log | xargs grep ep_exp_pager_stime
          ./cbcollect_info_ns_1@172.23.105.206_20200721-093433/stats.log: ep_exp_pager_stime:                                    3600
          ./cbcollect_info_ns_1@172.23.105.206_20200721-093433/stats.log: ep_exp_pager_stime:                                    3600
          ./cbcollect_info_ns_1@172.23.105.159_20200721-093433/stats.log: ep_exp_pager_stime:                                    5
          ./cbcollect_info_ns_1@172.23.105.159_20200721-093433/stats.log: ep_exp_pager_stime:                                    5
          ./cbcollect_info_ns_1@172.23.105.205_20200721-093433/stats.log: ep_exp_pager_stime:                                    5
          ./cbcollect_info_ns_1@172.23.105.205_20200721-093433/stats.log: ep_exp_pager_stime:                                    5
          ./cbcollect_info_ns_1@172.23.105.155_20200721-093433/stats.log: ep_exp_pager_stime:                                    5
          ./cbcollect_info_ns_1@172.23.105.155_20200721-093433/stats.log: ep_exp_pager_stime:                                    5
          

          Ashwin Govindarajulu how was the setting applied?

          Show
          jwalker Jim Walker added a comment - The expiry pager setting of 5 seconds is only showing on 3 of 4 nodes > find . -name stats.log | xargs grep ep_exp_pager_stime ./cbcollect_info_ns_1@172.23.105.206_20200721-093433/stats.log: ep_exp_pager_stime: 3600 ./cbcollect_info_ns_1@172.23.105.206_20200721-093433/stats.log: ep_exp_pager_stime: 3600 ./cbcollect_info_ns_1@172.23.105.159_20200721-093433/stats.log: ep_exp_pager_stime: 5 ./cbcollect_info_ns_1@172.23.105.159_20200721-093433/stats.log: ep_exp_pager_stime: 5 ./cbcollect_info_ns_1@172.23.105.205_20200721-093433/stats.log: ep_exp_pager_stime: 5 ./cbcollect_info_ns_1@172.23.105.205_20200721-093433/stats.log: ep_exp_pager_stime: 5 ./cbcollect_info_ns_1@172.23.105.155_20200721-093433/stats.log: ep_exp_pager_stime: 5 ./cbcollect_info_ns_1@172.23.105.155_20200721-093433/stats.log: ep_exp_pager_stime: 5 Ashwin Govindarajulu how was the setting applied?
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          This is a testware issue again.Problem is we have to reset the expiry_timer everytime the node is added back to the cluster.

          Fixed and the test passed with the patch.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - This is a testware issue again.Problem is we have to reset the expiry_timer everytime the node is added back to the cluster. Fixed and the test passed with the patch.
          Hide
          ashwin.govindarajulu Ashwin Govindarajulu added a comment -

          Closing the ticket.

          Show
          ashwin.govindarajulu Ashwin Govindarajulu added a comment - Closing the ticket.

            People

            Assignee:
            ashwin.govindarajulu Ashwin Govindarajulu
            Reporter:
            ashwin.govindarajulu Ashwin Govindarajulu
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Dates

              Created:
              Updated:
              Resolved:

                Gerrit Reviews

                There are no open Gerrit changes

                  PagerDuty