Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-62597

[System Test] Cluster unusable messages seen spanning over 1 hour

    XMLWordPrintable

Details

    • Bug
    • Resolution: Fixed
    • Test Blocker
    • Columnar 1.0.0
    • Columnar 1.0.0
    • analytics
    • Columnar Edition 1.0.0 build 2190
    • Untriaged
    • 0
    • Unknown
    • Analytics Sprint 45, Analytics Sprint 46

    Description

      During day 3 of system test, the cluster seems to have become unusable and it hasn't come back to healthy state. It's been in this state for almost 1+ hour.

      A couple of caveats -

      There are a large number of Kafka collections( Not sure https://issues.couchbase.com/browse/MB-61350 is the cause but Kafka ingestion had happened around

      (2024-07-03T13:25:04.694+00:00) 
      

      and the cluster becoming unusable messages span from

      2024-07-04T06:35:32.116 to 2024-07-04T07:45:13.268+00:00 
      

      All the timestamps are from node-001. Cluster was fine for around 15 hours. In those 15 hours, 3 rebalances were triggered to go from 4 to 8 and then 8 to 16 and finally 16 to 32 nodes. Because Kafka ingestion was slow it is possible that the cluster went into rebalance state while the ingestion was not complete. I'm alluding to this comment made by Ali here

      The workload is as follows -

      No. of remote collections	80 (* 50 million per collection)
      Standalone collections	50  ( total count in these is 300 * 8 million. Some collections have 8 million docs and some are in multiples of 8 million)
      Kafka collections	72 (* 10 million per collection) 
       
      Total doc count comes up to 7.2 billion documents ( Around 10.6 TB to 14 TB) (approximately assuming doc size of 1.5-2KB)
      Number of links = around 7 ( 3 remote + 2 kafka + 2 external).
      

      There are other exceptions such as NullPointerException and IllegalStateException on other nodes. I'm not sure if this has anything to do with this. But I'll file tickets for those separately as they seem to be occurring at a different time.

      cbcollect ->

      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-001.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-002.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-003.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-004.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-005.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-006.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-007.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-008.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-009.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-010.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-011.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-012.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-013.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-014.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-015.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-016.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-017.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-018.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-019.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-020.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-021.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-022.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-023.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-024.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-025.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-026.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-027.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-028.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-029.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-030.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-031.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip
      https://cb-engineering.s3.amazonaws.com/SysTestColumnarJul3/collectinfo-2024-07-04T074358-ns_1%40svc-da-node-032.b2yoytucmykunsrf.sandbox.nonprod-project-avengers.com.zip

      Supportal snapshot -> http://supportal.couchbase.com/snapshot/b18e5645ce6e07c85b98850b3221e5e4::31

      Attachments

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            People

              pavan.pb Pavan PB
              pavan.pb Pavan PB
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty