Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-43724

Index node did not recover after reducing and resetting system file size limit

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      Build: 7.0.0-4122

      • Cluster with 2 kv+n1ql, 1 index
      • Create a bucket, scope and collection
      • Create one index on this collection
      • Load documents till this index is in DGM

        2021-01-18 09:14:28 | INFO | MainProcess | test_thread | [base_gsi.check_if_indexes_in_dgm_node] Index : idx_test_scope_1_test_collection_1job_title0 , resident_ratio : 0.96791
        

      • once index is in DGM, limit file size to 10MB for index process on index node and continue to load documents.

        2021-01-18 09:14:38 | INFO | MainProcess | system_failure_detector_thread | [remote_util.execute_command_raw] running command.raw on 172.23.121.58: prlimit --fsize=20480 --pid $(pgrep indexer)
        

      • We do see these messages in logs, which is expected:

        Plasma: (/data/@2i/shards/shard2/data) Unable to write - err write /data/@2i/shards/shard2/data/log.00000000000000.data: file too large
        Plasma: (/data/@2i/shards/shard1/data) Unable to write - err write /data/@2i/shards/shard1/data/log.00000000000000.data: file too large
        Plasma: (/data/@2i/shards/shard2/data) Unable to write - err write /data/@2i/shards/shard2/data/log.00000000000000.data: file too large
        Plasma: (/data/@2i/shards/shard1/data) Unable to write - err write /data/@2i/shards/shard1/data/log.00000000000000.data: file too large
        

      • reset file size limit of indexer process to unlimited after around 2 mins:

        2021-01-18 09:17:31 | INFO | MainProcess | system_failure_detector_thread | [remote_util.execute_command_raw] running command.raw on 172.23.121.58: prlimit --fsize=unlimited --pid $(pgrep indexer)
        

      • Above messages are not seen in the logs after resetting.
      • But indexer not processing mutations and not accepting other requests.

        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] Keyspace: default:test_bucket.test_scope_1.test_collection_1
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] Index: idx_test_scope_1_test_collection_1job_title0
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] number of docs pending: 5365370
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] number of docs queued: 241100
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] Keyspace: default:test_bucket.test_scope_1.test_collection_1
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] Index: idx_test_scope_1_test_collection_1job_title0
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] number of docs pending: 5365370
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [base_gsi._verify_items_count_collections] number of docs queued: 241100
        2021-01-18 09:32:38 | ERROR | MainProcess | test_thread | [base_gsi.wait_for_mutation_processing] All Items didn't get Indexed...
         
         
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [tuq_helper.run_cbq_query] *RUN QUERY drop index idx_test_scope_1_test_collection_1job_title0 ON default:test_bucket.test_scope_1.test_collection_1*
        2021-01-18 09:32:38 | INFO | MainProcess | test_thread | [rest_client.query_tool] query params : statement=drop+index+idx_test_scope_1_test_collection_1job_title0+ON+default%3Atest_bucket.test_scope_1.test_collection_1
        2021-01-18 09:32:38 | ERROR | MainProcess | test_thread | [rest_client._http_request] POST http://172.23.121.45:8093/query?statement=drop+index+idx_test_scope_1_test_collection_1job_title0+ON+default%3Atest_bucket.test_scope_1.test_collection_1 body:  headers: {'Content-Type': 'application/x-www-form-urlencoded', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==', 'Accept': '*/*'} error: 500 reason: unknown b'{\n"requestID": "98fee4e0-1928-42b1-bb46-61c4da69c46b",\n"signature": null,\n"results": [\n],\n"errors": [{"code":5000,"msg":"GSI Drop() - *cause: Fail to drop index on some indexer nodes.  Error=Terminate Request due to server termination\\n.  If cluster or indexer is currently unavailable, the operation will automaticaly retry after cluster is back to normal.*"}],\n"status": "errors",\n"metrics": {"elapsedTime": "11.234594ms","executionTime": "11.142284ms","resultCount": 0,"resultSize": 0,"serviceLoad": 3,"errorCount": 1}\n}\n' auth: Administrator:password
        

      Logs:

      https://cb-jira.s3.us-east-2.amazonaws.com/logs/test/collectinfo-2021-01-18T180200-ns_1%40172.23.121.41.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/test/collectinfo-2021-01-18T180200-ns_1%40172.23.121.45.zip
      https://cb-jira.s3.us-east-2.amazonaws.com/logs/test/collectinfo-2021-01-18T180200-ns_1%40172.23.121.58.zip

      Attachments

        Activity

          People

            kevin.cherkauer Kevin Cherkauer (Inactive)
            girish.benakappa Girish Benakappa
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

            Dates

              Created:
              Updated:

              PagerDuty