Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46241

Service 'indexer' exited with status 137

    XMLWordPrintable

Details

    Description

      Was trying to reproduce MB-46206 when I hit into this: (could be the same RCA of MB-46206)
      Indexer seems to have crashed on one or more nodes. For example on indexer node ns_1@172.23.121.78:

      Service 'indexer' exited with status 137. Restarting. Messages:
      2021-05-11T04:56:12.308-07:00 [Info] watcher.processChange(): done -> key = IndexTopology/s9NLzpdOOFpz3X%--30-262000/_default/PSuxyUlPfrsh33N9-30-296000
      2021-05-11T04:56:12.371-07:00 [Info] LeaderSyncProxy.syncReceive(). Receive stream_end. Txid : 4294971431
      2021-05-11T04:56:12.372-07:00 [Info] FollowerSyncProxy.receiveAndUpdateCurrentEpoch()
      2021-05-11T04:56:12.372-07:00 [Info] WatcherServer.syncWithPeer(): Watcher done synchronization with peer (TCP 172.23.106.251:9100)
      2021-05-11T04:56:12.372-07:00 [Info] WatcherServer.runWatcher(): Start Watcher Protocol
      2021-05-11T04:56:12.372-07:00 [Info] WatcherServer.runWatcher(): Watcher is ready to process request
      2021-05-11T04:56:12.397-07:00 [Info] lifecycleMgr.dispatchRequest: op OPCODE_SERVICE_MAP elapsed 126.725088ms len(expediates) 0 len(incomings) 0 len(outgoings) 0 error <nil>
      2021-05-11T04:56:12.398-07:00 [Info] lifecycleMgr.dispatchRequest: op OPCODE_CLIENT_STATS elapsed 11.244µs len(expediates) 0 len(incomings) 0 len(outgoings) 0 error <nil>
      2021-05-11T04:56:12.398-07:00 [Info] Plasma: Adaptive memory quota tuning RSS:23647268864, freePercent:1.7754779891251544, currentQuota=4152360960

      Steps to Reproduce
      1. create 2 kv, 1 n1ql, 14 indexer nodes

      2021-05-11 01:41:19,924 | test  | INFO    | pool-2-thread-21 | [table_view:display:72] Rebalance Overview
      +----------------+-----------+-----------------------+----------------+--------------+
      | Nodes          | Services  | Version               | CPU            | Status       |
      +----------------+-----------+-----------------------+----------------+--------------+
      | 172.23.105.175 | kv        | 7.0.0-5127-enterprise | 0.162948107295 | Cluster node |
      | 172.23.106.233 | ['kv']    |                       |                | <--- IN ---  |
      | 172.23.106.236 | ['n1ql']  |                       |                | <--- IN ---  |
      | 172.23.106.238 | ['index'] |                       |                | <--- IN ---  |
      | 172.23.106.250 | ['index'] |                       |                | <--- IN ---  |
      | 172.23.106.251 | ['index'] |                       |                | <--- IN ---  |
      | 172.23.121.74  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.121.78  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.43  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.58  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.44  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.45  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.54  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.47  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.78  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.84  | ['index'] |                       |                | <--- IN ---  |
      | 172.23.107.85  | ['index'] |                       |                | <--- IN ---  |
      +----------------+-----------+-----------------------+----------------+--------------+
      

      2. Create 1 bucket with 1k collections, and a total of 10K items

       +----------------------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+ +----------------------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+| Bucket                     | Type      | Replicas | Durability | TTL | Items | RAM Quota  | RAM Used  | Disk Used |+----------------------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+| s9NLzpdOOFpz3X%--30-262000 | couchbase | 1        | none       | 0   | 10000 | 4194304000 | 266983632 | 434381289 |+----------------------------+-----------+----------+------------+-----+-------+------------+-----------+-----------+

      3. Create 10 GSI indexes per collection with no replicas, and with defer build (so a total of 10K deferred indexes)
      Indexer crashed midway into step 3 and create index requests started failing

      2021-05-11 04:56:16,259 | test  | ERROR   | MainThread | [rest_client:_http_request:748] POST http://172.23.106.236:8093/query?statement=CREATE+INDEX+%60gsi6221%60+ON+%60s9NLzpdOOFpz3X%25--30-262000%60.%60y--30-347000%60.%60JaMXMS2HcvZIp%25hXEo-30-354000%60%28%60age%60%29+WITH+%7B+%27defer_build%27%3A+true%2C+%27num_replica%27%3A+0+%7D body:  headers: {'Accept': '*/*', 'Connection': 'close', 'Authorization': 'Basic QWRtaW5pc3RyYXRvcjpwYXNzd29yZA==\n', 'Content-Type': 'application/x-www-form-urlencoded'} error: 500 reason: unknown {
      "requestID": "9e9cf0bf-cd7d-4338-ab65-4c15823440ce",
      "signature": null,
      "results": [
      ],
      "errors": [{"code":5000,"msg":"GSI CreateIndex() - cause: Index creation for index gsi6221, bucket s9NLzpdOOFpz3X%--30-262000, scope y--30-347000, collection JaMXMS2HcvZIp%hXEo-30-354000 cannot start. Reason: Create index or Alter replica cannot proceed due to network partition, node failover or indexer failure.."}],
      "status": "errors",
      "metrics": {"elapsedTime": "3.641195108s","executionTime": "3.641139289s","resultCount": 0,"resultSize": 0,"serviceLoad": 3,"errorCount": 1}
      } auth: Administrator:password

       

      Attachments

        1. consoleText.txt
          104 kB
        2. log1.png
          log1.png
          589 kB
        3. log2.png
          log2.png
          577 kB
        4. servers.png
          servers.png
          454 kB

        Issue Links

          No reviews matched the request. Check your Options in the drop-down menu of this sections header.

          Activity

            jliang John Liang added a comment -

            From 172.23.121.78, I cannot find "exited with status 137" on indexer log or error log. Indexer restarted at 2021-05-11T04:56:18.479-07:00.

            Note that during this time, there is only about 1.77% free memory left.

            2021-05-11T04:56:12.398-07:00 [Info] Plasma: Adaptive memory quota tuning RSS:23647268864, freePercent:1.7754779891251544, currentQuota=4152360960
            

            During this time, there is on mutation, it is only index creation. But we see memory keeps growing steady for jemalloc as more indexes are created.

            2021-05-11T04:51:20.171-07:00
            "num_instances":                 878,
            assigned_quota":                 20761804800,
            "current_quota":                 4152360960,
            "memory_resident":               23535005696,
            "memory_jemalloc":               24931356672,
            "memory_jemalloc_allocated":     24519615632,
            "memory_in_use":                 24931356672,
            "memory_stats_size_total":       28096,
            "memory_stats_size_global":      0,
            "memory_stats_size_page":        28096,
            "memory_stats_size_index":       0,
            "memory_stats_size_bloom":       0,
            "memory_stats_size_delta":       28096,
            "reclaim_pending":               0,
            "buf_memused":                   0,
            "total_ops":                     0,
            

            The RCA could be the same as MB-46206. The difference is that it fails while creating the index.

            jliang John Liang added a comment - From 172.23.121.78, I cannot find "exited with status 137" on indexer log or error log. Indexer restarted at 2021-05-11T04:56:18.479-07:00. Note that during this time, there is only about 1.77% free memory left. 2021-05-11T04:56:12.398-07:00 [Info] Plasma: Adaptive memory quota tuning RSS:23647268864, freePercent:1.7754779891251544, currentQuota=4152360960 During this time, there is on mutation, it is only index creation. But we see memory keeps growing steady for jemalloc as more indexes are created. 2021-05-11T04:51:20.171-07:00 "num_instances": 878, assigned_quota": 20761804800, "current_quota": 4152360960, "memory_resident": 23535005696, "memory_jemalloc": 24931356672, "memory_jemalloc_allocated": 24519615632, "memory_in_use": 24931356672, "memory_stats_size_total": 28096, "memory_stats_size_global": 0, "memory_stats_size_page": 28096, "memory_stats_size_index": 0, "memory_stats_size_bloom": 0, "memory_stats_size_delta": 28096, "reclaim_pending": 0, "buf_memused": 0, "total_ops": 0, The RCA could be the same as MB-46206 . The difference is that it fails while creating the index.
            jliang John Liang added a comment - - edited

            Note that on performance cluster, we don't see the same problem for 10K index build with 7.0.0-5127

            http://perf.jenkins.couchbase.com/job/hemera/1763/

            2021-05-10T02:37:00.098-07:00 [Info] Periodic Aggregated StorageStats:
            {
            "num_shard":                     156,
            "num_instances":                 15594,
            "num_lssctx":                    312,
            "num_sctxs":                     2466,
            "num_active_sctxs":              1248,
            "num_free_sctxs":                1218,
            "num_wctxs":                     33134,
            "num_free_wctxs":                17540,
            "num_active_wctxs":              15594,
            "num_readers":                   0,
            "num_writers":                   0,
            "cpu":                           7.82400,
            "total_core":                    48,
            "threads":                       106,
            "assigned_quota":                165150720000,
            "current_quota":                 165150720000,
            "memory_resident":               15125770240,
            "memory_jemalloc":               53059584,
            "memory_jemalloc_allocated":     34258920,
            "memory_in_use":                 53059584,
            "memory_stats_size_total":       499008,
            "items_count":                   0,
            "total_records":                 0,
            "total_ops":                     0,
            

            jliang John Liang added a comment - - edited Note that on performance cluster, we don't see the same problem for 10K index build with 7.0.0-5127 http://perf.jenkins.couchbase.com/job/hemera/1763/ 2021-05-10T02:37:00.098-07:00 [Info] Periodic Aggregated StorageStats: { "num_shard": 156, "num_instances": 15594, "num_lssctx": 312, "num_sctxs": 2466, "num_active_sctxs": 1248, "num_free_sctxs": 1218, "num_wctxs": 33134, "num_free_wctxs": 17540, "num_active_wctxs": 15594, "num_readers": 0, "num_writers": 0, "cpu": 7.82400, "total_core": 48, "threads": 106, "assigned_quota": 165150720000, "current_quota": 165150720000, "memory_resident": 15125770240, "memory_jemalloc": 53059584, "memory_jemalloc_allocated": 34258920, "memory_in_use": 53059584, "memory_stats_size_total": 499008, "items_count": 0, "total_records": 0, "total_ops": 0,
            jliang John Liang added a comment -

            The issue looks similar to MB-46206. Mark as dup.

            jliang John Liang added a comment - The issue looks similar to MB-46206 . Mark as dup.

            People

              akhil.mundroy Akhil Mundroy
              sumedh.basarkod Sumedh Basarkod (Inactive)
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty