Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-52203

[Backport MB-51066 to 7.1.2] [CBSE] Fix Logging in PickRandom function for printing stats and needed info on error

    XMLWordPrintable

Details

    • Untriaged
    • 1
    • Unknown

    Description

      In MB-41688 we moved the printing of the stats to Verbose level and when PickRandom function is failing its not easy to change the log level to verbose and get the data as these occurrences might be rare.

      So we must update the logging to print all the stats and necessary info for debugging on error and also can check if we can have some deduped version of these stats or print them less frequently to help debugging.

      Attachments

        No reviews matched the request. Check your Options in the drop-down menu of this sections header.

        Activity

          Build couchbase-server-7.1.2-3331 contains indexing commit 3919194 with commit message:
          MB-52203: [BP to 7.1.2 of MB 51066] Print stats and info on PickRandom Error

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.2-3331 contains indexing commit 3919194 with commit message: MB-52203 : [BP to 7.1.2 of MB 51066] Print stats and info on PickRandom Error
          pavan.pb Pavan PB added a comment -

          Hi Sai Krishna Teja, can the QE verify this? If so , could you outline the steps? If not, I request you to close this with request-dev-verify label.

          pavan.pb Pavan PB added a comment - Hi Sai Krishna Teja , can the QE verify this? If so , could you outline the steps? If not, I request you to close this with request-dev-verify label.
          sai.teja Sai Krishna Teja added a comment - - edited

          Hi Pavan PB  If you have index with multiple replica instances and one instance is lagging behind the other one significantly, during the scan GSI Client should pickup the one which is having less num_docs_pending and skip the other one. Here after this fix you should see logs while eliminating such instances. You can check if such elimination is happening correctly after the fix here too and if you are seeing the logs and able to infer the reason from the logs.

          No idea if QE has such tests already and if you feel this is difficult to simulate the scenarios i can dev verify it instrumenting the code.

          sai.teja Sai Krishna Teja added a comment - - edited Hi Pavan PB   If you have index with multiple replica instances and one instance is lagging behind the other one significantly, during the scan GSI Client should pickup the one which is having less num_docs_pending and skip the other one. Here after this fix you should see logs while eliminating such instances. You can check if such elimination is happening correctly after the fix here too and if you are seeing the logs and able to infer the reason from the logs. No idea if QE has such tests already and if you feel this is difficult to simulate the scenarios i can dev verify it instrumenting the code.
          pavan.pb Pavan PB added a comment -

          Hi Sai Krishna Teja As discussed, I've labelled this as request-dev-verify.

          pavan.pb Pavan PB added a comment - Hi Sai Krishna Teja As discussed, I've labelled this as request-dev-verify.

          Instrumented indexer code to prune all the replica and found the below prints

          2022-07-07T02:09:09.543+05:30 [Error] metadataClient:PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 0 
          2022-07-07T02:09:09.543+05:30 [Error] metadataClient:PickRandom: Replicas - [18428683653176080248 8987109159927819956 14739116843820171325], PrunedReplica - map[8987109159927819956:map[1:\{"pending": 0, "quota": 33333} 2:\{"pending": 0, "quota": 33333} 3:\{"pending": 0, "quota": 33333}] 14739116843820171325:map[1:\{"pending": 0, "quota": 33333} 2:\{"pending": 0, "quota": 33333} 3:\{"pending": 0, "quota": 33333}] 18428683653176080248:map[1:\{"pending": 0, "quota": 33333} 2:\{"pending": 0, "quota": 33333} 3:\{"pending": 0, "quota": 33333}]], FilteredReplica map[]
          

          sai.teja Sai Krishna Teja added a comment - Instrumented indexer code to prune all the replica and found the below prints 2022-07-07T02:09:09.543+05:30 [Error] metadataClient:PickRandom: Fail to find indexer for all index partitions. Num partition 3.  Partition with instances 0  2022-07-07T02:09:09.543+05:30 [Error] metadataClient:PickRandom: Replicas - [18428683653176080248 8987109159927819956 14739116843820171325], PrunedReplica - map[8987109159927819956:map[1:\{"pending": 0, "quota": 33333} 2:\{"pending": 0, "quota": 33333} 3:\{"pending": 0, "quota": 33333}] 14739116843820171325:map[1:\{"pending": 0, "quota": 33333} 2:\{"pending": 0, "quota": 33333} 3:\{"pending": 0, "quota": 33333}] 18428683653176080248:map[1:\{"pending": 0, "quota": 33333} 2:\{"pending": 0, "quota": 33333} 3:\{"pending": 0, "quota": 33333}]], FilteredReplica map[]

          Build couchbase-server-7.2.0-5000 contains indexing commit 3919194 with commit message:
          MB-52203: [BP to 7.1.2 of MB 51066] Print stats and info on PickRandom Error

          build-team Couchbase Build Team added a comment - Build couchbase-server-7.2.0-5000 contains indexing commit 3919194 with commit message: MB-52203 : [BP to 7.1.2 of MB 51066] Print stats and info on PickRandom Error

          People

            sai.teja Sai Krishna Teja
            amit.kulkarni Amit Kulkarni
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

            Dates

              Created:
              Updated:
              Resolved:

              Gerrit Reviews

                There are no open Gerrit changes

                PagerDuty