Uploaded image for project: 'Couchbase Server'
  1. Couchbase Server
  2. MB-46215

Automation and Parse-Friendly System Event Feed

    XMLWordPrintable

Details

    • 1

    Description

      Monitoring a Couchbase Server cluster for system events can be challenging as there's a dozen different log formats across the services.  We need to provide a structured API returning notable system-level events, ideally in a JSON format that can be ingested into 3rd party systems like splunk and logstash.

      We need to capture events such a configuration changes and system disruptions.

      ns_server Design Document: https://docs.google.com/document/d/1dMkRVbJFQbGE0cfJl05lYN6qtv_jDv7YEdUfFzQGMbo/edit

      Attachments

        Issue Links

          For Gerrit Dashboard: MB-46215
          # Subject Branch Project Status CR V

          Activity

            It is pretty much by definition, we are requiring to introduce an API change to meet the ask, which means the next "compat breaking release" = 7.1.

            If we want evaluate anything sooner, we can consider enhancing the existing system user logs which are accessible through REST and already contains some of the requested events. Adding new events will involve dependencies on other services/facilities. Some of these events are already conditionally reported by audit. Audit is not enabled by default and user needs to opt-in to get these events, but the at least the places in the code to triggered such events are already identified.

            We can consider leveraging the audit facility to also report system events to accommodate missing events. As to the parse-friendly requirement, we can post-append to the end of the message a 'hints' that comply to some parsable format (json/yaml) so that automation logic can take advantage of it. When operating in a mixed mode cluster, the UI can hide this extra content for nodes which are already upgraded, but that does leave older release nodes that will show the full string which may be a bit ugly, but than again, that's the cost of introducing this capability without breaking compatibility. At least some parties (cloud/customers) will be able to take advantage of this sooner, as oppose to waiting for next compat release (7.1).

            In 7.1 release, we can move the 'hint' portion to a first class new fields we can introduce in the user log APIs.

            The notion of the 'feed' which suggest some sort of a continuous delivery of events - we can introduce a streaming API, but again that's a bigger change and API compat. We can introduce an additional query args to specify timestamp from which events should be returned. Client will need to manage minimal state to remember their last query. This is more of an efficiency consideration.

            As we will be adding more events, but still assuming they are pretty high level and non-frequent, it is possible the log will grow in size. We are currently holding 3K entries as part of the config, which is persisted and replicated. We need to eval whether we can increase this cap, but with full awareness that we cannot carry a major memory cost. Again, we are making the assumption that once a client retrieves and process the info, it should be ok for entries to get lost if log gets rotated.

            meni.hillel Meni Hillel (Inactive) added a comment - It is pretty much by definition, we are requiring to introduce an API change to meet the ask, which means the next "compat breaking release" = 7.1. If we want evaluate anything sooner, we can consider enhancing the existing system user logs which are accessible through REST and already contains some of the requested events. Adding new events will involve dependencies on other services/facilities. Some of these events are already conditionally reported by audit. Audit is not enabled by default and user needs to opt-in to get these events, but the at least the places in the code to triggered such events are already identified. We can consider leveraging the audit facility to also report system events to accommodate missing events. As to the parse-friendly requirement, we can post-append to the end of the message a 'hints' that comply to some parsable format (json/yaml) so that automation logic can take advantage of it. When operating in a mixed mode cluster, the UI can hide this extra content for nodes which are already upgraded, but that does leave older release nodes that will show the full string which may be a bit ugly, but than again, that's the cost of introducing this capability without breaking compatibility. At least some parties (cloud/customers) will be able to take advantage of this sooner, as oppose to waiting for next compat release (7.1). In 7.1 release, we can move the 'hint' portion to a first class new fields we can introduce in the user log APIs. The notion of the 'feed' which suggest some sort of a continuous delivery of events - we can introduce a streaming API, but again that's a bigger change and API compat. We can introduce an additional query args to specify timestamp from which events should be returned. Client will need to manage minimal state to remember their last query. This is more of an efficiency consideration. As we will be adding more events, but still assuming they are pretty high level and non-frequent, it is possible the log will grow in size. We are currently holding 3K entries as part of the config, which is persisted and replicated. We need to eval whether we can increase this cap, but with full awareness that we cannot carry a major memory cost. Again, we are making the assumption that once a client retrieves and process the info, it should be ok for entries to get lost if log gets rotated.
            hareen.kancharla Hareen Kancharla added a comment - - edited

            Ian McCloy: I have left some comments on the PRD. This might have been acknowledged before too, but this will definitely need a phased approach.

            I am ok to use the user_log to begin with and cleanup what constitutes those logs today. We might also need a uniform approach that Services can use to send us over Service specific system events. Are there specific tickets against the Services to drive this on their end?

            hareen.kancharla Hareen Kancharla added a comment - - edited Ian McCloy : I have left some comments on the PRD. This might have been acknowledged before too, but this will definitely need a phased approach. I am ok to use the user_log to begin with and cleanup what constitutes those logs today. We might also need a uniform approach that Services can use to send us over Service specific system events. Are there specific tickets against the Services to drive this on their end?

            Rob Ashcom can we create a separate ticket for UX related work? This ticket meant to capture is NS_SERVER work.

            meni.hillel Meni Hillel (Inactive) added a comment - Rob Ashcom can we create a separate ticket for UX related work? This ticket meant to capture is NS_SERVER work.

            Build couchbase-server-7.1.0-1261 contains ns_server commit 0e09784 with commit message:
            MB-46215 Service side API to add Event Logs

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1261 contains ns_server commit 0e09784 with commit message: MB-46215 Service side API to add Event Logs

            Build couchbase-server-7.1.0-1354 contains ns_server commit f57541e with commit message:
            MB-46215 Move replicator code in ns_log to seperate module.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1354 contains ns_server commit f57541e with commit message: MB-46215 Move replicator code in ns_log to seperate module.

            Build couchbase-server-7.1.0-1364 contains ns_server commit f4e5754 with commit message:
            MB-46215 Enable event logging done via /_event

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1364 contains ns_server commit f4e5754 with commit message: MB-46215 Enable event logging done via /_event

            Build couchbase-server-7.1.0-1364 contains ns_server commit ff1acfa with commit message:
            MB-46215 Consumer API's for event logs

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1364 contains ns_server commit ff1acfa with commit message: MB-46215 Consumer API's for event logs

            Build couchbase-server-7.1.0-1364 contains ns_server commit 635f414 with commit message:
            MB-46215 Event log server

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1364 contains ns_server commit 635f414 with commit message: MB-46215 Event log server

            Gerrit patches attached to this MB also have patches necessary for MB-47025.

            hareen.kancharla Hareen Kancharla added a comment - Gerrit patches attached to this MB also have patches necessary for MB-47025 .

            Build couchbase-server-7.1.0-1479 contains ns_server commit 4348680 with commit message:
            MB-46215 Dump event logs in /diag response

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 4348680 with commit message: MB-46215 Dump event logs in /diag response

            Build couchbase-server-7.1.0-1479 contains ns_server commit 5c89ccd with commit message:
            MB-46215 Event log for master election

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 5c89ccd with commit message: MB-46215 Event log for master election

            Build couchbase-server-7.1.0-1479 contains ns_server commit 75b0673 with commit message:
            MB-46215 Event logs for rebalance and failover

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 75b0673 with commit message: MB-46215 Event logs for rebalance and failover

            Build couchbase-server-7.1.0-1479 contains ns_server commit 4481d15 with commit message:
            MB-46215 Event log for service restarts/starts.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 4481d15 with commit message: MB-46215 Event log for service restarts/starts.

            Build couchbase-server-7.1.0-1479 contains ns_server commit 9955650 with commit message:
            MB-46215 Rename ns_crash_log to ns_babysitter_log

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 9955650 with commit message: MB-46215 Rename ns_crash_log to ns_babysitter_log

            Build couchbase-server-7.1.0-1479 contains ns_server commit c9d478e with commit message:
            MB-46215 Memcached related event logs

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit c9d478e with commit message: MB-46215 Memcached related event logs

            Build couchbase-server-7.1.0-1479 contains ns_server commit edb028e with commit message:
            MB-46215 Bucket specific Event logs

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit edb028e with commit message: MB-46215 Bucket specific Event logs

            Build couchbase-server-7.1.0-1479 contains ns_server commit 830415a with commit message:
            MB-46215 Add cluster compat checks for /_event endpoint.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 830415a with commit message: MB-46215 Add cluster compat checks for /_event endpoint.

            Build couchbase-server-7.1.0-1479 contains ns_server commit 7a6437b with commit message:
            MB-46215 Fix seq_num checks in event_log_server

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 7a6437b with commit message: MB-46215 Fix seq_num checks in event_log_server

            Build couchbase-server-7.1.0-1479 contains ns_server commit 18dff18 with commit message:
            MB-46215 Normalize timestamp formats in event log

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 18dff18 with commit message: MB-46215 Normalize timestamp formats in event log

            Build couchbase-server-7.1.0-1479 contains ns_server commit 7ed23dd with commit message:
            MB-46215 Collect event_log file in cbcollect_info

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 7ed23dd with commit message: MB-46215 Collect event_log file in cbcollect_info

            Build couchbase-server-7.1.0-1479 contains ns_server commit 2dad4dd with commit message:
            MB-46215 Configure max events stored in event_log_server

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 2dad4dd with commit message: MB-46215 Configure max events stored in event_log_server

            Build couchbase-server-7.1.0-1479 contains ns_server commit 2b30aae with commit message:
            MB-46215 Add event logs in ns_audit module.

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit 2b30aae with commit message: MB-46215 Add event logs in ns_audit module.

            Build couchbase-server-7.1.0-1479 contains ns_server commit e9d06b1 with commit message:
            MB-46215 Add event log when a node is added to the ...

            build-team Couchbase Build Team added a comment - Build couchbase-server-7.1.0-1479 contains ns_server commit e9d06b1 with commit message: MB-46215 Add event log when a node is added to the ...
            avsej Sergey Avseyev added a comment - - edited

            This change breaks configuration streaming protocol for SDKs. See MB-48970. SDKs cannot bootstrap using HTTP protocol anymore, for instance when the KV service is not configured for the node.

            avsej Sergey Avseyev added a comment - - edited This change breaks configuration streaming protocol for SDKs. See MB-48970 . SDKs cannot bootstrap using HTTP protocol anymore, for instance when the KV service is not configured for the node.

            People

              hareen.kancharla Hareen Kancharla
              ianmccloy Ian McCloy
              Votes:
              1 Vote for this issue
              Watchers:
              14 Start watching this issue

              Dates

                Created:
                Updated:
                Resolved:

                Gerrit Reviews

                  There are no open Gerrit changes

                  PagerDuty