Monitoring C++ Applications
Daniel Kneipp
Posted on August 31, 2022
In this document is described in general terms what is expected from a monitoring solution, together with suggestions of tools that could be used for C++ applications.
OpenTelemetry has a broad and generic terminology to define concepts around possible types of telemetry data. Here we will focus on traces, metrics, logs and something that is not covered there: crash reports.
Overall Architecture
Generally speaking, any software application generate the following kinds of data for monitoring purposes:
- Logs: text record with metadata and potentially semantic information about the action being performed by the software
- Metrics: numeric measurement that can describe useful information (e.g. execution count or timer) about an action or event being triggered
- Traces: metadata that can be correlated between applications that interact between themselves (used to create distributed profilers)
- Crash reports: artifacts generated by the application when a unrecoverable failure happens (usually in form of memory dumps)
Each kind of data usually has a service (or an independent functionality of a service) that supports and stores it for future queries and analyzes. And depending on the tool used, the mechanism to obtain the data can vary. However, usually the services in the market follow a similar approach where the data is pushed, for the exception of logs, which has another software (called agent) that runs separately and is responsible to get the logs from a source (e.g. a file in the filesystem) to the destined service.
Note: this is a simple analyses of just an independent application being monitored. Here it's not covered the possibilities that the container orchestration systems like Kubernetes have
The general architecture of what was described can be shown by the following diagram:
+-----------------------------------------+
| Company Infrastructure |
| |
+-----------------------------------+ | +-----------------+ |
| Machine | | | | |
| | +--------+-----> Logs Server | |
| +-------------+ | | | | | |
| | | | Push | | +-----------------+ |
| Pull | Log agent +---+-------------+ | |
| +--------> | | | +------------------------------+ |
| | +-------------+ | | | | |
| | | +------+-----> Distributed Tracing System | |
| +----v--+ | Push traces | | | | |
| | +-----------------------+---------------+ | +------------------------------+ |
| | App | | | |
| | +-----------------------+---------------+ | +------------------+ |
| +-----+-+ | Push metrics | | | | |
| | | +------+-----> Metrics Server | |
| | | | | | |
| | | | +------------------+ |
+---------+-------------------------+ | |
| | +----------------------------+ |
| | | | |
+------------------------------------------------+-----> Crash Reporting System | |
Push minidumps | | | |
| +----------------------------+ |
| |
| |
+-----------------------------------------+
For traces, metrics and crash reports, changes in the code are needed to to send the data to the related services. And for logs, as said before, an agent is responsible for that. So, the application only needs to sends the logs to a file, for example, and the agent properly configured will do the rest.
With this approach, an application running in a customers environment can publish telemetry data without having to expose its internal network by allowing external incoming traffic. All traffic is outgoing.
For each component present diagram, there are several (and I do mean, several 😅) tools and services available in the market that can help accomplish this monitoring architecture. To keep the discussion brief, here I'll mention just a few for each kind of data we wan't to track.
Tools and services available
The focus will be given to the OpenTelemetry standard and for tools in the Grafana ecosystem. This will allow the application to be compatible with a variety of tools and services in the market, while allowing easy visualization and management of almost all telemetry data (with the exception of the crash reports, that will be discussed later).
Logs
It's recommended to use a logging formatter to add severity, timestamp, correlation ids, among other metadata to allow correlation the logs with other telemetry data . Also is good to format the logs in a specific format (e.g. json) to be able to quickly query the logs afterwards without too much preprocessing rules.
For that the OpenTelemetry SDK could be used.
As for the Logs Server, Grafana Loki could be used (also ELK stack and Datadog are well known options with not only logging support, but traces and metrics as well)
As the agent must be compatible with the Logs server, we need to follow Grafana's documentation There shows several options to choose from like Promtail or Fleunt Bit.
Metrics
Following Grafana's ecosystem, Prometheus is a widely used metrics server and OpenTelemetry has a nice example on how to use it already
Traces
And here again we can leverage a tool from the Grafana ecosystem and OpenTelemetry. With Tempo as the Distributed Tracing System, we can use OpenTelemetry (example) to send traces to it since Tempo is compatible to the OpenTelementry standard.
Crash reports
So, for crash reports it gets more interesting. There is nothing on OpenTelemetry or Grafana that is dedicated to handle this kind of telemetry data. We are still able to publish traces with error information for exceptions that can be handled, but to get information about crashes like segfault
s, another tool must be used.
One interesting option is Sentry. It also has integration with Qt-based applications, which is a library widely used by C++ GUIs
Another onr is Raygun. Although it doesn't have an SDK itself, it shows how you can integrate your software with Google's breakpad and send the crash report via an http request.
Both options have their own GUIs, so you won't access them on the same place you would access the rest of the telemetry data
Visualizing
And as closing point, To visualize all this data (except crash reports), Grafana itself can be used to query, manage and create dashboards, alerts, etc. from this data. And with everything propely configured correlation ids could be used to allow the developer to grab all kinds of telemetry data related to a user interaction to get a better idea of how the application is being used and how performant it is.
Conclusion
For sure this is not the only way to achieve a good observability level of your application, but it's one with a good synergy between the tools and services chosen with a succinct tech stack
Posted on August 31, 2022
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.