Observability or Monitoring: Which one do you need?

As software engineers, we have most likely heard a lot about monitoring, but what about observability? what is the difference between them? Why do we need them in the first place? These are the topics that we are going to discuss in this article.

Table of Contents

Monitoring Definition
Observability Definition
Monitoring vs Observability
Observability main components
Visualization and Dashboard
Incident Automation
Challenges and Pitfalls
Benchmarking Observability Tools

You can also watch this article’s video on YouTube

What is Monitoring?

🔖Short Definition

Monitoring involves the systematic collection of data about a system’s health, performance, and other key metrics.

🔖Long Definition
In a broader sense, monitoring entails identifying specific areas within the application that require attention. By strategically placing logs in these identified locations, we gain the ability to observe and assess the system’s behavior. The essential aspect of monitoring lies in the proactive observation of anticipated issues, with an expectation that the system will manifest expected behaviors in those instances.

🔖Purpose
Detect and alert on deviations from expected behavior.

🔖Examples

Monitoring the CPU usage
Tracking the response time of a web application
Logging errors or exceptions in an application for later analysis

🔖Tools

What Is Observability?

🔖Short Definition
The measure of how well you can understand the internal state of a system based on its external outputs.

🔖Long Definition
Observability encompasses a broader spectrum of application performance, extending beyond internal status. Its primary focus lies in addressing the larger picture of the entire system rather than concentrating on a single service.

Its significance becomes evident when applied to complex systems that require monitoring from various aspects, such as the relationships between services, internal service statuses, communication statuses, and more.

In essence, during debugging and troubleshooting, observability empowers developers or system administrators to comprehensively assess the system’s overall status by scrutinizing each component.

Observability platforms prove especially valuable when dealing with unknown issues, like incidents claiming a low response time for the order service. In such cases, these platforms come into play, facilitating the examination of all services and their interrelationships.

It’s not just about reading logs, as system performance issues may not always manifest as errors in the logs, but rather as improper system functioning.

🔖Purpose
Provide insights into the system’s internal workings and facilitate debugging and troubleshooting.

🔖Examples

Instrumenting code to capture detailed traces, logs, and events.
Using distributed tracing to follow the flow of a request.
Collecting and analyzing logs, and metrics.

🔖Tools

Dynatrace
DataDog
Splunk
Logz
SumoLogic
Cloud Observability tools (AWS, AZURE, GCP)

Key Components of Observability

Logs: Including system and server logs, network system logs, and application logs
Metrics:Monitoring CPU and memory usage, infrastructure metrics
Traces:Track the performance of microservices

Monitoring vs Observability

Monitoring and observability can be differentiated based on various perspectives, with the primary ones being :

Focus , Purpose, Data Collection, Alerting, Use Case, Scope, Tools

Data Collection and Instrumentation

Data collection and instrumentation are integral components of observability, providing the means to gather information about the internal workings and performance of a software system.

Data Collection

🔖Definition
Involves gathering relevant information and metrics from various components within a software system.

🔖Purpose
The collected data helps in monitoring, analyzing, and understanding the system’s behavior, performance, and health.

🔖Example
In this image, depicting a configuration file associated with the Prometheus Monitoring Platform, our objective is to extract metrics from an application hosted at the address localhost:5000. The scraping interval set for this operation is 15 seconds. This signifies that the Prometheus service will, at regular 15-second intervals, retrieve metrics such as CPU usage, Memory Usage, and more from the specified endpoint.

Later on, we’ll delve into the internal components of the Prometheus server, exploring how metrics are collected and persistently stored.

Instrumentation

🔖Definition
Involves adding code or agents to a software system to gather specific data and metrics at runtime.

🔖Purpose
Instrumentation provides a way to gather fine-grained insights into the behavior of the application, allowing for detailed analysis and troubleshooting.

🔖Example
Within this code snippet, utilizing the Prometheus SDK, we can explicitly gather specific metrics within our application. For example, in this code, as we receive a request, the request counter increments by one unit. This framework allows you to expose various types of metrics essential for monitoring your application’s performance and health by instrumenting your code.

To expose your custom metrics, all you need to do is configure the Prometheus service, already hosted on your service. Utilizing your specific programming language SDK, you can then expose the desired metrics from within your code.

Data Quality Matters

To maintain the quality of collected data, data sources from various monitoring systems should be standardized to prevent redundancy, reduce clutter, and minimize noise.

In a large and intricate system, we gather diverse metrics such as application heartbeat, application logs (Information, Debug, Exception, etc.), as well as metrics related to application performance and resource usage.

Given the substantial volume of data collected, it becomes crucial to implement a strategy for filtering only the essential metrics.

One effective approach involves sampling metrics in conjunction with the Telemetry Framework, enabling us to selectively filter out necessary metrics. For example, by sampling metrics associated with server responses having a status code greater than 300, we specifically focus on gathering responses that indicate not successful, excluding those categorized as successful.

Visualization and Dashboards

Providing a user-friendly interface to analyze and interpret complex data from diverse sources.

Once we’ve collected metrics, the next step is to derive insights from them. However, each team may have unique requirements, necessitating their specialized dashboards and graphs.

Some teams may require real-time data on application health status, while others may need information on the rate of messages received from a Kafka topic, and so forth. Consequently, the selection of an observability platform capable of accommodating these varied needs becomes crucial when adopting an observability platform.

🔖 Grafanaspecializes in customizing dashboards and visualizing metrics.

🔖 Kibana, when paired with the ELK stack, facilitates the collection and querying of logs through a user-friendly dashboard.

🔖 Dynatrace, on the other hand, stands out as an enterprise-level observability tool, encompassing all the necessary features for visualization and dashboard creation.

Framework for data collection

These frameworks are used to instrument code, capture telemetry, and enable monitoring, tracing, and logging.

Having discussed code instrumentation earlier, it’s important to note that Prometheus is not the sole framework available for this purpose. There are several other notable frameworks for data collection and code instrumentation.

Prometheus Overview

Exploring how the Prometheus server operates and facilitates the metric collection, notification to other services, and exposure and visualization of collected metrics reveals three main components at its core:

Retrieval: This component enables the collection of metrics from various applications, short-lived jobs, and services. Operating on a pulling mechanism, it retrieves a list of targets from service discovery.
TSDB(Time-series Database): Prometheus provides a Time-series database that can be hosted on a separate node. It supports vertical scaling and federation for scenarios requiring multiple instances of the database.
HTTP Server: Prometheus includes an endpoint through which it exposes its metrics to the external environment.

Additionally, Prometheus incorporates the following components to enhance its functionality:

Push Gateway:Some jobs and services may be unable to directly expose or push their metrics externally. The Push Gateway allows Prometheus to pull metrics from these services and jobs.
Alert Manager: This component enables Prometheus to send various types of notifications (Email, SMS, Slack, Microsoft Team, etc.) to relevant teams, enhancing communication and alerting capabilities.
Prometheus Web UI: Prometheus features a web UI that visualizes all metrics by pulling them from the HTTP server endpoint. Leveraging the PromQL (Prometheus Query Language), users can pull and visualize metrics on other tools like Grafana.

OpenTelemetry Overview

Similar to Prometheus, the OpenTelemetry framework provides an alternative for collecting metrics from various services. In this diagram, the central component is the OTel Collector, responsible for scraping or collecting metrics from diverse microservices and shared infrastructure services such as Kubernetes, cloud services, and more. Once the metrics are collected, they can be persisted in a time-series database.

Notably, OpenTelemetry is observability framework-agnostic. This means that the structure of the collected data is standardized, and as OpenTelemetry is under the umbrella of CNCF(Cloud Native Computing Foundation), users have the flexibility to employ any observability framework and databases of their choice.

Incident Automation and Remediation

🔖 Definition
Involve the use of automated processes and workflows to respond to and resolve issues detected through monitoring, logging, and tracing.

🔖 Purpose
To reduce manual intervention, minimize downtime, and enhance the overall reliability of the software system.

🔖Example
There are plenty of tools for incident management but ServiceNow and PagerDuty can be always considered as the items on the top of the list.

ServiceNow Incident Automation: ServiceNow automates incident resolution through predefined workflows and scripts, reducing manual effort and accelerating response times.

ServiceNow Remediation: In ServiceNow, remediation involves corrective actions taken to resolve incidents swiftly, leveraging automated solutions and predefined processes.

Incident Automation and Remediation Real-World Example

Incident response in observability refers to the systematic approach of detecting, managing, and resolving unexpected issues or disruptions in a software system.

Let’s consider this real-world scenario

A sudden surge in website traffic leads to a spike in server CPU usage beyond normal thresholds.

1. Detection: Anomaly detection algorithms identify a sudden spike in server CPU usage, signaling a potential issue.

2. Alerting: The system triggers alerts to the operations team, notifying them of the abnormal server load and providing initial details.

3. Automated Response: Automated scripts kick in to scale up additional server instances, redistributing the load to prevent performance degradation.

4. Verification: The operations team reviews system metrics to confirm if the automated response effectively mitigated the issue, ensuring stability.

5. Notification: Upon successful resolution, notifications are sent to relevant stakeholders, updating them on the incident’s detection, response, and resolution.

Challenges and Pitfalls in Observability

Like any other aspect, Observability has its own set of challenges and pitfalls. The aspects mentioned in this picture are self-explanatory and should be clear without further elaboration.

Observability Tools

When selecting an observability tool, numerous factors come into play, influencing our choices. Considerations such as

Budget constraints
Scale of services
Data volume
Feature updates
Open-source nature
Visualization
Alerting
Keeping SDKs Up to date (coding instrumentation)

all play pivotal roles. Below, I have compiled a list of some of the top observability tools and frameworks that can be chosen based on your specific requirements.

Final Thought

To sum it up, having good observability and monitoring is important for making software successful. Getting quick insights helps fix problems fast and makes sure the application works well. Picking a platform that suits your needs is crucial because dealing with complex observability tools can be tricky. Knowing how to compare tools and find the best one for what you need is important. Understanding the basics of observability concepts can be helpful.

Follow me on Medium or YouTube if you are interested in back-end development subjects 🤞

🎬 Youtube
📖 Medium
💻 Linkedin

Cheers, Matt Ghafouri

Blog

Observability or Monitoring: Which one do you need?

Matt Ghafouri