Observability or Monitoring: Which one do you need?

mattqafouri

Matt Ghafouri

Posted on February 16, 2024

Observability or Monitoring: Which one do you need?

Observability or Monitoring: Which one do you need?

As software engineers, we have most likely heard a lot about monitoring, but what about observability? what is the difference between them? Why do we need them in the first place? These are the topics that we are going to discuss in this article.

Table of Contents

obervability and monitoring in software

You can also watch this article’s video on YouTube

What is Monitoring?

🔖Short Definition

Monitoring involves the systematic collection of data about a system’s health, performance, and other key metrics.

🔖Long Definition
In a broader sense, monitoring entails identifying specific areas within the application that require attention. By strategically placing logs in these identified locations, we gain the ability to observe and assess the system’s behavior. The essential aspect of monitoring lies in the proactive observation of anticipated issues, with an expectation that the system will manifest expected behaviors in those instances.

🔖Purpose
Detect and alert on deviations from expected behavior.

🔖Examples

  • Monitoring the CPU usage

  • Tracking the response time of a web application

  • Logging errors or exceptions in an application for later analysis

🔖Tools

obervability and monitoring in software

What Is Observability?

🔖Short Definition
The measure of how well you can understand the internal state of a system based on its external outputs.

🔖Long Definition
Observability encompasses a broader spectrum of application performance, extending beyond internal status. Its primary focus lies in addressing the larger picture of the entire system rather than concentrating on a single service.

Its significance becomes evident when applied to complex systems that require monitoring from various aspects, such as the relationships between services, internal service statuses, communication statuses, and more.

In essence, during debugging and troubleshooting, observability empowers developers or system administrators to comprehensively assess the system’s overall status by scrutinizing each component.

Observability platforms prove especially valuable when dealing with unknown issues, like incidents claiming a low response time for the order service. In such cases, these platforms come into play, facilitating the examination of all services and their interrelationships.

It’s not just about reading logs, as system performance issues may not always manifest as errors in the logs, but rather as improper system functioning.

🔖Purpose
Provide insights into the system’s internal workings and facilitate debugging and troubleshooting.

🔖Examples

  • Instrumenting code to capture detailed traces, logs, and events.

  • Using distributed tracing to follow the flow of a request.

  • Collecting and analyzing logs, and metrics.

🔖Tools

obervability and monitoring in software

Key Components of Observability

  • Logs: Including system and server logs, network system logs, and application logs

  • Metrics:Monitoring CPU and memory usage, infrastructure metrics

  • Traces:Track the performance of microservices

Monitoring vs Observability

Monitoring and observability can be differentiated based on various perspectives, with the primary ones being :

Focus , Purpose, Data Collection, Alerting, Use Case, Scope, Tools

obervability and monitoring in software

Data Collection and Instrumentation

Data collection and instrumentation are integral components of observability, providing the means to gather information about the internal workings and performance of a software system.

obervability and monitoring in software

Data Collection

🔖Definition
Involves gathering relevant information and metrics from various components within a software system.

🔖Purpose
The collected data helps in monitoring, analyzing, and understanding the system’s behavior, performance, and health.

🔖Example
In this image, depicting a configuration file associated with the Prometheus Monitoring Platform, our objective is to extract metrics from an application hosted at the address localhost:5000. The scraping interval set for this operation is 15 seconds. This signifies that the Prometheus service will, at regular 15-second intervals, retrieve metrics such as CPU usage, Memory Usage, and more from the specified endpoint.

Later on, we’ll delve into the internal components of the Prometheus server, exploring how metrics are collected and persistently stored.

obervability and monitoring in software

Instrumentation

🔖Definition
Involves adding code or agents to a software system to gather specific data and metrics at runtime.

🔖Purpose
Instrumentation provides a way to gather fine-grained insights into the behavior of the application, allowing for detailed analysis and troubleshooting.

🔖Example
Within this code snippet, utilizing the Prometheus SDK, we can explicitly gather specific metrics within our application. For example, in this code, as we receive a request, the request counter increments by one unit. This framework allows you to expose various types of metrics essential for monitoring your application’s performance and health by instrumenting your code.

To expose your custom metrics, all you need to do is configure the Prometheus service, already hosted on your service. Utilizing your specific programming language SDK, you can then expose the desired metrics from within your code.

obervability and monitoring in software

Data Quality Matters

To maintain the quality of collected data, data sources from various monitoring systems should be standardized to prevent redundancy, reduce clutter, and minimize noise.

In a large and intricate system, we gather diverse metrics such as application heartbeat, application logs (Information, Debug, Exception, etc.), as well as metrics related to application performance and resource usage.

Given the substantial volume of data collected, it becomes crucial to implement a strategy for filtering only the essential metrics.

One effective approach involves sampling metrics in conjunction with the Telemetry Framework, enabling us to selectively filter out necessary metrics. For example, by sampling metrics associated with server responses having a status code greater than 300, we specifically focus on gathering responses that indicate not successful, excluding those categorized as successful.

Visualization and Dashboards

Providing a user-friendly interface to analyze and interpret complex data from diverse sources.

obervability and monitoring in software

Once we’ve collected metrics, the next step is to derive insights from them. However, each team may have unique requirements, necessitating their specialized dashboards and graphs.

Some teams may require real-time data on application health status, while others may need information on the rate of messages received from a Kafka topic, and so forth. Consequently, the selection of an observability platform capable of accommodating these varied needs becomes crucial when adopting an observability platform.

🔖 Grafanaspecializes in customizing dashboards and visualizing metrics.

🔖 Kibana, when paired with the ELK stack, facilitates the collection and querying of logs through a user-friendly dashboard.

🔖 Dynatrace, on the other hand, stands out as an enterprise-level observability tool, encompassing all the necessary features for visualization and dashboard creation.

obervability and monitoring in software

Framework for data collection

These frameworks are used to instrument code, capture telemetry, and enable monitoring, tracing, and logging.

Having discussed code instrumentation earlier, it’s important to note that Prometheus is not the sole framework available for this purpose. There are several other notable frameworks for data collection and code instrumentation.

obervability and monitoring in software

Prometheus Overview

obervability and monitoring in software

Exploring how the Prometheus server operates and facilitates the metric collection, notification to other services, and exposure and visualization of collected metrics reveals three main components at its core:

  • Retrieval: This component enables the collection of metrics from various applications, short-lived jobs, and services. Operating on a pulling mechanism, it retrieves a list of targets from service discovery.

  • TSDB(Time-series Database): Prometheus provides a Time-series database that can be hosted on a separate node. It supports vertical scaling and federation for scenarios requiring multiple instances of the database.

  • HTTP Server: Prometheus includes an endpoint through which it exposes its metrics to the external environment.

Additionally, Prometheus incorporates the following components to enhance its functionality:

  • Push Gateway:Some jobs and services may be unable to directly expose or push their metrics externally. The Push Gateway allows Prometheus to pull metrics from these services and jobs.

  • Alert Manager: This component enables Prometheus to send various types of notifications (Email, SMS, Slack, Microsoft Team, etc.) to relevant teams, enhancing communication and alerting capabilities.

  • Prometheus Web UI: Prometheus features a web UI that visualizes all metrics by pulling them from the HTTP server endpoint. Leveraging the PromQL (Prometheus Query Language), users can pull and visualize metrics on other tools like Grafana.

OpenTelemetry Overview

Similar to Prometheus, the OpenTelemetry framework provides an alternative for collecting metrics from various services. In this diagram, the central component is the OTel Collector, responsible for scraping or collecting metrics from diverse microservices and shared infrastructure services such as Kubernetes, cloud services, and more. Once the metrics are collected, they can be persisted in a time-series database.

Notably, OpenTelemetry is observability framework-agnostic. This means that the structure of the collected data is standardized, and as OpenTelemetry is under the umbrella of CNCF(Cloud Native Computing Foundation), users have the flexibility to employ any observability framework and databases of their choice.

[OTel Document](https://opentelemetry.io/docs/)

Incident Automation and Remediation

🔖 Definition
Involve the use of automated processes and workflows to respond to and resolve issues detected through monitoring, logging, and tracing.

🔖 Purpose
To reduce manual intervention, minimize downtime, and enhance the overall reliability of the software system.

🔖Example
There are plenty of tools for incident management but ServiceNow and PagerDuty can be always considered as the items on the top of the list.

ServiceNow Incident Automation: ServiceNow automates incident resolution through predefined workflows and scripts, reducing manual effort and accelerating response times.

ServiceNow Remediation: In ServiceNow, remediation involves corrective actions taken to resolve incidents swiftly, leveraging automated solutions and predefined processes.

obervability and monitoring in software

Incident Automation and Remediation Real-World Example

Incident response in observability refers to the systematic approach of detecting, managing, and resolving unexpected issues or disruptions in a software system.

obervability and monitoring in software

Let’s consider this real-world scenario

A sudden surge in website traffic leads to a spike in server CPU usage beyond normal thresholds.

1. Detection: Anomaly detection algorithms identify a sudden spike in server CPU usage, signaling a potential issue.

2. Alerting: The system triggers alerts to the operations team, notifying them of the abnormal server load and providing initial details.

3. Automated Response: Automated scripts kick in to scale up additional server instances, redistributing the load to prevent performance degradation.

4. Verification: The operations team reviews system metrics to confirm if the automated response effectively mitigated the issue, ensuring stability.

5. Notification: Upon successful resolution, notifications are sent to relevant stakeholders, updating them on the incident’s detection, response, and resolution.

Challenges and Pitfalls in Observability

Like any other aspect, Observability has its own set of challenges and pitfalls. The aspects mentioned in this picture are self-explanatory and should be clear without further elaboration.

obervability and monitoring in software

Observability Tools

When selecting an observability tool, numerous factors come into play, influencing our choices. Considerations such as

  • Budget constraints

  • Scale of services

  • Data volume

  • Feature updates

  • Open-source nature

  • Visualization

  • Alerting

  • Keeping SDKs Up to date (coding instrumentation)

all play pivotal roles. Below, I have compiled a list of some of the top observability tools and frameworks that can be chosen based on your specific requirements.

obervability and monitoring in software

Final Thought

To sum it up, having good observability and monitoring is important for making software successful. Getting quick insights helps fix problems fast and makes sure the application works well. Picking a platform that suits your needs is crucial because dealing with complex observability tools can be tricky. Knowing how to compare tools and find the best one for what you need is important. Understanding the basics of observability concepts can be helpful.

Follow me on Medium or YouTube if you are interested in back-end development subjects 🤞

🎬 Youtube
📖 Medium
💻 Linkedin

Cheers, Matt Ghafouri

💖 💪 🙅 🚩
mattqafouri
Matt Ghafouri

Posted on February 16, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related