AWS CLOUD/DEVOPS OBSERVABILITY
SIMON MAFANY E.
Posted on November 15, 2024
“What cannot be measure, cannot be managed” __Peter Drucker.
Improvements come from observability (monitoring, measurements, troubleshooting and controls). Practically, effective management involves consistent observability.
Observability being a core principle in the DevOps Culture.
Observability provides the tools and techniques to measure various aspects of cloud infrastructure and applications. By quantifying performance, availability and user experience, we can effectively manage and optimize your systems.
The truth is, applications are becoming increasingly complex, distributed, and cloud-native. Optimal performance, reliability and best user experience are among the core need of organizations. Ensuring comprehensive approach to monitor and troubleshoot systems.
In this article, I will discuss the concept of Observability, its importance and how AWS can be used to effectively implement observability solutions in your projects.
AGENDA
a. What is Cloud/DevOps Observability?
b. Why is Observability Important?
c. Benefits of Observability
d. Tools and Technologies (Open-source, Commercial and Cloud)
e. Leveraging AWS for Cloud/DevOps Observability
WHAT IS OBSERVABILITY?
In the context of Cloud/DevOps, Observability refers to the practice of collecting, processing, and analyzing telemetry data from various components of a system to gain deep insights into its behavior, health and performance.
Proceed, I must first make sure we understand the Key Concepts in that definition (Telemetry Data).
Telemetry Data includes the various metrices, logs and traces generated by a system (software application, network or infrastructure)
- Metrics: Numerical measurements of system performance, such as CPU utilization, memory usage, and network traffic.
- Logs: Textual records of events and errors generated by applications and infrastructure components.
- Traces: Time-stamped records of requests as they propagate through a distributed system.
By combining these three data sources, organizations can identify and resolve issues quickly, optimize system performance, and proactively prevent unwanted scenarios. One of the main goals of Observability is to improve overall reliability.
IMPORTANCE OF OBSERVABILITY
Honestly, we really cannot underestimate the power and importance of observability especially in the Cloud space where business objectives of system reliability, enhance security, rapid deployment, cost reduction, resilience, scalability and optimized performance held up in high esteem. Among many, I have highlighted some 5-core importance of Observability. (Feel free to extend the list. These are just my personal preference)
- Cost Optimization: Observability can equally optimize cost by identifying inefficiencies in resources, optimize resource utilization thereby controlling costs.
- Enhanced Reliability and Security: Proactive monitoring can help detect and address potential issues before they escalate into major challenges. Security threats can be detected and addressed, security compliances also can be monitored and reinforced.
- Accelerated Incident Response: Observability tools can help identify the root cause of issues, enabling faster resolution times.
- Faster Feedback Loop: Observability enables Devops teams to receive immediate feedback on the impact of changes which helps to ensure faster iterations in the devops cycle.
- Data-Driven Decision Making: Observability data can provide valuable insights to inform strategic decisions. There are many more advantages which I might not be aware of yet, but I am sharing my experience from projects.
TOOLS AND TECHNOLOGIES
I understand that after reading all the literature about, a hands-on person like myself will be eager to know what tools are used to implement this amazing concept in an organization’s IT system as a whole. I will breakdown the tools into 3 (Open-source, Commercial and Cloud Provider based tools):
a. Open-Source Tools:
- Prometheus
- Grafana
- OpenTelemetry
b. Commercial Platforms:
Datadog, Splunk
c. Cloud Provider Solutions:
- AWS: CloudWatch,
- GCP: Logging and Monitoring
- Azure: Monitor
Both Open-source and Commercial solutions are cloud-agnostic. While specific Cloud-provider tools work only for the parent Cloud provider.
NOTE: In this article, I will only dive deeper into AWS Observability offerings.
LEVERAGING AWS FOR CLOUD/DEVOPS OBSERVABILITY
While Prometheus and Grafana are becoming the most popular choices for observability, AWS hosts a suite of observability tools which satisfy different needs/requirements. You already know the very first tool I am going to mention;
1. AWS CloudWatch:
When you hear Monitory and Logging in AWS, CloudWatch should be the very first thing that should come to mind. It Comprises of a collection of features performing different tasks to ensure a smooth experience. These Features include:
CloudWatch Unified Agent: A plugging or driver you must run in your EC2/on-premise machines to enable capturing and sending of logs to CloudWatch Logs. Captures logs and metrics including RAM, CPU usage, etc.
CloudWatch Alarms: Use to trigger alarms based on certain metrics, when thresholds are met. An alarm can trigger certain actions on a target e.g SNS Notification.
CloudWatch Logs: A perfect place to store logs (Logs can be stored in S3 as well). Captures different types of logs including: application logs, OS logs, Access logs and AWS managed logs. This feature alone provides plenty of flexibility with inbuild sub-features for various Log manipulation. I will just mention 4 of those. To get a full picture of AWS CloudWatch Logs, checkout this link: aws-cloudwatch-log
- CW Logs Insight: used to query and analyze logs
- CW Logs Subscriptions: used to create realtime exports which can be sent to Amazon Kinesis or Lambda for analyses.
- CW Logs Metric Filter: Used to Filter expressions in logs. E.g filter our “ERROR” log messages.
- CW Logs S3 Export: used to perform batch exports of logs to s3 for long-term storages or analysis.
CloudWatch Metrics: These are numeric/non-numeric values captured (a variable to monitor). Metrics belong to namespace (services). CW Metric contains metrics and dimensions (attributes describing metrics). From a group of metrics, we can create a Dashboard.
CloudWatch Events (Now EventBridge): This a power standalone service which helps building event-driven applications. It responds to events within AWS Cloud. It can be used for realtime invent handling; tracking events history and routes.
CloudWatch Dashboard: In interactive dashboard created from metrics.
CloudWatch Synthetic Canaries: A proactive tool used to test automation. Here, you write scripts (canaries) that simulate read user interactions. Best used for Testing web app UX and API behavior.
Truth is, you can actually feel overwhelmed with many services and tools AWS offers but with time, you will get to know which is best for specific use cases. These are the main tools you would like see and use when working with AWS CloudWatch for monitoring and logging.
2. Amazon X-Ray
Another powerful offering by AWS for tracing distributed applications. Best for Microservices.
3. AWS CloudTrail
Provides governance, compliance and auditing for AWS accounts. It records all API calls and events in an AWS account. It is a good tool for monitory and troubleshooting. Records 3 kinds of events:
- Management Events: operations performed on resources in an AWS account.
- Data Events: logs data operations (operations that create, destroy or modify data) and lambda execution activities.
- Insight Events: analyzed data events aimed at detecting unusual activities in a Account.
4. AWS Distro for OpenTelemetry
OpenTelemetry is an open source tool observability tool, just as we I have seen above.
AWS Distro for OpenTelemetry is AWS’s configure version of the open-source tool. This provides flexibility to those who are already using OpenTelemetry, making them easily integrate it into their AWS environment.
5. AWS Managed Service for Grafana
6. AWS Managed Service for Prometheus
It is worth noting that there are many other services that connect with these identified services to help build powerful observability solutions. For example, use Amazon Athena to query log data stored in S3, used Amazon QuickSight to create powerful dashboards from analyzed log data.
Integrating Lambda, EventBridge and SNS services can help you build very powerful serverless monitoring, observability and event responses solutions.
To this effect, I can say I have tried my very best demystify the powerful concept of Observability.
Please, note that what I have shared here is my personal knowledge and experience. The AWS Documentation provides the best information need. If you fine value in this my experience shared, kindly hit a like and/or drop a comment.
Additional Resources
https://aws.amazon.com/cloudops/monitoring-and-observability/
https://www.servicenow.com/products/observability.html
Posted on November 15, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.