Back2Basics: Monitoring Workloads on Amazon EKS

Overview

We're down to the last part of this series✨ In this part, we will explore monitoring solutions. Remember the voting app we've deployed? We will set up a basic dashboard to monitor each component's CPU and memory utilization. Additionally, we’ll test how the application would behave under load.

If you haven't read the second part, you can check it out here:

Back2Basics: Running Workloads on Amazon EKS

Romar Cablao for AWS Community Builders ・ Jun 19

#aws #eks #kubernetes #karpenter

Grafana & Prometheus

To start with, let’s briefly discuss the solutions we will be using. Grafana and Prometheus are the usual tandem for monitoring metrics, creating dashboards and setting up alerts. Both are open-source and can be deployed on a Kubernetes cluster - just like what we will be doing in a while.

Grafana is open source visualization and analytics software. It allows you to query, visualize, alert on, and explore your metrics, logs, and traces no matter where they are stored. It provides you with tools to turn your time-series database data into insightful graphs and visualizations. Read more: https://grafana.com/docs/grafana/latest/fundamentals/
Prometheus is an open-source systems monitoring and alerting toolkit. It collects and stores its metrics as time series data, i.e. metrics information is stored with the timestamp at which it was recorded, alongside optional key-value pairs called labels. Read more: https://prometheus.io/docs/introduction/overview/

Alternatively, you can use an AWS native service like Amazon CloudWatch, or a managed service like Amazon Managed Service for Prometheus and Amazon Managed Grafana. However, in this part, we will only cover self-hosted Prometheus and Grafana, which we will host on Amazon EKS.

Let's get our hands dirty!

Like the previous activity, we will use the same repository. First, make sure to uncomment all commented lines in 03_eks.tf, 04_karpenter.tf and 05_addons.tf to enable Karpenter and other addons we used in the previous activity.

Second, enable Grafana and Prometheus by adding these lines in terraform.tfvars:

enable_grafana    = true
enable_prometheus = true

Once updated, we have to run tofu init, tofu plan and tofu apply. When prompted to confirm, type yes to proceed with provisioning the additional resources.

Accessing Grafana

We need credentials to access Grafana. The default username is admin and the auto-generated password is stored in a Kubernetes secret. To retrieve the password, you can use the command below:

kubectl -n grafana get secret grafana -o jsonpath="{.data.admin-password}" | base64 -d

This is what the home or landing page would look like. You have the navigation bar on the left side where you can navigate through different features of Grafana, including but not limited to Dashboards and Alerting.

It's worth noting the Prometheus that we have deployed. You might be asking - Does the Prometheus server have a UI? Yes, it does. You can even query using PromQL and check the health of the targets. But we will use Grafana for the visualization instead of this.

Setting up our first data source

Before we can create dashboards and alerts, we first have to configure the data source.

First, expand the Connections menu and click Data Sources.

Click Add data source. Then select Prometheus.

Set the Prometheus server URL to http://prometheus-server.prometheus.svc.cluster.local. Since Prometheus and Grafana reside on the same cluster, we can use the Kubernetes service as the endpoint.

Leave other configuration as default. Once updated, click Save & test.

Now we have our first data source! We will use this to create dashboard in the next few section.

Grafana Dashboards

Let’s start by importing an existing dashboard. Dashboards can be searched here: https://grafana.com/grafana/dashboards/

For example, consider this dashboard - 315: Kubernetes Cluster Monitoring via Prometheus

To import this dashboard, either copy the Dashboard ID or download the JSON model. For this instance, use the dashboard ID 315 and import it into our Grafana instance.

Select the Prometheus data source we've configured earlier. Then click Import.

You will then be redirected to the dashboard and it should look like this:

Yey🎉 We now have our first dashboard!

Let's Create a Custom Dashboard for our Voting App

Copy this JSON model and import it into our Grafana instance. This is similar to the steps above, but this time, instead of ID, we'll use the JSON field to paste the copied template.

Once imported, the dashboard should look like this:

Here we have the visualization for basic metrics such as cpu and memory utilization for each components. Also, replica count and node count were part of the dashboard so we can check in later the behavior of vote-app component when it auto scale.

Let's Test!

If you haven't deployed the voting-app, please refer to the command below:

helm -n voting-app upgrade --install app -f workloads/helm/values.yaml thecloudspark/vote-app --create-namespace

Customize the namespace voting-app and release name app as needed, but update the dashboard query accordingly. I recommend to use the command above and use the same naming: voting-app for namespace and app as the release name.

Back to our dashboard: When the vote-app has minimal load, it scales down to a single replica (1), as shown below.

Horizontal Pod Autoscaling in Action

The vote-app deployment has Horizontal Pod Autoscaler (HPA) configured with a maximum of five replicas. This means the voting app will automatically scale up to five pods to handle increased load. We can observe this behavior when we apply the seeder deployment.

Now, let's test how the vote-app handles increased load using a seeder deployment.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: seeder
  namespace: voting-app
spec:
  replicas: 5
...

The seeder deployment simulates real user load by bombarding the vote-app with vote requests. It has five replicas and allows you to specify the target endpoint using an environment variable. In this example, we'll target the Kubernetes service directly instead of the load balancer.

...
        env:
        - name: VOTE_URL
          value: "http://app-vote.voting-app.svc.cluster.local/"
...

To apply, use the command below:

kubectl apply -f workloads/seeder/seeder-app.yaml

After a few seconds, monitor your dashboard. You'll see the vote-app replicas increase to handle the load generated by the seeder.

D:\> kubectl -n voting-app get hpa
NAME                 REFERENCE                        TARGETS         MINPODS   MAXPODS   REPLICAS   AGE
app-vote-hpa         Deployment/app-vote              cpu: 72%/80%   1         5         5          12m

Since the vote-app chart's default max value for the horizontal pod autoscaler (HPA) is five, we can see that the replica for this deployment stops at five.

Stopping the Load and Scaling Down

Once you've observed the scaling behavior, delete the seeder deployment to stop the simulated load:

kubectl delete -f workloads/seeder/seeder-app.yaml

Give the dashboard a few minutes and observe the vote-app scaling down. With no more load, the HPA will reduce replicas, down to a minimum of one. This may also lead to a node being decommissioned by Karpenter if pod scheduling becomes less demanding.

You'll see that the vote-app eventually scales in as there is lesser load now. As you might see above, the node count also change from two to one - showing the power of Karpenter.

PS D:\> kubectl -n voting-app get hpa
NAME                 REFERENCE                        TARGETS        MINPODS   MAXPODS   REPLICAS   AGE
app-vote-hpa         Deployment/app-vote              cpu: 5%/80%    1         5         2          18m

Challenge: Scaling Workloads

We've successfully enabled autoscaling for the vote-app component using Horizontal Pod Autoscaler (HPA). This is a powerful technique to manage resource utilization in Kubernetes. But HPA isn't limited to just one component.

Tip: Explore the ArtifactHub: Vote App configuration in more detail. You'll find additional configurations related to HPA that you can leverage for other deployments.

Conclusion

Yey! You've reached the end of the Back2Basics: Amazon EKS Series🌟🚀. This series provided a foundational understanding of deploying and managing containerized applications on Amazon EKS. We covered:

Provisioning an EKS cluster using OpenTofu
Deploying workloads leveraging Karpenter
Monitoring applications using Prometheus and Grafana

While Kubernetes can have a learning curve, hopefully, this series empowered you to take your first steps. Ready to level up? Let me know in the comments what Kubernetes topics you'd like to explore next!

Blog