TKGI: Observability challenge
Philippe Bürgisser
Posted on March 3, 2021
Introduction
In this post we’re going to review the observability options on a Kubernetes multi-cluster managed by VMware TKGI (Tanzu Kubernetes Grid Integrated).
When deployed using the TKGI toolset, Kubernetes comes with a concepts of metric sinks to collect data from the platform and/or from applications. Based on Telegraf, the metrics are pushed to a destination that has to be set in the ClusterMetricSink CR object.
In our use case, TKGI is used to deploy one Kubernetes cluster per application/environment (dev, qa, prod) from which we need to collect metrics. At this customer, we also operate a Prometheus stack which is used to scrape data from traditional virtual machines and containers running on OpenShift, in order to handle alarms and to offer dashboards to end-users via Grafana.
We have explored different architectures of implementation that match our current monitoring system and our internal process.
Architecture 1
In this scenario, we leverage the usage of the (cluster) MetricSink provided by VMware, configured to push the data into a central InfluxDB database. Data pushed by Telegraf can either come from pushed metrics from Telegraf Agent or can be scrape by Telegraf. Then running Telegraf, a pod is running on each node, deployed via a DaemonSet. Grafana has a database connector able to connect to InfluxDB.
Pros
- Easiest implementation
- No extra software to deploy on Kubernetes
- Multi-tenancy of data
- RBAC for data access
Cons
- InfluxDB cannot scale and there is no HA in the free version
- Need to rewrite Grafana dashboards in order to to match the InfluxDB query language
- Integration with our current alarm flow
Architecture 2
Telegraf is able to expose data using the Prometheus format over an HTTP endpoint. This configuration is done using the MetricSink CR. Prometheus will then scrape the Telegraf service.
When Telegraf is deployed on each node using a DeamonSet, it comes with a Kubernetes service so we can access the exposed service. As Prometheus is sitting outside of the targeted cluster, it is not possible to directly access each Telegraf endpoint as it needs to be accessed through a Kubernetes service. The main drawback of this architecture is that we cannot ensure that all endpoints are scraped evenly, so it may create gaps in the metrics. We have also noticed that when Telegraf is configured to expose Prometheus data over HTTP, the service isn’t updated to match the new exposed port. One solution would have been to create another service in the namespace where telegraf resides but due to RBAC, we aren’t allowed to do so.
apiVersion: pksapi.io/v1beta1
kind: ClusterMetricSink
metadata:
name: my-demo-sink
spec:
inputs:
- monitor_kubernetes_pods: true
type: prometheus
outputs:
- type: prometheus_client
Pros
- We can leverage the usage of Telegraf and MetricSinks
- Integration with our existing Prometheus stack
- Prometheus ServiceDiscovery possible through Kubernetes API
Cons
- No direct access to Telegraf endpoints
- Depending on the number of targets to discover for each
- Kubernetes cluster, the ServiceDiscovery can be impacted in terms of performance
Architecture 3
In this architecture, we configure Prometheus to directly scrape exporters running on each cluster. Unfortunately, each replica of a pod running the exporter is exposing its endpoint through a Kubernetes service. As mentioned in architecture 2, Prometheus, living outside, cannot directly scrape an endpoint and we thus can’t ensure the scraping is evenly done.
Pros
- Integration with our Prometheus stack
- Prometheus ServiceDiscovery possible through Kubernetes API
Cons
- No direct access to exporter endpoints
- Not good for scaling
Architecture 4
This is a hybrid approach where we leverage the metric tooling provided by VMware. We push all the metrics into an InfluxDB exporter acting as proxy-cache, which is scrapped by Prometheus.
Pros
- Leveraging VMware tooling
Cons
- InfluxDB exporter becomes an SPOF (Single Point of Failure)
- Extra components to manage
- No Prometheus ServiceDiscovery available
- Handling of data expiration
Architecture 5
In this architecture we introduce PushProx, composed of a proxy running on the same cluster as Prometheus and agents that are running on each Kubernetes cluster. These agents initiate a connection towards the proxy to create a tunnel so Prometheus can directly scrape each endpoint through the tunnel.
Each scraping configuration will need to have a proxy referenced:
scrape_configs:
- job_name: node
proxy_url: http://proxy:8080/
static_configs:
- targets: ['client:9100']
Pros
- Bypass network segmentation
- Integration with our Prometheus stack
Cons
- No Prometheus ServiceDiscovery
- Scaling issue
- Extra component to manage
Architecture 6
In this architecture, a Prometheus instance is deployed on each cluster which will scrape targets residing in the same cluster. Using this design, the data will be stored on each instance. The major difference in this approach is that only the AlertManager and Grafana are shared for all clusters.
Pros
- Best integration with our Prometheus stack
- Multi-tenancy
- Federation possible
Cons
- Memory and CPU footprint due repetition of the same service
- Skip using any TKGI component
- Multiple instances to manage
Conclusion
After testing almost all architectures, we finally came to the conclusion that architecture 6 is the best match with our current architecture and needs. We also privileged Prometheus as it can be easily deployed using the operator and features such as HA is automatically managed. We had however to make some compromises like not using the TKGI metric components and “reinvent the wheel” as we believe that monitoring and alerting should be done by pulling data and not pushing them.
Disclaimer
This research was made on a TKGI environment that hasn't been installed and operated by Camptocamp.
Posted on March 3, 2021
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.