Modern monitoring

jccguimaraes

João Guimarães

Posted on August 20, 2019

Modern monitoring

Table Of Contents

Abstract

I want to share my experience when dealing with monitoring applications such as micro-services and/or lambdas and how too much monitoring (or lack of it) causes an impact in ways beyond just analysis or pretty dashboards.

If you have another opinion on the concepts described here, I encourage you to provide constructive feedback so I too can learn with other people's ideas and thoughts.

Opinions are my own and not of my employer.

The concept

Monitoring is the combination of logging and adding metrics, but usually, they are treated as separate areas of our applications when in fact they can both work together and we can take advantage of this union.

Instead of logging everything (which is unhealthy) and add metrics for everything (which is also unhealthy), we can complement a metric that has triggered an alert with detailed logs. This is modern monitoring!

Monitoring

Very briefly, logs represent information that can't be grouped but they provide unique details about an event that happened, while metrics represent information that can be grouped by events but don't provide unique details.

Consider that an HTTP request returned a 404 status code. We can use a counter metric called clientError, as an example, which will continue to increment whenever another 4xx error occurs. Detailed information about individual errors can be logged to provide additional information for troubleshooting purposes. You can correlate them by their timestamp.

Consider that the above error was caused by the following HTTP/1.1 request to my-service application:

GET /my-path?id=my-resource HTTP/1.1
Host: www.my-host.com
Content-Type: application/json
Enter fullscreen mode Exit fullscreen mode

and its corresponding response:

HTTP/1.1 404 NOT FOUND
Date: Wed, 17 Jun 2019 10:36:20 GMT
Server: Apache/2.2.14 (Win32)
Enter fullscreen mode Exit fullscreen mode

If your application, after processing the request, logs something like:

{
  "app"         : "my-service",
  "xcid"        : "uuid",
  "time"        : "2019-07-17T10:36:19.000Z",
  "host"        : "www.my-host.com",
  "method"      : "get",
  "path"        : "my-path",
  "statusCode"  : 404,
  "msg"         : "client error"
}
Enter fullscreen mode Exit fullscreen mode

Would be a waste of resources as it does not provide any value to understand the reasons why it happened!

Although the next log may give us data to understand why the event occurred, it leaks sensible data and should be avoided:

We SHOULD only log information that will help identify why a certain event occurs without exposing sensible data.

{
  "app"         : "my-service",
  "xcid"        : "uuid",
  "time"        : "2019-07-17T10:36:20.000Z",
  "host"        : "www.my-host.com",
  "method"      : "get",
  "path"        : "my-path",
  "statusCode"  : 404,
  "msg"         : "client error",
  "data": {
    "status"        : "deleted",
    "sensibleKey1" : "sensibleValue1",
    "sensibleKey2" : "sensibleValue2"
  }
}
Enter fullscreen mode Exit fullscreen mode

The 3rd log entry example can provide valuable information to understand why this event occurred.

{
  "app"         : "my-service",
  "xcid"        : "uuid",
  "time"        : "2019-07-17T10:36:20.000Z",
  "host"        : "www.my-host.com",
  "method"      : "get",
  "path"        : "my-path",
  "statusCode"  : 404,
  "msg"         : "'my-resource' does not exist"
}
Enter fullscreen mode Exit fullscreen mode

This subtle difference allows you to investigate the reason why that resource was deleted without exposing sensible data.

Bottom line:

  • The metric clientError triggered an alarm for the event;
  • The log entry provided a reason why it happened.

We now have all the information to troubleshoot this event outside the monitoring scope.

This is a pretty simple example but it shows how we need to weight metrics and logs accordingly to the situation and business value.

Logs should tell you why an event occurred, but not explain the specific reason it happened, or you'll risk exposing, once again, sensible data.

There is no need to have logs that won't serve any purpose, they'll just cost you or your company money.

If your application is behind a High Availability application it most likely is backed-up by a load balancer / Auto Scaling Group of some sort or you are simply spinning up some containers yourself, your application SHOULD log only exit codes that aren't expected.

When your services are under heavy load, they will spin up more containers, and when that load drops, containers are going to be spin down. Logging those predictable shutdowns, again, will have no meaningful information.

AWS and/or Kubernetes set the exit code when a container has been ordered to shut down, allowing the application to read that code and log meaningful information.

Having predictable log objects can also help you manage and estimate the service daily capacity for your LMS - Log Management Service of choice. This is not the same described here.

What should represent a metric?

  • Business value to dashboards;
  • Information for triggering alerts (on its own or aggregated with other metrics).

Some LMS can add dimensions/tags/labels to metrics which is great but can turn into a nightmare in terms of costs.

A bad example of adding a dimension is the hostname, instead, it SHOULD be a part of the logs (if applied).

The region where your application is running is a good dimension (if applied) as it can provide insight on which regions some services have more load.

At the end it's a trade-off between costs and business value.

Any unique combination of a metric with its dimensions represents a new time series, which will increase the amount of data that will be stored in your provider. Once again, this also increases the overall costs.

A dimension MUST have a low cardinality. High cardinality means that the dimension will have many different values.

Conclusion

We should not jump into adding logs and metrics for everything. We are tempted to do this while developing to find issues and bugs but we will certainly leave them wandering around in production as well.

This is not a topic to fire and forget. Keep them sane and most important, secure and that they provide the minimum information for proper troubleshooting outside the monitoring scope.

💖 💪 🙅 🚩
jccguimaraes
João Guimarães

Posted on August 20, 2019

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related