Using Anomaly Alerts and Triggers Is Good For Predictive Troubleshooting

I've been working with many companies that do logging or APM (Application Performance Monitoring). To name few I think I might actually cover all of the top ones or at least o#nes that you might've heard of.

SaaS: Sumologic, NewRelic, Splunk, Datadog, Loggly, Logz etc etc..
Self Service: ELK Stack(elasticsearch, logstash, kibana), Graylog, Nagios, Zabbix, PRTG etc etc...

What I liked about most of them is that you could do some really nice predictive and anomaly alerts.

Example(A):A company is selling products and they have spikes in traffic, however they can not afford to run 500 ec2 instances or kubernetes cluster(s) with thousands of pods, because at downtime they want to scale it accordingly.

There is really many ways to go around this, one of my favorite is to utilize anomaly in traffic (in this case) a site has been hit harder for past 20 minutes than 40 minutes ago and even less if we take whole week into consideration , which means that could be potential DDos attack, someone shared website on reddit and it's on frontpage OR simply its time of the year. Well whatever the spike might be, lets treat it as anomaly.

What you can do is set an alert for such spike, and programmatically scale up your xyz infrastructure.

Example(B): A company is shipping logs to S3 or Splunk for all of their infrastructure including Dev, QA and Prod infrastructure. Some newbie at the company by accident turned up all apps in Dev to DEBUG level and now instead of sending 150gb of logs a day, company is sending 50gb of logs in an hr. To me that is anomaly that should be foreseen.
Take few weeks of logs data, divide that by how many GB of logs have been sent in last hr and how many GB of logs have been sent for past 20 minutes.. = Total = High Amount.

In case like the one above, someone could programmatically send an alert to whatever pager system (VictorOps, PagerDuty, Email or whatever) to notify appropriate on call person that there was a large amount of logs sent within past 20 minutes.

I am writing this because for past decades being in IT, there is always some anomaly that could be easily prevented as long as you know how to implement it correctly with easy to use tools that nowadays are being offered. + it is really nice to hear from coworkers that example(B) saved $10K.

So think of some anomaly that you can implement today, the reward of getting that one alert or self fix is going to make your day.

Comment below as to what kind of anomaly detections have you worked on, would love to hear Example(C,D,E) 👨‍💻👩‍💻

Blog

Using Anomaly Alerts and Triggers Is Good For Predictive Troubleshooting

Joe Hobot

Join Our Newsletter. No Spam, Only the good stuff.

Related