Amazon DevOps Guru for the Serverless applications - Part 11 Anomaly detection on SNS (kind of)

vkazulkin

Vadym Kazulkin

Posted on June 17, 2024

Amazon DevOps Guru for the Serverless applications - Part 11 Anomaly detection on SNS (kind of)

Introduction

In the 1st part of the series we introduced the Amazon DevOps Guru service, described its value proposition, the benefits of using it and explain how to configure it. We also need to go through all the steps in the 2nd part of the series to set everything up. In the subsequent parts we saw DevOps Guru in action detecting anomalies on DynamoDB and Aurora Serverless v2, API Gateway and Lambda alone and also in conjunction with other AWS Serverless Services like SQS, Kinesis, Step Functions and Aurora Serverless v2.
In this part of the series I'd like to explore whether DevOps Guru will recognize anomalies with Amazon Simple Notification Service (SNS)

Detecting anomalies with SNS

Let's enhance our architecture so that in case of creation of the new product we send the notification to the SNS Topic which then delivers this notification to other (external) HTTP(s) endpoint.

Image description

Not let's imagine that this HTTP(s) endpoint was moved or answers with the 500 error code, so that SNS will consider the notification as not being delivered.

I was able reproduce this scenario on AWS but deploying temporary API Gateway endpoint and configured as SNS subscription. I needed to confirm the subscription, so I put the Lambda behind my temporary API Gateway endpoint which was triggered for POST request (this is what SNS sends to the configured HTTP(s) endpoint as confirmation request). Then I logged the whole HTTP body of the POST request in my Lambda function and copied the subscription URL (which is a part of the HTTP body) which I entered in the browser. With SNS subscription being confirmed, I then deleted my temporary API Gateway endpoint so that SNS HTTP(s) subscription was sent but couldn't be delivered to the endpoint anymore.

Then I sent several hundreds create product requests via the hey tool like :

hey -q 1 -z 15m -c 1 -m PUT -d '{"id": 1, "name": "Print 10x13", "price": 0.15}' -H "X-API-Key: XXXa6XXXX" https://XXX.execute-api.eu-central-1.amazonaws.com/prod/products

Enter fullscreen mode Exit fullscreen mode

which all failed to be delivered and have been retried (without success) 3 times by default, see Amazon SNS message delivery retries.

Despite seeing the NumberOfNotificationsFailed in CloudWatch metrics (see the blue line), no DevOpsGuru insight has been created even after re-trying this experiment several times.

Image description

Then directly after this experiment and I immediately started another experiment to fetch not existing product from the database which then caused HTTP Error 404 (Not Found) on API Gateway. I was then surprised that the following insight has been created by DevOps Guru right away:

Image description

with the following anomalous metrics NumberOfNotificationsFailed Average (for anomaly with SNS) and 4XXX Error Average (for anomaly with API Gateway):

Image description

and the following graphed anomalies :

Image description

Conclusion

In this article we explored whether DevOps Guru will recognize anomalies with Amazon Simple Notification Service (SNS) like the HTTPs Subscription which endpoint doesn't exist anymore (no connection can be established) or answers with HTTP 500. We saw that DevOps Guru seemed not to react on the anomalous metric NumberOfNotificationsFailed Average alone as DevOps Guru considers this not to be an anomaly (which is wrong on my opinion). It only seems to create DevOps Guru insight then at least another anomalous metric will be detected. I will approach the DevOps Guru team with my insights so that they can verify the experiment and look behind the scenes what's happening and hopefully improve DevOps Guru service to correctly handle also this SNS anomaly.

💖 💪 🙅 🚩
vkazulkin
Vadym Kazulkin

Posted on June 17, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related