Why you need observability in notification systems

Your notifications are a critical bridge between your product and your customers. A broken notification system hurts your customer experience and leads to unsubscribes. When a critical notification fails to send, the result is a missed growth opportunity or customer churn.

Observability matters in notifications. Companies don’t think twice about adding observability to their core product. They know outages have a direct impact on their business. But when it comes to notifications, teams often let their service run in silence with nothing past basic logging.

This is a mistake. Active observability in notifications ensures your customers don’t get spammed by your product and don’t miss the important events that help them find value and help your product grow.

In this post we’ll cover:

An overview of observability: logs, traces, and metrics
How observability informs product direction
Observability in notification systems

Observability helps engineering and customer success understand “the now”

To understand observability, you have to know a little about where the idea comes from—control theory. Control theory is about modeling dynamic systems to achieve a desired output. The system is observable if you can measure the state of that system, and use those measurements to stabilize the system.

This is what you are doing with observability in the software engineering sense—you are measuring the state of your system (your notification service), so that you can correct any destabilization (errors, bugs, or outages).

Observability establishes a feedback loop, enabling your engineering team to address and rectify issues. It allows your developers to quickly identify what is going wrong and why. Observability tools assist developers in identifying anomalies, inconsistencies, and errors in the system. They are outputting the three pillars of observability, each providing a different angle of insight into system behavior:

Logs: Logs are discrete events or records generated by a service. They provide context, detail, and the sequence of activities.
Traces: Tracing provides insights into the flow of a request across various services. It gives a view of the end-to-end journey of a request.
Metrics: Metrics are aggregate values that provide a high-level view of system health.

Let’s say your notification service has stopped sending messages. We can use each of these to investigate the problem from different angles to diagnose the root cause.

Logs for detail

The logs provide detailed event-based information. An engineer can check the error logs for any error messages or exceptions thrown by the notification service. Errors might indicate problems like failed database connections, third-party API failures (e.g., if using a service like Twilio or Firebase), or internal logic issues.

Traces for paths

Traces are useful if the notification service interacts with multiple components or services. You may have a wrapper notification service that calls services for email, SMS, or in-app messaging, as well as message templating. Traces can show the path a notification request takes to help identify if a specific service or component in the flow is causing the delay or failure. Traces are also good for finding:

Latency Issues: If the traces indicate that notifications are being processed but with significant delays, it can point to performance issues. For instance, an external API taking too long to respond can be a bottleneck.
Dependencies: Traces can highlight dependencies, such as databases or third-party services that might be problematic.

Metrics for trends

Metrics provide a high-level view and are good for identifying trends and performance anomalies. If you are instrumenting different components of your notification service, you might see:

A spike in error rates can correlate with the time the issue started.
An increase in queue length indicating that notifications are being added faster than they're being processed.
Increases in CPU, memory, or network utilization if there are logic errors.

So if users aren’t receiving notifications:

Logs might show that notifications are being created, but no send confirmation exists.
Traces show that the notification service is making a request to your SMS service API, but there's no response.
Metrics show that the response time for the SMS API has spiked, and there's a high error rate.

Then your team will know that the SMS service is either down or experiencing issues, and that is causing the notification service to fail.

Engineering can then do two things. Firstly, escalate the issue with the SMS service to fix the fault. But, as importantly, escalate the issue with their customer service team. With observability, the CS team knows what the problem is and can give this information to affected customers. Observability allows you to add two further enhancements to this:

You can give your customer service team access to system insights from observability tools, so they can proactively address potential customer challenges.
You can add observability data to status pages so customers can find this information themselves and be continually informed.

Bringing them together for system performance

Performance optimization is how the feedback loop from observability of notification services extends out beyond just immediate performance to the long-term improvement of the service.

Here, metrics give teams an insight into any existing or potential bottlenecks:

Latency metrics—specifically, the time it takes to process and send notifications—teams can spot inefficiencies. Whether it's a slow response of an external API or delays in internal logic, metrics help developers optimally tune the system.
Resource utilization metrics, be it those related to CPU, memory, or network consumption, can decide scalability. These metrics can show how the service operates under current peak loads and thus what is going to need to be optimized or upgraded for future growth.

Finally, these metrics are the leading indicator of costs. If you have notifications needing multiple retries before successful delivery, this doesn't just hamper user experience but also escalates operational costs. Metrics can illuminate such inefficiencies, triggering deeper dives into root causes and potential remedies.

The same goes for external services. It's essential to have a metrics-driven perspective on the cost and performance of these providers so the team can make well-informed decisions, be it the continuation of existing services or the shifting to alternative solutions.

There's also a subtler side to observability. To understand when things are amiss, you first need to understand what 'normal' looks like. Engineering teams need to have a clear picture of what smooth operations look like for their notification service. This means understanding average request times, typical server loads, normal error rates, and other metrics. This baseline observability ensures that anomalies don't go unnoticed.

Observability helps your product team understand the future

Observability isn't just about monitoring and troubleshooting. It also offers insights that can drive service and product enhancement. When it comes to a notification service, metrics play a pivotal role in understanding user behavior and potential areas for improvement.

Observability in notifications is crucial in understanding how users interact with a product. Metrics, in this context, act as the eyes and ears of a product team, giving them valuable insights into user behaviors, preferences, and engagement levels.

When introducing a new notification feature, it's essential to track its performance closely. Metrics can show if the latest feature has been welcomed by the users or if there's a disconnect. For instance, a recent roll-out of a notification feature that garners little engagement may indicate it doesn't resonate with users or perhaps lacks clarity, prompting the product team to revisit its design or utility.

Moreover, metrics can dive deeper into users' notification preferences. Are they more inclined towards SMS, email, or push notifications? How frequently do they want to be notified? Such insights can empower teams to tailor their notification strategies, aligning them closer to what users genuinely want.

But where do these metrics come from? And how do we ensure we're capturing the right data?

Email Notifications: Pixels embedded within emails can report if the email was opened. Moreover, tracked links can reveal which parts of the email content users found engaging enough to click on.
In-app Notifications: Event tracking tools can capture how users interact with in-app notifications—whether they click on them, dismiss them, or adjust settings related to them.
SMS and Push Notifications: Delivery success rates can be essential metrics here. A high rate of undelivered messages might hint at issues with the service provider or the content itself.

Another crucial aspect is the content and relevance of notifications. Metrics like open rates or click-through rates for notifications can shed light on this. If users frequently open a notification but seldom click through, it may suggest the content is catchy but not compelling enough to drive action. Conversely, high click-through rates could indicate content that's both engaging and relevant.

Finally, while metrics offer a quantitative view, qualitative feedback is just as critical. Pairing metrics with user surveys or feedback sessions can provide richer insights, combining the 'what' with the 'why'. In essence, observability in notifications goes beyond just numbers; it's about understanding user behavior and iterating to enhance the overall user experience.

Adding observability to your notifications

If you don’t have observability working with your current notification service, you need to add it. It’s the difference between a successful service and not, and the difference between keeping your customers and not.

Knock has observability built in through:

End-to-end debugging capabilities from API request issued to workflow run logs.
Webhooks for notification status changes.
A Datadog extension to stream key metrics into your Datadog dashboard in realtime.
Streaming normalized, cross-channel notification engagement data to your data warehouse.

The best place to start is to sign up for an account and read our docs. If you’d like to learn more, you can book a personalized demo today.

Blog