High Performance Notification System Practices

junjie_qin_512245a2eac9a4

Leo Dev Blog

Posted on November 21, 2024

High Performance Notification System Practices

Building a High-Performance Notification System

1. Service Segmentation

2. System Design

2.1 Initial Message Sending

2.2 Retry Message Sending

3. Ensuring Stability

3.1 Handling Traffic Surges

3.2 Resource Isolation for Faulty Services

3.3 Protection of Third-Party Services

3.4 Middleware Fault Tolerance

3.5 Comprehensive Monitoring System

3.6 Active-Active Deployment and Elastic Scaling

4. Conclusion

4.1 Feedback on Results

4.2 Final Thoughts


In any company, a notification system is an indispensable component. Each team may develop its own notification modules, but as the company grows, problems like maintenance complexity, issue debugging, and high development costs begin to emerge. For example, in our enterprise WeChat notification system, due to variations in message templates, a single project may use three different components—not even counting other notification functionalities.

Given this context, there is an urgent need to develop a universal notification system. The key challenge lies in efficiently handling a large volume of message requests while ensuring system stability. This article explores how to build a high-performance notification system.

Architecture Overview

  • Configuration Layer: This layer consists of a backend management system for configuring sending options, including request methods, URLs, expected responses, channel binding and selection, retry policies, and result queries.

  • Interface Layer: Provides external services, supporting both RPC and MQ. Additional protocols like HTTP can be added later as needed.

  • Core Service Layer: The business logic layer manages initial and retry message sending, message channel routing, and service invocation encapsulation. This design isolates normal and abnormal service execution to prevent faulty services from affecting normal operations. For instance, if a particular message channel has a high latency, it could monopolize resources, impacting normal service requests. Executors are selected via routing strategies, including both configured routing policies and dynamic fault discovery.

  • Common Component Layer: Encapsulates reusable components for broader use.

  • Storage Layer: Includes a caching layer for storing sending strategies, retry policies, and other transient data, as well as a persistence layer (ES and MySQL). MySQL stores message records and configurations, while ES is used for storing message records for user queries.


2. System Design

2.1 Initial Message Sending

When handling message-sending requests, two common approaches are RPC service requests and MQ message consumption. Each has its pros and cons. RPC ensures no message loss, while MQ supports asynchronous decoupling and load balancing.

2.1.1 Idempotency Handling

To prevent processing duplicate message content, idempotency designs are implemented. Common approaches include locking followed by querying or using unique database keys. However, querying the database can become slow with high message volumes. Since duplicate messages usually occur within short intervals, Redis is a practical solution. By checking for duplicate Redis keys and verifying message content, idempotency can be achieved. Note: identical Redis keys with different message content may be allowed, depending on business needs.

2.1.2 Faulty Service Dynamic Detector

Routing strategies include both configured routes and dynamic service fault-detection routing. The latter relies on a dynamic service detector to identify faulty channels and reroute execution via a fault-notification executor.

This functionality uses Sentinel APIs within JVM nodes to track total and failed requests within a time window. If thresholds are exceeded, the service is flagged as faulty. Key methods include loadExecuteHandlerRules (setting flow control rules, dynamically adjustable via Apollo/Nacos) and judge (intercepting failed requests to mark services as faulty).

Faulty services are not permanently flagged. An automatic recovery mechanism includes:

  1. Silent Period: Requests during this time are handled by the fault executor.
  2. Half-Open Period: If sufficient successful requests occur, the service is restored to normal.

2.1.3 Sentinel Sliding Window Implementation (Circular Array)

Sliding windows are implemented using a circular array. The array size and indices are calculated based on the time window.

Example:

For a 1-second window with two sub-windows (500ms each):

  • Window IDs: 0, 1
  • Time ranges: 0–500ms (ID 0), 500–1000ms (ID 1)
  • At 700ms, window ID = (700 / 500) % 2 = 0 and windowStart = 700 - (700 % 500) = 200.
  • At 1200ms, window ID = (1200 / 500) % 2 = 0, requiring reset of ID 0 to reflect the new start time.

2.1.4 Dynamic Thread Pool Adjustment

After message processing, a thread pool is used for asynchronous sending. Separate pools exist for normal and faulty services, configured based on task type and CPU cores, with dynamic adjustments informed by performance testing.

A dynamically adjustable thread pool design leverages tools like Apollo or Nacos to modify parameters. Thread queue lengths remain fixed unless a custom queue is implemented. Instead, an unbounded pool is defined with matching core and max threads, using a discard policy. Overloading the pool triggers task persistence in MQ for retries, ensuring no memory overflow or message loss.

2.2 Retrying Message Sending

Messages failing due to bottlenecks or errors are retried via distributed task scheduling frameworks. Techniques like sharding and broadcasting optimize retry efficiency. Duplicate message control is achieved using locks.

Retry Mechanism:

  1. Check if the handler’s resources are sufficient. If not, tasks wait in a queue.
  2. Lock control prevents duplicate processing across nodes.
  3. Task volume is based on handler settings.
  4. Retrieved tasks are sent to MQ, then processed via thread pools.

2.2.1 ES and MySQL Data Synchronization

For large datasets, Elasticsearch (ES) is used for queries. Data consistency between ES and the database must be maintained.

Synchronization Flow:

  1. Update ES first, then change the database state to "updated."
  2. If synchronization isn't complete, reset the state to "init."
  3. Synchronization includes the database update_time to ensure updates only occur for the latest data.

ES Index Management:

  • Monthly rolling indices are created.
  • New indices are tagged as "hot," storing new data on high-performance nodes.
  • A scheduled task marks previous indices as "cold," moving them to lower-performance nodes.

3. Stability Assurance

The designs outlined above focus on high performance, but stability must also be considered. Below are several aspects of stability assurance.

3.1 Sudden Traffic Spikes

A two-layer degradation approach is implemented to handle sudden traffic spikes:

  • Gradual Traffic Increase: When traffic grows steadily, and the thread pool becomes busy, MQ is used for traffic shaping. Data is asynchronously persisted, and subsequent tasks are scheduled with a 0s delay for processing.
  • Rapid Traffic Surge: In the case of abrupt spikes, Sentinel directly routes traffic to MQ for shaping and persistence without additional checks. Subsequent processing is delayed until resources become available.

3.2 Resource Isolation for Problematic Services

Why isolate problematic services? Without isolation, problematic services share thread pool resources with normal services. If problematic services experience long processing times, thread releases are delayed, preventing timely processing of normal service requests. Resource isolation creates a separation to ensure problematic services do not impact normal operations.

3.3 Protection of Third-Party Services

Third-party services often implement rate-limiting and degradation to prevent overload. For those that lack such mechanisms, the following should be considered:

  • Avoid overwhelming third-party services due to high request volume.
  • Ensure our services are resilient to third-party service failures by using circuit breakers and graceful degradation.

3.4 Middleware Fault Tolerance

Fault tolerance for middleware is essential. For example, during a scaling operation or upgrade, MQ might experience a few seconds of downtime. The system design must account for such transient failures to ensure service continuity.

3.5 Comprehensive Monitoring System

A robust monitoring system should be established to:

  1. Detect and mitigate issues before they escalate.
  2. Provide rapid incident resolution.
  3. Offer actionable insights for post-incident analysis and optimization.

3.6 Active-Active Deployment and Elastic Scaling

Operationally, active-active deployment across multiple data centers ensures service availability. Elastic scaling, based on comprehensive service metrics, accommodates traffic variations while optimizing costs.


4. Conclusion

System design should address service architecture, functionality, and stability assurance comprehensively. Achieving scalability, fault tolerance, and adaptability to dynamic scenarios is an ongoing challenge. There is no universal "silver bullet"; technical designs must be tailored to specific business needs through thoughtful planning and iteration.

💖 💪 🙅 🚩
junjie_qin_512245a2eac9a4
Leo Dev Blog

Posted on November 21, 2024

Join Our Newsletter. No Spam, Only the good stuff.

Sign up to receive the latest update from our blog.

Related