How to Ensure High Priority for Main Service Traffic - Learning from Netflix's Established Practices
Bochao Li
Posted on July 5, 2024
Scenario
In the following Netflix tech blog, they propose a scenario and solution where traffic is allocated based on service priority. By shedding non-core functionality traffic, they ensure that resources are dedicated to the main process.
Enhancing Netflix Reliability with Service-Level Prioritized Load Shedding
Let's take an example of an online shopping mall. Placing an order is our core functionality. While users are browsing products, preloading product information to achieve faster response times is relatively less important. If this interface is unavailable, it will only result in slightly longer wait times when clicking on product details (e.g., increasing from 20ms to 150ms), but it will not affect the main shopping process.
In the current microservices architecture context, a common solution is to split services and deploy instances for the order service and instances for the product information preloading service separately. However, this approach increases the overall complexity of the services and raises operational costs. Additionally, splitting traffic at the gateway level incurs some performance overhead.
Using product information loading as an example, preloading is a non-core function, but loading product information itself is a core function of shopping. If we want to implement microservices splitting, we must divide the product information service into two sets of instances. This approach could lead to each module being split into two sets of services: core and non-core. Combining all non-core functions into one service does not align with microservices principles.
Solution
In this article, a sharding mechanism within an instance was implemented using the open-source Netflix/concurrency-limits Java library. The usage of this open-source component is very simple. The core idea is that requests at different levels carry different header values, and the data is sharded according to these header values in a custom interceptor. The core code is as follows:
Filter filter = new ConcurrencyLimitServletFilter(
new ServletLimiterBuilder()
.named("playapi")
.partitionByHeader("X-Netflix.Request-Name")
.partition("user-initiated", 1.0)
.partition("pre-fetch", 0.0)
.build());
This effectively implements the sharding functionality, prioritizing high-priority requests. When resources are constrained, low-priority requests will be rejected.
Of course, to write code more elegantly, you can also use the predefined methods. The component has already enumerated several levels such as CRITICAL, DEGRADED, BEST_EFFORT, and BULK. We just need to ensure that the corresponding headers are set in the request.
In this article, there is a detailed description of resource usage, the selection of resource exhaustion metrics, and the thresholds for these metrics. For instance, for CPU-intensive applications, more attention should be paid to CPU usage, whereas for IO-intensive applications, focusing on IO response time ensures that core IO operations are not affected by lower-priority IO.
Source Code Analysis
In fact, in the work process, introducing a component and making it run, meet requirements, and pass tests is not very difficult. However, during this process, many edge cases require a deep understanding of the source code. Let's briefly analyze why this Filter can achieve such functionality and ensure the main service's operation.
For the ConcurrencyLimitServletFilter class, it inherits from javax.servlet.Filter, which is the most general filter. This is easy to understand and reduces our learning curve. It then uses the builder pattern, calling ServletLimiterBuilder to generate a well-assembled ServletLimiter.In this process, we need to set the required key and value.
In the overridden method of this filter, we can see the following code: limiter.acquire((HttpServletRequest) request); This is the core of the decision-making process, determining whether the request can be accepted. If it cannot, it will return a **429 **status code.
@Override
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain) throws IOException, ServletException {
Optional<Limiter.Listener> listener = limiter.acquire((HttpServletRequest)request);
if (listener.isPresent()) {
try {
chain.doFilter(request, response);
listener.get().onSuccess();
} catch (Exception e) {
listener.get().onIgnore();
throw e;
}
} else {
outputThrottleError((HttpServletResponse)response);
}
}
The acquire method is defined in the Limiter interface. Its function is to determine whether a token can be retrieved from the token bucket.Because we included partitions in the build process, an AbstractPartitionedLimiter is created for sharding.
Next, let's analyze the source code of the core acquire method.
@Override
public Optional<Listener> acquire(ContextT context) {
// Resolve the partition based on the context
final Partition partition = resolvePartition(context);
try {
// Acquire the reentrant lock to ensure thread-safe access to shared resources
lock.lock();
// Check if the request should bypass the rate limiting
if (shouldBypass(context)){
return createBypassListener(); // Return a bypass listener
}
// Check if the current inflight requests have reached the global limit and if the partition's limit is exceeded
if (getInflight() >= getLimit() && partition.isLimitExceeded()) {
lock.unlock(); // Release the lock before handling rejection
// If a backoff delay is configured and the number of delayed threads is below the maximum allowed, introduce a delay
if (partition.backoffMillis > 0 && delayedThreads.get() < maxDelayedThreads) {
try {
delayedThreads.incrementAndGet(); // Increment the count of delayed threads
TimeUnit.MILLISECONDS.sleep(partition.backoffMillis); // Introduce the delay
} catch (InterruptedException e) {
Thread.currentThread().interrupt(); // Restore the interrupted status
} finally {
delayedThreads.decrementAndGet(); // Decrement the count of delayed threads
}
}
return createRejectedListener(); // Return a rejected listener
}
// Record the request in the partition's busy count
partition.acquire();
return Optional.of(createListener());
} finally {
if (lock.isHeldByCurrentThread())
lock.unlock(); // Ensure the lock is released if held by the current thread
}
}
The source code mainly consists of several steps.
- It retrieves the partition information for the request.
- To ensure concurrency safety, it uses a reentrant lock in Java.
- The core logic is to check if getInflight() >= getLimit() && partition.isLimitExceeded().
getInflight() < getLimit() indicates that the current number of concurrent requests is less than the limit, allowing more requests to enter. The getLimit() method is the core token acquisition algorithm of the entire component. This limit is updated by continuously sampling the system's condition. In other words, when our request volume is very high, this sample will detect the system's condition and inform the requesting thread that the current threads are limited and cannot support more requests.
So, how are different levels of requests implemented? This is determined by partition.isLimitExceeded(). Each partition has its own busy and limit parameters, and the check is performed as follows:
boolean isLimitExceeded() {
return busy >= limit;
}
The key lies in how the limit is calculated:
this.limit = (int)Math.max(1, Math.ceil(totalLimit * percent));
In this way, based on the percentage we set, less important requests are more likely to trigger the isLimitExceeded() condition, thus being restricted.
So the overall logic is that as long as the requests do not exceed the overall limit, all requests will be allowed. However, if there is a resource shortage, since we have set the pass rate for important requests to 100% and non-important requests to 0%, all the requests that pass will be important requests. In this way, we achieve our goal.
Summary
By understanding Netflix's rate limiting open-source component, we now have the capability to prioritize high-priority requests and restrict low-priority requests within the same instance when resources are tight.
This parameter request is very simple. Of course, we can also adjust the parameters to ensure that a small number of low-priority interfaces still provide requests, such as 0.1.
Afterward, I will frequently update my interpretations of technical articles from major companies and combine them with practical experience or source code analysis. Please follow me, and thank you.
Posted on July 5, 2024
Join Our Newsletter. No Spam, Only the good stuff.
Sign up to receive the latest update from our blog.