As traffic volumes delivered by our CDN continue to increase, we continue to expand the footprint of our CDN both in terms of the number and size of Points of Presence (PoP). As the sizes of PoPs grow, it is important to distribute load across their servers to maintain their health and stability and to guarantee resilience to ephemeral traffic spikes.
Current load-balancing system
Requests arriving at a PoP are load-balanced to different servers based on the request URI.
When a request arrives at a PoP, a front-end server processes the request and maps the request to a cache server based on the requested URI. Consistent hashing (or, to be precise, Rendezvous Hashing achieved using CARP -Cache Array Routing Protocol) guarantees that the same resource will always be mapped to the same cache server as long as that server is healthy. This consistent mapping is crucial for performance. Directing a request to a server that does not have the requested resource in its cache will require fetching the resource from another server, increasing response latency. Importantly, consistent hashing also achieves a fairly uniform distribution of resources across all cache servers in the PoP, helping distribute the workload.
However, different resources vary in popularity, and a server that serves a popular resource can end up serving much more traffic than others in the PoP. For that reason, a mechanism called Hot Filing kicks in when a resource rapidly becomes very popular and starts using up a considerable fraction of a server’s network or CPU resources. Hot Filing quickly replicates this resource on one or more additional servers. It notifies the front ends about which servers the replicas are on, resulting in a wider distribution of the resource-generated load.
Potential room for improvement
The logic that triggers Hot Filing today is based on fixed thresholds which guarantee that no single resource is responsible for more than a predetermined rate of requests or bytes served by a server. If a resource load exceeds that threshold, it is distributed further. However, a challenge with fixed thresholds is that if a considerable fraction of a server’s load on a server is generated by resources just below that threshold, those resources will not be Hot Filed. So their respective request load will not be distributed.
Because of this, we have observed that even with the combination of resource-based load balancing and Hot Filing, the load distribution across servers in a PoP can often be uneven, with some servers delivering more than 2-3x the median load.
Uneven network load distribution is common in many PoPs at any given time, with some servers delivering up to 2x more traffic than the median.
This unevenness is not always a problem because servers can remain performant as long as they are not reaching their capacity. However, in extreme cases, the impact can be very apparent. The following figure illustrates a real example of this effect.
Two servers reach or exceed their network capacity, while the rest of the PoP delivers lower traffic volumes. The outlier servers suffer obvious inflation on health checks (response times).
In this snapshot of network load distribution at a particular PoP, while most of the servers are sending low volumes of traffic, two servers are serving more than 2x the median load, reaching their capacity due to one or a few very popular resources. This has an observable impact on their health check metric, a proxy of their minimum response time. Healthy values for this metric are typically below 1 ms, and alerts are raised when they exceed 10 ms. In this example, the health check increased to values above 100ms and persisted for at least one hour, during which the overloaded servers likely performed poorly.
We have also observed cases where a few servers are persistently more loaded than the rest of the PoP for up to several days. During such periods, these loaded servers are generally less resilient to incoming traffic spikes than other servers in the PoP. This is exacerbated during peak hours, as their load can reach or exceed their capacity, although there is available capacity in the rest of the PoP.
Adaptive request load balancing system
Based on these observations, we have been researching the idea of Hot Filing with dynamic thresholds. This approach considers load distribution across servers at any given time and where each server lies in that distribution. Based on these conditions, a server-specific threshold is calculated as a function of the server’s place in the load distribution: a server with a load higher than the median is assigned a lower threshold than a server with a lower load, favoring more offloading for servers in the tail of the distribution.
Server thresholds are generated based on each server’s current load distribution location. A server with a load higher than the median is assigned a threshold lower than a server with a lower load.
More specifically, we define two parameters that control the level of Hot Filing:
BaseThresh controls the baseline value for each server’s threshold. A server-specific threshold is derived from this value and adjusted for the server according to its current load.
α ∈ (0, 2) controls how aggressively the algorithm adjust weights for servers needing offloading.
Then, for each server in the PoP, we generate a weight W(s) ∈ (0, 2), which is inversely proportional to the server’s current load, using the formula:
Where: α ∈ (0, 1) BW(s) is the current server load BWmin is the lowest load server’s load BWmin is the highest load server’s load Then, each server’s threshold T(s) is calculated as:
In our implementation, we configure BaseThresh to a value fitting our workloads. We let the algorithm dynamically choose the value for α, such that more aggressive offloading is enforced on outlier (more loaded) servers if those servers are very far from the median in terms of load.
Evaluation using CDN production workloads
We evaluate our approach in simulation, using snapshots of production workloads. To measure the (change of) skewness in the distribution of load across the servers in a PoP, we define the “skewness factor”:
In other words, S measures how far away the most loaded server is from the median. For example, S=2 means that the most loaded server delivers 2x the median load. Ideally, we want all servers to be close to the median, so lower values for S are better (with S=1 being the theoretical optimal). The figure below shows how S changes over multiple iterations of the Hot Filing process based on dynamic thresholds, the number of new Hot Files generated in every iteration and the value for α picked in each iteration.
Load distribution after several back-to-back iterations of the algorithm. Load on the most loaded servers is reduced from 2.73x the median value 1.42x the median value, and traffic within the PoP becomes more evenly distributed.
The blue line (“start”) shows the starting state, representing a real snapshot of the load distribution at a PoP. We note that S is decreased after each iteration until a point is reached where no more Hot Files (HF) are generated. As the tail is trimmed in each iteration, the distribution becomes more even, with more servers closer to the median load.
Next, we repeat the same experiment 10 times for 6 different PoPs:
Skewness factor (S) change for 6 PoPs. S converges in 5 iterations, decreasing by 92% on average.
Each group of bars in the figure represents a different PoP, and each bar within a group represents a subsequent iteration. The first bar in each group represents the starting state, drawn from production work-load snapshots. Each bar represents the values for 10 runs. We observe that in all cases, the decrease in S is much more dramatic in the first iteration than in any of the following iterations in which S is reaching more acceptable values. Importantly, the spread of the distribution (illustrated by the whiskers), as well as the outliers (diamonds), are reduced similarly. We observe that the load balancing mechanism converges after a small number of iterations and would only need to be triggered when S becomes high due to new unbalanced traffic.
Distributing load across servers in a PoP is important for performance. While some degree of unevenness is expected due to rapidly popular resources, persistently overloaded servers, despite available capacity in the rest of the PoP, could impact their performance and resilience to more incoming traffic spikes. In this work, we explored an enhancement to our existing load-balancing mechanism, which can help mitigate such load unevenness.
Simulation results have been promising, so we are baking this change into our existing mechanism and monitoring the results in production. Testing this production method will allow us to get more realistic results and evaluate additional factors that are harder to quantify in simulation. Such factors include resilience to highly dynamic workloads, time-of-day effects, resource replication changes, and the associated overhead.
Explore our website to learn more about how our content delivery network offers better performance, security, and reliability.