When operating a large global network, ensuring good connectivity and performance between systems that communicate over the public Internet is key to ensuring a positive user experience. Given the complex and best-effort nature of the Internet, even the most well-provisioned links on the most reliable providers sometimes have issues.
There are a number of strategies for monitoring such links, including active measurement, which generates traffic specifically for measuring, and passive measurement, which monitors existing traffic. In this article, we describe a passive approach that makes use of our xTCP socket sampling system to passively monitor many such links that our network depends on.
To make the most of xTCP’s data, we’ve further developed an approach to processing this data which addresses some of the challenges associated with Internet monitoring. In particular, we introduce a concept we call the retransmit ratio, which provides a relative measure of the severity of retransmits observed between content delivery network (CDN) sites. We demonstrate that the retransmit ratio above certain levels corresponds to degradations in throughput, directly impacting user-perceived performance, therefore making it an excellent basis for network automation that allows us to work around network performance degradations.
A common workflow in the CDN, involves one point-of-presence (PoP) reaching out to another for some piece of content, for example, to fetch a piece of content to cache. Frequently these interactions are made in direct response to a client request, meaning that the downstream request may be waiting for this transfer to complete. Generally, the requests themselves may be quite small, on the order of a few kilobytes. The responses may be highly variable in size, from kilobytes to many megabytes.
Fig 1: The request flow sends small (order KBs) requests and receives potentially large responses (potentially megabytes).
In order to monitor the health of connections between points of presence, we are able to use our socket monitoring tool, xTCP, to sample the current state of all open sockets on our edge servers. While this provides critical visibility in our client-facing sockets, this socket data also gives us a view of the data between PoPs.
However, measuring this data is not without a few challenges. First, xTCP provides a point-in-time sample of different connections. That means we might catch many connections in many different points of the transmission. Therefore any assessment we do will have to consider the wider distribution of behaviors, rather than any single values.
Next, we need to ensure we are monitoring the correct direction. While both the PoP that generated the request (PoP A in the above diagram) and the PoP that received the request and must respond to it (PoP B above) have socket information, their asymmetric workloads mean we expect to see different behavior: the majority of packets sent by the client will be control packets (the initial request, subsequent acknowledgment packets), while the majority of packets sent by the server will be data packets, which are more likely to contain meaningful volumes of data.
As a result, if there is congestion or other issues along the path, the packets carrying data, and therefore occupying more queue space, are more likely to encounter packet loss and suffer retransmits, for example as the result of queue drops on a busy router. To demonstrate this we consider the distribution of packet retransmit rates (computed as the ratio of total retransmitted packets divided by the total number of sent data segments, less retransmits) seen in the request and response flows between a pair of PoPs for a 10-minute period.
Fig 2: Response traffic encounters more retransmits, likely due to their larger size.
Here, we see the client request sockets experience nearly no retransmits during this time period. On the other hand, the responses show nearly 85% of sockets having non-zero retransmits, however, we note the retransmit rate is well below 1% for the vast majority of connections. Unsurprisingly, we observe similar behavior across nearly every pair of pops with non-zero retransmits during the test period. We therefore focus on the data-laden response flows. Since we are concerned with servicing the requests to the original requesting pops, we refer to these as “inbound” flows.
Our final challenge comes from some general complexities around retransmits, and the difficulty of using them as a signal for performance degradation. Indeed, retransmits may occur regularly without indicating a particular issue, as they simply reflect the sender state and the number of times a packet was resent. These may ultimately be the result of other complex protocol behavior besides loss (e.g tail loss probing). Adding to the complexity, we observe that many sockets never observe retransmits. This means that naïve summaries (e.g, taking the median) may result in very conservative summaries of the retransmit rate, and skewed summaries (e.g. 95th or 99th percentile) may largely capture behavior that isn’t harmful to the population in general.
In order to help simplify the impacts of these challenges, we consider a composite metric which we call the retransmit ratio. Inspired by Meta’s HD Ratio, which aims to quantify the fraction of clients who are able to stream HD video, this measure endeavors to describe the fraction of sockets that are experiencing an unhealthy level of retransmits. Since non-zero retransmits are sometimes expected, we define the retransmit ratio as follows:
Critically, this value is particularly easy to compute with data made available via xTCP. In our operational experience, we’ve found the values of the retransmit ratio are generally small on healthy links, while avoiding the almost-always-zero challenges present in the raw retransmit rate measurements.
We’ve also found the measurement to be sensitive, often generating alerts prior to other performance monitoring systems. This makes it especially valuable when rapidly diagnosing network degradations, which often begin with small issues that eventually cascade into larger problems.
Validating the Metric
In order to demonstrate that the retransmit ratio is directly correlated with application performance, we demonstrate its effectiveness by considering two measurements. First, we show that during times of high retransmit ratio, the client application (the requester) experiences lower throughput than during times with low or zero retransmit ratio. Second, we show that high retransmit ratio often corresponds with degradations in other network signals, in particular, ICMP probes between PoPs.
To explore the impacts on application performance, we turn to measurements taken explicitly from the application layer. In particular we consider throughputs measured at the client PoP, as these represent the functional delivery rate achieved in the data sending process. In order to understand the impact of retransmit events, we conducted the following study on a pair of pops over the course of a week, during which a significant provider issue occurred.
First, we consider all periods in which a retransmit event occurred. We define a retransmit event as any time period in which the retransmit ratio between a pair of PoPs lies within a certain range for at least ten consecutive minutes. While we note this excludes short lived events, it provides insight into the behavior of longer events. For each retransmit event, we then collect the corresponding throughput values during the event. As a control, we collect data for the same duration of time as the event but three hours prior. This gives us two sets of throughput measurements: the “during” retransmit events and “normal”, taken during times of no retransmits. We then normalize the “during” measurements by the median throughput achieved between the PoPs during normal time. For our thresholds, we consider four ranges: greater than 0 but less than 25%, greater than 25 but less than 50%, greater than 50 but less than 75%, and finally greater than 75%.
Fig 3: The relative throughput compared to non-retransmit periods, observed during each retransmit event.
The above figure shows the distribution of relative throughputs observed during the measurement period. First, we see that even in the lowest range, 60% of transactions achieved throughput lower than the median. As we consider higher retransmit ratios, the throughput continues to drop, with higher retransmit ratio corresponding to lower throughput, and the worst case resulting in a relative median decrease of over an order of magnitude. These measurements make it clear that the retransmit ratio successfully captures the poor performance of the impacted flows.
Next, we turn to how these events correlate with our active ICMP measurements between PoPs. Here, we consider the behavior of some of our active monitoring, which performs regular ICMP probing between PoPs, to measure for any loss or change in delay patterns. For this analysis, we again use the events extracted from our throughput comparison. This time, however, we look at the ICMP measured loss for time periods of each threshold, where normal in this case observed no loss. We note that limitations in our ICMP probing result in a 2% loss granularity for these particular measurements.
Fig 4: The observed ICMP loss during each time period. Greater retransmit ratio corresponds with greater loss.
Here, we see that the lower thresholds rarely show any loss, with 90% of the measurements failing to detect any. In contrast, at the .75 threshold, 80% of measurements observed loss, and observed relatively high median loss of 4%. Critically, we note levels where the retransmit ratio corresponded with significant throughput impacts (e.g. 0.25) result in little loss in the ICMP metrics. These findings reiterate the importance of measuring path performance beyond simple ICMP probes, and highlights the ability of the retransmit ratio to offer a nuanced view of the performance of actual flows across the Internet.
Conclusions and Beyond
In this post, we demonstrated the value of the retransmit ratio, a convenient summary metric that can be readily computed with data available from the xTCP socket data too. We further showed that it provides clear insight into cases in which application performance is impacted and network interventions are necessary.
Retransmit ratio has become a key part of our monitoring process, providing clear insight into system performance, without the need for processing larger and more unwieldy application logs or relying on ICMP probes which fail to capture some impacts.
Our ongoing work is exploring how the metric can be made as sensitive as possible to provide early warning to degradations, while also providing a suitable input to more complex automation systems.
Special thanks to the Architecture and Network Reliability Engineering teams in their support on this work!
For researchers interested in learning more about Edgio Labs & Advanced Projects, or interested in exploring collaborative works on any of the topics described above, please reach out to the team at email@example.com.