Time is So Everything Doesn’t Happen at Once

I’ve been looking for a reason to use the above quote in a blog title. It is apocryphally associated with Albert Einstein although there seems to be no actual valid citation. In any case…

I really enjoyed this post by LucidChart engineering on some recent troubles they encountered when enabling HTTP/2 on their service. The problems they ran into were basically “everything happening at once” which is one kind of problem you can run into in building asynchronous systems. This is another good example where having a strong mental framework for the types of challenges you encounter when building these systems can help you diagnose (or ideally anticipate) problems more quickly.

Some quick background (more detail in post above). HTTP is the main networking protocol used for the web. It was originally designed to allow for a single client request and then a server response. The browser establishes a connection with the server, makes a single request, receives the response and then closes the connection. This was unusual for Internet protocols at the time it was designed (e.g. looking at other Internet protocols in wide use then like Telnet or FTP which typically involved a longer-lived connection used for multiple requests and responses). In the original web usage model, a response would be a big chunk of information (e.g. a web page representing a technical paper) which the user would presumably spend considerable time reading. Only allowing a single request and response simplified the design of the overall system on client, server and any intermediate caches, proxies or gateways. Additionally, the very next thing the user would often do is follow a link to some completely new website (who remembers that some of the most popular early web pages were just big lists of links to other interesting places on the nascent web)?

Over time, pages became more complex, with additional assets required to render them like images, scripts and CSS files. The inherent disadvantage of a single request-response protocol is that there is non-trivial overhead required to set up the underlying TCP connection — in particular latency incurred as a setup handshake happens before any payload can be transferred. This just gets worse with HTTPS and this startup delay just continues to get more impactful over time as bandwidth improves faster than latency.

From early days, browsers addressed some of the latency concern by opening multiple connections to the same web server in parallel. The browser would self-throttle by limiting the number of connections to a single server to initially 2 and now generally 6. From a pure raw bandwidth perspective, using multiple connections serves no purpose — the raw aggregate bandwidth between the endpoints does not change by using multiple connections. However practically, it allows both the service and the client to interleave these requests and responses independently. Additionally, depending on service design, those independent requests might actually be handled by multiple servers sitting behind a single load balancer.

HTTP/1.1 was the first attempt at improving the core protocol. It allows for a single connection to remain open for multiple requests and responses, saving that initial setup time. HTTP/1.1 also allows for pipelining multiple requests — essentially firing multiple requests before receiving a response. A significant constraint however is that responses need to be received in the order the requests were sent in order to ensure a consistent semantic model. This makes it less flexible than just using independent connections, and led to limited overall practical implementation of these pipelining capabilities. So even when HTTP/1.1 was in use, the actual communication over the connection was usually just one request and then other requests were locally throttled until the next response was received.

HTTP/2 is a major enhancement. There are some technical enhancements to reduce some of the boilerplate overhead in each request and response but the biggest change is to allow multiple requests and multiple responses to be interleaved or multiplexed independently over a single connection. This prevents a single big or slow request or response from delaying all communication on the channel.

When LucidChart enabled HTTP/2, they found that while their overall request load did not change, the shape of the load was much spikier and many more requests were timing out. What was going on?

Any asynchronous system can have a problem with congestion. Requests come more quickly then they can be satisfied. You have five waiters in the front of the restaurant taking orders but one cook in back trying to fill all the orders. So your order gets taken quickly but then you wait forever for your food — or “time out” and get up and leave. This problem arises whenever there are differences in the maximum processing or throughput rates of the independent communicating processes in a system.

One approach for handling this mismatch in rates is to put in place a queue or buffer. You keep track of the requests and then just fill them when you have time by picking them off the queue. This can work if the problem you are trying to solve is that you have bursty request traffic. So the rate of one part of the process might be too high for a short period but over time it evens out. The queue is just to even things out over time.

So you have a restaurant that can produce 1 hamburger a minute. A car arrives with four people who all order a hamburger. You can’t satisfy their requests immediately, so you queue their requests and after four minutes you have produced 4 hamburgers and the queue is empty again. Five minutes later another car arrives and the same thing happens. Everybody’s reasonably happy.

But if a car arrives every minute so I am taking requests for 4 hamburgers every minute but can only produce 1 hamburger a minute, a queue is not going to help me. The queue could get infinitely long but it is not going to help me make more hamburgers.

Another approach is to throttle. You slow down the requests you make of the system. So the car pulls up to the drive-in window and stays there till their order is fulfilled. You don’t take another request until that car is gone.

In a real restaurant drive-in, there might be space for multiple cars waiting in line to place their order. A car drives by, sees the long line and just continues on without stopping. The total number of requests made over time is throttled because new customers never even get in line to make their request. In any system where throttling is happening, you want to push that throttling (or “back-pressure”) all the way to the ends of the system. Otherwise you just end up moving around where the queuing happens.

An important variant of throttling is to “shed load”, especially in the Internet context. The request arrives but you just throw it on the floor immediately. Or queue it but then realize you are over-committed and discard it. The benefit of explicitly shedding load is that the requestor can get quicker notification that some part of the pipeline is overcommitted and make adjustments rather than having to wait for a timeout.

The final approach is over-provisioning. If you can guarantee a component can handle any amount of traffic it is handed, it will not bottleneck the system. Over-provisioning is common but inherently wasteful — you are allocating resources for your maximum load rather than for your typical load.

Any system uses some combination of queuing, throttling and over-provisioning. Surprisingly enough, in real systems throttling is often implicit in the way the system is coded or which underlying technology was chosen for some layer in the system. The developers of the system might not even realize that throttling is happening. That appears to be the case here for LucidChart — the use of HTTP/1.1 was essentially throttling the requests from the client. Switching to HTTP/2 released this throttle so the server started to get hit with rapid bursts of requests more frequently. It could queue the requests at the server but could not speed up the rate it could satisfy them. So more requests were timing out before they could be satisfied.

The LucidChart post discusses possible solutions to the problem they encountered. One approach was to throttle at the load balancer so their application servers still see the same type of traffic flow they saw before. Exactly what “throttle at the load balancer” means is unclear — if the client is still sending all those requests at once and they are getting queued and slowly released at the load balancer, you could still see the same type of client timeouts as before — they are just being generated at a different point in the system.

They also said “perhaps the best solution” is to re-architect their server infrastructure to better handle the bursty traffic. That could be true if their server capacity is already big enough and the problem is just that the new traffic pattern is leading to too many requests being directed at one server while other servers are underloaded. It wasn’t clear that that is what was going on.

As an application guy, before I look for a solution, I usually start with the question of “what is really going on at the application end point”? In any system where you are looking at congestion and traffic control problems, a major advantage is to be able to analyze the system all the way from end-to-end.

So the first question I would ask is, why is this bursty traffic being generated at all? This actually smells a lot like the story I told in my post Asynchronous Issues in the Word Web App. That congestion problem arose due to bursty traffic because of wide variation in user documents — in that case the number of images in a document. Document authoring applications like Word or LucidChart often have this problem that there are significant outliers from what is a “typical” document. That can generate unexpected traffic patterns. This is the same problem a grocery store has managing its checkout queue when most people arrive with 1 or 2 items and then one person arrives with an entire cart full of groceries. You typically don’t want to design your server infrastructure based on outlying usage patterns.

There was no indication in the post that there was a problem with actual application behavior. The application had been behaving just fine with the implicit throttling behavior inherent in the use of HTTP/1.1. Often the simplest solution is to just change from an implicit throttle to an explicit one. That is the approach that was taken in the Word Web App case described above. In LucidChart’s case, their application was launching a burst of requests all at once that were being implicitly throttled at the client by the browser’s underlying use of HTTP/1.1. Their client could be modified to explicitly limit the number of outstanding requests until a response is received.

There are pros and cons to this approach. The advantage is that you now have an explicit mechanism to tune and control the behavior of your application rather than having major characteristics of your application behavior dependent on the dynamics of layers that you did not really understand in the first place. This can also be quite simple to implement in the client (your mileage may vary). The disadvantage is that you are placing a constraint at one point in the system that might disappear in the future due to a change elsewhere — perhaps the service infrastructure improves in a way such that all those requests could be more rapidly handled rather than throttling them. Trying to allow for this by putting a dynamic throttling mechanism in place at the client (rather than just statically limiting to a fixed number) can be surprisingly challenging — there are numerous PhD thesis on the topic. The best approach is to keep it simple. If there is a desire to improve application behavior in the future, treat it as an end-to-end feature that requires a consistent set of changes through the overall pipeline.

read original article here