What would Cloud Congestion Control look like?
Do you know why Van Jacobson invented the famous Additive Increase, Multiplicative Decrease (AIMD) algorithm in 1988? It was the result of an unexpected Internet collapse. Initially, the Internet had been designed without any congestion control. In fact, the Internet was not supposed to become so popular! Nobody was worrying on congestion before 1984 (RFC 896). But then a congestion collapse happened in 1986 when the NSFnet phase-I backbone dropped three orders of magnitude from its capacity of 32 kbit/s to 40 bit/s.
AIMD has been integrated in TCP in 1988 to prevent, control and eliminate congestions in the Internet. Investigating this protocol in depth for more than 10 years, I am still fascinated by its genius. It is a completely distributed algorithm designed and implemented by a community, via a test and try approach, and has evolved to adapt to new networking technologies, such as WiFi, ultra fast wide area networks or datacenter networks, etc. This highly collaborative congestion control algorithm, embedded in each Internet end node is just brilliant. The rocket science behind this artistic invention, has been realized recently when it was modeled for the first time by a famous Control Theory team at Caltech in the last decade. The key of the stability of the Internet resides in the feedback loop of the TCP algorithm. TCP is really difficult to tune for high performance, but it transport data in all situations and is really robust!
In 2006, I tried with my team to reproduce an Internet collapse at the TCP train wreck workshop organized by the Clean Slate lab at Stanford. All the big names of TCP/IP where there. We generated thousand of flows from hundreds of independent sources in our academic cloud (Grid5000). We were able to simulate, in real time and in front of all of the fathers of TCP, an Internet collapse. In less than two minutes we were able to block myriads of sources. But, only for a couple of minutes. After this we progressively saw all the sources come back to life and after some time our “Internet” state was normal again.
TCP is working very well, and even if rate and burstiness are increasing tremendously, an Internet collapse will probably never occur again. However, over time we observe that we suffer very bad performance, unbearable latencies, abundant useless retransmissions and bandwidth wastage, because TCP is not designed for low delays, or differentiated or high bandwidth. TCP is a solution designed for transporting homogeneous small size files and barely fits well any other scenarios.
TCP is all about the automation of TEST & TEST. You start slow then you increase linearly your sending rate until you reach the limit of capacity of your path. Then you back off and start testing the path up to the limit again, in a continuous feedback loop.
Let’s be very pragmatic and come back to the cloud. I assume that if you choose to adopt the cloud, it is because you can develop your great ideas with a small budget. The majority of businesses are in the cloud for this reason indeed – to save time and money! It’s not just for fun or pleasure! For these reasons, you will want to rapidly get the best return on your investment. Getting the maximum leverage for your money is a key part of the game!
We are no more in the context of a free and unlimited Internet, where all flows cooperate very well. In fact, since 20 years we are struggling with this design and the need for more predictability in Internet. This is exacerbated in the Cloud. We are in a very competitive world and we pay for the resources we need. The new trend is to cost optimize each individual infrastructure! In this scenario, we do not expect much collaboration, nor do we expect the user will back off automatically to be nice with his neighbors. This is not just happening for bandwidth sharing, but for all resources which are shared in the Cloud.
It’s a strange game! And we certainly need to (re:)invent a sort of automated “congestion control” solution for the cloud to stop critical bottlenecks and resource starvation from interfering with successful service delivery, a feedback loop to constantly adapt the networks of resources to the activity of their respective heterogeneous applications.