Do your Apps tolerate latency?
Last Saturday I attended a wonderful event in the Computer History Museum at Mountain View. The Cloud Tech III day is organized by the Silicon Valley Cloud Computing group and followed by more than 250 geeks. I specially enjoyed the deep technical keynotes by legendary BigTable and MapReduce designer Jeff Dean, Sun & Arista founder Andy Bechtolsheim, and former Yahoo CTO Raymie Stata. Jeff Dean presented the last research at Google, Andy Bechtolsheim gave the evolutionary vision of Networks in Data Center and was outlining the issue TCP is facing in Clouds and Raymie advocated for an Orchestrator playing the auditor role between all the different sources of truth in the byzantine empire of Infrastructure as a Service!
The most important lesson I retained from this day is that the DevOps professionals pain point to deploy and efficiently operate their distributed applications in Cloud environments arises mainly because:
- The end to end network is not flexible enough to isolate, differentiate and optimize traffics according their specific needs.
- The network is dumb and hidden and always presented as a black box.
An overview of Jeff’s presentation had his Slide N° 1 map Google’s distributed worldwide services relying on the private Google network, slide N° 2 informs us that not surprisingly, the network is their first class issue, slide N° 3: teaches us that in a shared machine you can always expect network congestion, N° 4: in a shared environment, a developer has to deal with many challenges. It is well known that lot of sharing issues are solved well in fully controlled low-latency clusters. But achieving the same for cluster-level services in a wide area and virtualized context is not as easy and leaves the resolution of many cross-cluster issues to human level operators: naming, locality, consistency, migration…
To cope with uncertainty and ever changing environment, Google experts are working on sophisticated “tolerance” and “approximation” techniques. For example, one of the approaches Google engineers adopted for dealing with unpredictable queuing delays are based on latency tolerating techniques. They also heavily rely on Machine Learning solutions.
We get the same echo from companies such as Netflix, Facebook and Zynga. Their smart engineers are inventing sophisticated solutions to run efficiently at very large scale.
My concern is that the majority of small and medium enterprises that program applications and deploy in the Cloud do not have these army of smart engineers. This is why new tools to ease and reduce the cost of Cloud deployment have to be build and are already on the way! The virtualization of the Network which pushes all the complex algorithms out of the network boxes into a consolidated software layer is also very good news for the App developers! This makes it possible to automate the exploration and control of latency and bandwidth and to develop flow-based and policy-driven networking solutions to meet the specific needs of each application.