Elizabeth Nelson
Elizabeth Nelson

Cloud Computing for Agility, Complexity, and Speed

by J. Edward Anthony

Cloud computing has become synonymous with the internet itself. If you watch Netflix, call an Uber, or meet on Zoom, a cloud data center is delivering the processing power and data storage to make your experience possible. “But cloud systems have a serious problem,” says Christina Delimitrou, Electrical and Computer Engineering—“unpredictability.”

As a business service, cloud computing has provided the infrastructure for a veritable tech explosion in the past decade. An internet startup can purchase a small plot in the cloud and then expand as business grows, all without buying or maintaining a single server. For the typical computer user, the cloud brings the power of giant data centers to devices that fit in your hip pocket. With the cloud’s mobility and power, it is no surprise that even the Central Intelligence Agency has adopted the technology.

But new developments in hardware and software have outstripped the ability of systems experts to keep data centers operating at optimal performance and efficiency. “Applications change quite frequently, and their resource requirements change too,” says Delimitrou. “And you don’t have only a single application. You might have thousands all linked together.”

A Next Generation of Software and Hardware

In the early days of cloud computing, web-based applications were written as single, monolithic programs. Today, software designers build applications from smaller interdependent components called microservices. An end user who visits an internet retailer or social media platform might engage with hundreds of microservices. Each microservice is self-contained, so a programmer can redesign one microservice for improved performance and functionality without tinkering with the rest of the application. The shift to microservices means that many interdependent applications are running simultaneously. Microservices also tax the network, because they depend on each other for data and must communicate constantly. A problem with one can have a cascading effect on the entire system.

Specialized hardware platforms also add complications. When the cloud was first introduced, servers were basically the same and could host any application. “Now people are switching to heterogeneous hardware designs,” says Delimitrou. “We aren’t packing more power into smaller chips at the rate we used to. New hardware technologies—like special chips for specific applications and reconfigurable fabrics—are a way to move forward. These platforms have performance, power, and cost benefits. But it takes what used to be a perfectly homogeneous system and makes it far more complex.”

Experience—a Limited Guide

The added complexity puts an incredible burden on cloud operators, who have to allocate compute, storage, and network resources among thousands of applications. When applications were simpler and servers basically interchangeable, systems experts could rely on past experience—what Delimitrou calls an empirical, heuristic approach. “Traditionally, an application like web search needs so much compute, so much memory, so much network bandwidth,” Delimitrou says. “Knowing that means you can provision your servers with the resources you’ve seen the application needing in the past. Or you might do some trial and error: ‘Let’s adjust that resource a little bit and see how it reacts.’ The problem is that standard empirical and heuristic-based approaches don’t scale as you add more complexity to the system.”

Cloud services hold themselves to high standards, and serious failures are rare. But current levels of unpredictability make it too risky to put real-time applications, such as driverless cars, in the cloud. “If it takes a search engine 11 milliseconds instead of 10 to return results, it’s probably okay,” says Delimitrou. “But if an autonomous vehicle doesn’t respond to an obstacle in time, it’s going to crash. There are a lot of applications that are not possible now, that would be possible if we had more predictable performance.”

A Data-Driven Approach

As a PhD student, Delimitrou discovered that the data center operator where she interned was recording massive amounts of data about the system’s behavior—how long every network request took to process, when and where delays occurred, every tiny error. The data exceeded what an army of experts could sift through, and it was constantly changing. “That’s when the idea clicked for me,” she says. “Don’t rely on people to solve this problem. Rely on the data. Rely on the system to tell you how to optimize the system.”

“That’s when the idea clicked for me: Don’t rely on people to solve this problem. Rely on the data. Rely on the system to tell you how to optimize the system.”

Her first attempt, focused on resource allocation, met with skepticism. “We wanted to see whether there’s some useful information in all that data to improve resource management decisions,” says Delimitrou. “We were trying to show that you can take a new application, run it a few seconds, and then based on that, make an almost perfect decision about the resources it needs.” With the support of her adviser, Delimitrou persisted. When she presented her first paper at a conference, people were impressed by the scale of her experiments. Now several large cloud providers have introduced data-driven design to their resource allocation systems.

Delimitrou envisions a cloud system that can constantly adjust itself based on its past behavior. She is currently working on machine learning to diagnose performance issues in microservices before they happen. “Instead of waiting until performance becomes bad before you react, when you see that a pattern will evolve into something that might be problematic, adjust the resources and avoid the unpredictable performance altogether,” she says. “If you wait too long, it also takes very long for the system to recover and for performance to go back to normal.” In effect, she is endowing cloud systems with the intelligence to foresee issues and learn from past mistakes.

The Full Potential of Heterogeneity

Delimitrou hopes her data-driven approach will enable cloud operators to take full advantage of specialized hardware. “Programmability is a major challenge with heterogeneous hardware, especially when relying on empirical approaches. To some extent, that’s limited a lot of companies to introducing hardware accelerators to only specific applications they know well, instead of exploiting their full potential—exposing more applications to it, designing custom hardware for different applications, and deploying it at scale.”

The Computer Systems Laboratory at Cornell provides the perfect environment for the collaborations that are important to Delimitrou. “A lot of the optimization opportunities come from looking across the system stack,” she says. “It’s very rare to have so many people working together with such diverse expertise—from low levels of architecture, even circuits, to traditional architecture, operating systems, distributed systems, and programming languages. There’s basically an expert for each layer of the system stack. I am collaborating on projects that would be impossible in most places.”