Making developers responsible for how an application runs after it's finished may be a central tenet of DevOps. However, it doesn't necessarily make a lot of sense, according to Joe Beda, the entrepreneur in residence at Accel Partners, and a former Google software engineer who cofounded the Google Compute Engine project.
Beda spoke on Tuesday, Nov. 10, to attendees at KubeCon, a conference in San Francisco for users and developers of the Kubernetes container cluster management system.
Unlike some places, Google doesn't hold developers directly accountable for how their code runs. Rather, the company would rather have someone with development experience who wants to specialize in operations become a site reliability engineer.
Site reliability engineers, or SREs, are responsible for keeping Google Search, Maps, and other production systems running. They're responsible for continuous integration of new code, and other common production tasks. But they're also still programmers.
Instead of working on applications, they work on automating the processes and procedures of the data center to make them more efficient, Beda said.
One of the systems to come out of SRE efforts was the Borg cluster management system, still in use and running Bigtable, the GFS and CFS storage system, and other components of Google operations. As the cofounder of Compute Engine -- Google's Infrastructure-as-a-Service -- Beda said he sometimes ended up in "a few places where I had to run Compute Engine and I did a bad job of it. I worked to get our stuff on Borg clusters. The SREs on Borg had the expertise" to run clusters efficiently.
Another Google engineer, Brendan Burns, who cofounded the Kubernetes project, told attendees on Monday that the system was designed to match newly generated containers with the right resources on a server cluster. Kubernetes uses the concept of pods to put related containers that need to share resources on a single host within a cluster.
In his talk, "The Opeations Dividend," Beda described effective DevOps as the place where "the people operating the application are in great communication with the people writing the application." But there's a payoff when effective operations people are given the chance to automate more of their tasks, as SRE's do.
In Beda's view, the "The Operations Dividend" happens when developers and operations people understand the relationship between the right degree of simplicity and operational costs.
"As things get more complex, costs tend to go up," Beda warned. They don't go up linearly as services are added; on the contrary they escalate upward as complexity begins to outstrip the operations staff's ability to understand it.
Breaking Complex Systems Down
Very complex systems need to be broken down into smaller units that are easier to manage, update, and maintain. Too many microservices, however, can also add to costs.
"There's a sweet spot," Beda said, where a set of microservices will run well together and should be treated as a unit, sometimes as a Kubernetes pod, or a set of containers on a single host. In other cases, complex, interrelated services need to be broken apart into more discrete units in order to further the understanding of what they're doing, how they're running, and how they can be fixed when something goes wrong.
When an organization finds the sweet spot for its production systems, it gains a dividend where it needs less physical capacity and sometimes fewer people to keep the data center running. That's the operational dividend that Kubernetes, Docker containers, and Borg clusters yield at Google, and the gain yields more hardware capacity for developers and more software engineers with the time to do development.
More Work Ahead
Bob Wise, chief technologist for cloud infrastructure at Samsung SDS America and head of its Kubernetes consulting practice, told KubeCon attendees Tuesday that Kubernetes helps container managers scale up their resources today, but there's more work needs to be done to make it better at scaling in the future.
Wise has been a leader of the scaling-oriented, K8Scale Special Interest Group of the Kubernetes Project since it was formed after the release of Kubernetes 1.0 last July. The group has met each week since its formation in August, said Wise.
Samsung wants to use Kubernetes at a scale that's still difficult to achieve. "We want really large cluster and lots of sharing [of resources within the cluster]," said Wise. "We want to be the Google infrastructure that's for everybody else."
Tuning Kubernetes clusters, however, "is not going to get us to the goal." Tuning is too piecemeal and can't overcome the barriers that large implementers encounter as they try to get Kubernetes to scale up further. Google has never disclosed at what scale it operates container clusters inside its data centers, but it's suspected of being a primary practitioner of getting Kubernetes to scale.
[Want to learn more about the Kubernetes 1.1 release? See Kubernetes Augments Container Management.]
When Google's Kelsey Hightower, master of ceremonies for KubeCon, asked at one point during the event who in the audience was running the largest Kubernetes cluster in production, the nod went to Jack Foy, engineering manager for Whitepages in Seattle. Foy said he was operating two Kubernetes clusters in production, one of 10 nodes and a second of 25 nodes.
There are larger Kubernetes clusters in research and lab settings, sponsored by the Cloud Native Computing Foundation, but the modest size of Foy's clusters illustrated how far Kubernetes has to go to be a container orchestration system used in production in large scale settings.
Google donated its core Kubernetes code to the foundation last July.
"End-to-end optimizations are where we find the biggest gains," said Wise, but such results mean re-engineering the Kubernetes system to achieve those optimizations, a process that will continue in the K8Scale SIG and the project as a whole.