There may be no going back as ecosystems get peppered with options meant to help developers move faster while also increasing the complexity they must deal with. Whether it is microservices, multi-cloud, or other aspects introduced in a modernization plan, each new layer adds the potential for confusion. The complexity of these so-called “deep systems” can make it more challenging to address production issues that emerge.
Providers of solutions meant to increase observability and improve efficiency in ecosystems have been in the spotlight of late as concerns about complexity become more widespread.
Daniel Spoonhower, CTO and cofounder of LightStep, has been elbow-deep working on this issue, having seen it from both sides. He previously served at Google, working on developer tools for internal infrastructure and cloud platform. He spared time before he took the stage at the recent ESCAPE/19 multi-cloud conference and discussed the rise in large-scale, deep systems, how they might slow code deployment, and why boosting observability matters.
Where and how did deep systems start to emerge?
“People have been talking about scale for a long time but for a long time, scale meant more VMs [virtual machines] scaling horizontally. What’s happened recently with things like microservices is that things are beginning to scale deeply. The important thing about calling out deep systems is that depth introduces complexity in a different way. If you’ve broken things up in such a way that different parts of your application can be managed independently, you’re left in a spot where teams at the top of the stack are beholden to those below them for good performance. They have a responsibility, even if it’s indirect, for the performance of those systems. They don’t have control over those systems. That tension between the large scope of responsibility and the small scope of control creates a lot of stress. What we are trying to do is provide more information to those folks, so they can better understand what’s happening below them on the stack.
What are the initial signs that organizations should be cognizant of where they might run into trouble as layers of complexity are introduced?
“For a lot of organizations, they think a lot about automation and orchestration tools because they have to have them. They say, ‘Are we doing Kubernetes or not? Is there service mesh or not?’ A lot of things on the observability side are often left until later in the process. If there was one thing I would encourage people to do is think about observability from the beginning. Think about what data you’re going to need, how do you get that from an application, what do you do with it, how do you manage it, and how is that going to be used to drive feedback in the system in order to regain control.
Can you describe a potential worst-case scenario if deep systems get out of hand and how the situation could be addressed?
“One path a lot of organizations will take will start from more monolithic architecture and they’ll start breaking pieces off of that to move towards a microservice architecture. As they do that they’re probably thinking about frameworks for how they’ll manage remote requests. At the same time, they should think about building standards into their platform and organization about how they gather telemetry. We’ve partnered with a couple of different people to build the open telemetry standard for getting data out of architectures. That gives you flexibility down the road as you choose different solutions for what you do with that data. But there is a lot of cost in making changes to the application itself. Choosing something upfront to get that data and create some standards is a good way to start.
How much of this is a balance of how the technology is evolving versus the people and the team’s culture?
“People are definitely a big part of the solution—and problem maybe. When we saw things like this at Google, and we had these very deep stacks with shared infrastructure on the bottom. Outside of Google, the role is played by a third-party vendor or a cloud provider. There’d often be disagreements between those lower level systems and the applications built on top of them about whose fault a particular problem was. We saw tracing and other kinds of telemetry being used to cut those discussions off. Otherwise it was a situation where I think ‘this’ and you think ‘that.’ It was a lot of time, a lot of meetings, and a lot human effort when we can go look at the data and get answers much more quickly.
What kind of trajectory are we seeing in terms of complexity? Is this going to be a mounting issue that is pervasive with teams looking for help?
“It seems to be going more quickly. Looking over the last 20 years, with the amount of open source that’s available I can just pull frameworks off the shelf to do a lot of things that maybe we would have built in-house. That’s a kind of complexity we didn’t have before. There are a lot of good things about microservices; they do allow your teams to work more independently and treat each other like customers. To think about the service levels each of those services are offering to each other is a great way to hold those teams accountable.”