DuneOps Part 2: Crossing bloodlines

📝 post
Paul Atreides with blue Fremen eyes in Dune Part Two

In last week's Dune as an operational model post, I combined two of my obsessions — Dune and Ops — into one allegory. This week, I talk about how we can evolve beyond the existing operational models, thanks to AI. Which of course is ironic because in Dune, society rejected AI through the Butlerian Jihad. But anyway...

As a short recap to what became somewhat of a manifesto, I talk about how harsh environments like the planet Arrakis lead to hardened fighters. In ops, a hands-on environment leads to better operators. And in operational models where those same people are empowered to automate and solve their problems, and given the space to be altruistic and expand their tools for others to use, that yields the best operational outcome for customers and the most efficiency for a company.

And yes, as I write this I'm listening to the soundtrack from Dune Part 2. Thank you Hans Zimmer for this beautiful gift to the world.

Platform Engineering: The Next Generation

This post looks to the future of how AI transforms Platform Engineering teams in exciting ways.

Of course AI helps all of the models. The latest service I've been working on — AWS DevOps Agent — is being used by teams of all configurations, from customer support teams, to frontline ops, to SRE, to DevOps, and even Platform Engineering. It troubleshoots typical alarm-driven "find the root cause", but also helps with adhoc ops like researching a specific customer's reported issue, to scanning across teams' stuff for best practices, like finding missing or misconfigured alarm configuration.

But in Platform Engineering, the thing that AI can help with is particularly interesting to me. A Platform Engineering team is one who builds tools that help other teams build, test, deploy, secure, and operate their systems with lower effort and with better outcomes. These teams try to push the ownership boundary as much as possible to simplify things for other teams. But it's hard to do and is a balancing act. Each team's environment varies just enough that building software to automate the act of applying, configuring, and using a tool in every bespoke environment is simply impractical.

I've worked on a team like this before. We built the web service framework that does all of the request/reply serialization, authentication, admission control, and that kind of stuff for teams across Amazon and AWS. By defining the API model, we could expand that out to take care of all that stuff, and even generate SDKs in a bunch of languages, all automatically.

Trading off solving the problem for customization

"I'm limited by the technology of my time" — a meme that sums up the Platform Engineering dilemma

But there was only so far we could go with the abstraction. Sure, we had features like caching that people could configure, but we couldn't decide to add caching automatically without knowing more about the guarantees behind their APIs and the expectations of their clients. If we wanted to pursue that, we could provide a sort of "configuration wizard" but it would ultimately be a tedious 20 questions exercise. Everyone would have to decide to bother using the wizard, and then the hard part of testing and verification would be an exercise left to the reader.

In other cases, Platform Engineering teams have been able to push the decisions and ownership quite far. I've seen a couple of solutions that automatically set up certain types of system alarms (CPU, file descriptors, etc) and try to tune their thresholds, but that only works for customers who have specific tool setups, and specific architectures. Customization on top of these would have limited knobs that had to be coded up, but more knobs leads to more complexity, and pushes customers to just do the whole thing themselves. And it leaves everyone to set up their own application health alarms, so adding infra alarms isn't that hard to do once you're in there setting up alarms anyway.

It's a tough problem. On one hand, a Platform Engineering team can either stop short of solving the complete problem and give checklists and tons of configuration knobs that everyone has to sort through, understand, and apply. On the other extreme, a Platform Engineering team can solve the whole problem, but only if the customer is using very specific technologies, tools, and architectures.

Flexibility without configuration knob hell

The missing ingredient in both extremes is adaptability. Adaptability means that the tool can handle and reason about differences that it wasn't coded to handle. An agentic solution can say, "Oh, your service has 2 load balancers behind one DNS record? Great, let's set up alarms for all of them." But adaptability also needs customizability. But that's also solved with agents through steering, skills, and deterministic tools that gate the agent. If the agent didn't realize that one of those load balancers behind the DNS record is for a blue/green deployments, you can write a couple bullet points in a markdown file to guide it to understand that it's normal for one to be weighted to 0% traffic, and to adjust the alarms accordingly.

This really works. Last week I was looking at a retrospective where a system had a misconfigured alarm that would have helped the team get an earlier start on the problem. Some types of alarms are unfortunately easy to misconfigure, either pointing to the wrong underlying resource, or using the wrong statistic like "avg" where it was supposed to be "sum", or that kind of thing. Well to test my "agents are great" hypothesis, I tweaked my own test service to have the same alarm misconfiguration, and prompted AWS DevOps Agent to audit my alarms. And sure enough, it found the misconfiguration, and even recommended some others to add. Since I configure alarms as IaC, I fed its report to Kiro and had it fix and add the missing ones. To organize the alarms the way I like, I described my alarm strategy of rollups with composite alarms — something that I hadn't written down before because it's complicated and varies a little bit by every application — and it shifted the alarms to set them up better than I had it in the first place.

AI pushes the boundary on "how much of the full problem can you own" so much further than before. Instead of dumping pull requests of code upgrades to every team, you can trigger their load test suite. Sure, this was possible before, but it was impractical because there was no "run my bespoke load test suite and interpret the results in an isolated test environment" API. At least that's not a W3C standard that I'm familiar with. But with agents that can adapt, and try repeatedly until they get what they need, doing it is practical at scale.

Crossing bloodlines

Ops remains a cultural problem, and for the longest time people try to treat the cultural problem as a technology problem and fail to solve it. So while this new agentic AI stuff is a very powerful new technology, Platform Engineering teams still need to consider the DuneOps philosophy: strength comes from experience. When I was on a Platform Engineering team, the longer I was on it, the less of an intuition I had for where the real problems were for my customers. Sure, I'd talk to as many customers as I could, attend ops meetings and read retrospectives of where things went wrong, and all that. But my personal hands-on experience of the pain of ops was a snapshot in time that kept moving further into the past.

So maybe there's a new DuneOps model for the next generation of Platform Engineering. AI makes it so those teams can reach even further into the end solution — at scale — than ever before. Maybe this gives those teams just enough time to take on the ops for a production system that looks more like what their customers run. Sure, Platform Engineering teams operate the systems that they build, but those tend to have different operational requirements and be built differently. This way they still have the charter to build for everyone, but they gain the firsthand understanding that they are (or aren't) building the right thing for everyone. Wait, but this sounds like it has properties of an SRE team or a DevOps team or a Frontline Ops team! A sort of combination of parts of each!

In Dune, the Bene Gesserit worked in the shadows over 90 generations on a eugenics program, crossing bloodlines of the major houses with the end goal of producing a human with a mind so powerful that it could see past space and time — a being they called the Kwisatz Haderach.

DuneOps is exactly this. With AI, you can now cherry-pick genes from each operational model and combine them, to borrow strengths of other operational models to offset each of any inherent weaknesses in the core model you choose. (And if you think your operational model lacks weaknesses, read last week's Dune as an operational model and then "@" me on twitter or whatever.)

And so just like how Paul Atreides , AI is the leverage we needed to be able to borrow from and combine the operational models to take them further than we could ever before.