Dune as an operational model

✎ post

Dune as an allegory for ops

Fremen fighters in the desert of Arrakis

I am obsessed with the new Dune movies by Denis Villeneuve. The filmmaking, the music by Hans Zimmer, the restraint and pacing, sweeping visuals, the use of space - it's a masterpiece. The subsequent books get a little...interesting for me, but I made it through them. Internet person NatureBased publishes a steady stream of Instagram reels to fill in the gaps in my understanding of the lore.

One recurring part of the lore of the Dune universe is how some groups of people become so much more powerful fighters than others. There are two groups in particular: the Sardaukar, who are the Emperor Shaddam's personal army, and the Fedaykin, who are Fremen fighters. Frank Herbert, the author of Dune, writes that both factions of fighters are stronger because of the environment from which they are forged. The Sardaukar are raised on the planet Salusa Secundus, which has a harsh environment that trains them to reject weakness. The Fremen live on the planet Dune, which is an even harsher environment, leading the fighters to be even more powerful and fanatical.

This idea about being "forged by harsh environments" has amusing parallels to my experience being a software developer and operator. I've learned to be a better operator by being in an intense operational environment. This in turn has made me a better software developer. It's taught me to avoid making the same mistake twice, but it's also taught me a certain operational paranoia that makes me question everything in terms of its operational safety. So I'll call this idea - becoming a better operator and better developer through a harsh ops experience - to be known as DuneOps.

What are the different operational models?

I've seen, participated in, and talked with people from engineers to CTOs about different operational models - the pros and cons, applicability to a given company or environment, and there's no consensus around which is universally "the best". Different situations call for different approaches.

To start with some bookkeeping, I'll give my personal definition of SRE, DevOps, Frontline, and Platform Engineering. These definitions aren't going to be nuanced and they'll be imprecise, but at least they'll serve as ragebait for engagement on social media. Consider these to be caricatures of each, so if I draw the forehead of your operational model to be comically large, just know it's all in good fun.

  • On one extreme, you have Frontline Ops. This is just having a separate ops team of people whose job it is to do pure ops. Think carry the pager and take all of the alarms. They unblock CICD pipelines, patch hosts, you name it. The dev team throws everything over the fence to the front line team.
  • On the opposite end, you have Platform Engineering. This is a central team who writes tools and systems that everyone can use, but they don't own any aspect of the operations of others' systems. They might build CI/CD tools or best practice scanners or golden path abstractions that hopefully solve real problems that people have day to day.
  • Somewhere in between is Site Reliability Engineering (SRE). I've heard some interpretations where these teams do take the first-pass at every alarm. But they build their way out of pain, are experts in best practices and how to set up everything from alarms to pipelines, and they have strict contracts with the dev teams so that if the systems become unwieldy, the dev team agrees to pause everything and burn down the ops backlog.
  • And then there's DevOps. Here the dev team is on the hook for everything, and they all carry the first-responder pager. If their pipeline is flakey, they own cleaning it up. If their alarms are either too sensitive or not sensitive enough, they're the ones who need to figure out the right way to deal with it. If their ops are too manual, they need to automate it.

Of course you don't have to pick one model. Platform Engineering supplements all of these approaches. Some systems might shared require specific ownership that lends itself to SRE, like a Kubernetes cluster, database service, web server, or API gateway.

DevOps

I've spent the vast majority of my career on teams that do DevOps - where developers do all of their own ops. As developers, we can build our way out of any problem. We're all at least a little bit prima donnas so when ops gets too painful, we bring pitchforks to our backlog prioritization meetings and fix it or we walk. We can see the pager pain of our team compared to other teams, so that transparency is helpful incentive even for managers. Fewer people will transfer to a team that has excessive ops, and more people will leave such teams.

I swear by the DevOps model, and I think in the age of agentic development tools, developers will become more generalist and wear more hats, not fewer. I'm part developer, working on code. I'm part people manager - especially the more senior I get - because plenty of getting things done involves involving and coaching the right people and getting everyone marching in the same (or compatible) direction. I'm part customer support, because by hearing straight from customers I can tell where the product I'm working on is bad or what it's missing. I'm part sales and marketing, so I can make sure the product sounds good and has easy and intuitive getting started and attractive features to buyers. I'm part product manager, and since I tend to work on products where I'm also the customer. And of course I'm part operator since I can architect and build the most amazing sounding system in the world, but if I miss some important detail that matters a ton in practice, then the system will fail and nobody will care about how it works in theory. With agentic coding, this gives us more time and need to wear these multiple hats. So I think where we're headed is into more of this DevOps model vs one of specialization.

At least in the environments I've been in, DevOps has led to the best long-term quality of the services we build, and the best operational outcomes for customers. When operational pain creeps up, the team naturally trades off other work to respond to resolving the pain. Since the team is in touch directly with customers, they hear about things that are bad about the operational or customer experience, even if the metrics don't measure or represent the pain well. And most importantly, they teach developers about what actually matters. You can have the most amazing theoretical architecture in the world, but if your database replication doesn't catch up even during a load spike and node down situation, then your customers are going to see latency or outages.

But of course there are multiple sides to a coin (typically two). DevOps' downside is you never have enough time. The ops backlog is infinite. There's always some alarm that you can tune to "cry wolf" less often, or another alarm to be made more sensitive to detect problems earlier. In DevOps, when you build things to automate away the pain, you build it for yourself. It's the Ayn Rand model of operations. Actually I never read any of her stuff so I could be totally off base. What I mean is that teams are responsible for solving all of their problems themselves. They're on the hook for everything, from security to availability to customer service. Hopefully someone helps provide tools (like a platform engineering team), but each team is left cobbling the tools together and deals with the consequences of the gaps in the tooling.

Within DevOps teams, sometimes an altruistic engineer comes along and makes their bespoke automation tool reusable for other teams, but that requires vision from the engineer, and an understanding manager who gives them room to force multiple instead of overrotating on that one team's goals. I've been that altruistic engineer and have spent a ton of time cheerleading the other ones around, providing covering fire when needed. Sometimes these people find themselves rotating into a central team to spend all of their time on their side project with the goal of helping everyone in the company. And since AWS' business is all about making ops easier for customers, sometimes these people rotate onto a product team who aims to solve that problem for everyone.

DevOps' other downside is that you can miss things. Yes, even an experienced DevOps engineer like myself has somewhat recently missed having an auto-rollback alarm in a pipeline to kick off a roll back of a deployment before anyone even gets paged, leading to the oncall having to get paged and trigger the rollback themselves. It wasn't the end of the world, but it left me facepalming wondering how I missed something so obvious.

Platform Engineering

To complement the DevOps model, you need a great platform engineering team. These teams get ahead of the problems faced by all DevOps teams, and build abstractions that makes everyone's lives easier. They also try to help teams see what things they might be missing. To figure out what to build, they can perform "tool harvesting" where they go around and look at all of the bespoke tools that people have made, look for patterns, and make the general version. To make it so DevOps teams don't miss things, platform engineering teams also make "scanners" or "auditors" or "recommenders". For example if a pipeline is missing an auto-rollback alarm, the pipeline system will reach out to you and let you know you're missing it. For certain types of misconfigurations, the pipeline can even refuse to deploy until you fix it.

I've also worked on a Platform Engineering team, building a web service framework called Coral. This framework handles request and response serialization, validation, protocols, authentication, rate limiting, SDK generation, and whatever else we can make more convenient for service owners and consumers. I joined it because I was fed up with service teams providing REST APIs and leaving actually calling the service from a typed language like Java as an exercise for the reader. When I heard we were making a framework that would let service owners have their cake and let consumers eat it too by auto-generating those clients, I said, "sign me up".

This web service framework team was interesting. In some ways, we had zero oncall burden. After all, we were making a framework, not running a service. On the other hand, our ops burden was the highest of any team I'd been on. Why? Because we fielded every question from every Amazon developer who got stuck and suspected the framework. Sure, we had rough edges in the framework, and gaps in our documentation, but any time someone saw a stack trace with our framework name somewhere in it (every stack trace), there was a chance that they would do the low-effort thing of just asking us to help. We'd be as patient and helpful as humanly possible in helping people figure out their own problems in their own code, but we also had plenty of sharp edges that we had to smooth out and documentation to improve.

In a sense were a DevOps team operating this framework. When ops spiked, we'd find the common patterns in the questions we got, and would deprioritize new feature development until we got it under control. Every customer question we answered without having that answer be available and discoverable in our internal documentation was a defect, and we worked to drive that down to zero. We also spent time on general education on how to help oneself and "how to help us help you", frequently citing this brilliant manifesto "How to Ask Questions The Smart Way" by Eric Steven Raymond and Rick Moen. (Eric and Rick, THANK YOU.)

The biggest downside is that you lack a first-hand signal about whether you're solving the real problems for customers. Every year I spent on the platform engineering team, the less confident I was that I was solving the biggest source of pain from customers feature-wise. One solution is to rotate in and out of a team like this. When peers have been super passionate about an idea and abstraction to solve a problem broadly, I encouraged them to rotate into our platform engineering team for a while (long enough to see it through and iterate on it). And when they're done, some stay but others rotate back out into the fray. I know plenty of folks who have stayed on a platform engineering team working on problems that take a really, really long time to solve and adapt to changing times, and they're doing great at it. So it isn't necessary to rotate out, but it's very necessary to rotate in.

The second downside to platform engineering is scaling ops. At the end of the day if it costs someone nothing to ask a question, and costs a great deal to answer that question, that's an asymmetric scaling problem that turns into a DDoS. Scaling support for that leads to the loss of signal about what's bad in your stuff, reducing overall quality.

Site Reliability Engineering (SRE)

While I haven't been on a "true" SRE team before, I've been on a team that seemed a lot like one. This was my first job out of college at Amazon on a team called Website Hosting. We ran the web server environment for the fleets that rendered HTML for amazon.com and its regional variants. Unlike the rest of Amazon, we weren't exactly DevOps. We didn't run the code we wrote. I didn't write the website rendering framework. I didn't write the perl/mason code that decided what to actually show on the page. I ran the servers. So I needed to make sure we forecasted how many we'd need, estimating traffic and efficiency of each page render. I ran the CI/CD pipeline that everyone fence-threw their releases onto. And I dealt with single-server failures and crashes, and OS upgrades and performance tuning. After all, someone had to do this stuff. It couldn't be done by a shared oncall rotation of hundreds of developers who knew nothing about the server environment.

The upsides were great in this model. Our only job was to automate the operations of this fleet, so we weren't distracted by any features to build. We were connected directly to the pain of what mattered, so we were the perfect product managers to decide what to build. And while we ran server ops for a countable number of fleets, we weren't on the hook for everything in the company. But we owned just enough fleets that everything we built needed to be configurable and extensible - making it an easy extension to make our tools usable by other teams too. Essentially we could be a platform engineering team that was slowed down by operating an unrelated system to the tools we built, but it sharpened our focus into the pragmatic. In fact one of the alarm aggregation systems we built on this team got adopted across all of Amazon, and eventually became (essentially) the CloudWatch composite alarms feature to roll up many alarms into one page. We realized we needed this feature when every web server paged us individually resulting in a loop of some 500 back-to-back pages, to which the oncall responded by ripping the battery out of the pager and hurling it against the wall. (In retrospect either of those actions would have likely succeeded, but why not both?)

There were certainly downsides to this model. We were hired as software developers, but we were doing the shit work writing and running scripts while our peers did the glamorous work building features and frameworks. We spent our day writing Greasemonkey browser scripts for provisioning servers, perl scripts for killing runaway processes, and VBScript to automate Excel-driven forecasting. And of course someone had to run those scripts and processes, so that was us too. Don't get me wrong, I love solving real practical problems of any shape or form, but we were doing this while we w atched our peers build distributed systems, protocols, and frameworks in C++. Of course we thought big and worked on the long term solutions, like a replacement for the Excel sheet involving some cool managed R environment with a time series database. But it wasn't glamorous.

We also got paged all the time. Whenever there was an "order drop" (the best metric for "is there a problem with the website right now" is "have people stopped buying things"), our team was on the default engagement list because sometimes the problem was with our web servers or our load balancers in front of them. Even if it wasn't our fault, we could help because we were really good at reading the tea leaves (the logs and metrics and traces, even though they weren't called that yet) to figure out which microservice in the ball of wax architecture was the culprit. Engineers tended not to stay super-long on this team before they rotated out to other teams. But it was a hell of a Dune-like environment for us Fedaykin Fremen fighters to grow stronger in ops.

Frontline Ops

The only model that I can't say I've personally participated in is frontline ops. I suppose being a bank teller in high school was essentially this, since we were doing manual transactions for people. My attempt at reducing ops was to teach customers how to get ATM cards. But we were actually attached to sales to try to bring in more money to the bank, so that part of the job was not my jam.

But I've experienced frontline ops plenty and have studied it somewhat. Often frontline ops comes in the unenviable position of customer support. One day I tried to order a replacement physical pager so I could go on call, but the reordering system was broken. Long story short, I found that there was a re-ordering edge case that some people hit, and the resolution to the problem was to email Brenda at a 3p pager vendor directly. The support team for this kind of corporate system support had gotten over a hundred of these tickets in a couple years, but none of the awareness of the problem had bubbled up to the team who owned pager reordering. Once I pointed out the problem the team who owned paging fixed its integration with the 3p right quite quickly. Everyone involved was very helpful and great owners; it was the structure that removed the signal from the team who could stop the pain. Sure frontline ops teams produce reports and data and ticket trends so central teams can know what to fix. This is even described in the SRE books. But I find it's too easy for those best intentions mechanisms to break down, or for nuance to be lost in translation.

The place I've seen frontline ops used the most effectively is when scaling and automation is not yet possible, from certain physical or network infrastructure ops, to customer support. Every time I shadow customer support I get hit with a wave of learning about everything that sucks. But it's energizing. I feed off of that with optimism that we can fix things. I just need to go back to the well often so that I get fill up with the next problem to carry the banner for. And every time I talk with physical or network infrastructure folks I run into people who both fearlessly save the day and who put in place the most careful operational processes I have seen.

Frontline ops engineers are forged from the harshest environment of anyone, creating the strongest operators. I've worked with many support engineers who have converted to software developer roles or solution architects, and they bring a level of customer obsession and operational excellence that nobody else on the team has.

Keeping the best while avoiding the worst

DuneOps is all about understanding the environment that you're forged from, so you can take the parts that naturally improve you, and compensate for the parts of the environment that make some skill atrophe. Here are some things to consider to compensate for each operational model:

  • DevOps works best when you incentivize organizational altruism, making tools that you actively share with other teams to make their lives better too. Cross-team altruism needs to be part of the job description so that managers know that they can still put together a high performance rating or promotion case for someone who spends a lot of their time on it. No two software developers are the same; some spend time on ops automation, others focus on testing, and others focus on micro-optimization. But it's easy to have a role guideline talk about "shipping features" and forget to encourage altruism.
  • Platform Engineering works best when it is connected directly to what's real. That means actively "tool harvesting". If a team spent their precious cycles to build some automation, that means they've solved a real problem that could apply broadly, and should be amplified or brought into the fold. It also means encouraging mobility so that people who lean toward doing that force multiplying can do some kind of rotation into the platform engineering team to bring in direct understanding of what it's like to build using the platform engineering team's stuff.
  • SRE works best when you aren't carrying the pager for other teams' crappy code and writing contracts around how much pain is acceptable before service teams fix their shit code. And it works best when the team has license to see things through and not spend all their time writing Greasemonkey bandaids and one-off scripts. They need to be able to be a Platform Engineering team and tool provider for other teams to use too or else they're stuck doing only the crap work.
  • Frontline ops works best when their extreme expertise is sought out regularly and issues addressed urgently. These engineers have the best understanding of how things actually work and fail, and how customers are disappointed. It's easy for teams to settle in and ignore this, but watching it is essential.

DuneOps

DuneOps can emerge from any of these operational models - DevOps or SRE or Frontline ops. In Dune, the most powerful fighters are forged by harsh environments. And so in ops, the best operators are born from harsh, real-world operations. They need to be able to make mistakes and learn, see what works and what doesn't, and find out what actually matters by being accountable for the outcomes.

No matter what operational model you go with, I find it's important that dev and ops teams: 1) Stay grounded in reality. The closer you can put the responsibility for building central solutions to the place where the real ops is happening, the more complete the outcome. 2) Put operational ownership with service ownership. The people whose software is broken should have their pager's go off first, not second. 3) Incentivize organizational altruism. Build a culture where people share their solutions to improve others' ops as well as theirs. 4) Keep blame focused on structure and not people. We didn't talk about it in this post but it's super important and is covered in this talk if you're interested.

Epilogue

I'm sure folks will disagree with parts of this. I'm happy to chat about it (see social media links) and keep updating this as I collect more perspectives. And please know and believe that I'm not trying to offend anyone with any of these takes or colorful language. All of this stuff is extremely hard, and I respect everyone who been involved with any of this kind of stuff. After all, we are all brothers and sisters of the Fremen, forged out of the harsh desert of Dune.