In my last article, I argued that internal platforms should be treated as products - thoughtfully designed, user-centred, and built to empower engineers to move quickly and confidently. And once you adopt this mindset, a natural question follows:
How do you build a platform that earns trust, not just usage? The answer lies in how you define and manage reliability.
It’s not enough to say “we’re up” - modern teams need a shared understanding of what “good enough” looks like, how much failure is acceptable, and how operational pain is addressed. That’s where service level indicators (SLIs), service level objectives (SLOs), error budgets, and a deliberate approach to toil come in.
These aren’t just operational metrics - they’re the foundations of dependable, developer-friendly platforms.
Start with clarity: Key terms that shape reliability
Before we can manage reliability, we need to define it in clear, measurable terms:
- SLI (Service Level Indicator): A specific metric that tells you how a system is performing from a user’s perspective. Examples might include request success rate, build time, or environment provisioning latency.
- SLO (Service Level Objective): A target or threshold for that metric, typically expressed as a percentage over a rolling window (e.g. “95% of builds complete in under 10 minutes over the last 30 days”). This is your agreed definition of “good enough”.
- SLA (Service Level Agreement): A formal, often contractual commitment to customers or stakeholders, typically accompanied by penalties if not met. Unlike SLOs, which are internal alignment tools, SLAs are external guarantees.
- Error Budget: The difference between perfect performance (100%) and your SLO target. If your SLO is 99.9%, your error budget is 0.1%. It’s a buffer for experimentation and change - a can be considered a safe space for failure.
- Toil: Manual, repetitive work that adds no lasting value, is automatable, and tends to scale linearly with system growth. It distracts engineers from higher-impact work and erodes team energy over time.
Reliability isn’t perfection - It’s predictability
Let’s step out of the tech world for a moment. Think about your daily train commute.
You take the same 7:43 service every morning. Most days, it arrives on time, and your routine flows without friction. You grab a coffee. You arrive at work just as planned.
But imagine if - without warning - that same train starts arriving unpredictably. Some days it’s early. Others, 10–15 minutes late. And no one tells you why.
Naturally, your trust in the service erodes. You start adjusting:
- Arriving earlier, “just in case”
- Cancelling plans you’d otherwise keep
- Feeling anxious, not efficient
Here, the SLI is simple: on-time arrival.
The SLO might be: “the train should arrive within five minutes of schedule, 97% of the time.”
And that 3% wiggle room? That’s your error budget - a margin for signal failures, bad weather, or the occasional unavoidable delay. More on this later.
The difference isn’t just performance. It’s confidence. And confidence is what lets users plan, trust, and build on top of your platform. This is what SLOs do - they turn performance into a promise. Not one of perfection, but of predictability.
Because users don’t need your platform to be flawless. They need to know what to expect - and that when it drifts, someone’s paying attention.
Error Budgets: Where innovation meets risk
When we talk about SLOs, we’re not just setting a target - we’re defining a tolerance for imperfection. That tolerance is known as the error budget, and it represents the small amount of failure we’re willing to accept in exchange for faster delivery, innovation, or operational flexibility.
Calculating it is simple: Error Budget = 100% – SLO target. Then apply that percentage to the total time window you’re measuring against.
Let’s look at what that means in a typical 30-day month (720 hours):
- 99.9% SLO ("three nines") gives you an error budget of 0.1%, or 43.2 minutes of allowable downtime.
- 99.99% SLO ("four nines") tightens that to just 4.32 minutes.
- 99.999% SLO ("five nines") leaves only 25.9 seconds.
The difference is staggering. Moving from three nines to five nines doesn’t just mean more reliability - it means 100x less room for failure. That’s a major strategic decision. Every extra nine you commit to requires more investment in automation, redundancy, observability, and incident response.
Error budgets help teams balance resilience and velocity. They give you the breathing room to release frequently, experiment safely, and build trust - without over-engineering for unreachable perfection.
Toil: The silent killer of engineering progress
Imagine you're in a small rowboat, headed for a clear destination. You have a map, a plan, and a crew. But there’s a leak - not a crisis, just a slow, steady trickle.
So every day, before you can row, you have to bail. Bucket after bucket. It’s mindless, repetitive, and thankless. At first, you talk about fixing the hole. But over time, bailing becomes normal. You optimise for it. You assign people to it. You build rituals around it.
You’ve accepted the leak - and slowed the journey. This is what toil feels like in digital engineering. Toil is the manual, repetitive work that:
- Doesn’t scale
- Doesn’t teach
- Doesn’t improve outcomes
But it does consume time, burn out good people, and quietly slow teams down. The worst part? The longer you tolerate it, the more invisible it becomes - absorbed into culture, accepted as "just the way things work."
Examples of toil include:
- Manually re-running flaky CI pipelines
- Restarting services that crash silently
- Repeatedly fixing the same environment issue
- Triaging alerts you know will self-resolve
Toil doesn’t show up on roadmaps. It’s not demoed. But it accumulates. Fix the leak, not just the symptoms. That’s how you reclaim velocity, morale, and meaningful progress.
Why reliability tools belong in every platform team’s toolbox
As platform engineers, our job is to provide reliable infrastructure, tools, and workflows that enable delivery teams to move faster, safer, and with less friction. But reliability, like usability, is experienced - not just measured.
- Without clear SLOs, platform reliability becomes a vague aspiration, not a managed outcome.
- Without error budgets, teams can struggle to balance speed and stability.
- Without addressing toil, reliability becomes brittle - and team morale suffers.
These mechanisms bring structure to how we think about service health, how we respond to incidents, and how we prioritise work.
Applying This in Practice
Here’s how to make this real:
- Define Meaningful SLIs
Start by identifying where reliability matters most to your users. For example:
- Time to provision a developer environment
- Success rate of CI/CD jobs
- Time to respond to an access request
Choose indicators that reflect what engineers feel when things go wrong.
- Agree on SLOs That Reflect Reality
SLOs shouldn’t be aspirational - they should be achievable and useful. A good SLO gives teams confidence to act and sets the right expectations with users.
For instance:
- “95% of builds succeed on first attempt within 10 minutes”
- “99.5% of test environments are ready within 3 minutes of request”
Start with just one or two per service. Make them visible. Review them regularly.
- Use Error Budgets to Guide Decision-Making
If your error budget is intact, keep shipping. If you’re burning through it, slow down and focus on stability. This encourages healthy tension between innovation and resilience, without relying on vague instincts or subjective judgment.
Error budgets also help reset the conversation during retrospectives. Instead of asking “did we break anything?”, try asking “did we operate within our agreed risk tolerance?”
- Track and Reduce Toil Proactively
Toil creeps in silently. A small manual step here, a repeated workaround there - over time, they steal capacity from your roadmap.
Make toil visible by asking:
- What tasks are repeated frequently?
- What causes unnecessary handoffs or delays?
- What could be automated if someone had time?
Allocate time every sprint or cycle to remove or reduce toil. Treat it as you would technical debt: it’s not always urgent, but it’s always important.
Rethinking What “Reliable” Means
One of the biggest mindset shifts when adopting these practices is this:
Perfection is not the goal. Predictability is.
Your platform doesn’t have to be flawless. But it should behave in ways that are understandable, recoverable, and fair. When something goes wrong, users should feel confident that it will be resolved quickly, that the team is already working to prevent it happening again, and that transparent communication is flowing in a proactive manner.
That’s what builds trust. And trust is the true UX of any internal platform.
In Summary
If you're serious about treating your platform as a product, reliability can’t be an afterthought - it must be designed in.
- SLIs and SLOs give teams a shared definition of “good enough”
- Error budgets create guardrails for fast, safe iteration
- Toil reduction protects energy, creativity, and long-term sustainability
These aren’t just tools - they’re how great teams balance reliability and agility, trust and speed, autonomy and alignment.