← Financial Operating Model

FinOps and Cloud Cost Management

Cloud costs are engineering decisions. Engineers need to own them.

Cloud infrastructure costs are no longer a finance problem. They are an engineering problem that requires engineering solutions. FinOps is the practice of bringing financial accountability to cloud spending without slowing teams down. This covers the principles, practices, and cultural change required.

Why Cloud Costs Became an Engineering Problem

On-premise infrastructure had a natural brake on spending: procurement. Buying servers required a purchase order, a delivery lead time, and physical installation. Finance saw every pound before it was spent.

Cloud removed that brake. Any engineer with console access can spin up resources. Autoscaling groups expand in response to traffic. Orphaned environments accumulate. Data transfer costs appear without anyone choosing to incur them. The bill arrives monthly and it is rarely what anyone expected.

FinOps is the practice of bringing financial discipline back into this environment - not by recreating the friction of on-premise procurement, but by building cost awareness into how engineering decisions are made. The goal is not to slow engineers down. It is to make sure that when they make architectural decisions, they understand the financial implications.

The FinOps Operating Model

The FinOps Foundation describes three phases of maturity: Inform, Optimise, and Operate. Most organisations are stuck in the first.

Inform: Making Costs Visible

Before you can manage cloud costs, you need to be able to see them - broken down in ways that are meaningful to the people who made the spending decisions.

A cloud bill that arrives as a single number tells you nothing useful. A bill broken down by team, by product, by environment, and by service type gives you a basis for conversation. This requires tagging - the practice of attaching metadata to cloud resources so costs can be attributed.

Tagging sounds simple. It is not. Engineers create resources without tags. Tags are applied inconsistently. Naming conventions drift. Shared infrastructure (networking, security services, monitoring) does not map cleanly to a single team. Getting to 85% tagging coverage across a large estate is a meaningful achievement.

The platform team is usually responsible for establishing tagging standards and enforcing them. This means: documented tag taxonomy, infrastructure-as-code templates that include required tags by default, and alerts or automated remediation when untagged resources appear.

Cost allocation reports - ideally surfaced weekly, not monthly - give teams the data they need. Dashboards that show each team their spend versus a budget or trend line, broken down by service type, are more effective than spreadsheets delivered after the fact.

Optimise: Reducing Waste Without Reducing Capability

Once costs are visible, you can act on them. The optimisation opportunities in most cloud environments are significant - industry benchmarks suggest 20-35% of cloud spend is waste.

Common waste categories:

  • Idle or underutilised resources: compute instances running at low utilisation, databases with no traffic, load balancers attached to nothing.
  • Oversized resources: instances provisioned for peak load that never materialises, memory allocations set conservatively and never reviewed.
  • Orphaned resources: snapshots, volumes, and IP addresses left behind when workloads were decommissioned.
  • Development and test environments running 24/7: most do not need to be. Automated scheduling to shut them down outside working hours typically saves 60-70% of those environment costs.
  • Data transfer costs: often invisible until large, often reducible by placing services in the same region or availability zone, or by using direct connectivity rather than traversing the public internet.

The reserved instance decision is one of the most impactful optimisation levers. On-demand pricing is convenient but expensive. Committing to one-year or three-year reservations for stable workloads can reduce those costs by 30-60%. Savings plans (AWS) and committed use discounts (GCP) offer similar mechanisms.

The trade-off is flexibility. Reserved capacity you no longer need is not easily unwound. The platform team or a dedicated FinOps function should own the reservation strategy, treating it as a portfolio management exercise - what can be committed to, for how long, and what should remain on-demand because of uncertainty.

Spot instances and preemptible VMs are appropriate for fault-tolerant, interruptible workloads - batch processing, CI/CD agents, ML training. They are not appropriate for anything requiring continuous availability. The cost saving (60-90% versus on-demand) is significant enough to build the architectural support for spot usage in appropriate contexts.

Operate: Embedding Cost as an Engineering Consideration

The mature state is one where cost is a normal part of engineering decision-making - not a separate concern managed by a FinOps team after the fact, but something engineers think about when designing systems and choosing services.

This requires culture change more than process change. Engineers need to know that cost awareness is valued and expected. They need the tools to check cost implications before committing to an architecture. And they need to see leadership acting on cost data rather than just collecting it.

Unit Economics for Cloud

Aggregate cloud spend figures are hard to act on. Unit economics - cost expressed relative to something meaningful - are actionable.

Cost per active user: total infrastructure cost divided by monthly active users. This tells you whether your cost base is growing faster or slower than your user base. If it is growing faster, you have an efficiency problem. If it is growing slower, your investment in optimisation and architecture is paying off.

Cost per transaction: useful for transactional systems. If processing a payment costs you $0.003 today and was $0.005 twelve months ago, you can demonstrate the value of engineering work in financial terms.

Cost per deployment: total infrastructure cost divided by number of deployments. Controversial because it can create perverse incentives (deploy less to reduce the denominator), but useful as one signal among many about deployment efficiency.

These ratios are imprecise. The point is not actuarial accuracy - it is to create a shared language between engineering and finance that goes beyond raw spend numbers.

The Role of the Platform Team

In most organisations, the platform team (or SRE team, or infrastructure team) is the natural home for cloud cost governance. They own the infrastructure, they set the standards, and they have the technical depth to evaluate trade-offs.

This does not mean the platform team is responsible for driving down costs by themselves. It means they are responsible for:

  • Establishing and maintaining the tagging taxonomy
  • Building cost-visibility tooling and dashboards
  • Creating the standards and guardrails that prevent avoidable waste
  • Providing guidance to product teams on cost-efficient architectural choices
  • Running the reservation and savings plan portfolio
  • Flagging anomalies and unexpected cost increases

Product engineering teams are responsible for their own cost envelopes within these standards. The platform team provides the tooling and the guardrails; product teams make decisions within them.

Building Cost Awareness Without Creating Fear

The biggest risk in a FinOps programme is creating an environment where engineers are afraid to provision resources. That kills velocity and causes engineers to make poor trade-offs - under-provisioning production systems to avoid scrutiny, for example.

Cost awareness should be normalised, not weaponised. If a team's cloud bill increases by 40% in a month, the first question should be "what drove this and was it expected?" not "who is responsible and what are the consequences?"

Ideally, teams set their own cost budgets as part of planning. They receive weekly spend reports. Anomalies are surfaced as conversations, not escalations. Engineers who find and eliminate waste are recognised for it.

Cost optimisation should appear in the engineering backlog as legitimate work. A story to rightsize a fleet of compute instances, or to schedule development environments to shut down overnight, is engineering work with a quantifiable return. Treating it as such - with appropriate prioritisation - is how cost awareness becomes embedded rather than aspirational.