Site Reliability Engineering (SRE)

Also known as: reliability engineering

Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations — using error budgets, SLOs, automation, and blameless incident response to balance reliability and feature velocity.

Detailed explanation

SRE originated at Google and is built around a few core practices: defining Service Level Objectives (SLOs) for what users actually experience, using error budgets to negotiate between reliability work and feature work, automating toil out of the on-call rotation, and learning from incidents through blameless postmortems.

A mature SRE practice owns reliability as a product feature: it has metrics, owners, and tradeoffs. SREs typically split time between system improvements (capacity planning, automation, chaos engineering) and embedded work with product teams.

For organizations that cannot staff a dedicated SRE team, the principles still apply: pick a few SLOs, instrument them, and use the data to drive prioritization. SRE is a culture and a set of practices, not a job title.

← Back to glossary