Site Reliability Engineering (SRE)
Also known as: reliability engineering
Site Reliability Engineering (SRE) is a discipline that applies software engineering practices to operations — using error budgets, SLOs, automation, and blameless incident response to balance reliability and feature velocity.
Detailed explanation
SRE originated at Google and is built around a few core practices: defining Service Level Objectives (SLOs) for what users actually experience, using error budgets to negotiate between reliability work and feature work, automating toil out of the on-call rotation, and learning from incidents through blameless postmortems.
A mature SRE practice owns reliability as a product feature: it has metrics, owners, and tradeoffs. SREs typically split time between system improvements (capacity planning, automation, chaos engineering) and embedded work with product teams.
For organizations that cannot staff a dedicated SRE team, the principles still apply: pick a few SLOs, instrument them, and use the data to drive prioritization. SRE is a culture and a set of practices, not a job title.