Observability

Also known as: o11y, system observability

Observability is the ability to understand a system’s internal state by examining its external outputs — metrics, logs, traces, and (increasingly) profiles — without needing to ship new code to investigate a problem.

Detailed explanation

Observability extends traditional monitoring by emphasizing high-cardinality, high-dimensional telemetry that lets operators ask new questions about a system after it is running. The standard pillars are metrics (numerical time series), logs (discrete events), traces (causal chains across services), and increasingly continuous profiles.

Modern observability stacks are unified around OpenTelemetry for instrumentation, with backends such as Prometheus, Grafana, Datadog, New Relic, Honeycomb, or Splunk for storage and analysis. SLOs (Service Level Objectives) and error budgets turn observability data into prioritization signals.

For AI systems, observability extends to model and prompt versions, retrieval traces, tool calls, token costs, and eval outcomes — often called LLM observability. The goal is the same: when something is wrong, you can find out why without redeploying.

← Back to glossary