Skip to main content

Prometheus & Monitoring Terminology

Key terms for communicating with SREs


Core Concepts

Metric

A measured value being monitored. Examples: cpu_usage, http_requests_total.

Label

Key-value pairs attached to a metric to distinguish different dimensions:

http_requests_total{method="GET", status="200", pod="app-a"}

Time Series

One metric name + one fixed set of labels = one time series.

Sample

One timestamp + one value = one sample. Each scrape of one time series produces 1 sample.

Active Time Series

The number of time series currently being tracked by Prometheus. Goes stale ~5 minutes after a Pod is deleted.

Cardinality

The number of label value combinations for a metric. High cardinality = many time series = high memory and cost. Cardinality Explosion = too many label values (e.g., user_id, request_id), causing the time series count to explode.


Data Collection

Scrape

Prometheus actively pulls data from a target's /metrics endpoint (pull model).

Scrape Interval

How often to scrape. Common values: 15s / 30s / 60s.

Scrape Target

The endpoint to scrape, typically a pod's IP:port.

Job

A named group of scrape targets in the Prometheus config.

Service Discovery

Auto-discovering scrape targets (in K8s via kubernetes_sd_config).

Exporter

A middleware that converts non-Prometheus metrics into Prometheus format:

  • node-exporter → machine-level metrics
  • kube-state-metrics → K8s object state
  • blackbox-exporter → external endpoint probes

Storage

TSDB

Time Series Database — optimized for time-series data, built into Prometheus and stored on local disk.

Head Block

The most recent 2 hours of data, stored in memory for fast queries.

WAL

Write-Ahead Log — ensures data isn't lost after a crash.

Retention

How long data is kept before automatic deletion.

Remote Write

The mechanism for Prometheus to push data to an external TSDB — the core of centralized monitoring.

Ingestion

The process of writing data into the TSDB. Ingestion Rate = samples written per second.


Querying

PromQL Common Functions

FunctionPurposeExample
rate()Per-second growth rate of a counterrate(http_requests_total[5m])
sum()Sumsum(rate(http_requests_total[5m]))
avg()Averageavg(cpu_usage)
count()Countcount({__name__=~".+"})
sort_desc()Sort descendingsort_desc(sum by (job) (...))
avg_over_time()Average over a time rangeavg_over_time(metric[7d])
byGroup by labelsum by (namespace) (...)

Recording Rule

A pre-computed PromQL result stored as a new time series.

Alert Rule

When a PromQL expression continuously satisfies a condition, triggers an alert.


Architecture Components

  • Prometheus Server — core: scrapes, stores, queries, evaluates alerts
  • Alertmanager — receives alerts, handles dedup, grouping, routing, and notifications
  • Grafana — visualization tool, connects to Prometheus for dashboards
  • VictoriaMetrics — high-performance TSDB alternative, 100% PromQL-compatible, better compression and memory efficiency
    • vmagent / vminsert / vmselect / vmstorage
  • Thanos — extension layer on top of Prometheus for cross-cluster queries and S3 long-term storage
    • Sidecar / Querier / Store Gateway / Compactor

Common Abbreviations

AbbreviationFull Name
TSDBTime Series Database
WALWrite-Ahead Log
HAHigh Availability
QSPQuery Samples Processed
IRSAIAM Roles for Service Accounts
OTelOpenTelemetry
CRDCustom Resource Definition
AMPAmazon Managed Prometheus
AMGAmazon Managed Grafana