DevOpsMonitoring
Monitoring & Observability
Observability with Prometheus, Grafana, and the ELK Stack. Metrics, logs, traces, dashboards, and alerting.
Prometheus
Time-series monitoring with pull-based metric collection, powerful PromQL query language, and built-in alerting.
Grafana
Open-source dashboard and visualization platform. Supports multiple data sources with rich panel types.
ELK Stack
Centralized logging: Elasticsearch stores logs, Logstash processes, Kibana visualizes. Add Filebeat for shipping.
Prometheus
# prometheus.yml
global:
scrape_interval: 15s
evaluation_interval: 15s
alerting:
alertmanagers:
- static_configs:
- targets: ['localhost:9093']
rule_files:
- "alerts.yml"
scrape_configs:
- job_name: 'node'
static_configs:
- targets: ['localhost:9100']
- job_name: 'kubernetes'
kubernetes_sd_configs:
- role: podPromQL Examples
node_cpu_seconds_total{mode="idle"}CPU idle time
Grafana Dashboards
Supported Data Sources
PrometheusTime-series metrics
ElasticsearchLogs and search analytics
InfluxDBTime-series database
CloudWatchAWS metrics
Azure MonitorAzure metrics
StackdriverGoogle Cloud metrics
LokiLog aggregation (Grafana-native)
TempoTracing backend
Panels: Time series, Bar chart, Stat, Gauge, Table, Heatmap, Logs, Traces
ELK Stack
Flow: Beats → Logstash → Elasticsearch → Kibana
Alerting
# alerts.yml
groups:
- name: node_alerts
rules:
- alert: HighCPUUsage
expr: node_cpu_seconds_total{mode="idle"} < 20
for: 5m
labels:
severity: critical
annotations:
summary: "CPU usage above 80% for 5 minutes"Alertmanager
Handles deduplication, grouping, silencing, and routing alerts to receivers (Slack, PagerDuty, email).
Alert Rules
PromQL expressions evaluated periodically. Transition from pending to firing after the `for` duration.
Interview Questions
Q1: What is the difference between white-box and black-box monitoring?
White-box monitoring monitors internal application metrics (CPU, memory, request rate, error rate, latency) exposed by the application itself. Black-box monitoring tests the system from the outside (synthetic checks, uptime probes, API endpoint validation). White-box tells you what's happening inside; black-box tells you what users experience.
Q2: Explain the Prometheus pull model vs push-based monitoring.
Prometheus pulls metrics by scraping targets at configured intervals. Targets expose an HTTP endpoint (/metrics). The pull model is simpler for discovery, self-monitoring (alert if target is down), and centralized control. Push-based systems (Graphite, InfluxDB) require agents to send data — better for batch jobs or firewalled environments.
Q3: What are the four golden signals of monitoring?
1) Latency — time to service a request. 2) Traffic — demand on the system (requests/sec). 3) Errors — rate of failed requests (explicit HTTP 500s, implicit slow responses). 4) Saturation — how full the service is (CPU, memory, queue depth). These signals, from Google SRE, provide a comprehensive view of system health.
Q4: How does the ELK stack work together?
Beats ship data from sources (Filebeat for logs, Metricbeat for metrics). Logstash optionally processes and transforms data with filters (parsing, enrichment). Elasticsearch stores and indexes data for fast search and aggregation. Kibana provides visualization, dashboards, and alerting. This forms the Elastic Stack (formerly ELK) for centralized logging.