DevOpsMonitoring

Monitoring & Observability

Observability with Prometheus, Grafana, and the ELK Stack. Metrics, logs, traces, dashboards, and alerting.

Prometheus

Time-series monitoring with pull-based metric collection, powerful PromQL query language, and built-in alerting.

Grafana

Open-source dashboard and visualization platform. Supports multiple data sources with rich panel types.

ELK Stack

Centralized logging: Elasticsearch stores logs, Logstash processes, Kibana visualizes. Add Filebeat for shipping.

Prometheus

# prometheus.yml
global:
  scrape_interval: 15s
  evaluation_interval: 15s

alerting:
  alertmanagers:
    - static_configs:
        - targets: ['localhost:9093']

rule_files:
  - "alerts.yml"

scrape_configs:
  - job_name: 'node'
    static_configs:
      - targets: ['localhost:9100']

  - job_name: 'kubernetes'
    kubernetes_sd_configs:
      - role: pod

PromQL Examples

node_cpu_seconds_total{mode="idle"}

CPU idle time

Grafana Dashboards

Supported Data Sources

PrometheusTime-series metrics

ElasticsearchLogs and search analytics

InfluxDBTime-series database

CloudWatchAWS metrics

Azure MonitorAzure metrics

StackdriverGoogle Cloud metrics

LokiLog aggregation (Grafana-native)

TempoTracing backend

Panels: Time series, Bar chart, Stat, Gauge, Table, Heatmap, Logs, Traces

ELK Stack

Flow: Beats → Logstash → Elasticsearch → Kibana

Alerting

# alerts.yml
groups:
  - name: node_alerts
    rules:
      - alert: HighCPUUsage
        expr: node_cpu_seconds_total{mode="idle"} < 20
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "CPU usage above 80% for 5 minutes"

Alertmanager

Handles deduplication, grouping, silencing, and routing alerts to receivers (Slack, PagerDuty, email).

Alert Rules

PromQL expressions evaluated periodically. Transition from pending to firing after the `for` duration.

Interview Questions

Q1: What is the difference between white-box and black-box monitoring?

White-box monitoring monitors internal application metrics (CPU, memory, request rate, error rate, latency) exposed by the application itself. Black-box monitoring tests the system from the outside (synthetic checks, uptime probes, API endpoint validation). White-box tells you what's happening inside; black-box tells you what users experience.

Q2: Explain the Prometheus pull model vs push-based monitoring.

Prometheus pulls metrics by scraping targets at configured intervals. Targets expose an HTTP endpoint (/metrics). The pull model is simpler for discovery, self-monitoring (alert if target is down), and centralized control. Push-based systems (Graphite, InfluxDB) require agents to send data — better for batch jobs or firewalled environments.

Q3: What are the four golden signals of monitoring?

1) Latency — time to service a request. 2) Traffic — demand on the system (requests/sec). 3) Errors — rate of failed requests (explicit HTTP 500s, implicit slow responses). 4) Saturation — how full the service is (CPU, memory, queue depth). These signals, from Google SRE, provide a comprehensive view of system health.

Q4: How does the ELK stack work together?

Beats ship data from sources (Filebeat for logs, Metricbeat for metrics). Logstash optionally processes and transforms data with filters (parsing, enrichment). Elasticsearch stores and indexes data for fast search and aggregation. Kibana provides visualization, dashboards, and alerting. This forms the Elastic Stack (formerly ELK) for centralized logging.