Monitoring Stack (Prometheus + Grafana)

What

A monitoring stack modeled after what we run at Datanet, but built from scratch so I understand every piece. Prometheus scrapes metrics from exporters on each VM plus SNMP data from network devices. Grafana visualizes everything. Alertmanager fires alerts to a Telegram bot.

Components:

Tool	Role
Prometheus	Metrics collection & storage
Node Exporter	Per-VM CPU, RAM, disk, network
SNMP Exporter	Network device metrics (pfSense, switch)
Grafana	Dashboards
Alertmanager	Alert routing
Telegram Bot	Alert delivery

Why

At work I use commercial SNMP polling tools (LibreNMS, PRTG). Building a Prometheus stack from scratch forces me to understand the underlying mechanics — scrape intervals, cardinality, PromQL. It also means I can monitor my own infrastructure the way we monitor the ISP backbone.

How

All components run as Docker Compose services on a dedicated Debian VM (2 vCPU, 4 GB RAM, VLAN 20).

# docker-compose.yml (excerpt)
services:
  prometheus:
    image: prom/prometheus:v2.50.1
    volumes:
      - ./prometheus.yml:/etc/prometheus/prometheus.yml
      - prometheus_data:/prometheus
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:10.3.1
    environment:
      - GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASS}
    volumes:
      - grafana_data:/var/lib/grafana
    ports:
      - "3000:3000"

  alertmanager:
    image: prom/alertmanager:v0.26.0
    volumes:
      - ./alertmanager.yml:/etc/alertmanager/alertmanager.yml

SNMP Exporter for pfSense

pfSense exposes standard MIBs. The SNMP exporter uses the if_mib module to scrape interface counters:

# prometheus.yml scrape config
- job_name: 'snmp_pfsense'
  static_configs:
    - targets: ['10.10.99.1']
  metrics_path: /snmp
  params:
    module: [if_mib]
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - target_label: __address__
      replacement: snmp-exporter:9116

Key Alerts

VM down (node exporter unreachable for 2 min)
CPU > 90% for 5 min
Disk > 85% used
Interface error rate > 1% over 5 min

Challenges

SNMP MIB cardinality — pulling all OIDs from the HP switch generated 50k+ time series. Filtered down to interface counters and CPU/memory OIDs only.
Grafana provisioning — manually clicking through dashboards doesn’t survive container restarts. Solved by using Grafana’s provisioning YAML to load dashboard JSON from a volume.
Telegram rate limiting — Alertmanager was flooding the bot during a cascading alert event. Added group_wait: 30s and repeat_interval: 4h to throttle notifications.

Tech Used

Prometheus 2.50 — metrics database
Grafana 10.3 — visualization
Alertmanager 0.26 — alert routing
Node Exporter 1.7 — host metrics
SNMP Exporter 0.24 — network device metrics
Docker Compose — orchestration
Telegram Bot API — alert delivery