What
A monitoring stack modeled after what we run at Datanet, but built from scratch so I understand every piece. Prometheus scrapes metrics from exporters on each VM plus SNMP data from network devices. Grafana visualizes everything. Alertmanager fires alerts to a Telegram bot.
Components:
| Tool | Role |
|---|---|
| Prometheus | Metrics collection & storage |
| Node Exporter | Per-VM CPU, RAM, disk, network |
| SNMP Exporter | Network device metrics (pfSense, switch) |
| Grafana | Dashboards |
| Alertmanager | Alert routing |
| Telegram Bot | Alert delivery |
Why
At work I use commercial SNMP polling tools (LibreNMS, PRTG). Building a Prometheus stack from scratch forces me to understand the underlying mechanics — scrape intervals, cardinality, PromQL. It also means I can monitor my own infrastructure the way we monitor the ISP backbone.
How
All components run as Docker Compose services on a dedicated Debian VM (2 vCPU, 4 GB RAM, VLAN 20).
# docker-compose.yml (excerpt)
services:
prometheus:
image: prom/prometheus:v2.50.1
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus_data:/prometheus
ports:
- "9090:9090"
grafana:
image: grafana/grafana:10.3.1
environment:
- GF_SECURITY_ADMIN_PASSWORD=${GRAFANA_PASS}
volumes:
- grafana_data:/var/lib/grafana
ports:
- "3000:3000"
alertmanager:
image: prom/alertmanager:v0.26.0
volumes:
- ./alertmanager.yml:/etc/alertmanager/alertmanager.yml
SNMP Exporter for pfSense
pfSense exposes standard MIBs. The SNMP exporter uses the if_mib module to scrape interface counters:
# prometheus.yml scrape config
- job_name: 'snmp_pfsense'
static_configs:
- targets: ['10.10.99.1']
metrics_path: /snmp
params:
module: [if_mib]
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- target_label: __address__
replacement: snmp-exporter:9116
Key Alerts
- VM down (node exporter unreachable for 2 min)
- CPU > 90% for 5 min
- Disk > 85% used
- Interface error rate > 1% over 5 min
Challenges
- SNMP MIB cardinality — pulling all OIDs from the HP switch generated 50k+ time series. Filtered down to interface counters and CPU/memory OIDs only.
- Grafana provisioning — manually clicking through dashboards doesn’t survive container restarts. Solved by using Grafana’s provisioning YAML to load dashboard JSON from a volume.
- Telegram rate limiting — Alertmanager was flooding the bot during a cascading alert event. Added
group_wait: 30sandrepeat_interval: 4hto throttle notifications.
Tech Used
- Prometheus 2.50 — metrics database
- Grafana 10.3 — visualization
- Alertmanager 0.26 — alert routing
- Node Exporter 1.7 — host metrics
- SNMP Exporter 0.24 — network device metrics
- Docker Compose — orchestration
- Telegram Bot API — alert delivery