Production-grade observability solution with Prometheus, Grafana, and custom exporters
This project implements a comprehensive monitoring and observability stack for infrastructure and applications. Built on Prometheus for metrics collection and Grafana for visualization, the solution provides real-time insights into system performance, application health, and resource utilization.
The stack includes custom exporters for specialized metrics, alerting rules for proactive incident response, and pre-built dashboards for common monitoring scenarios. Everything is containerized and deployed using Docker Compose for easy replication and scaling.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Monitoring Stack Architecture β
β β
β ββββββββββββββ ββββββββββββββ ββββββββββββββββββ β
β β Grafana ββββββΆβ Prometheus βββββββ Exporters β β
β β (Dashboards)β β (Metrics) β β - Node β β
β β Port 3000 β β Port 9090 β β - Blackbox β β
β ββββββββ¬ββββββ ββββββββ¬ββββββ β - Custom β β
β β β ββββββββββββββββββ β
β β β β
β β βΌ β
β β ββββββββββββββββββ β
β β β Alertmanager β β
β β β (Alerts) β β
β β β Port 9093 β β
β β βββββββββ¬βββββββββ β
β β β β
β β βΌ β
β β ββββββββββββββββββ β
β βββββββββββΆβ Notification β β
β β Channels β β
β β (Slack, Email) β β
β ββββββββββββββββββ β
β β
β β² β² β
β β β β
β Scrape Metrics Scrape Targets β
β β β β
β ββββββββββββ΄ββββββββββ ββββββββββββ΄ββββββββββ β
β β Applications β β Infrastructure β β
β β - Web Servers β β - Servers β β
β β - Databases β β - Containers β β
β β - APIs β β - Networks β β
β ββββββββββββββββββββββ ββββββββββββββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Time-series database and monitoring system for collecting and storing metrics.
# prometheus.yml configuration
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
environment: 'prod'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets: ['alertmanager:9093']
# Scrape configurations
scrape_configs:
# Prometheus self-monitoring
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node Exporter - System metrics
- job_name: 'node-exporter'
static_configs:
- targets: ['node-exporter:9100']
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+)(:[0-9]+)?'
replacement: '${1}'
# Application metrics
- job_name: 'app-metrics'
static_configs:
- targets:
- 'app-1:8080'
- 'app-2:8080'
metrics_path: '/metrics'
# Blackbox exporter - Endpoint monitoring
- job_name: 'blackbox'
metrics_path: /probe
params:
module: [http_2xx]
static_configs:
- targets:
- https://example.com
- https://api.example.com
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter:9115
Visualization and analytics platform for creating interactive dashboards.
# grafana/provisioning/datasources/prometheus.yml
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
access: proxy
url: http://prometheus:9090
isDefault: true
editable: true
jsonData:
timeInterval: "15s"
# grafana/provisioning/dashboards/dashboard.yml
apiVersion: 1
providers:
- name: 'Default'
orgId: 1
folder: ''
type: file
disableDeletion: false
updateIntervalSeconds: 10
allowUiUpdates: true
options:
path: /var/lib/grafana/dashboards
Handles alerts from Prometheus and routes them to notification channels.
# alertmanager.yml
global:
resolve_timeout: 5m
slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL'
route:
group_by: ['alertname', 'cluster', 'service']
group_wait: 10s
group_interval: 10s
repeat_interval: 12h
receiver: 'default'
routes:
- match:
severity: critical
receiver: 'critical-alerts'
- match:
severity: warning
receiver: 'warning-alerts'
receivers:
- name: 'default'
slack_configs:
- channel: '#monitoring'
title: 'Alert: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
- name: 'critical-alerts'
slack_configs:
- channel: '#incidents'
title: 'π¨ CRITICAL: {{ .GroupLabels.alertname }}'
text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}'
email_configs:
- to: '[email protected]'
from: '[email protected]'
smarthost: 'smtp.gmail.com:587'
- name: 'warning-alerts'
slack_configs:
- channel: '#monitoring'
title: 'β οΈ Warning: {{ .GroupLabels.alertname }}'
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
container_name: prometheus
volumes:
- ./prometheus:/etc/prometheus
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
- '--web.enable-lifecycle'
ports:
- "9090:9090"
networks:
- monitoring
restart: unless-stopped
grafana:
image: grafana/grafana:latest
container_name: grafana
volumes:
- grafana-data:/var/lib/grafana
- ./grafana/provisioning:/etc/grafana/provisioning
- ./grafana/dashboards:/var/lib/grafana/dashboards
environment:
- GF_SECURITY_ADMIN_PASSWORD=admin
- GF_USERS_ALLOW_SIGN_UP=false
- GF_SERVER_ROOT_URL=http://localhost:3000
ports:
- "3000:3000"
networks:
- monitoring
restart: unless-stopped
depends_on:
- prometheus
alertmanager:
image: prom/alertmanager:latest
container_name: alertmanager
volumes:
- ./alertmanager:/etc/alertmanager
- alertmanager-data:/alertmanager
command:
- '--config.file=/etc/alertmanager/alertmanager.yml'
- '--storage.path=/alertmanager'
ports:
- "9093:9093"
networks:
- monitoring
restart: unless-stopped
node-exporter:
image: prom/node-exporter:latest
container_name: node-exporter
command:
- '--path.procfs=/host/proc'
- '--path.sysfs=/host/sys'
- '--path.rootfs=/rootfs'
- '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)'
volumes:
- /proc:/host/proc:ro
- /sys:/host/sys:ro
- /:/rootfs:ro
ports:
- "9100:9100"
networks:
- monitoring
restart: unless-stopped
blackbox-exporter:
image: prom/blackbox-exporter:latest
container_name: blackbox-exporter
volumes:
- ./blackbox:/etc/blackbox
command:
- '--config.file=/etc/blackbox/blackbox.yml'
ports:
- "9115:9115"
networks:
- monitoring
restart: unless-stopped
volumes:
prometheus-data:
grafana-data:
alertmanager-data:
networks:
monitoring:
driver: bridge
# alerts/rules.yml
groups:
- name: infrastructure
interval: 30s
rules:
- alert: HighCPUUsage
expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage detected on {{ $labels.instance }}"
description: "CPU usage is above 80% (current value: {{ $value }}%)"
- alert: HighMemoryUsage
expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is above 85% (current value: {{ $value }}%)"
- alert: DiskSpaceLow
expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 15
for: 10m
labels:
severity: warning
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk space is below 15% (current value: {{ $value }}%)"
- alert: ServiceDown
expr: up == 0
for: 2m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.job }} is down"
description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes"
- alert: HighResponseTime
expr: http_request_duration_seconds{quantile="0.99"} > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High response time detected"
description: "99th percentile response time is above 1s (current: {{ $value }}s)"
CPU, Memory, Disk, and Network metrics across all hosts
Request rates, latency, error rates, and throughput
Pod status, resource usage, cluster health
Query performance, connections, replication lag
# Clone repository
git clone https://github.com/yourusername/monitoring-stack.git
cd monitoring-stack
# Start the monitoring stack
docker-compose up -d
# Verify all services are running
docker-compose ps
# Access Grafana
open http://localhost:3000
# Default credentials: admin / admin
# Access Prometheus
open http://localhost:9090
# Access Alertmanager
open http://localhost:9093
# Backup Prometheus data
docker run --rm -v monitoring-stack_prometheus-data:/data -v $(pwd):/backup \
alpine tar czf /backup/prometheus-backup.tar.gz -C /data .
# Backup Grafana dashboards
docker exec grafana grafana-cli admin export > grafana-backup.json
# Restore Prometheus data
docker run --rm -v monitoring-stack_prometheus-data:/data -v $(pwd):/backup \
alpine tar xzf /backup/prometheus-backup.tar.gz -C /data
# Restart services
docker-compose restart
This monitoring stack provides comprehensive observability into infrastructure and application performance. By centralizing metrics collection, visualization, and alerting, teams gain the insights needed to maintain system reliability and quickly respond to incidents.
The containerized deployment approach ensures the monitoring solution can be easily replicated across environments, while the flexible configuration allows for customization to meet specific organizational needs.