← Back to Projects

Infrastructure Monitoring Stack

Production-grade observability solution with Prometheus, Grafana, and custom exporters

πŸ“Š Prometheus + Grafana
πŸ”” Alertmanager
🐳 Containerized Stack

Overview

This project implements a comprehensive monitoring and observability stack for infrastructure and applications. Built on Prometheus for metrics collection and Grafana for visualization, the solution provides real-time insights into system performance, application health, and resource utilization.

The stack includes custom exporters for specialized metrics, alerting rules for proactive incident response, and pre-built dashboards for common monitoring scenarios. Everything is containerized and deployed using Docker Compose for easy replication and scaling.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                   Monitoring Stack Architecture                  β”‚
β”‚                                                                  β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”     β”‚
β”‚  β”‚   Grafana  │────▢│ Prometheus │◀────│   Exporters    β”‚     β”‚
β”‚  β”‚  (Dashboards)β”‚    β”‚  (Metrics) β”‚     β”‚  - Node        β”‚     β”‚
β”‚  β”‚  Port 3000 β”‚     β”‚  Port 9090 β”‚     β”‚  - Blackbox    β”‚     β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜     β””β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”˜     β”‚  - Custom      β”‚     β”‚
β”‚         β”‚                   β”‚            β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜     β”‚
β”‚         β”‚                   β”‚                                   β”‚
β”‚         β”‚                   β–Ό                                   β”‚
β”‚         β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚         β”‚          β”‚  Alertmanager  β”‚                          β”‚
β”‚         β”‚          β”‚   (Alerts)     β”‚                          β”‚
β”‚         β”‚          β”‚   Port 9093    β”‚                          β”‚
β”‚         β”‚          β””β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚         β”‚                  β”‚                                    β”‚
β”‚         β”‚                  β–Ό                                    β”‚
β”‚         β”‚          β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚
β”‚         └─────────▢│  Notification  β”‚                          β”‚
β”‚                    β”‚   Channels     β”‚                          β”‚
β”‚                    β”‚ (Slack, Email) β”‚                          β”‚
β”‚                    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚
β”‚                                                                  β”‚
β”‚              β–²                              β–²                   β”‚
β”‚              β”‚                              β”‚                   β”‚
β”‚         Scrape Metrics              Scrape Targets             β”‚
β”‚              β”‚                              β”‚                   β”‚
β”‚   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”        β”‚
β”‚   β”‚   Applications     β”‚        β”‚   Infrastructure   β”‚        β”‚
β”‚   β”‚  - Web Servers     β”‚        β”‚  - Servers         β”‚        β”‚
β”‚   β”‚  - Databases       β”‚        β”‚  - Containers      β”‚        β”‚
β”‚   β”‚  - APIs            β”‚        β”‚  - Networks        β”‚        β”‚
β”‚   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜        β”‚
β”‚                                                                  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                

Stack Components

1. Prometheus

Time-series database and monitoring system for collecting and storing metrics.

# prometheus.yml configuration global: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: 'production' environment: 'prod' # Alertmanager configuration alerting: alertmanagers: - static_configs: - targets: ['alertmanager:9093'] # Scrape configurations scrape_configs: # Prometheus self-monitoring - job_name: 'prometheus' static_configs: - targets: ['localhost:9090'] # Node Exporter - System metrics - job_name: 'node-exporter' static_configs: - targets: ['node-exporter:9100'] relabel_configs: - source_labels: [__address__] target_label: instance regex: '([^:]+)(:[0-9]+)?' replacement: '${1}' # Application metrics - job_name: 'app-metrics' static_configs: - targets: - 'app-1:8080' - 'app-2:8080' metrics_path: '/metrics' # Blackbox exporter - Endpoint monitoring - job_name: 'blackbox' metrics_path: /probe params: module: [http_2xx] static_configs: - targets: - https://example.com - https://api.example.com relabel_configs: - source_labels: [__address__] target_label: __param_target - source_labels: [__param_target] target_label: instance - target_label: __address__ replacement: blackbox-exporter:9115

2. Grafana

Visualization and analytics platform for creating interactive dashboards.

# grafana/provisioning/datasources/prometheus.yml apiVersion: 1 datasources: - name: Prometheus type: prometheus access: proxy url: http://prometheus:9090 isDefault: true editable: true jsonData: timeInterval: "15s" # grafana/provisioning/dashboards/dashboard.yml apiVersion: 1 providers: - name: 'Default' orgId: 1 folder: '' type: file disableDeletion: false updateIntervalSeconds: 10 allowUiUpdates: true options: path: /var/lib/grafana/dashboards

3. Alertmanager

Handles alerts from Prometheus and routes them to notification channels.

# alertmanager.yml global: resolve_timeout: 5m slack_api_url: 'https://hooks.slack.com/services/YOUR/WEBHOOK/URL' route: group_by: ['alertname', 'cluster', 'service'] group_wait: 10s group_interval: 10s repeat_interval: 12h receiver: 'default' routes: - match: severity: critical receiver: 'critical-alerts' - match: severity: warning receiver: 'warning-alerts' receivers: - name: 'default' slack_configs: - channel: '#monitoring' title: 'Alert: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' - name: 'critical-alerts' slack_configs: - channel: '#incidents' title: '🚨 CRITICAL: {{ .GroupLabels.alertname }}' text: '{{ range .Alerts }}{{ .Annotations.description }}{{ end }}' email_configs: - to: '[email protected]' from: '[email protected]' smarthost: 'smtp.gmail.com:587' - name: 'warning-alerts' slack_configs: - channel: '#monitoring' title: '⚠️ Warning: {{ .GroupLabels.alertname }}'

Docker Compose Deployment

version: '3.8' services: prometheus: image: prom/prometheus:latest container_name: prometheus volumes: - ./prometheus:/etc/prometheus - prometheus-data:/prometheus command: - '--config.file=/etc/prometheus/prometheus.yml' - '--storage.tsdb.path=/prometheus' - '--storage.tsdb.retention.time=30d' - '--web.enable-lifecycle' ports: - "9090:9090" networks: - monitoring restart: unless-stopped grafana: image: grafana/grafana:latest container_name: grafana volumes: - grafana-data:/var/lib/grafana - ./grafana/provisioning:/etc/grafana/provisioning - ./grafana/dashboards:/var/lib/grafana/dashboards environment: - GF_SECURITY_ADMIN_PASSWORD=admin - GF_USERS_ALLOW_SIGN_UP=false - GF_SERVER_ROOT_URL=http://localhost:3000 ports: - "3000:3000" networks: - monitoring restart: unless-stopped depends_on: - prometheus alertmanager: image: prom/alertmanager:latest container_name: alertmanager volumes: - ./alertmanager:/etc/alertmanager - alertmanager-data:/alertmanager command: - '--config.file=/etc/alertmanager/alertmanager.yml' - '--storage.path=/alertmanager' ports: - "9093:9093" networks: - monitoring restart: unless-stopped node-exporter: image: prom/node-exporter:latest container_name: node-exporter command: - '--path.procfs=/host/proc' - '--path.sysfs=/host/sys' - '--path.rootfs=/rootfs' - '--collector.filesystem.mount-points-exclude=^/(sys|proc|dev|host|etc)($$|/)' volumes: - /proc:/host/proc:ro - /sys:/host/sys:ro - /:/rootfs:ro ports: - "9100:9100" networks: - monitoring restart: unless-stopped blackbox-exporter: image: prom/blackbox-exporter:latest container_name: blackbox-exporter volumes: - ./blackbox:/etc/blackbox command: - '--config.file=/etc/blackbox/blackbox.yml' ports: - "9115:9115" networks: - monitoring restart: unless-stopped volumes: prometheus-data: grafana-data: alertmanager-data: networks: monitoring: driver: bridge

Alert Rules

# alerts/rules.yml groups: - name: infrastructure interval: 30s rules: - alert: HighCPUUsage expr: 100 - (avg by(instance) (irate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage detected on {{ $labels.instance }}" description: "CPU usage is above 80% (current value: {{ $value }}%)" - alert: HighMemoryUsage expr: (node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100 > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is above 85% (current value: {{ $value }}%)" - alert: DiskSpaceLow expr: (node_filesystem_avail_bytes{fstype!="tmpfs"} / node_filesystem_size_bytes{fstype!="tmpfs"}) * 100 < 15 for: 10m labels: severity: warning annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk space is below 15% (current value: {{ $value }}%)" - alert: ServiceDown expr: up == 0 for: 2m labels: severity: critical annotations: summary: "Service {{ $labels.job }} is down" description: "{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 2 minutes" - alert: HighResponseTime expr: http_request_duration_seconds{quantile="0.99"} > 1 for: 5m labels: severity: warning annotations: summary: "High response time detected" description: "99th percentile response time is above 1s (current: {{ $value }}s)"

Custom Grafana Dashboards

System Overview Dashboard

CPU, Memory, Disk, and Network metrics across all hosts

Application Performance

Request rates, latency, error rates, and throughput

Kubernetes Monitoring

Pod status, resource usage, cluster health

Database Metrics

Query performance, connections, replication lag

Deployment and Management

Initial Setup

# Clone repository git clone https://github.com/yourusername/monitoring-stack.git cd monitoring-stack # Start the monitoring stack docker-compose up -d # Verify all services are running docker-compose ps # Access Grafana open http://localhost:3000 # Default credentials: admin / admin # Access Prometheus open http://localhost:9090 # Access Alertmanager open http://localhost:9093

Backup and Restore

# Backup Prometheus data docker run --rm -v monitoring-stack_prometheus-data:/data -v $(pwd):/backup \ alpine tar czf /backup/prometheus-backup.tar.gz -C /data . # Backup Grafana dashboards docker exec grafana grafana-cli admin export > grafana-backup.json # Restore Prometheus data docker run --rm -v monitoring-stack_prometheus-data:/data -v $(pwd):/backup \ alpine tar xzf /backup/prometheus-backup.tar.gz -C /data # Restart services docker-compose restart

Key Achievements

Technologies Used

Prometheus Grafana Alertmanager Node Exporter Blackbox Exporter Docker Docker Compose PromQL

Conclusion

This monitoring stack provides comprehensive observability into infrastructure and application performance. By centralizing metrics collection, visualization, and alerting, teams gain the insights needed to maintain system reliability and quickly respond to incidents.

The containerized deployment approach ensures the monitoring solution can be easily replicated across environments, while the flexible configuration allows for customization to meet specific organizational needs.

← Back to Projects Get in Touch