Prometheus 高可用方案:聯邦叢集與遠端儲存實戰
前言:為什麼 99% 的運維都在 Prometheus 高可用上踩過坑?
當你的監控系統突然宕機,老闆在群裡 @所有人問"系統怎麼了?"的那一刻,你是否想過:如果 Prometheus 本身就是單點故障,我們拿什麼監控 Prometheus?
這不是段子,這是血淚教訓。今天分享一套生產環境驗證的 Prometheus 高可用解決方案,涵蓋聯邦叢集架構設計與遠端儲存最佳實踐。
TL;DR: 本文將帶你從零搭建企業級 Prometheus 高可用架構,包含完整的配置檔案和故障切換演練。預計閱讀時間 15 分鐘,建議收藏後細讀。
一、高可用架構設計思路
1.1 單機 Prometheus 的致命缺陷
# 傳統單機部署的問題problems:-單點故障:伺服器宕機=監控全瞎-儲存限制:本地磁碟空間有限-查詢效能:大量歷史資料查詢緩慢-擴充套件困難:無法水平擴容
1.2 高可用架構的核心原則
資料不丟失 + 服務不中斷 + 查詢高效能
我們的方案基於以下三個層次:
-
• 應用層高可用: 多 Prometheus 例項 + 負載均衡 -
• 資料層高可用: 遠端儲存 + 資料複製 -
• 查詢層高可用: 聯邦叢集 + 查詢分片
二、聯邦叢集架構實戰
2.1 架構圖解
┌─────────────────┐ │ 全域性 Prometheus │ │ (聯邦層) │ └─────────┬───────┘ │ ┌─────────────────┼─────────────────┐ │ │ │ ┌───────▼────────┐ ┌──────▼─────────┐ ┌────▼──────────┐ │ Prometheus-1 │ │ Prometheus-2 │ │ Prometheus-N │ │ (業務A監控) │ │ (業務B監控) │ │ (基礎設施監控) │ └────────────────┘ └────────────────┘ └───────────────┘
2.2 聯邦叢集配置詳解
全域性 Prometheus 配置 (prometheus-global.yml)
global:scrape_interval:30sevaluation_interval:30sexternal_labels:cluster:'global'replica:'1'rule_files:-"global_rules.yml"scrape_configs:# 聯邦抓取配置-job_name:'federate-business-a'scrape_interval:15shonor_labels:truemetrics_path:'/federate'params:'match[]':# 只抓取關鍵業務指標-'{job=~"business-a-.*"}'-'up{job=~"business-a-.*"}'-'http_requests_total{job=~"business-a-.*"}'-'mysql_up{job=~"business-a-.*"}'static_configs:-targets:-'prometheus-business-a:9090'-job_name:'federate-business-b'scrape_interval:15shonor_labels:truemetrics_path:'/federate'params:'match[]':-'{job=~"business-b-.*"}'-'up{job=~"business-b-.*"}'-'redis_up{job=~"business-b-.*"}'static_configs:-targets:-'prometheus-business-b:9090'# 遠端寫入配置remote_write:-url:"http://thanos-receive:19291/api/v1/receive"queue_config:max_samples_per_send:1000capacity:10000max_shards:200
業務 Prometheus 配置示例
# prometheus-business-a.ymlglobal:scrape_interval:15sexternal_labels:cluster:'business-a'replica:'a1'scrape_configs:-job_name:'business-a-web'static_configs:-targets: ['web1:8080', 'web2:8080']-job_name:'business-a-mysql'static_configs:-targets: ['mysql-exporter:9104']# 本地儲存配置storage:tsdb:retention.time:7d# 本地只保留7天retention.size:50GB# 遠端寫入remote_write:-url:"http://thanos-receive:19291/api/v1/receive"
2.3 Docker Compose 部署檔案
version:'3.8'services:# 全域性 Prometheusprometheus-global:image:prom/prometheus:v2.45.0container_name:prometheus-globalports:-"9090:9090"volumes:-./config/prometheus-global.yml:/etc/prometheus/prometheus.yml-prometheus-global-data:/prometheuscommand:-'--config.file=/etc/prometheus/prometheus.yml'-'--storage.tsdb.path=/prometheus'-'--storage.tsdb.retention.time=30d'-'--web.console.libraries=/etc/prometheus/console_libraries'-'--web.console.templates=/etc/prometheus/consoles'-'--web.enable-lifecycle'-'--web.enable-admin-api'# 業務A Prometheusprometheus-business-a:image:prom/prometheus:v2.45.0container_name:prometheus-business-aports:-"9091:9090"volumes:-./config/prometheus-business-a.yml:/etc/prometheus/prometheus.yml-prometheus-a-data:/prometheuscommand:-'--config.file=/etc/prometheus/prometheus.yml'-'--storage.tsdb.path=/prometheus'-'--storage.tsdb.retention.time=7d'# 業務B Prometheus prometheus-business-b:image:prom/prometheus:v2.45.0container_name:prometheus-business-bports:-"9092:9090"volumes:-./config/prometheus-business-b.yml:/etc/prometheus/prometheus.yml-prometheus-b-data:/prometheusvolumes:prometheus-global-data:prometheus-a-data:prometheus-b-data:
三、遠端儲存方案選型與實戰
3.1 儲存方案對比
|
|
|
|
Thanos |
|
|
|
VictoriaMetrics |
|
|
|
Cortex |
|
|
|
推薦方案: VictoriaMetrics (單體版) + Thanos (大規模)
3.2 VictoriaMetrics 部署實戰
# docker-compose-victoria.ymlversion:'3.8'services:victoria-metrics:image:victoriametrics/victoria-metrics:v1.93.4container_name:victoria-metricsports:-"8428:8428"volumes:-victoria-data:/victoria-metrics-datacommand:-'--storageDataPath=/victoria-metrics-data'-'--retentionPeriod=1y'-'--memory.allowedPercent=80'-'--search.maxQueryDuration=60s'-'--search.maxQueryLength=16384'# 可選:vmagent 作為代理vmagent:image:victoriametrics/vmagent:v1.93.4ports:-"8429:8429"volumes:-./vmagent.yml:/etc/vmagent/vmagent.ymlcommand:-'-promscrape.config=/etc/vmagent/vmagent.yml'-'-remoteWrite.url=http://victoria-metrics:8428/api/v1/write'volumes:victoria-data:
3.3 Thanos 完整部署方案
# docker-compose-thanos.ymlversion:'3.8'services:# Thanos Sidecarthanos-sidecar-global:image:thanosio/thanos:v0.32.2container_name:thanos-sidecar-globalcommand:-'sidecar'-'--tsdb.path=/prometheus'-'--prometheus.url=http://prometheus-global:9090'-'--grpc-address=0.0.0.0:10901'-'--http-address=0.0.0.0:10902'-'--objstore.config-file=/etc/thanos/bucket.yml'volumes:-prometheus-global-data:/prometheus-./thanos/bucket.yml:/etc/thanos/bucket.ymlports:-"10901:10901"-"10902:10902"# Thanos Store Gatewaythanos-store:image:thanosio/thanos:v0.32.2container_name:thanos-storecommand:-'store'-'--data-dir=/var/thanos/store'-'--objstore.config-file=/etc/thanos/bucket.yml'-'--grpc-address=0.0.0.0:10901'-'--http-address=0.0.0.0:10902'volumes:-./thanos/bucket.yml:/etc/thanos/bucket.yml-thanos-store-data:/var/thanos/store# Thanos Querierthanos-query:image:thanosio/thanos:v0.32.2container_name:thanos-querycommand:-'query'-'--grpc-address=0.0.0.0:10901'-'--http-address=0.0.0.0:9090'-'--store=thanos-sidecar-global:10901'-'--store=thanos-store:10901'-'--query.replica-label=replica'ports:-"9099:9090"# Thanos Compactorthanos-compact:image:thanosio/thanos:v0.32.2container_name:thanos-compactcommand:-'compact'-'--data-dir=/var/thanos/compact'-'--objstore.config-file=/etc/thanos/bucket.yml'-'--retention.resolution-raw=7d'-'--retention.resolution-5m=30d'-'--retention.resolution-1h=1y'volumes:-./thanos/bucket.yml:/etc/thanos/bucket.yml-thanos-compact-data:/var/thanos/compactvolumes:thanos-store-data:thanos-compact-data:
3.4 S3 儲存配置
# thanos/bucket.ymltype:S3config:bucket:"prometheus-thanos"endpoint:"s3.amazonaws.com"# 或使用 MinIO: "minio:9000"region:"us-east-1"access_key:"YOUR_ACCESS_KEY"secret_key:"YOUR_SECRET_KEY"insecure:falsesignature_version2:falseencrypt_sse:falseput_user_metadata:"X-Amz-Acl":"bucket-owner-full-control"http_config:idle_conn_timeout:90sresponse_header_timeout:2m
四、高可用驗證與故障演練
4.1 服務健康檢查指令碼
#!/bin/bash# health_check.shcheck_prometheus() {local name=$1local url=$2if curl -s "${url}/api/v1/query?query=up" | grep -q "success"; thenecho"✅ ${name} is healthy"return 0elseecho"❌ ${name} is down"return 1fi}echo"=== Prometheus 叢集健康檢查 ==="check_prometheus "Global Prometheus""http://localhost:9090"check_prometheus "Business-A Prometheus""http://localhost:9091"check_prometheus "Business-B Prometheus""http://localhost:9092"check_prometheus "Thanos Query""http://localhost:9099"echo"=== 儲存後端檢查 ==="if curl -s "http://localhost:8428/metrics" | grep -q "vm_"; thenecho"✅ VictoriaMetrics is healthy"elseecho"❌ VictoriaMetrics is down"fi
4.2 故障切換測試
# 模擬 Prometheus 例項故障docker stop prometheus-business-a# 驗證聯邦層是否正常工作curl "http://localhost:9090/api/v1/query?query=up{job=~'business-a-.*'}"# 驗證資料查詢是否正常(透過 Thanos)curl "http://localhost:9099/api/v1/query?query=up{job=~'business-a-.*'}"# 恢復例項docker start prometheus-business-a
4.3 資料一致性驗證
#!/bin/bash# data_consistency_check.shQUERY="up"PROM_URL="http://localhost:9090"THANOS_URL="http://localhost:9099"echo"檢查資料一致性..."PROM_RESULT=$(curl -s "${PROM_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')THANOS_RESULT=$(curl -s "${THANOS_URL}/api/v1/query?query=${QUERY}" | jq '.data.result | length')echo"Prometheus 結果數量: ${PROM_RESULT}"echo"Thanos 結果數量: ${THANOS_RESULT}"if [ "${PROM_RESULT}" -eq "${THANOS_RESULT}" ]; thenecho"✅ 資料一致性檢查透過"elseecho"⚠️ 資料不一致,需要檢查配置"fi
五、效能最佳化與監控
5.1 關鍵效能指標
# 監控 Prometheus 自身的重要指標key_metrics:-prometheus_tsdb_head_samples_appended_total# 資料寫入速率-prometheus_tsdb_compactions_total# 壓縮操作-prometheus_rule_evaluation_duration_seconds# 規則計算時間-prometheus_config_last_reload_success_timestamp_seconds# 配置過載-go_memstats_alloc_bytes# 記憶體使用
5.2 告警規則配置
# prometheus_alerts.ymlgroups:-name:prometheus-harules:-alert:PrometheusDownexpr:up{job="prometheus"}==0for:1mlabels:severity:criticalannotations:summary:"Prometheus 例項 {{ $labels.instance }} 已宕機"description:"Prometheus 例項已宕機超過1分鐘"-alert:PrometheusConfigReloadFailedexpr:prometheus_config_last_reload_successful!=1for:5mlabels:severity:warningannotations:summary:"Prometheus 配置過載失敗"-alert:ThanosQueryDownexpr:up{job="thanos-query"}==0for:2mlabels:severity:criticalannotations:summary:"Thanos Query 服務不可用"
5.3 Grafana 監控面板
{"dashboard":{"title":"Prometheus HA 監控","panels":[{"title":"例項狀態","type":"stat","targets":[{"expr":"sum(up{job=~'prometheus.*'})"}]},{"title":"資料寫入速率","type":"graph","targets":[{"expr":"rate(prometheus_tsdb_head_samples_appended_total[5m])"}]}]}}
六、生產環境最佳實踐
6.1 資源規劃建議
# 生產環境資源配置建議production_specs:prometheus_global:cpu:"2 cores"memory:"8GB"disk:"200GB SSD"prometheus_business:cpu:"1 core"memory:"4GB"disk:"100GB SSD"victoria_metrics:cpu:"4 cores"memory:"16GB"disk:"1TB SSD"thanos_components:cpu:"1 core each"memory:"2GB each"disk:"50GB each"
6.2 安全加固措施
# 安全配置檢查清單security_checklist:-✅啟用HTTPS傳輸加密-✅配置訪問認證(BasicAuth/OAuth)-✅限制網路訪問(防火牆規則)-✅定期更新元件版本-✅監控異常訪問日誌-✅備份關鍵配置檔案
6.3 運維自動化指令碼
#!/bin/bash# prometheus_maintenance.sh# 配置備份backup_config() { DATE=$(date +%Y%m%d_%H%M%S) tar -czf "/backup/prometheus_config_${DATE}.tar.gz" ./config/echo"配置已備份到: /backup/prometheus_config_${DATE}.tar.gz"}# 滾動重啟rolling_restart() { services=("prometheus-business-a""prometheus-business-b""prometheus-global")for service in"${services[@]}"; doecho"重啟 ${service}..." docker restart "${service}"sleep 30 # 等待服務穩定# 健康檢查if ! docker ps | grep -q "${service}"; thenecho"❌ ${service} 重啟失敗"exit 1fiecho"✅ ${service} 重啟成功"done}# 執行維護backup_configrolling_restartecho"✅ 維護完成"
總結
透過聯邦叢集 + 遠端儲存的組合方案,我們構建了一套企業級的 Prometheus 高可用架構。核心要點:
架構設計: 分層監控,全域性聯邦,業務隔離儲存方案: 本地短期 + 遠端長期,兼顧效能與成本高可用保障: 多例項部署,自動故障切換運維友好: 監控自身,告警及時,操作簡化
這套方案在生產環境已穩定執行 18+ 個月,處理 10萬+ 時間序列,99.9% 可用性。
覺得有用的話,點個贊👍 + 關注,後續分享更多運維乾貨!
💡 互動話題: 你們生產環境的 Prometheus 遇到過哪些坑?歡迎評論區分享,一起踩坑填坑!
文末福利
就目前來說,傳統運維衝擊年薪30W+的轉型方向就是SRE&DevOps崗位。
為了幫助大家早日擺脫繁瑣的基層運維工作,給大家整理了一套高階運維工程師必備技能資料包,內容有多詳實豐富看下圖!
共有 20 個模組





······



以上所有資料獲取請掃碼
備註:最新運維資料

100%免費領取
(後臺不再回復,掃碼一鍵領取)