從零開始掌握Prometheus：企業級監控與報警系統的最佳實踐

測試環境

prometheus-2.26.0.linux-amd64.tar.gz下載地址：https://github.com/prometheus/prometheus/releases/download/v2.26.0/prometheus-2.26.0.linux-amd64.tar.gzprometheus-2.54.1.linux-amd64.tar.gz下載地址：https://github.com/prometheus/prometheus/releases/download/v2.26.0/prometheus-2.26.0.linux-amd64.tar.gzCentOS 7.9

下載並執行Prometheus

# 
wget https://github.com/prometheus/prometheus/releases/download/v2.26.0/prometheus-2.26.0.linux-amd64.tar.gz
# 
tar xvzf prometheus-2.26.0.linux-amd64.tar.gz
# cd
 prometheus-2.26.0.linux-amd64
# ls
console_libraries  consoles  LICENSE  NOTICE  prometheus  prometheus.yml  promtool

開始執行之前，先對它進行配置。

配置Prometheus自身監控

Prometheus透過抓取度量HTTP端點來從目標收集指標。由於Prometheus以同樣的方式暴露自己的資料，它也可以蒐集和監控自己的健康狀況。

雖然只收集自身資料的Prometheus伺服器不是很有用，但它是一個很好的開始示例。儲存以下Prometheus基礎配置到一個名為

prometheus.yml

的檔案（安裝包自動解壓後，解壓目錄下，預設就就有一個名為

prometheus.yml

的檔案）

global:
scrape_interval:15s# 預設，每15秒取樣一次目標
# 與其它外部系統(比如federation, remote storage, Alertmanager)互動時，會附加這些標籤到時序資料或者報警
external_labels:
monitor:'codelab-monitor'
# 一份取樣配置僅包含一個 endpoint 來做取樣
# 下面是 Prometheus 本身的endpoint:
scrape_configs:
# job_name 將被被當作一個標籤 `job=<job_name>`新增到該配置的任意時序取樣.
-job_name:'prometheus'
# 覆蓋全域性預設值，從該job每5秒對目標取樣一次
scrape_interval:5s
static_configs:
# 如果需要遠端訪問， localhost  也可以替換為具體IP，比如10.118.71.170
-targets:
 [
'localhost:9090'
]

有關配置選項的完整說明，請參閱配置文件。

啟動Prometheus

使用新建立的配置檔案來啟動 Prometheus，切換到包含 Prometheus 二進位制檔案的目錄並執行

# 
啟動 Prometheus.
# 
預設地, Prometheus 在 ./data 路徑下儲存其資料庫 (flag --storage.tsdb.path).
# 
./prometheus --config.file=prometheus.yml

透過訪問 localhost:9000 來瀏覽狀態頁。等待幾秒讓他從自己的 HTTP metric endpoint 來收集資料。

還可以透過訪問到其 metrics endpoint（http://localhost:9090/metrics）來驗證 Prometheus 是否正在提供有關其自身的 metrics

開放防火牆埠

# 
firewall-cmd --permanent --zone=public --add-port=9090/tcp

success
# 
firewall-cmd --reload

success

使用expressin browser

使用 Prometheus 內建的expressin browser訪問 localhost:9090/graph，選擇 Graph 導航選單下的 Table tab頁 (Classic UI下為Console tab頁)。

透過檢視localhost:9090/metrics 頁面內容可知，Prometheus 匯出了關於其自身的一個名為 prometheus_target_interval_length_seconds指標（目標取樣之間的實際間隔）。將其作為搜尋表示式，輸入到表示式搜尋框中，點選 Execute 按鈕，如下，將返回多個不同的時間序列（以及每個時間序列的最新值），所有時間序列的 metric 名稱均為 prometheus_target_interval_length_seconds，但具有不同的標籤。這些標籤具有不同的延遲百分比和目標組間隔（target group intervals）。

如果我們只對第 99 個百分位延遲感興趣，則可以使用以下查詢來檢索該資訊：

prometheus_target_interval_length_seconds
{quantile
="0.99"
}

如果需要計算返回的時間序列數，可以修改查詢如下：

count
(prometheus_target_interval_length_seconds)

更多有關 expression language 的更多資訊，請檢視 expression language 文件。

使用繪圖介面

要繪製圖形表示式，請使用 “Graph” 選項卡。

例如，輸入以下表達式以繪製在自採樣的 Prometheus 中每秒建立 chunk 的速率：

rate
(prometheus_tsdb_head_chunks_created_total[
1
m])

啟動一些取樣目標

現在讓我們增加一些取樣目標供 Prometheus 進行取樣。

使用Node Exporter作為取樣目標，多關於它的使用請查閱

# 
wget https://github.com/prometheus/node_exporter/releases/download/v1.1.2/node_exporter-1.1.2.linux-amd64.tar.gz
# 
tar -xvzf node_exporter-1.1.2.linux-amd64.tar.gz
# 
./node_exporter --web.listen-address 127.0.0.1:8001
# 
./node_exporter --web.listen-address 127.0.0.1:8002
# 
./node_exporter --web.listen-address 127.0.0.1:8003

現在，應該存在監聽 http://localhost:8080/metrics, http://localhost:8081/metrics 和http://localhost:8082/metrics的示例目標

配置 Prometheus 來監控示例目標

現在，我們將配置 Prometheus 來取樣這些新目標。讓我們將所有三個 endpoint 分組為一個稱為 “node” 的 job。但是，假設前兩個 endpoint 是生產目標，而第三個 endpoint 代表金絲雀例項。為了在 Prometheus 中對此建模，我們可以將多個端組新增到單個 job 中，併為每個目標組新增額外的標籤。在此示例中，我們將 group=“ production” 標籤新增到第一個目標組，同時將 group=“ canary”新增到第二個目標。

為此，請將以下job定義新增到 prometheus.yml 中的 scrape_configs 部分，然後重新啟動 Prometheus 例項。修改後的 prometheus.yml內容如下

global:
scrape_interval:15s# 預設，每15秒取樣一次目標
# 與其它外部系統(比如federation, remote storage, Alertmanager)互動時，會附加這些標籤到時序資料或者報警
external_labels:
monitor:'codelab-monitor'
# 一份取樣配置僅包含一個 endpoint 來做取樣
# 下面是 Prometheus 本身的endpoint:
scrape_configs:
# job_name 將被被當作一個標籤 `job=<job_name>`新增到該配置的任意時序取樣.
-job_name:'prometheus'
# 覆蓋全域性預設值，從該job每5秒對目標取樣一次
scrape_interval:5s
static_configs:
-targets:
 [
'10.118.71.170:9090'
]
-job_name:'node'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval:5s
static_configs:
-targets:
 [
'localhost:8001'
, 
'localhost:8002'
]
labels:
group:'production'
-targets:
 [
'localhost:8003'
]

labels: group:'canary'

檢視Targets(Status -> Targets)

Graph查詢

配置規則以將取樣的資料聚合到新的時間序列

儘管在我們的示例中並不會有問題，但是在聚集了數千個時間序列中查詢時可能會變慢。為了提高效率，Prometheus 允許透過配置的記錄規則將表示式預記錄到全新的持久化的時間序列中。假設我們感興趣的是 5 分鐘的視窗內測得的每個例項的所有cpu上平均的cpu時間（node_cpu_seconds_total，保留 Job，instance，和mode 維度））。我們可以這樣寫：


avg 
by 
(
job, instance, 
mode) (rate(node_cpu_seconds_total[
5
m]))

Graph中執行查詢，結果如下

現在，要將由該表示式產生的時間序列記錄到一個名為：job_instance_mode:node_cpu_seconds:avg_rate5m 的新指標，使用以下記錄規則建立檔案並將其儲存 prometheus.rules.yml

groups:
-name:cpu-node
rules:
-record:job_instance_mode:node_cpu_seconds:avg_rate5m
expr:avgby(job,instance,mode)(rate(node_cpu_seconds_total[5m]))

在 prometheus.yml 中新增 rule_files 語句，以便 Prometheus 選擇此新規則。現在，prometheus.yml配置應如下所示：

global:
scrape_interval:15s# 預設，每15秒取樣一次目標
# 與其它外部系統(比如federation, remote storage, Alertmanager)互動時，會附加這些標籤到時序資料或者報警
external_labels:
monitor:'codelab-monitor'
rule_files:
-'prometheus.rules.yml'
# 一份取樣配置僅包含一個 endpoint 來做取樣
# 下面是 Prometheus 本身的endpoint:
scrape_configs:
# job_name 將被被當作一個標籤 `job=<job_name>`新增到該配置的任意時序取樣.
-job_name:'prometheus'
# 覆蓋全域性預設值，從該job每5秒對目標取樣一次
scrape_interval:5s
static_configs:
-targets:
 [
'10.118.71.170:9090'
]
-job_name:'node'
# Override the global default and scrape targets from this job every 5 seconds.
scrape_interval:5s
static_configs:
-targets:
 [
'localhost:8001'
, 
'localhost:8002'
]
labels:
group:'production'
-targets:
 [
'localhost:8003'
]