Linux系統告警與自動化響應配置

引言

在現代IT運維環境中，系統監控和自動化響應是確保服務穩定性和可用性的關鍵要素。Linux系統作為企業級伺服器的主流選擇，其告警機制和自動化響應配置直接影響著業務的連續性。本文將深入探討Linux系統告警與自動化響應的配置方法，為運維工程師提供實用的解決方案。

監控指標體系

系統核心指標

CPU監控

• CPU使用率（整體和分核心）
• CPU負載平均值（1分鐘、5分鐘、15分鐘）
• CPU上下文切換次數
• CPU中斷處理次數

記憶體監控

• 記憶體使用率和剩餘記憶體
• Swap使用情況
• 記憶體碎片化程度
• 快取和緩衝區使用情況

磁碟監控

• 磁碟空間使用率
• 磁碟I/O讀寫速率
• 磁碟佇列長度
• 檔案系統inode使用情況

網路監控

• 網路介面流量統計
• 網路連線數量
• 網路錯誤包統計
• 網路延遲和丟包率

應用層指標

程序監控

• 關鍵程序存活狀態
• 程序CPU和記憶體佔用
• 程序檔案描述符使用情況
• 程序埠監聽狀態

服務監控

• 服務響應時間
• 服務可用性檢查
• 服務錯誤率統計
• 服務連線池狀態

告警系統架構設計

監控資料收集層

系統級監控工具

使用node_exporter收集系統指標：

# 安裝node_exporterwget https://github.com/prometheus/node_exporter/releases/download/v1.6.1/node_exporter-1.6.1.linux-amd64.tar.gztar xvfz node_exporter-1.6.1.linux-amd64.tar.gzsudocp node_exporter-1.6.1.linux-amd64/node_exporter /usr/local/bin/# 建立systemd服務sudotee /etc/systemd/system/node_exporter.service > /dev/null <<EOF[Unit]Description=Node ExporterWants=network-online.targetAfter=network-online.target[Service]User=prometheusGroup=prometheusType=simpleExecStart=/usr/local/bin/node_exporterRestart=alwaysRestartSec=3[Install]WantedBy=multi-user.targetEOFsudo systemctl daemon-reloadsudo systemctl enable node_exportersudo systemctl start node_exporter

自定義監控指令碼

建立系統健康檢查指令碼：

#!/bin/bash# system_health_check.sh# 配置檔案CONFIG_FILE="/etc/monitoring/health_check.conf"# 預設閾值CPU_THRESHOLD=80MEMORY_THRESHOLD=85DISK_THRESHOLD=90LOAD_THRESHOLD=10# 載入配置if [ -f "$CONFIG_FILE" ]; thensource"$CONFIG_FILE"fi# 檢查CPU使用率check_cpu() {local cpu_usage=$(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%''{print $1}')if (( $(echo "$cpu_usage > $CPU_THRESHOLD" | bc -l) )); thenecho"CRITICAL: CPU usage is ${cpu_usage}%"return 2elif (( $(echo "$cpu_usage > $((CPU_THRESHOLD - 10))" | bc -l) )); then        echo "WARNING: CPU usage is ${cpu_usage}%"        return 1    fi    return 0}# 檢查記憶體使用率check_memory() {    local memory_usage=$(free | grep Mem | awk '{printf("%.2f"), $3/$2 * 100.0}')    if (( $(echo "$memory_usage > $MEMORY_THRESHOLD" | bc -l) )); then        echo "CRITICAL: Memory usage is ${memory_usage}%"        return 2    elif (( $(echo "$memory_usage > $((MEMORY_THRESHOLD - 10))" | bc -l) )); then        echo "WARNING: Memory usage is ${memory_usage}%"        return 1    fi    return 0}# 檢查磁碟使用率check_disk() {    local disk_usage=$(df -h | grep -vE '^Filesystem|tmpfs|cdrom' | awk '{print $5}' | sed 's/%//g' | sort -n | tail -1)    if [ "$disk_usage" -gt "$DISK_THRESHOLD" ]; then        echo "CRITICAL: Disk usage is ${disk_usage}%"        return 2    elif [ "$disk_usage" -gt "$((DISK_THRESHOLD - 10))" ]; then        echo "WARNING: Disk usage is ${disk_usage}%"        return 1    fi    return 0}# 檢查系統負載check_load() {    local load_avg=$(uptime | awk -F'load average:' '{print $2}' | awk '{print $1}' | sed 's/,//')    if (( $(echo "$load_avg > $LOAD_THRESHOLD" | bc -l) )); then        echo "CRITICAL: System load is ${load_avg}"        return 2    elif (( $(echo "$load_avg > $((LOAD_THRESHOLD - 2))" | bc -l) )); then        echo "WARNING: System load is ${load_avg}"        return 1    fi    return 0}# 主檢查函式main() {    local exit_code=0    local timestamp=$(date '+%Y-%m-%d %H:%M:%S')    echo "[$timestamp] Starting system health check..."    # 執行各項檢查    check_cpu    local cpu_result=$?    check_memory    local memory_result=$?    check_disk    local disk_result=$?    check_load    local load_result=$?    # 確定最終狀態    if [ $cpu_result -eq 2 ] || [ $memory_result -eq 2 ] || [ $disk_result -eq 2 ] || [ $load_result -eq 2 ]; then        exit_code=2    elif [ $cpu_result -eq 1 ] || [ $memory_result -eq 1 ] || [ $disk_result -eq 1 ] || [ $load_result -eq 1 ]; then        exit_code=1    fi    echo "[$timestamp] Health check completed with exit code: $exit_code"    exit $exit_code}main "$@"

告警規則配置

Prometheus告警規則

建立告警規則檔案：

# /etc/prometheus/rules/system_alerts.ymlgroups:-name:system_alertsrules:-alert:HighCPUUsageexpr:100-(avgby(instance)(irate(node_cpu_seconds_total{mode="idle"}[5m]))*100)>80for:5mlabels:severity:warningannotations:summary:"High CPU usage detected"description:"CPU usage is above 80% for more than 5 minutes on {{ $labels.instance }}"-alert:HighMemoryUsageexpr:(node_memory_MemTotal_bytes-node_memory_MemAvailable_bytes)/node_memory_MemTotal_bytes*100>85for:5mlabels:severity:warningannotations:summary:"High memory usage detected"description:"Memory usage is above 85% for more than 5 minutes on {{ $labels.instance }}"-alert:DiskSpaceLowexpr:(node_filesystem_avail_bytes{fstype!="tmpfs"}/node_filesystem_size_bytes{fstype!="tmpfs"})*100<10for:1mlabels:severity:criticalannotations:summary:"Low disk space"description:"Disk space is below 10% on {{ $labels.instance }}"-alert:SystemLoadHighexpr:node_load1>10for:5mlabels:severity:warningannotations:summary:"High system load"description:"System load is above 10 for more than 5 minutes on {{ $labels.instance }}"-alert:ServiceDownexpr:up==0for:1mlabels:severity:criticalannotations:summary:"Service is down"description:"{{ $labels.instance }} has been down for more than 1 minute"

自動化響應機制

響應策略分類

預防性響應

• 資源預分配
• 負載均衡調整
• 快取預熱
• 連線池擴容

修復性響應

• 服務重啟
• 程序清理
• 臨時檔案清理
• 日誌輪轉

擴充套件性響應

• 自動擴容
• 資源遷移
• 負載分流
• 備份啟用

自動化指令碼實現

服務自動重啟指令碼

#!/bin/bash# auto_restart_service.shSERVICE_NAME="$1"LOG_FILE="/var/log/auto_restart.log"MAX_RESTART_COUNT=3RESTART_INTERVAL=60# 檢查服務狀態check_service_status() {    systemctl is-active --quiet "$SERVICE_NAME"return $?}# 記錄日誌log_message() {echo"$(date '+%Y-%m-%d %H:%M:%S') - $1" >> "$LOG_FILE"}# 傳送通知send_notification() {local message="$1"local severity="$2"# 傳送郵件通知echo"$message" | mail -s "Service Alert: $SERVICE_NAME" [email protected]# 傳送釘釘通知    curl -X POST https://oapi.dingtalk.com/robot/send \        -H 'Content-Type: application/json' \        -d "{\"msgtype\": \"text\", \"text\": {\"content\": \"$message\"}}"}# 主要重啟邏輯main() {local restart_count=0while [ $restart_count -lt $MAX_RESTART_COUNT ]; doif check_service_status; then            log_message "Service $SERVICE_NAME is running normally"exit 0else            restart_count=$((restart_count + 1))            log_message "Attempting to restart $SERVICE_NAME (attempt $restart_count/$MAX_RESTART_COUNT)"            systemctl restart "$SERVICE_NAME"sleep$RESTART_INTERVALif check_service_status; then                log_message "Successfully restarted $SERVICE_NAME"                send_notification "Service $SERVICE_NAME has been successfully restarted""INFO"exit 0fifidone    log_message "Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts"    send_notification "CRITICAL: Failed to restart $SERVICE_NAME after $MAX_RESTART_COUNT attempts""CRITICAL"exit 1}main "$@"

磁碟空間自動清理指令碼

#!/bin/bash# disk_cleanup.shCLEANUP_PATHS=("/var/log""/tmp""/var/tmp""/var/cache")LOG_RETENTION_DAYS=7TEMP_FILE_AGE=7# 清理日誌檔案cleanup_logs() {local log_path="$1"    find "$log_path" -name "*.log" -type f -mtime +$LOG_RETENTION_DAYS -delete    find "$log_path" -name "*.log.*" -type f -mtime +$LOG_RETENTION_DAYS -delete}# 清理臨時檔案cleanup_temp() {local temp_path="$1"    find "$temp_path" -type f -mtime +$TEMP_FILE_AGE -delete    find "$temp_path" -type d -empty -delete}# 清理系統快取cleanup_cache() {# 清理包管理器快取ifcommand -v apt-get &> /dev/null; then        apt-get cleanelifcommand -v yum &> /dev/null; then        yum clean allfi# 清理系統快取sync && echo 3 > /proc/sys/vm/drop_caches}# 主清理函式main() {echo"Starting disk cleanup process..."for path in"${CLEANUP_PATHS[@]}"; doif [ -d "$path" ]; thenecho"Cleaning up $path..."case"$path"in"/var/log")                    cleanup_logs "$path"                    ;;"/tmp"|"/var/tmp")                    cleanup_temp "$path"                    ;;"/var/cache")                    cleanup_cache                    ;;esacfidoneecho"Disk cleanup completed"}main "$@"

告警管理器配置

Alertmanager配置

# /etc/alertmanager/alertmanager.ymlglobal:smtp_smarthost:'localhost:587'smtp_from:'[email protected]'templates:-'/etc/alertmanager/templates/*.tmpl'route:group_by: ['alertname']group_wait:10sgroup_interval:10srepeat_interval:1hreceiver:'default'routes:-match:severity:criticalreceiver:'critical-alerts'-match:severity:warningreceiver:'warning-alerts'receivers:-name:'default'email_configs:-to:'[email protected]'subject:'Alert: {{ .GroupLabels.alertname }}'body:|      {{ range .Alerts }}      Alert: {{ .Annotations.summary }}      Description: {{ .Annotations.description }}      {{ end }}-name:'critical-alerts'email_configs:-to:'[email protected]'subject:'CRITICAL Alert: {{ .GroupLabels.alertname }}'body:|      {{ range .Alerts }}      CRITICAL Alert: {{ .Annotations.summary }}      Description: {{ .Annotations.description }}      {{ end }}webhook_configs:-url:'http://localhost:9093/webhook'send_resolved:true-name:'warning-alerts'email_configs:-to:'[email protected]'subject:'WARNING Alert: {{ .GroupLabels.alertname }}'body:|      {{ range .Alerts }}      WARNING Alert: {{ .Annotations.summary }}      Description: {{ .Annotations.description }}      {{ end }}

整合第三方工具

Webhook處理器

建立webhook處理器來觸發自動化響應：

#!/usr/bin/env python3# webhook_handler.pyfrom flask import Flask, request, jsonifyimport subprocessimport jsonimport loggingapp = Flask(__name__)logging.basicConfig(level=logging.INFO)# 自動化響應對映AUTOMATION_MAPPING = {'HighCPUUsage': 'handle_high_cpu','HighMemoryUsage': 'handle_high_memory','DiskSpaceLow': 'handle_disk_space_low','ServiceDown': 'handle_service_down'}defhandle_high_cpu(alert_data):"""處理高CPU使用率告警"""    instance = alert_data.get('labels', {}).get('instance', '')    logging.info(f"Handling high CPU usage for {instance}")# 執行CPU最佳化指令碼    subprocess.run(['/usr/local/bin/cpu_optimization.sh', instance])return {"status": "success", "action": "cpu_optimization"}defhandle_high_memory(alert_data):"""處理高記憶體使用率告警"""    instance = alert_data.get('labels', {}).get('instance', '')    logging.info(f"Handling high memory usage for {instance}")# 執行記憶體清理指令碼    subprocess.run(['/usr/local/bin/memory_cleanup.sh', instance])return {"status": "success", "action": "memory_cleanup"}defhandle_disk_space_low(alert_data):"""處理磁碟空間不足告警"""    instance = alert_data.get('labels', {}).get('instance', '')    logging.info(f"Handling low disk space for {instance}")# 執行磁碟清理指令碼    subprocess.run(['/usr/local/bin/disk_cleanup.sh'])return {"status": "success", "action": "disk_cleanup"}defhandle_service_down(alert_data):"""處理服務停止告警"""    instance = alert_data.get('labels', {}).get('instance', '')    job = alert_data.get('labels', {}).get('job', '')    logging.info(f"Handling service down for {job} on {instance}")# 執行服務重啟指令碼    subprocess.run(['/usr/local/bin/auto_restart_service.sh', job])return {"status": "success", "action": "service_restart"}@app.route('/webhook', methods=['POST'])defwebhook():"""處理Alertmanager webhook"""try:        data = request.json        alerts = data.get('alerts', [])        responses = []for alert in alerts:            alert_name = alert.get('labels', {}).get('alertname', '')if alert_name in AUTOMATION_MAPPING:                handler_func = globals()[AUTOMATION_MAPPING[alert_name]]                response = handler_func(alert)                responses.append(response)else:                logging.warning(f"No handler found for alert: {alert_name}")return jsonify({"responses": responses})except Exception as e:        logging.error(f"Error processing webhook: {str(e)}")return jsonify({"error": str(e)}), 500if __name__ == '__main__':    app.run(host='0.0.0.0', port=9093)

釘釘整合

#!/bin/bash# send_dingtalk_alert.shWEBHOOK_URL="https://oapi.dingtalk.com/robot/send?access_token=YOUR_TOKEN"ALERT_TYPE="$1"ALERT_MESSAGE="$2"INSTANCE="$3"# 根據告警型別設定顏色case"$ALERT_TYPE"in"CRITICAL")        COLOR="red"        ;;"WARNING")        COLOR="yellow"        ;;"INFO")        COLOR="green"        ;;    *)        COLOR="blue"        ;;esac# 構造訊息MESSAGE=$(cat <<EOF{    "msgtype": "markdown",    "markdown": {        "title": "系統告警通知",        "text": "## 系統告警通知\n\n**告警級別**: <font color='$COLOR'>$ALERT_TYPE</font>\n\n**告警例項**: $INSTANCE\n\n**告警內容**: $ALERT_MESSAGE\n\n**告警時間**: $(date '+%Y-%m-%d %H:%M:%S')\n\n請及時處理相關問題。"    }}EOF)# 傳送訊息curl -X POST "$WEBHOOK_URL" \    -H 'Content-Type: application/json' \    -d "$MESSAGE"

監控資料視覺化

Grafana儀表板配置

建立系統監控儀表板的JSON配置：

{"dashboard":{"title":"Linux系統監控","panels":[{"title":"CPU使用率","type":"stat","targets":[{"expr":"100 - (avg by(instance) (irate(node_cpu_seconds_total{mode=\"idle\"}[5m])) * 100)"}],"fieldConfig":{"defaults":{"thresholds":{"steps":[{"color":"green","value":0},{"color":"yellow","value":70},{"color":"red","value":90}]}}}},{"title":"記憶體使用率","type":"stat","targets":[{"expr":"(node_memory_MemTotal_bytes - node_memory_MemAvailable_bytes) / node_memory_MemTotal_bytes * 100"}]},{"title":"磁碟使用率","type":"stat","targets":[{"expr":"(node_filesystem_size_bytes{fstype!=\"tmpfs\"} - node_filesystem_avail_bytes{fstype!=\"tmpfs\"}) / node_filesystem_size_bytes{fstype!=\"tmpfs\"} * 100"}]}]}}

效能最佳化與調優

監控效能最佳化

資料採集最佳化

• 調整採集間隔，平衡準確性和效能
• 使用資料壓縮減少儲存空間
• 實施資料保留策略
• 最佳化查詢語句效能

告警最佳化

• 設定合理的告警閾值避免誤報
• 實施告警抑制機制
• 配置告警聚合規則
• 定期評估和調整告警策略

系統資源最佳化

記憶體管理

# 記憶體最佳化指令碼#!/bin/bash# memory_optimization.sh# 清理頁面快取sync && echo 1 > /proc/sys/vm/drop_caches# 調整swap使用策略echo 10 > /proc/sys/vm/swappiness# 最佳化記憶體回收echo 1 > /proc/sys/vm/overcommit_memory

磁碟I/O最佳化

# 磁碟I/O最佳化指令碼#!/bin/bash# disk_io_optimization.sh# 調整I/O排程器echo noop > /sys/block/sda/queue/scheduler# 最佳化檔案系統引數mount -o remount,noatime,nodiratime /# 調整磁碟佇列深度echo 32 > /sys/block/sda/queue/nr_requests

故障處理與恢復

故障分類處理

硬體故障

• 磁碟故障自動切換
• 網路故障自動恢復
• 記憶體故障隔離處理

軟體故障

• 程序異常自動重啟
• 服務依賴關係檢查
• 配置檔案自動恢復

網路故障

• 網路連線自動重試
• 負載均衡自動切換
• DNS解析故障處理

恢復策略實現

#!/bin/bash# disaster_recovery.shBACKUP_DIR="/opt/backups"CONFIG_BACKUP="$BACKUP_DIR/configs"DATA_BACKUP="$BACKUP_DIR/data"# 配置檔案恢復restore_configs() {echo"Restoring configuration files..."# 恢復系統配置cp -r "$CONFIG_BACKUP"/etc/* /etc/# 恢復服務配置    systemctl daemon-reload# 重啟相關服務    systemctl restart nginx    systemctl restart mysql    systemctl restart redis}# 資料恢復restore_data() {echo"Restoring data..."# 恢復資料庫    mysql -u root -p < "$DATA_BACKUP/mysql_backup.sql"# 恢復檔案資料    rsync -av "$DATA_BACKUP/files/" /var/www/html/}# 系統健康檢查health_check() {echo"Performing health check..."# 檢查服務狀態    systemctl status nginx    systemctl status mysql    systemctl status redis# 檢查埠監聽    netstat -tuln | grep :80    netstat -tuln | grep :3306    netstat -tuln | grep :6379}# 主恢復流程main() {echo"Starting disaster recovery process..."    restore_configs    restore_data    health_checkecho"Disaster recovery completed"}main "$@"