EFK日誌管理解決方案完整指南
1. 概述
1.1 EFK架構元件
-
• 基於Lucene構建的全文搜尋引擎 -
• 支援水平擴充套件和高可用性 -
• 提供強大的聚合查詢能力
-
• 統一的日誌收集層 -
• 支援多種資料來源和目標 -
• 具備強大的外掛生態系統
-
• 直觀的Web介面 -
• 豐富的圖表型別 -
• 支援儀表板和告警功能
1.2 技術優勢
-
• 統一日誌管理:集中收集和管理分散式系統的日誌 -
• 即時分析:近即時的日誌搜尋和分析能力 -
• 視覺化展示:直觀的圖表和儀表板 -
• 高可用性:支援叢集部署和故障轉移 -
• 擴充套件性強:可根據業務需求靈活擴充套件
2. 系統架構設計
2.1 整體架構
應用服務 → Fluentd Agent → Kafka/Redis → Fluentd Aggregator → Elasticsearch → Kibana
2.2 架構層次
-
• 應用程式日誌 -
• 系統日誌 -
• 容器日誌 -
• 網路裝置日誌
-
• Fluentd Agent部署在各個節點 -
• 負責本地日誌收集和初步過濾 -
• 支援多種輸入源格式
-
• Kafka作為訊息佇列 -
• Redis作為快取層 -
• 提供資料緩衝和削峰填谷
-
• Fluentd Aggregator集中處理 -
• 資料清洗和格式化 -
• 路由和分發邏輯
-
• Elasticsearch叢集 -
• 索引管理和資料分片 -
• 資料備份和恢復
-
• Kibana視覺化介面 -
• 自定義儀表板 -
• 告警和通知機制
2.3 高可用設計
-
• 多節點叢集部署 -
• 主從節點分離 -
• 資料副本配置 -
• 自動故障轉移
-
• 多例項部署 -
• 負載均衡配置 -
• 故障檢測和恢復 -
• 資料重試機制
-
• 多例項部署 -
• 負載均衡器 -
• 會話保持配置
3. 環境準備和部署
3.1 系統要求
-
• CPU:8核以上 -
• 記憶體:16GB以上 -
• 磁碟:SSD,500GB以上 -
• 網路:千兆網路
-
• 作業系統:CentOS 7/8, Ubuntu 18.04/20.04 -
• Java:OpenJDK 11或更高版本 -
• Docker:19.03以上(可選) -
• Kubernetes:1.18以上(可選)
3.2 Elasticsearch部署
# 下載和安裝
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.15.0-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.15.0-linux-x86_64.tar.gz
cd elasticsearch-7.15.0/
# 配置檔案 config/elasticsearch.yml
cluster.name: efk-cluster
node.name: es-node-1
network.host: 0.0.0.0
http.port: 9200
discovery.type: single-node
# 啟動服務
./bin/elasticsearch
# 主節點配置
cluster.name: efk-cluster
node.name: es-master-1
node.master: true
node.data: false
network.host: 192.168.1.10
http.port: 9200
transport.tcp.port: 9300
discovery.seed_hosts: ["192.168.1.10", "192.168.1.11", "192.168.1.12"]
cluster.initial_master_nodes: ["es-master-1", "es-master-2", "es-master-3"]
# 資料節點配置
cluster.name: efk-cluster
node.name: es-data-1
node.master: false
node.data: true
network.host: 192.168.1.20
http.port: 9200
transport.tcp.port: 9300
discovery.seed_hosts: ["192.168.1.10", "192.168.1.11", "192.168.1.12"]
# JVM設定
-Xms16g
-Xmx16g
-XX:+UseG1GC
-XX:G1HeapRegionSize=32m
-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
# 系統引數
vm.max_map_count=262144
fs.file-max=65536
nofile65536
nproc4096
3.3 Fluentd部署
# 使用官方指令碼安裝
curl -fsSL https://toolbelt.treasuredata.com/sh/install-redhat-td-agent4.sh | sh
# 或使用gem安裝
gem install fluentd
# 安裝必要外掛
fluent-gem install fluent-plugin-elasticsearch
fluent-gem install fluent-plugin-kubernetes_metadata_filter
fluent-gem install fluent-plugin-rewrite-tag-filter
# fluent.conf
<source>
@type tail
path /var/log/app/*.log
pos_file /var/log/fluentd/app.log.pos
tag app.logs
format json
read_from_head true
refresh_interval 10
</source>
<filterapp.logs>
@type record_transformer
<record>
hostname ${hostname}
timestamp ${time}
env production
</record>
</filter>
<matchapp.logs>
@type elasticsearch
host 192.168.1.10
port 9200
index_name app-logs
type_name _doc
logstash_format true
logstash_prefix app-logs
logstash_dateformat %Y.%m.%d
flush_interval 5s
max_retry_wait 30
disable_retry_limit
reload_connections false
reconnect_on_error true
reload_on_failure true
</match>
# docker-compose.yml
version:'3.8'
services:
fluentd:
image:fluent/fluentd:v1.14-1
container_name:fluentd
volumes:
-./fluent.conf:/fluentd/etc/fluent.conf
-/var/log:/var/log
ports:
-"24224:24224"
environment:
-FLUENTD_CONF=fluent.conf
3.4 Kibana部署
# 下載安裝
wget https://artifacts.elastic.co/downloads/kibana/kibana-7.15.0-linux-x86_64.tar.gz
tar -xzf kibana-7.15.0-linux-x86_64.tar.gz
cd kibana-7.15.0-linux-x86_64/
# 配置檔案 config/kibana.yml
server.port: 5601
server.host: "0.0.0.0"
elasticsearch.hosts: ["http://192.168.1.10:9200"]
kibana.index: ".kibana"
logging.dest: /var/log/kibana.log
version:'3.8'
services:
kibana:
image:kibana:7.15.0
container_name:kibana
environment:
-ELASTICSEARCH_HOSTS=http://elasticsearch:9200
ports:
-"5601:5601"
depends_on:
-elasticsearch
4. 日誌收集策略
4.1 應用日誌收集
<source>
@type tail
path /var/log/nginx/access.log
pos_file /var/log/fluentd/nginx.pos
tag nginx.access
format nginx
read_from_head true
</source>
<filternginx.access>
@type parser
key_name message
reserve_data true
<parse>
@type regexp
expression /^(?<remote>[^ ]*) (?<host>[^ ]*) (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^\"]*?)(?: +\S*)?)?" (?<code>[^ ]*) (?<size>[^ ]*)(?: "(?<referer>[^\"]*)" "(?<agent>[^\"]*)")?$/
time_format %d/%b/%Y:%H:%M:%S %z
</parse>
</filter>
<source>
@type kubernetes_metadata_filter
@id filter_kube_metadata
kubernetes_url "#{ENV['KUBERNETES_SERVICE_HOST']}:#{ENV['KUBERNETES_SERVICE_PORT_HTTPS']}"
verify_ssl true
ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
</source>
<filterkubernetes.**>
@type kubernetes_metadata
@id filter_kube_metadata
kubernetes_url "#{ENV['KUBERNETES_SERVICE_HOST']}:#{ENV['KUBERNETES_SERVICE_PORT_HTTPS']}"
verify_ssl true
ca_file /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
bearer_token_file /var/run/secrets/kubernetes.io/serviceaccount/token
</filter>
4.2 系統日誌收集
<source>
@type syslog
port 514
bind 0.0.0.0
tag system.syslog
</source>
<filtersystem.syslog>
@type record_transformer
<record>
source_type syslog
hostname ${hostname}
</record>
</filter>
<source>
@type exec
command /usr/bin/vmstat 1 1 | tail -1
format tsv
keys r,b,swpd,free,buff,cache,si,so,bi,bo,in,cs,us,sy,id,wa,st
tag system.vmstat
run_interval 60s
</source>
4.3 日誌格式化和解析
<filterapp.**>
@type parser
key_name message
reserve_data true
<parse>
@type json
json_parser_error_class JSONParserError
</parse>
</filter>
<filterapache.**>
@type parser
key_name message
<parse>
@type regexp
expression /^(?<host>[^ ]*) [^ ]* (?<user>[^ ]*) \[(?<time>[^\]]*)\] "(?<method>\S+)(?: +(?<path>[^ ]*) +\S*)?" (?<code>[^ ]*) (?<size>[^ ]*)/
time_format %d/%b/%Y:%H:%M:%S %z
</parse>
</filter>
5. 索引管理和最佳化
5.1 索引策略
{
"index_patterns":["app-logs-*"],
"template":{
"settings":{
"number_of_shards":3,
"number_of_replicas":1,
"refresh_interval":"30s",
"index.codec":"best_compression"
},
"mappings":{
"properties":{
"@timestamp":{
"type":"date"
},
"level":{
"type":"keyword"
},
"message":{
"type":"text",
"analyzer":"standard"
},
"hostname":{
"type":"keyword"
}
}
}
}
}
{
"policy":{
"phases":{
"hot":{
"actions":{
"rollover":{
"max_size":"50gb",
"max_age":"7d"
}
}
},
"warm":{
"min_age":"7d",
"actions":{
"allocate":{
"number_of_replicas":0
}
}
},
"cold":{
"min_age":"30d",
"actions":{
"allocate":{
"number_of_replicas":0
}
}
},
"delete":{
"min_age":"90d"
}
}
}
}
5.2 效能最佳化
-
• 每個分片大小控制在10-50GB -
• 分片數量 = 資料節點數 × 1-3 -
• 副本數量根據可用性要求設定
{
"query":{
"bool":{
"must":[
{
"range":{
"@timestamp":{
"gte":"now-1h",
"lte":"now"
}
}
}
],
"filter":[
{
"term":{
"level":"ERROR"
}
}
]
}
},
"sort":[
{
"@timestamp":{
"order":"desc"
}
}
]
}
5.3 儲存最佳化
index.codec:best_compression
index.mapping.total_fields.limit:1000
index.refresh_interval:30s
index.translog.flush_threshold_size:512mb
indices.memory.index_buffer_size:30%
indices.memory.min_index_buffer_size:48mb
indices.fielddata.cache.size:20%
indices.queries.cache.size:10%
6. 監控和告警
6.1 系統監控
# 叢集健康狀態
curl -X GET "localhost:9200/_cluster/health?pretty"
# 節點狀態
curl -X GET "localhost:9200/_nodes/stats?pretty"
# 索引統計
curl -X GET "localhost:9200/_stats?pretty"
<system>
<log>
format json
time_format %Y-%m-%dT%H:%M:%S.%NZ
</log>
</system>
<source>
@type monitor_agent
bind 0.0.0.0
port 24220
</source>
6.2 告警配置
{
"trigger":{
"schedule":{
"interval":"1m"
}
},
"input":{
"search":{
"request":{
"index":"app-logs-*",
"body":{
"query":{
"bool":{
"must":[
{
"range":{
"@timestamp":{
"gte":"now-5m"
}
}
},
{
"term":{
"level":"ERROR"
}
}
]
}
}
}
}
}
},
"condition":{
"compare":{
"ctx.payload.hits.total":{
"gt":10
}
}
},
"actions":{
"send_email":{
"email":{
"to":"[email protected]",
"subject":"Error Alert",
"body":"Found {{ctx.payload.hits.total}} errors in the last 5 minutes"
}
}
}
}
6.3 效能監控
-
• 索引速度(docs/sec) -
• 查詢延遲(ms) -
• 堆記憶體使用率 -
• 磁碟使用率 -
• 網路I/O
#!/bin/bash
# es_monitor.sh
ES_HOST="localhost:9200"
THRESHOLD_HEAP=80
THRESHOLD_DISK=85
# 檢查堆記憶體使用率
HEAP_USAGE=$(curl -s "http://${ES_HOST}/_nodes/stats/jvm" | jq '.nodes | to_entries[] | .value.jvm.mem.heap_used_percent' | sort -n | tail -1)
if (( $(echo "$HEAP_USAGE > $THRESHOLD_HEAP" | bc -l) )); then
echo"WARNING: Heap usage is ${HEAP_USAGE}%"
fi
# 檢查磁碟使用率
DISK_USAGE=$(curl -s "http://${ES_HOST}/_nodes/stats/fs" | jq '.nodes | to_entries[] | .value.fs.total.available_in_bytes / .value.fs.total.total_in_bytes * 100' | sort -n | head -1)
if (( $(echo "$DISK_USAGE > $THRESHOLD_DISK" | bc -l) )); then
echo"WARNING: Disk usage is ${DISK_USAGE}%"
fi
7. 安全配置
7.1 訪問控制
# elasticsearch.yml
xpack.security.enabled:true
xpack.security.transport.ssl.enabled:true
xpack.security.transport.ssl.verification_mode:certificate
xpack.security.transport.ssl.keystore.path:elastic-certificates.p12
xpack.security.transport.ssl.truststore.path:elastic-certificates.p12
# 建立使用者
bin/elasticsearch-users useradd kibana_user -p password123 -r kibana_system
# 建立角色
bin/elasticsearch-users roles -a log_reader -c "indices:read" -i "app-logs-*"
7.2 網路安全
# 開放必要埠
firewall-cmd --permanent --add-port=9200/tcp
firewall-cmd --permanent --add-port=9300/tcp
firewall-cmd --permanent --add-port=5601/tcp
firewall-cmd --permanent --add-port=24224/tcp
firewall-cmd --reload
# kibana.yml
elasticsearch.hosts: ["https://localhost:9200"]
elasticsearch.ssl.certificateAuthorities: ["/path/to/ca.crt"]
elasticsearch.ssl.certificate:"/path/to/client.crt"
elasticsearch.ssl.key:"/path/to/client.key"
7.3 資料加密
<match **>
@type elasticsearch
scheme https
ssl_verify true
ca_file /path/to/ca.crt
client_cert /path/to/client.crt
client_key /path/to/client.key
</match>
# elasticsearch.yml
xpack.security.encryptionKey:"your-encryption-key-here"
xpack.security.encryption.enabled:true
8. 故障排查
8.1 常見問題
# 檢查日誌
tail -f /var/log/elasticsearch/elasticsearch.log
# 檢查配置
./bin/elasticsearch-config-check
# 檢查JVM設定
ps aux | grep elasticsearch
# 檢查配置檔案語法
fluentd --config /etc/fluentd/fluent.conf --dry-run
# 檢查檔案許可權
ls -la /var/log/app/
# 檢查網路連線
telnet elasticsearch-host 9200
8.2 效能問題
# 啟用慢查詢日誌
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"transient": {
"logger.index.search.slowlog.threshold.query.warn": "10s",
"logger.index.search.slowlog.threshold.query.info": "5s"
}
}'
# 調整批次大小
curl -X PUT "localhost:9200/_cluster/settings" -H 'Content-Type: application/json' -d'
{
"transient": {
"indices.memory.index_buffer_size": "30%"
}
}'
8.3 資料恢復
# 建立快照倉庫
curl -X PUT "localhost:9200/_snapshot/backup" -H 'Content-Type: application/json' -d'
{
"type": "fs",
"settings": {
"location": "/backup/elasticsearch"
}
}'
# 建立快照
curl -X PUT "localhost:9200/_snapshot/backup/snapshot_1"
# 恢復快照
curl -X POST "localhost:9200/_snapshot/backup/snapshot_1/_restore"
9. 最佳實踐
9.1 架構設計
-
• 收集層:輕量級Agent -
• 聚合層:中央處理節點 -
• 儲存層:專用儲存叢集 -
• 展示層:視覺化介面
-
• 日誌量評估:每日產生的日誌量 -
• 儲存週期:資料保留時間 -
• 查詢頻率:併發查詢數量 -
• 增長預期:業務增長預測
9.2 配置最佳化
# 生產環境配置
bootstrap.memory_lock:true
indices.memory.index_buffer_size:30%
thread_pool.write.queue_size:1000
discovery.zen.minimum_master_nodes:2
gateway.recover_after_nodes:2
<system>
workers 4
root_dir /var/log/fluentd
</system>
<buffer>
@type file
path /var/log/fluentd/buffer
flush_mode interval
flush_interval 5s
chunk_limit_size 2MB
queue_limit_length 32
</buffer>
9.3 運維規範
{
"timestamp":"2024-01-15T10:30:00.000Z",
"level":"INFO",
"service":"user-service",
"message":"User login successful",
"userId":"12345",
"ip":"192.168.1.100",
"traceId":"trace-12345"
}
-
• 應用日誌:app-logs-YYYY.MM.DD -
• 系統日誌:system-logs-YYYY.MM.DD -
• 訪問日誌:access-logs-YYYY.MM.DD -
• 錯誤日誌:error-logs-YYYY.MM.DD
9.4 監控告警
-
• 叢集狀態:綠色/黃色/紅色 -
• 節點狀態:線上/離線 -
• 索引健康:分片狀態 -
• 查詢效能:響應時間
-
• 叢集狀態異常:立即告警 -
• 磁碟空間不足:提前告警 -
• 查詢延遲過高:延遲告警 -
• 日誌量異常:趨勢告警
10. 總結








