日志监控与告警的选型
技术选型
elastic stack
- elastic search 日志持久化
- filebeats 日志收集
- kibana 日志展示
- elaticalert 日志告警
- Elastic Observability APM 指标监控 java-agent
prometheus stack
- promtail 日志收集
- loki 日志聚合
- filesystem/Cassandra/S3/MinIO 日志持久化
- grafana 日志展示
- alertmanger 日志告警
- prometheus 指标监控
- spring + acturator + prometheus + meterBinder 自定义业务监控
云厂商
- 腾讯云CLS https://cloud.tencent.com/document/product/614/51741
- CLS整合飞书https://cloud.tencent.com/document/product/614/66236
各选型优劣势对比
elastic stack
- 日志收集成熟方案 大量实践与文档参考
- EFK技术栈作为分布式日志收集用户量较多 相对学习成本低一些
- APM监控的文档相对较少需要摸索,告警功能引入外部的组件
- 集群管理难度增加
prometheus stack
- APM监控天然优势
- 非常符合现行云原生环境
- 对于APM性能的告警非常完善 配置简单
- 日志收集的告警相对薄弱 配置相对复杂
云厂商
- 线上直接使用腾讯云提供的CLS功能即可
- 免费
- 支持接入飞书
- 需要学习CLS的配置文档
- APM的监控需要依赖云厂商的其他产品
实现简单说明
elastic stack
- 基于logback根据level进行日志的切分聚合
- 宿主机上安装filebeats
- kibana配置Index Pattern 进行索引匹配 可视化展示
- 配置elastalert
- 安装python 3.11以上版本 或者使用anaconda
- 修改config.yml 指定es host username/password
- 修改config 中run_every 采集时间 buffer_time 缓冲时间 rules告警规则目录等
- 编写rule.yaml告警规则
- 选择一个合适的alert type
- 配置filter 参考es DSL
- error可以通过filebeat中配置tags进行匹配
- 配置采集恢复时间
- 选择一个合适的告警通道 原生支持webhook dingtalk jira等如需拓展参考官方文档实现python代码
- 启动elast alert
- APM监控 可选
- es stack 自带的APM监控通过java agent的形式 在中央仓库下载指定的jar包 启动参考官方文档
- 自定义指标监控 早期基于spring acturator
参考配置
filebeats.yml
filebeat.inputs:
- type: logenabled: truepaths:- C:\\Users\\JimWu\\Desktop\\test_log\\info/*.logmultiline.pattern: '^20' #多行匹配规则multiline.negate: true #将不匹配的规则的行合并在一起multiline.match: after #合并到匹配规则的上一行末尾tags: ["demo","info"]- type: logenabled: truepaths:- C:\\Users\\JimWu\\Desktop\\test_log\\*-error-*.logmultiline.pattern: '^20' #多行匹配规则multiline.negate: true #将不匹配的规则的行合并在一起multiline.match: after #合并到匹配规则的上一行末尾tags: ["demo","error"]output.elasticsearch:hosts: ["localhost:9200"]username: "elastic"password: "elastic"indices:- index: "demo-%{+yyyy.MM.dd}"when.contains:tags: "demo"setup.ilm.enable: false
setup.template.name: "demo-log"
setup.template.pattern: "demo-dev-*"
setup.template.overwrite: false
setup.template.settings:index.number_of_shards: 1index.number_of_replicas: 1processors:- script:lang: javascriptid: my_filtertag: enablesource: >function process(event) {var str= event.Get("message");var reg = /\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}.\\d{3}/;var time = str.match(reg)[0];event.Put("log_time",time);}- timestamp:field: log_timetimezone: Asia/Shanghailayouts:- '2006-01-02 15:04:05'- '2006-01-02 15:04:05.999'test:- '2019-06-22 16:33:51'
APM
java -javaagent:/path/to/elastic-apm-agent-<version>.jar \\
-Delastic.apm.service_name=my-application \\
-Delastic.apm.server_urls=http://localhost:8200 \\
-Delastic.apm.secret_token= \\
-Delastic.apm.environment=production \\
-Delastic.apm.application_packages=org.example \\
-jar my-application.jar
elastalert rule参考
# Alert when the rate of events exceeds a threshold# (Optional)
# Elasticsearch host
# es_host: elasticsearch.example.com# (Optional)
# Elasticsearch port
# es_port: 14900# (OptionaL) Connect with SSL to Elasticsearch
#use_ssl: True# (Optional) basic-auth username and password for Elasticsearch
#es_username: someusername
#es_password: somepassword# (Required)
# Rule name, must be unique
name: Demo frequency rule# (Required)
# Type of alert.
# the frequency rule type alerts when num_events events occur with timeframe time
type: frequency# (Required)
# Index to search, wildcard supported
index: demo-*# (Required, frequency specific)
# Alert when this many documents matching the query occur within a timeframe
num_events: 1# (Required, frequency specific)
# num_events must occur within this amount of time to trigger an alert
timeframe:minutes: 10# (Required)
# A list of Elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
- term:tags: "error"# (Required)
# The alert is use when a match is found
alert:
- "post"http_post_url: "http://localhost:3000/alert"
prometheus stack
需准备的组件
- promtail
- loki
- grafana
- alertmanager
- prometheus (可选)
- node-export (可选)
- 配置promtail的scrap_config规则区分info error日志分别收集 并且配置多行匹配规则
- promtail连接loki作为日志持久化
- grafana explore进行日志展示
- promtail整合alertmanager配置告警规则rule
- 编写webhook进行告警处理
日志展示与报文示例
{"receiver":"web\\\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"BusinessError","app":"demo","filename":"C:\\\\Users\\\\JimWu\\\\Desktop\\\\test_log\\\\demo1-error-2023-04-11-0.log","level":"error","severity":"critical"},"annotations":{"app":"demo","summary":"error-log"},"startsAt":"2023-04-11T07:37:42.887385891Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"/graph?g0.expr=%28count_over_time%28%7Bapp%3D%22demo%22%2C+level%3D%22error%22%7D%5B1m%5D%29+%3E+0%29\\u0026g0.tab=1","fingerprint":"fc04d62f5f601ff1"}],"groupLabels":{"alertname":"BusinessError"},"commonLabels":{"alertname":"BusinessError","app":"demo","filename":"C:\\\\Users\\\\JimWu\\\\Desktop\\\\test_log\\\\demo1-error-2023-04-11-0.log","level":"error","severity":"critical"},"commonAnnotations":{"app":"demo","summary":"error-log"},"externalURL":"http://LAPTOP-1BQNQ8EO:9093","version":"4","groupKey":"{}:{alertname=\\"BusinessError\\"}","truncatedAlerts":0}