> 文章列表 > 日志监控与告警的选型

日志监控与告警的选型

日志监控与告警的选型

技术选型

elastic stack

  • elastic search 日志持久化
  • filebeats 日志收集
  • kibana 日志展示
  • elaticalert 日志告警
  • Elastic Observability APM 指标监控 java-agent

prometheus stack

  • promtail 日志收集
  • loki 日志聚合
  • filesystem/Cassandra/S3/MinIO 日志持久化
  • grafana 日志展示
  • alertmanger 日志告警
  • prometheus 指标监控
  • spring + acturator + prometheus + meterBinder 自定义业务监控

云厂商

  • 腾讯云CLS https://cloud.tencent.com/document/product/614/51741
  • CLS整合飞书https://cloud.tencent.com/document/product/614/66236

各选型优劣势对比

elastic stack

  1. 日志收集成熟方案 大量实践与文档参考
  2. EFK技术栈作为分布式日志收集用户量较多 相对学习成本低一些
  3. APM监控的文档相对较少需要摸索,告警功能引入外部的组件
  4. 集群管理难度增加

prometheus stack

  1. APM监控天然优势
  2. 非常符合现行云原生环境
  3. 对于APM性能的告警非常完善 配置简单
  4. 日志收集的告警相对薄弱 配置相对复杂

云厂商

  1. 线上直接使用腾讯云提供的CLS功能即可
  2. 免费
  3. 支持接入飞书
  4. 需要学习CLS的配置文档
  5. APM的监控需要依赖云厂商的其他产品

实现简单说明

elastic stack

日志监控与告警的选型

  1. 基于logback根据level进行日志的切分聚合
  2. 宿主机上安装filebeats
    1. 配置inputs插件
    2. 配置采集路径
    3. 配置多行匹配规则
    4. 配置tags
    5. 配置Output插件
      1. 选择output到Logstash或者直接到es
      2. 配置es索引模板规则
      3. 配置索引分片 副本规则
    6. 可选配置processor 时间戳timestamp格式化
  3. kibana配置Index Pattern 进行索引匹配 可视化展示
  4. 配置elastalert
    1. 安装python 3.11以上版本 或者使用anaconda
    2. 修改config.yml 指定es host username/password
    3. 修改config 中run_every 采集时间 buffer_time 缓冲时间 rules告警规则目录等
    4. 编写rule.yaml告警规则
      1. 选择一个合适的alert type
      2. 配置filter 参考es DSL
      3. error可以通过filebeat中配置tags进行匹配
      4. 配置采集恢复时间
    5. 选择一个合适的告警通道 原生支持webhook dingtalk jira等如需拓展参考官方文档实现python代码
    6. 启动elast alert
  5. APM监控 可选
    1. es stack 自带的APM监控通过java agent的形式 在中央仓库下载指定的jar包 启动参考官方文档
    2. 自定义指标监控 早期基于spring acturator

参考配置

filebeats.yml

filebeat.inputs:
- type: logenabled: truepaths:- C:\\Users\\JimWu\\Desktop\\test_log\\info/*.logmultiline.pattern: '^20'                    #多行匹配规则multiline.negate: true                    #将不匹配的规则的行合并在一起multiline.match: after                #合并到匹配规则的上一行末尾tags: ["demo","info"]- type: logenabled: truepaths:- C:\\Users\\JimWu\\Desktop\\test_log\\*-error-*.logmultiline.pattern: '^20'                    #多行匹配规则multiline.negate: true                    #将不匹配的规则的行合并在一起multiline.match: after                #合并到匹配规则的上一行末尾tags: ["demo","error"]output.elasticsearch:hosts: ["localhost:9200"]username: "elastic"password: "elastic"indices:- index: "demo-%{+yyyy.MM.dd}"when.contains:tags: "demo"setup.ilm.enable: false
setup.template.name: "demo-log"
setup.template.pattern: "demo-dev-*"
setup.template.overwrite: false
setup.template.settings:index.number_of_shards: 1index.number_of_replicas: 1processors:- script:lang: javascriptid: my_filtertag: enablesource: >function process(event) {var str= event.Get("message");var reg = /\\d{4}-\\d{2}-\\d{2} \\d{2}:\\d{2}:\\d{2}.\\d{3}/;var time = str.match(reg)[0];event.Put("log_time",time);}- timestamp:field: log_timetimezone: Asia/Shanghailayouts:- '2006-01-02 15:04:05'- '2006-01-02 15:04:05.999'test:- '2019-06-22 16:33:51'

APM

java -javaagent:/path/to/elastic-apm-agent-<version>.jar \\
-Delastic.apm.service_name=my-application \\
-Delastic.apm.server_urls=http://localhost:8200 \\
-Delastic.apm.secret_token= \\
-Delastic.apm.environment=production \\
-Delastic.apm.application_packages=org.example \\
-jar my-application.jar

elastalert rule参考

# Alert when the rate of events exceeds a threshold# (Optional)
# Elasticsearch host
# es_host: elasticsearch.example.com# (Optional)
# Elasticsearch port
# es_port: 14900# (OptionaL) Connect with SSL to Elasticsearch
#use_ssl: True# (Optional) basic-auth username and password for Elasticsearch
#es_username: someusername
#es_password: somepassword# (Required)
# Rule name, must be unique
name: Demo frequency rule# (Required)
# Type of alert.
# the frequency rule type alerts when num_events events occur with timeframe time
type: frequency# (Required)
# Index to search, wildcard supported
index: demo-*# (Required, frequency specific)
# Alert when this many documents matching the query occur within a timeframe
num_events: 1# (Required, frequency specific)
# num_events must occur within this amount of time to trigger an alert
timeframe:minutes: 10# (Required)
# A list of Elasticsearch filters used for find events
# These filters are joined with AND and nested in a filtered query
# For more info: https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
filter:
- term:tags: "error"# (Required)
# The alert is use when a match is found
alert:
- "post"http_post_url: "http://localhost:3000/alert"

prometheus stack

日志监控与告警的选型

需准备的组件

  • promtail
  • loki
  • grafana
  • alertmanager
  • prometheus (可选)
  • node-export (可选)
  1. 配置promtail的scrap_config规则区分info error日志分别收集 并且配置多行匹配规则
  2. promtail连接loki作为日志持久化
  3. grafana explore进行日志展示
  4. promtail整合alertmanager配置告警规则rule
  5. 编写webhook进行告警处理

日志展示与报文示例

日志监控与告警的选型

{"receiver":"web\\\\.hook","status":"firing","alerts":[{"status":"firing","labels":{"alertname":"BusinessError","app":"demo","filename":"C:\\\\Users\\\\JimWu\\\\Desktop\\\\test_log\\\\demo1-error-2023-04-11-0.log","level":"error","severity":"critical"},"annotations":{"app":"demo","summary":"error-log"},"startsAt":"2023-04-11T07:37:42.887385891Z","endsAt":"0001-01-01T00:00:00Z","generatorURL":"/graph?g0.expr=%28count_over_time%28%7Bapp%3D%22demo%22%2C+level%3D%22error%22%7D%5B1m%5D%29+%3E+0%29\\u0026g0.tab=1","fingerprint":"fc04d62f5f601ff1"}],"groupLabels":{"alertname":"BusinessError"},"commonLabels":{"alertname":"BusinessError","app":"demo","filename":"C:\\\\Users\\\\JimWu\\\\Desktop\\\\test_log\\\\demo1-error-2023-04-11-0.log","level":"error","severity":"critical"},"commonAnnotations":{"app":"demo","summary":"error-log"},"externalURL":"http://LAPTOP-1BQNQ8EO:9093","version":"4","groupKey":"{}:{alertname=\\"BusinessError\\"}","truncatedAlerts":0}