监控平台系列

监控服务端Promethues
监控采集器Exporter
告警服务端Altermanager
告警中间件PrometheusAlert
监控展示Grafana

Exporter安装

官方exporter下载地址汇总

Windows-exporter安装

✅ 程序下载
Windows-exporter的下载地址
选择 windows_exporter-{{版本号}}-amd64.msi

注意是amd64.msi结尾

✅ 程序安装
将程序上传到服务器桌面,执行以下命令,注意修改程序版本号

  • 不包含mssql监控
msiexec /i C:\Users\Administrator\Desktop\windows_exporter-{{版本号}}-amd64.msi ENABLED_COLLECTORS=cpu,net,os,memory,process,tcp,textfile,cs,logical_disk,service,system /quiet
  • 包含mssql监控
msiexec /i C:\Users\Administrator\Desktop\windows_exporter-{{版本号}}-amd64.msi ENABLED_COLLECTORS=cpu,net,os,memory,process,tcp,mssql,textfile,cs,logical_disk,service,system /quiet

✅ 防火墙开启
如果系统开启了防火墙,那么需要配置防火墙规则放通9182端口

netsh advfirewall firewall add rule name="windows-exporter" dir=in action=allow protocol=TCP localport=9182

✅ job配置

  • prometheus.yaml配置
- job_name: windows-exporter
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/windows.yaml"  
  • windows.yaml配置(通过文件挂载映射进容器)
  1. 多个主机使用不通签
- job_name: 'dev_app_01'
  static_configs:
  - targets: ['192.168.1.117:9182']
    labels:
      os: WINDOWS
      env: dev
- job_name: 'test_app_01'
  static_configs:
  - targets: ['192.168.2.217:9182']
    labels:
      os: WINDOWS
      env: test
  1. 多个主机使用相同标签
- job_name: 'test_app_01'
  static_configs:
  - targets: 
    - 192.168.2.217:9182
    - 192.168.2.218:9182
    - 192.168.2.219:9182
    labels:
      os: WINDOWS
      env: test

✅ rule配置

groups:
- name: windows主机信息
  rules:
  - alert: CPU使用
    expr: round(100 - (avg by (instance) (irate(windows_cpu_time_total{mode="idle"}[5m])) * 100),0.01) > 60
    for: 10m
    labels:
      severity:  严重
    annotations:
      summary: "{{$labels.instance}} CPU使用率过高!"
      description: "{{$labels.instance}} CPU使用大于60%(目前使用:{{$value}}%)"

  - alert: 内存使用
    expr: round(100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100),0.01) > 90
    for: 2m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} 内存使用率过高!"
      description: "{{$labels.instance}} 内存使用大于90%(目前使用:{{$value}}%)"

  - alert: 磁盘使用
    expr:  round((1- windows_logical_disk_free_bytes   / windows_logical_disk_size_bytes) * 100 , 0.01) >90
    for: 1m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} 磁盘分区{{$labels.volume}}使用率过高!"
      description: "{{$labels.instance}} 磁盘分区使用大于90%(目前使用:{{$value}}%)"

Linux-exporter安装

✅ 程序下载
Linux-exporter的下载地址
选择 node_exporter-{{版本号}}.linux-amd64.tar.gz,上传到服务器根目录,执行以下命令

注意是amd64.tar.gz结尾

✅ 系统优化

  • 如果系统开启了防火墙,那么需要配置防火墙规则放通9100端口
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0 
iptables -A INPUT -p tcp --dport 9100 -j ACCEPT
service iptables save
service iptables restart
  • 如果系统不需要开启了防火墙
systemctl stop firewalld  && systemctl disable firewalld
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0 

✅ 程序安装
将程序上传的根目录,执行以下命令 注意修改版本号

mkdir -p /usr/share/exporter/node_exporter
tar xf node_exporter-{{版本号}}.linux-amd64.tar.gz  -C /usr/share/exporter/node_exporter  --strip-components=1


useradd prometheus
cat  >>  /usr/lib/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/share/exporter/node_exporter/node_exporter \\
          --web.listen-address=:9100 \\
          --collector.systemd \\
          --collector.systemd.unit-whitelist=(sshd|nginx).service \\
          --collector.processes \\
          --collector.tcpstat
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now node_exporter

✅ job配置

  • prometheus.yaml配置
- job_name: linux-exporter
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/linux.yaml"  
  • linux.yaml配置(通过文件挂载映射进容器)
  1. 多个主机使用不通签
- job_name: 'dev_app_01'
  static_configs:
  - targets: ['192.168.1.117:9182']
    labels:
      os: Linux
      env: dev
- job_name: 'test_app_01'
  static_configs:
  - targets: ['192.168.2.217:9182']
    labels:
      os: Linux
      env: test
  1. 多个主机使用相同标签
- job_name: 'test_app_01'
  static_configs:
  - targets: 
    - 192.168.2.217:9182
    - 192.168.2.218:9182
    - 192.168.2.219:9182
    labels:
      os: Linux
      env: test

✅ rule配置

groups:
- name: linux主机信息
  rules:
  - alert: CPU使用情况
    expr: round(100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) ,0.01)> 60
    for: 10m
    labels:
      severity:  严重
    annotations:
      summary: "{{$labels.instance}} CPU使用率过高!"
      description: "{{$labels.instance}} CPU使用大于60%(目前使用:{{$value}}%)"

  - alert: 内存使用
    expr: round( (1 - (node_memory_MemFree_bytes+ node_memory_Buffers_bytes+ node_memory_Cached_bytes)/ node_memory_MemTotal_bytes)* 100 ,0.01)> 90
    for: 2m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} 内存使用率过高!"
      description: "{{$labels.instance}} 内存使用大于90%(目前使用:{{$value}}%)"
      
  - alert: IO性能
    expr: round(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100,0.01) > 60
    for: 1m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} 流入磁盘IO使用率过高!"
      description: "{{$labels.instance}} 流入磁盘IO大于60%(目前使用:{{$value}})"

  - alert: 入网带宽
    expr: round(((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 1024/1024),0.01) > 100
    for: 1m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} 流入网络带宽过高!"
      description: "{{$labels.instance}}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"

  - alert: 出网带宽
    expr: round(((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 1024/1024),0.01) > 100
    for: 1m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} 流出网络带宽过高!"
      description: "{{$labels.instance}}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
  
  - alert: TCP会话
    expr: node_netstat_Tcp_CurrEstab > 10000
    for: 1m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} TCP_ESTABLISHED过高!"
      description: "{{$labels.instance}} TCP_ESTABLISHED大于10000(目前使用:{{$value}})"

  - alert: 磁盘容量
    expr: round(100-(node_filesystem_free_bytes{fstype=~"ext4|xfs",mountpoint!="/boot"}/node_filesystem_size_bytes {fstype=~"ext4|xfs",mountpoint!="/boot"}*100),0.01) > 80
    for: 1m
    labels:
      severity: 严重
    annotations:
      summary: "{{$labels.instance}} 磁盘分区{{$labels.volume}}使用率过高!"
      description: "{{$labels.instance }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"

Mysql-exporter安装

✅ 程序下载
Mysql-exporter的下载地址
选择 mysqld_exporter-{{版本号}}.linux-amd64.tar.gz,上传到服务器根目录,执行以下命令

注意是amd64.tar.gz结尾

✅ 系统优化

  • 如果系统开启了防火墙,那么需要配置防火墙规则放通9104端口
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0 
iptables -A INPUT -p tcp --dport 9104 -j ACCEPT
service iptables save
service iptables restart
  • 如果系统不需要开启了防火墙
systemctl stop firewalld  && systemctl disable firewalld
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0 

✅ 数据库账号配置(注意修改密码)

CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter' WITH MAX_USER_CONNECTIONS 3;
GRANT  SELECT,PROCESS, REPLICATION CLIENT,REPLICATION SLAVE  ON *.* TO 'exporter'@'%';

✅ 程序安装
将程序上传的根目录,执行以下命令 注意修改版本号

mkdir -p /opt/exporter/mysqld_exporter
tar xf mysqld_exporter-{{版本号}}.linux-amd64.tar.gz  -C /opt/exporter/mysqld_exporter  --strip-components=1

useradd prometheus
cat  >>  /usr/lib/systemd/system/mysqld_exporter.service << EOF
[Unit]
Description=mysqld_exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
Environment="DATA_SOURCE_NAME=exporter:exporter@(localhost:3306)/"
ExecStart=/opt/exporter/mysqld_exporter/mysqld_exporter \\
          --web.listen-address=:9104 \\
	--collect.slave_hosts
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now  mysqld_exporter

✅ job配置

  • prometheus.yaml配置
- job_name: 'sql-exporter'
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/mysql.yaml"  #加载mysql的配置文件
  relabel_configs:
    - action: replace
      source_labels: [__address__]
      regex: (.*):([0-9]+)  # 正则匹配标签值,( )分组
      replacement: $1       # 引用分组匹配的内容
      target_label: "nodeip" 
  • mysql.yaml配置(通过文件挂载映射进容器)
- targets: ['192.168.1.25:9104']
  labels:
    type: mysql
    env: dev
    nickname: mysql-dev
- targets: ['192.168.1.38:9104']
  labels:
    type: mysql
    env: test
    nickname: mysql-test

✅ rule配置

groups:
- name: Mysql状态
  rules:
  - alert: MysqlDown
    expr: mysql_up == 0
    for: 0m
    labels:
      severity: 灾难
    annotations:
      description: "数据库 {{ $labels.nodeip }} 服务停止"
  - alert: MysqlSlowQueries
    expr: ceil(increase(mysql_global_status_slow_queries[1m])) > 100
    for: 2m
    labels:
      severity: 严重
    annotations:
      description: "1分钟内慢日志超过100条,数量为{{ $value }}"

  - alert: MysqlInnodbLogWaits
    expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
    for: 0m
    labels:
      severity: 严重
    annotations:
      description: "MySQL innodb写入延迟,值为{{ $value }}"

  - alert: MysqlRestarted
    expr: mysql_global_status_uptime < 60
    for: 0m
    labels:
      severity: 信息
    annotations:
      description: "数据库服务已重启"

  - alert: MysqlSlaveReplicationLag
    expr: mysql_slave_status_seconds_behind_master > 5000
    for: 1m
    labels:
      severity: 严重
    annotations:
      description: "与主库 主从同步延迟过大,数量为 {{ $value }} "


  - alert: MysqlTooManyConnections(>60%)
    expr: max_over_time(mysql_global_status_threads_connected[1m]) > 3000
    for: 2m
    labels:
      severity: 严重
    annotations:
      description: "数据库连接数超过额定(5000)的60%,当前连接数为{{ $value }}"

Rabbitmq-exporter安装

✅ 程序下载
rabbitmq-exporter的下载地址
选择 rabbitmq_exporter_{{版本号}}-RC19_linux_amd64.tar.gz,上传到服务器根目录

注意是amd64.tar.gz结尾

✅ 系统优化

  • 如果系统开启了防火墙,那么需要配置防火墙规则放通9099端口
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0 
iptables -A INPUT -p tcp --dport 9099 -j ACCEPT
service iptables save
service iptables restart
  • 如果系统不需要开启了防火墙
systemctl stop firewalld  && systemctl disable firewalld
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0 

✅ 程序安装
切换到根目录,执行以下命令 注意修改版本号

mkdir -p /opt/exporter/rabbitmq_exporter
tar xf rabbitmq_exporter_{{版本号}}-RC19_linux_amd64.tar.gz  -C /opt/exporter/rabbitmq_exporter  --strip-components=1

✅ 修改config.json配置

vi /opt/exporter/rabbitmq_exporter/config.json
{
    "rabbit_url": "http://127.0.0.1:15672",
    "rabbit_user": "{{账号}}",
    "rabbit_pass": "{{密码}}",
    "publish_port": "9099",
    "publish_addr": "",
    "output_format": "TTY",
    "ca_file": "ca.pem",
    "cert_file": "client-cert.pem",
    "key_file": "client-key.pem",
    "insecure_skip_verify": false,
    "exlude_metrics": [],
    "include_queues": ".*",
    "skip_queues": "^$",
    "skip_vhost": "^$",
    "include_vhost": ".*",
    "rabbit_capabilities": "no_sort,bert",
    "enabled_exporters": [
            "exchange",
            "node",
            "overview",
            "queue"
    ],
    "timeout": 30,
    "max_queues": 0
}

✅ 将启动脚本做成服务

cat  >>  /usr/lib/systemd/system/rabbitmq_exporter.service << EOF
[Unit]
Description=rabbitmq_exporter
After=network.target

[Service]
User=prometheus
Group=prometheus
ExecStart=/opt/exporter/rabbitmq_exporter/rabbitmq_exporter  -config-file=/opt/exporter/rabbitmq_exporter/config.json
[Install]
WantedBy=multi-user.target
EOF

systemctl daemon-reload
systemctl enable --now  rabbitmq_exporter

✅ job配置

  • prometheus.yaml配置
- job_name: 'rabbitmq-exporter'
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /metrics
  scheme: http
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/rabbitmq.yaml"  #具体的配置文件
  relabel_configs:
    - action: replace
      source_labels: [__address__]
      regex: (.*):([0-9]+)  # 正则匹配标签值,( )分组
      replacement: $1       # 引用分组匹配的内容
      target_label: "nodeip"
  • rabbitmq.yaml配置(通过文件挂载映射进容器)
- targets:
  - 192.168.1.15:9099
  labels:   
    type: rabbitmq
    env: pro
  • rabbitmq.yaml配置(通过文件挂载映射进容器)
groups:
- name: RabbitmqStatus
  rules:
  - alert: RabbitMQ掉线
    expr: rabbitmq_up == 0
    for: 0m
    labels:
      severity: 灾难
    annotations:
      description: "RabbitMQ掉线"
  - alert: Rabbitmq存在过多的unacknowledged
    expr: sum by(queue,vhost,rabbitmq_cluster) (rabbitmq_queue_messages_unacknowledged) > 1000
    for: 1m
    labels:
      severity: 严重
    annotations:
      description: "vhost为 {{ $labels.vhost }} 队列{{ $labels.queue }} 存在超过1000条未确认的消息"

- name: RabbitmqBlockedRules
  rules:
  - alert: Vhost为test的MQ队列堵塞,超过10
    expr: max by(business, vhost, rabbitmq_cluster, queue) (rabbitmq_queue_messages_ready{vhost="test"}) > 10
    for: 2m
    labels:
      severity: 严重
    annotations:
      description: "队列{{ $labels.vhost }}-{{ $labels.queue }} 消息堆积数量为{{ $value }} "

Redis-exporter安装

✅ 程序安装 使用k8s部署redis_exporter

✅ job配置

  • prometheus.yaml配置
- job_name: 'redis_exporter_targets'
  scrape_interval: 15s
  scrape_timeout: 10s
  metrics_path: /scrape
  scheme: http
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/redis.yaml"  #具体的配置文件
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      regex: (.*)
      target_label: ping
      replacement: ${1}
    - target_label: __address__
      replacement: redis-exporter:9121

- job_name: 'redis_exporter'
  static_configs:
    - targets:
      - redis-exporter:9121
  • redis.yaml配置(通过文件挂载映射进容器)
- targets:
  - redis://192.168.1.1:7000
  - redis://192.168.1.2:7000
  labels:
    business: sentinel

✅ rule配置

groups:
- name: RedisStatus
  rules:
  - alert: RedisDown
    expr: redis_up{instance!~"redis-exporter:9121"} == 0
    for: 0m
    labels:
      severity: 灾难
    annotations:
      description: "Redis实例{{ $labels.instance }}服务停止"

  - alert: RedisReplicationBroken
    expr: delta(redis_connected_slaves{instance!~"redis-exporter:9121"}[1m]) > 0
    for: 0m
    labels:
      severity: 灾难
    annotations:
      description: "Redis集群{{ $labels.business }} slave节点掉线"

  - alert: RedisClusterFlapping
    expr: changes(redis_connected_slaves{instance!~"redis-exporter:9121"}[1m]) > 1 
    for: 2m
    labels:
      severity: 灾难
    annotations:
      description: "Redis集群 {{ $labels.business }} slave节点掉线,目前主节点为 {{ $labels.instance }} "

  - alert: RedisAOFBackupStatus
    expr: delta(redis_aof_last_bgrewrite_status{instance!~"redis-exporter:9121"}[1m]) > 0 
    for: 0m
    labels:
      severity: 灾难
    annotations:
      description: "Redis集群 {{ $labels.business }} 实例 {{ $labels.instance }} 上次AOF备份异常"

  - alert: RedisOutOfConfiguredMaxmemory
    expr: round(redis_memory_used_bytes{instance!~"redis-exporter:9121"} / redis_memory_max_bytes{instance!~"redis-exporter:9121"} * 100,0.01) > 90
    for: 2m
    labels:
      severity: 警告
    annotations:
      description: "Redis集群 {{ $labels.business }} 实例 {{ $labels.instance }} 使用内存超过额定内存的90%,值为{{ $value }}"

  - alert: RedisTooManyConnections
    expr: redis_connected_clients{instance!~"redis-exporter:9121"} > 1000
    for: 2m
    labels:
      severity: 警告
    annotations:
      description: "Redis集群 {{ $labels.business }} 实例 {{ $labels.instance }} 有过多的连接,数量为{{ $value }}"


  - alert: RedisRejectedConnections
    expr: increase(redis_rejected_connections_total{instance!~"redis-exporter:9121"}[1m]) > 0
    for: 0m
    labels:
      severity: 灾难
    annotations:
      description: "Some connections to Redis has been rejected\n  VALUE = {{ $value }}\n  LABELS = {{ $labels }}"

Blackbox-exporter安装

✅ 程序安装
使用k8s部署Blackbox-exporter

✅ job配置

  • prometheus.yaml配置
- job_name: 'black_box-port_check'
  scrape_interval: 15s
  scrape_timeout: 3s
  metrics_path: /probe
  params:
    module: [tcp_connect]
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/portcheck.yaml"  #具体的配置文件
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter-headless:9115  

- job_name: 'black_box-ping'
  scrape_interval: 15s
  scrape_timeout: 3s
  metrics_path: /probe
  params:
    module: [icmp]  
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/ping.yaml"  #具体的配置文件
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter-headless:9115  

- job_name: 'black_box-http_4xx'
  scrape_interval: 15s
  scrape_timeout: 3s
  metrics_path: /probe
  params:
    module: [http_4xx]  
  file_sd_configs:
  - refresh_interval: 10s
    files:
    - "/etc/config/jobs/http_4xx.yaml"  #具体的配置文件
  relabel_configs:
    - source_labels: [__address__]
      target_label: __param_target
    - source_labels: [__param_target]
      target_label: instance
    - target_label: __address__
      replacement: blackbox-exporter-headless:9115  
  • portcheck.yaml配置(通过文件挂载映射进容器)
# 监控端口
- targets:
  - 192.168.1.120:6379
  labels:
    name: redis
  • ping.yaml配置(通过文件挂载映射进容器)
# 监控连通性
- targets:
  - 192.168.1.90
  labels:
    name: dns服务器
  • http_4xx.yaml配置(通过文件挂载映射进容器)
# 监控http状态
- targets:
  - https://baidu.com
  labels:
    name: 百度网页

✅ rule配置

groups: 
- name:  连通性信息 
  rules: 
  - alert: SSLCertExpiringSoon 
    expr: probe_ssl_earliest_cert_expiry{job="black_box-http_2xx"} - time() < 86400 * 30 
    for: 10m
    labels:
      severity: 灾难
      env: test
    annotations:
      summary: "{{$labels.name}} {{$labels.instance}} 证书即将过期!"
      description: "{{$labels.name}} 证书有效期小于30天,接口地址为{{$labels.instance}}"

  - alert: 域名访问失败
    expr: probe_success{job="black_box-ping",group="domain"} == 0
    for: 1m
    labels:
      severity: 灾难
    annotations:
      summary: "{{$labels.name}}解析失败"
      description: "{{$labels.name}}解析失败"


  - alert: 端口检查失败
    expr: probe_success{job="black_box-port_check"} == 0
    for: 1m
    labels:
      severity: 灾难
    annotations:
      summary: "{{$labels.name}}-{{$labels.instance}}的端口检测失败"
      description: "{{$labels.name}}-{{$labels.instance}}端口不通"

  - alert: 接口调用失败
    expr: probe_success{job="black_box-http_4xx"} == 0
    for: 1m
    labels:
      severity: 灾难
    annotations:
      summary: "调用接口-{{$labels.name}}失败 接口地址为:{{$labels.instance}}"
      description: "调用接口-{{$labels.name}}失败 接口地址为:{{$labels.instance}}"