监控平台系列
监控服务端Promethues
监控采集器Exporter
告警服务端Altermanager
告警中间件PrometheusAlert
监控展示Grafana
Exporter安装
Windows-exporter安装
✅ 程序下载
Windows-exporter的下载地址
选择 windows_exporter-{{版本号}}-amd64.msi
注意是amd64.msi结尾
✅ 程序安装
将程序上传到服务器桌面,执行以下命令,注意修改程序版本号
- 不包含mssql监控
msiexec /i C:\Users\Administrator\Desktop\windows_exporter-{{版本号}}-amd64.msi ENABLED_COLLECTORS=cpu,net,os,memory,process,tcp,textfile,cs,logical_disk,service,system /quiet
- 包含mssql监控
msiexec /i C:\Users\Administrator\Desktop\windows_exporter-{{版本号}}-amd64.msi ENABLED_COLLECTORS=cpu,net,os,memory,process,tcp,mssql,textfile,cs,logical_disk,service,system /quiet
✅ 防火墙开启
如果系统开启了防火墙,那么需要配置防火墙规则放通9182端口
netsh advfirewall firewall add rule name="windows-exporter" dir=in action=allow protocol=TCP localport=9182
✅ job配置
- prometheus.yaml配置
- job_name: windows-exporter
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/windows.yaml"
- windows.yaml配置(通过文件挂载映射进容器)
- 多个主机使用不通签
- job_name: 'dev_app_01'
static_configs:
- targets: ['192.168.1.117:9182']
labels:
os: WINDOWS
env: dev
- job_name: 'test_app_01'
static_configs:
- targets: ['192.168.2.217:9182']
labels:
os: WINDOWS
env: test
- 多个主机使用相同标签
- job_name: 'test_app_01'
static_configs:
- targets:
- 192.168.2.217:9182
- 192.168.2.218:9182
- 192.168.2.219:9182
labels:
os: WINDOWS
env: test
✅ rule配置
groups:
- name: windows主机信息
rules:
- alert: CPU使用
expr: round(100 - (avg by (instance) (irate(windows_cpu_time_total{mode="idle"}[5m])) * 100),0.01) > 60
for: 10m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} CPU使用率过高!"
description: "{{$labels.instance}} CPU使用大于60%(目前使用:{{$value}}%)"
- alert: 内存使用
expr: round(100 - ((windows_os_physical_memory_free_bytes / windows_cs_physical_memory_bytes) * 100),0.01) > 90
for: 2m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} 内存使用率过高!"
description: "{{$labels.instance}} 内存使用大于90%(目前使用:{{$value}}%)"
- alert: 磁盘使用
expr: round((1- windows_logical_disk_free_bytes / windows_logical_disk_size_bytes) * 100 , 0.01) >90
for: 1m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} 磁盘分区{{$labels.volume}}使用率过高!"
description: "{{$labels.instance}} 磁盘分区使用大于90%(目前使用:{{$value}}%)"
Linux-exporter安装
✅ 程序下载
Linux-exporter的下载地址
选择 node_exporter-{{版本号}}.linux-amd64.tar.gz,上传到服务器根目录,执行以下命令
注意是amd64.tar.gz结尾
✅ 系统优化
- 如果系统开启了防火墙,那么需要配置防火墙规则放通9100端口
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0
iptables -A INPUT -p tcp --dport 9100 -j ACCEPT
service iptables save
service iptables restart
- 如果系统不需要开启了防火墙
systemctl stop firewalld && systemctl disable firewalld
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0
✅ 程序安装
将程序上传的根目录,执行以下命令 注意修改版本号
mkdir -p /usr/share/exporter/node_exporter
tar xf node_exporter-{{版本号}}.linux-amd64.tar.gz -C /usr/share/exporter/node_exporter --strip-components=1
useradd prometheus
cat >> /usr/lib/systemd/system/node_exporter.service << EOF
[Unit]
Description=node_exporter
After=network.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/usr/share/exporter/node_exporter/node_exporter \\
--web.listen-address=:9100 \\
--collector.systemd \\
--collector.systemd.unit-whitelist=(sshd|nginx).service \\
--collector.processes \\
--collector.tcpstat
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now node_exporter
✅ job配置
- prometheus.yaml配置
- job_name: linux-exporter
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/linux.yaml"
- linux.yaml配置(通过文件挂载映射进容器)
- 多个主机使用不通签
- job_name: 'dev_app_01'
static_configs:
- targets: ['192.168.1.117:9182']
labels:
os: Linux
env: dev
- job_name: 'test_app_01'
static_configs:
- targets: ['192.168.2.217:9182']
labels:
os: Linux
env: test
- 多个主机使用相同标签
- job_name: 'test_app_01'
static_configs:
- targets:
- 192.168.2.217:9182
- 192.168.2.218:9182
- 192.168.2.219:9182
labels:
os: Linux
env: test
✅ rule配置
groups:
- name: linux主机信息
rules:
- alert: CPU使用情况
expr: round(100-(avg(irate(node_cpu_seconds_total{mode="idle"}[5m])) by(instance)* 100) ,0.01)> 60
for: 10m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} CPU使用率过高!"
description: "{{$labels.instance}} CPU使用大于60%(目前使用:{{$value}}%)"
- alert: 内存使用
expr: round( (1 - (node_memory_MemFree_bytes+ node_memory_Buffers_bytes+ node_memory_Cached_bytes)/ node_memory_MemTotal_bytes)* 100 ,0.01)> 90
for: 2m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} 内存使用率过高!"
description: "{{$labels.instance}} 内存使用大于90%(目前使用:{{$value}}%)"
- alert: IO性能
expr: round(avg(irate(node_disk_io_time_seconds_total[1m])) by(instance)* 100,0.01) > 60
for: 1m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} 流入磁盘IO使用率过高!"
description: "{{$labels.instance}} 流入磁盘IO大于60%(目前使用:{{$value}})"
- alert: 入网带宽
expr: round(((sum(rate (node_network_receive_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 1024/1024),0.01) > 100
for: 1m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} 流入网络带宽过高!"
description: "{{$labels.instance}}流入网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
- alert: 出网带宽
expr: round(((sum(rate (node_network_transmit_bytes_total{device!~'tap.*|veth.*|br.*|docker.*|virbr*|lo*'}[5m])) by (instance)) / 1024/1024),0.01) > 100
for: 1m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} 流出网络带宽过高!"
description: "{{$labels.instance}}流出网络带宽持续5分钟高于100M. RX带宽使用率{{$value}}"
- alert: TCP会话
expr: node_netstat_Tcp_CurrEstab > 10000
for: 1m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} TCP_ESTABLISHED过高!"
description: "{{$labels.instance}} TCP_ESTABLISHED大于10000(目前使用:{{$value}})"
- alert: 磁盘容量
expr: round(100-(node_filesystem_free_bytes{fstype=~"ext4|xfs",mountpoint!="/boot"}/node_filesystem_size_bytes {fstype=~"ext4|xfs",mountpoint!="/boot"}*100),0.01) > 80
for: 1m
labels:
severity: 严重
annotations:
summary: "{{$labels.instance}} 磁盘分区{{$labels.volume}}使用率过高!"
description: "{{$labels.instance }} 磁盘分区使用大于80%(目前使用:{{$value}}%)"
Mysql-exporter安装
✅ 程序下载
Mysql-exporter的下载地址
选择 mysqld_exporter-{{版本号}}.linux-amd64.tar.gz,上传到服务器根目录,执行以下命令
注意是amd64.tar.gz结尾
✅ 系统优化
- 如果系统开启了防火墙,那么需要配置防火墙规则放通9104端口
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0
iptables -A INPUT -p tcp --dport 9104 -j ACCEPT
service iptables save
service iptables restart
- 如果系统不需要开启了防火墙
systemctl stop firewalld && systemctl disable firewalld
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0
✅ 数据库账号配置(注意修改密码)
CREATE USER 'exporter'@'%' IDENTIFIED BY 'exporter' WITH MAX_USER_CONNECTIONS 3;
GRANT SELECT,PROCESS, REPLICATION CLIENT,REPLICATION SLAVE ON *.* TO 'exporter'@'%';
✅ 程序安装
将程序上传的根目录,执行以下命令 注意修改版本号
mkdir -p /opt/exporter/mysqld_exporter
tar xf mysqld_exporter-{{版本号}}.linux-amd64.tar.gz -C /opt/exporter/mysqld_exporter --strip-components=1
useradd prometheus
cat >> /usr/lib/systemd/system/mysqld_exporter.service << EOF
[Unit]
Description=mysqld_exporter
After=network.target
[Service]
User=prometheus
Group=prometheus
Environment="DATA_SOURCE_NAME=exporter:exporter@(localhost:3306)/"
ExecStart=/opt/exporter/mysqld_exporter/mysqld_exporter \\
--web.listen-address=:9104 \\
--collect.slave_hosts
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now mysqld_exporter
✅ job配置
- prometheus.yaml配置
- job_name: 'sql-exporter'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/mysql.yaml" #加载mysql的配置文件
relabel_configs:
- action: replace
source_labels: [__address__]
regex: (.*):([0-9]+) # 正则匹配标签值,( )分组
replacement: $1 # 引用分组匹配的内容
target_label: "nodeip"
- mysql.yaml配置(通过文件挂载映射进容器)
- targets: ['192.168.1.25:9104']
labels:
type: mysql
env: dev
nickname: mysql-dev
- targets: ['192.168.1.38:9104']
labels:
type: mysql
env: test
nickname: mysql-test
✅ rule配置
groups:
- name: Mysql状态
rules:
- alert: MysqlDown
expr: mysql_up == 0
for: 0m
labels:
severity: 灾难
annotations:
description: "数据库 {{ $labels.nodeip }} 服务停止"
- alert: MysqlSlowQueries
expr: ceil(increase(mysql_global_status_slow_queries[1m])) > 100
for: 2m
labels:
severity: 严重
annotations:
description: "1分钟内慢日志超过100条,数量为{{ $value }}"
- alert: MysqlInnodbLogWaits
expr: rate(mysql_global_status_innodb_log_waits[15m]) > 10
for: 0m
labels:
severity: 严重
annotations:
description: "MySQL innodb写入延迟,值为{{ $value }}"
- alert: MysqlRestarted
expr: mysql_global_status_uptime < 60
for: 0m
labels:
severity: 信息
annotations:
description: "数据库服务已重启"
- alert: MysqlSlaveReplicationLag
expr: mysql_slave_status_seconds_behind_master > 5000
for: 1m
labels:
severity: 严重
annotations:
description: "与主库 主从同步延迟过大,数量为 {{ $value }} "
- alert: MysqlTooManyConnections(>60%)
expr: max_over_time(mysql_global_status_threads_connected[1m]) > 3000
for: 2m
labels:
severity: 严重
annotations:
description: "数据库连接数超过额定(5000)的60%,当前连接数为{{ $value }}"
Rabbitmq-exporter安装
✅ 程序下载
rabbitmq-exporter的下载地址
选择 rabbitmq_exporter_{{版本号}}-RC19_linux_amd64.tar.gz,上传到服务器根目录
注意是amd64.tar.gz结尾
✅ 系统优化
- 如果系统开启了防火墙,那么需要配置防火墙规则放通9099端口
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0
iptables -A INPUT -p tcp --dport 9099 -j ACCEPT
service iptables save
service iptables restart
- 如果系统不需要开启了防火墙
systemctl stop firewalld && systemctl disable firewalld
sed -i s/SELINUX\=enforcing/SELINUX\=disabled/g /etc/selinux/config
setenforce 0
✅ 程序安装
切换到根目录,执行以下命令 注意修改版本号
mkdir -p /opt/exporter/rabbitmq_exporter
tar xf rabbitmq_exporter_{{版本号}}-RC19_linux_amd64.tar.gz -C /opt/exporter/rabbitmq_exporter --strip-components=1
✅ 修改config.json配置
vi /opt/exporter/rabbitmq_exporter/config.json
{
"rabbit_url": "http://127.0.0.1:15672",
"rabbit_user": "{{账号}}",
"rabbit_pass": "{{密码}}",
"publish_port": "9099",
"publish_addr": "",
"output_format": "TTY",
"ca_file": "ca.pem",
"cert_file": "client-cert.pem",
"key_file": "client-key.pem",
"insecure_skip_verify": false,
"exlude_metrics": [],
"include_queues": ".*",
"skip_queues": "^$",
"skip_vhost": "^$",
"include_vhost": ".*",
"rabbit_capabilities": "no_sort,bert",
"enabled_exporters": [
"exchange",
"node",
"overview",
"queue"
],
"timeout": 30,
"max_queues": 0
}
✅ 将启动脚本做成服务
cat >> /usr/lib/systemd/system/rabbitmq_exporter.service << EOF
[Unit]
Description=rabbitmq_exporter
After=network.target
[Service]
User=prometheus
Group=prometheus
ExecStart=/opt/exporter/rabbitmq_exporter/rabbitmq_exporter -config-file=/opt/exporter/rabbitmq_exporter/config.json
[Install]
WantedBy=multi-user.target
EOF
systemctl daemon-reload
systemctl enable --now rabbitmq_exporter
✅ job配置
- prometheus.yaml配置
- job_name: 'rabbitmq-exporter'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /metrics
scheme: http
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/rabbitmq.yaml" #具体的配置文件
relabel_configs:
- action: replace
source_labels: [__address__]
regex: (.*):([0-9]+) # 正则匹配标签值,( )分组
replacement: $1 # 引用分组匹配的内容
target_label: "nodeip"
- rabbitmq.yaml配置(通过文件挂载映射进容器)
- targets:
- 192.168.1.15:9099
labels:
type: rabbitmq
env: pro
- rabbitmq.yaml配置(通过文件挂载映射进容器)
groups:
- name: RabbitmqStatus
rules:
- alert: RabbitMQ掉线
expr: rabbitmq_up == 0
for: 0m
labels:
severity: 灾难
annotations:
description: "RabbitMQ掉线"
- alert: Rabbitmq存在过多的unacknowledged
expr: sum by(queue,vhost,rabbitmq_cluster) (rabbitmq_queue_messages_unacknowledged) > 1000
for: 1m
labels:
severity: 严重
annotations:
description: "vhost为 {{ $labels.vhost }} 队列{{ $labels.queue }} 存在超过1000条未确认的消息"
- name: RabbitmqBlockedRules
rules:
- alert: Vhost为test的MQ队列堵塞,超过10
expr: max by(business, vhost, rabbitmq_cluster, queue) (rabbitmq_queue_messages_ready{vhost="test"}) > 10
for: 2m
labels:
severity: 严重
annotations:
description: "队列{{ $labels.vhost }}-{{ $labels.queue }} 消息堆积数量为{{ $value }} "
Redis-exporter安装
✅ 程序安装 使用k8s部署redis_exporter
✅ job配置
- prometheus.yaml配置
- job_name: 'redis_exporter_targets'
scrape_interval: 15s
scrape_timeout: 10s
metrics_path: /scrape
scheme: http
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/redis.yaml" #具体的配置文件
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
regex: (.*)
target_label: ping
replacement: ${1}
- target_label: __address__
replacement: redis-exporter:9121
- job_name: 'redis_exporter'
static_configs:
- targets:
- redis-exporter:9121
- redis.yaml配置(通过文件挂载映射进容器)
- targets:
- redis://192.168.1.1:7000
- redis://192.168.1.2:7000
labels:
business: sentinel
✅ rule配置
groups:
- name: RedisStatus
rules:
- alert: RedisDown
expr: redis_up{instance!~"redis-exporter:9121"} == 0
for: 0m
labels:
severity: 灾难
annotations:
description: "Redis实例{{ $labels.instance }}服务停止"
- alert: RedisReplicationBroken
expr: delta(redis_connected_slaves{instance!~"redis-exporter:9121"}[1m]) > 0
for: 0m
labels:
severity: 灾难
annotations:
description: "Redis集群{{ $labels.business }} slave节点掉线"
- alert: RedisClusterFlapping
expr: changes(redis_connected_slaves{instance!~"redis-exporter:9121"}[1m]) > 1
for: 2m
labels:
severity: 灾难
annotations:
description: "Redis集群 {{ $labels.business }} slave节点掉线,目前主节点为 {{ $labels.instance }} "
- alert: RedisAOFBackupStatus
expr: delta(redis_aof_last_bgrewrite_status{instance!~"redis-exporter:9121"}[1m]) > 0
for: 0m
labels:
severity: 灾难
annotations:
description: "Redis集群 {{ $labels.business }} 实例 {{ $labels.instance }} 上次AOF备份异常"
- alert: RedisOutOfConfiguredMaxmemory
expr: round(redis_memory_used_bytes{instance!~"redis-exporter:9121"} / redis_memory_max_bytes{instance!~"redis-exporter:9121"} * 100,0.01) > 90
for: 2m
labels:
severity: 警告
annotations:
description: "Redis集群 {{ $labels.business }} 实例 {{ $labels.instance }} 使用内存超过额定内存的90%,值为{{ $value }}"
- alert: RedisTooManyConnections
expr: redis_connected_clients{instance!~"redis-exporter:9121"} > 1000
for: 2m
labels:
severity: 警告
annotations:
description: "Redis集群 {{ $labels.business }} 实例 {{ $labels.instance }} 有过多的连接,数量为{{ $value }}"
- alert: RedisRejectedConnections
expr: increase(redis_rejected_connections_total{instance!~"redis-exporter:9121"}[1m]) > 0
for: 0m
labels:
severity: 灾难
annotations:
description: "Some connections to Redis has been rejected\n VALUE = {{ $value }}\n LABELS = {{ $labels }}"
Blackbox-exporter安装
✅ 程序安装
使用k8s部署Blackbox-exporter
✅ job配置
- prometheus.yaml配置
- job_name: 'black_box-port_check'
scrape_interval: 15s
scrape_timeout: 3s
metrics_path: /probe
params:
module: [tcp_connect]
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/portcheck.yaml" #具体的配置文件
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter-headless:9115
- job_name: 'black_box-ping'
scrape_interval: 15s
scrape_timeout: 3s
metrics_path: /probe
params:
module: [icmp]
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/ping.yaml" #具体的配置文件
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter-headless:9115
- job_name: 'black_box-http_4xx'
scrape_interval: 15s
scrape_timeout: 3s
metrics_path: /probe
params:
module: [http_4xx]
file_sd_configs:
- refresh_interval: 10s
files:
- "/etc/config/jobs/http_4xx.yaml" #具体的配置文件
relabel_configs:
- source_labels: [__address__]
target_label: __param_target
- source_labels: [__param_target]
target_label: instance
- target_label: __address__
replacement: blackbox-exporter-headless:9115
- portcheck.yaml配置(通过文件挂载映射进容器)
# 监控端口
- targets:
- 192.168.1.120:6379
labels:
name: redis
- ping.yaml配置(通过文件挂载映射进容器)
# 监控连通性
- targets:
- 192.168.1.90
labels:
name: dns服务器
- http_4xx.yaml配置(通过文件挂载映射进容器)
# 监控http状态
- targets:
- https://baidu.com
labels:
name: 百度网页
✅ rule配置
groups:
- name: 连通性信息
rules:
- alert: SSLCertExpiringSoon
expr: probe_ssl_earliest_cert_expiry{job="black_box-http_2xx"} - time() < 86400 * 30
for: 10m
labels:
severity: 灾难
env: test
annotations:
summary: "{{$labels.name}} {{$labels.instance}} 证书即将过期!"
description: "{{$labels.name}} 证书有效期小于30天,接口地址为{{$labels.instance}}"
- alert: 域名访问失败
expr: probe_success{job="black_box-ping",group="domain"} == 0
for: 1m
labels:
severity: 灾难
annotations:
summary: "{{$labels.name}}解析失败"
description: "{{$labels.name}}解析失败"
- alert: 端口检查失败
expr: probe_success{job="black_box-port_check"} == 0
for: 1m
labels:
severity: 灾难
annotations:
summary: "{{$labels.name}}-{{$labels.instance}}的端口检测失败"
description: "{{$labels.name}}-{{$labels.instance}}端口不通"
- alert: 接口调用失败
expr: probe_success{job="black_box-http_4xx"} == 0
for: 1m
labels:
severity: 灾难
annotations:
summary: "调用接口-{{$labels.name}}失败 接口地址为:{{$labels.instance}}"
description: "调用接口-{{$labels.name}}失败 接口地址为:{{$labels.instance}}"