用 Prometheus 碰到个奇怪的问题，百思不得其解

V2EX = way to explore

V2EX 是一个关于分享和探索的地方

现在注册

已注册用户请登录

• 请不要在回答技术问题时复制粘贴 AI 生成的内容

这是一个创建于 1793 天前的主题，其中的信息可能已经有所发展或是发生改变。

--问题：告警是正常的，恢复告警的发送不正常。

举例：以下面的 cpu 规则为例，如果 cpu 超过所设的阈值 90，是能接到告警的。如下：

·Status: firing

·StartsAt: 2020-10-18T10:44:22.718516222Z

·Discription: 命名空间 xx 的 pod xx-xx-testpod 当前的 CPU 使用率已达到 99.89.

但接到告警恢复的通知时，数值往往大于 90%...（也有正常的恢复）

·Status: resolved

·StartsAt: 2020-10-18T10:44:22.718516222Z

·Discription: 命名空间 xx 的 pod xx-xx-testpod 当前的 CPU 使用率已达到 98.11.

按我的理解恢复通知只有 cpu < 90 持续 resolve_timeout 的值之后才能触发，现在是哪里配置有问题吗？

相关组件都是通过 kubernetes operator 部署在 k8s 集群中的。
PrometheusRule：
- alert: PodCPUOvercommit
  
  description: 命名空间 {{ $labels.namespace }} 的 pod {{ $labels.pod }} 当前的 CPU 使用率已达到 {{ printf "%.2f" $value }}.
  
  expr: | 100 * (sum(rate(container_cpu_usage_seconds_total{namespace!~"monitoring",container!=""}[1m])) by (pod,namespace) / sum(kube_pod_container_resource_limits_cpu_cores{namespace!~"monitoring",container!=""}) by (pod,namespace)) > 90
  
  for: 3m
alertmanager

global:

resolve_timeout: 1m

route:

group_by: ['alertname']

group_wait: 10s

group_interval: 1m

repeat_interval: 12h

receiver: 'webhook'

receivers:

name: 'webhook'

webhook_configs:
- url: 'http://x.x.x.x:xx/'
  
  send_resolved: true

12 条回复 • 2024-05-29 17:28:16 +08:00

cdlixucd

2020-10-19 11:32:45 +08:00

https://aleiwu.com/post/prometheus-alert-why/
研究下无意看到的虽然还没开始玩儿微服务

HaroldChen

2020-10-19 15:03:08 +08:00

@cdlixucd 嗯，好文。这里有提到 scrape 和 evaluation 不一致产生的问题，但 send resolve 通知应该不受这个影响。既然发出通知来，一定是判定至少 2 个周期（我这边 scrape 和 evaluation 都是 30s ）值低于 90 。恢复通知上的值肯定低于 90 才对。

scarletass

2020-10-19 15:27:28 +08:00

表达式很清楚，统计的是 1min cpu 的平均值，你拿峰值去比较，肯定不对~

Maco

2020-10-19 17:09:13 +08:00

可以参考下这篇文章，讲的很清楚：
https://www.robustperception.io/why-do-resolved-notifications-contain-old-values
当指标正常时，告警表达式不在返回值，Prometheus 自然就不知道当前的值是多少。
文章中的建议是告警消息中链接到仪表盘。

HaroldChen

2020-10-19 17:10:03 +08:00

@scarletass 峰值？这个{{ printf "%.2f" $value }} 也是 expr 算出来的值啊。

Maco

2020-10-19 17:15:35 +08:00

告警恢复中的值，是最后一次根据你的告警表达式计算出来的值。也就是最后的一个异常值。

HaroldChen

2020-10-19 19:09:47 +08:00

@Maco 谢了，终于明白为啥出现这种情况了-.-。告警和恢复只能用同一个 description 的消息模板吗，有方法区分开吗。告警时显示当前 cpu 值，恢复了就显示 cpu 已恢复，加个 grafana link 。