关于 NGINX 的 upstream 配置的 fail_timeout=0 参数的意义

ctsed

2016-12-23 16:58:39 +08:00

动态程序也可能硬盘坏了内存满了

glasslion

2016-12-23 17:04:00 +08:00

这样的话，对动态程序难道不是增大 max_fails 更合理吗？

finab

2016-12-23 17:04:11 +08:00

@ctsed 别钻牛角尖呀，重点是静态的后端很少出现 500 ，出现了也是硬盘满了之类的。

而动态出现 500 的可能很大，因为磁盘的原因 500 的概率相对来说很小很小。

est

2016-12-23 17:09:28 +08:00

这是一个坑点。仔细读文档才会发现。专业的 sa 如果不知道这个是个扣分点。

还有一个点就是用 http 1.1 和 connection keep-alive 可以提高效率。

nginx 默认给 upstream 是 connection: close

ctsed

2016-12-23 17:11:04 +08:00

r#3 @finab 很少出现 500 那这个功能不就没啥用么

qq286735628

2016-12-23 17:18:32 +08:00

既然认为后端的 500 是偶发现象可以接收，那就应该加大 max_fails 来支持这种偶发的。
否则真的连续故障了， nginx upstream 的自动剔除机制就废了

lhbc

2016-12-23 17:23:57 +08:00

楼主这个理解是不正确的
1. 假如 upstream 只有一个 server ，那 max_fails 和 fail_timeout 都是没用的。

2. 假如 upstream 有多个 server ，那超过了 max_fails 次错误后，在 fail_timeout 时间内会摘除这个 server
如果全部 server 都失败， nginx 会清空这个状态，轮询所有服务器

就是说，无论怎么配置， nginx 都会保证 upstream 里有可用的 server

lhbc

2016-12-23 17:39:16 +08:00

@qq286735628 +1

多 server 的回源，偶发异常，加大 max_fails 就可以了，要保证 fail_timeout 能起到作用；
如果是单点回源，这两个参数就不用写了。

无论是纯静态还是动静混合的后端
应该用 proxy_next_upstream 和 proxy_cache_use_stale 保证静态资源的可用率

Livid

2016-12-23 17:45:10 +08:00

有的时候情况是这样的，后端其实还有处理能力，但是因为 fail_timeout 和 max_fails 的值不理想，而被浪费了。这也是为什么有时候后端明明还活着，但是错误日志里却出现 no live upstreams while connecting to upstream

wupher

2016-12-23 17:55:51 +08:00

受教，最近正好碰到类似问题。晚上回家试一下。

banxi1988

2016-12-23 18:06:59 +08:00

之前没有配置过 fail_timeout
看来之后要注意下这一点了. 我的后端还是比较有可能抛出 500 错误的.

fangpeishi

2016-12-23 18:10:54 +08:00

justfly

2016-12-23 19:33:25 +08:00

楼主的方法是错误的。

问题关键是不应该让 nginx 认为 500+的 http code 是错误。应当使用 proxy_next_upstream 来决定什么时候决定换 backend 。

The cases of error, timeout and invalid_header are always considered unsuccessful attempts, even if they are not specified in the directive. The cases of http_500, http_502, http_503 and http_504 are considered unsuccessful attempts only if they are specified in the directive. The cases of http_403 and http_404 are never considered unsuccessful attempts.

设置 fail_timeout 为 0 当某个后端阻塞会有很多不必要的尝试进而影响吞吐量

Livid

2017-07-25 09:36:07 +08:00

今天在某个生产环境中遇到了一个很难搞的 no live upstreams while connecting to upstream 错误，尝试了一下把 max_fails 和 fail_timeout 都设置为 0，貌似解决了。

之前是每次 upstream timed out 错误之后，就会跟一大串 no live upstreams while connecting to upstream （其实是因为这个时候 fail_timeout 的默认 10 秒等待），加入了这两个设置之后，现在只会有偶尔的 upstream timed out。

keakon

2017-07-25 10:14:13 +08:00

看了下源码，nginx 发现连接超时、读取超时、status code >= 300 就会尝试下一个 upstream，如果它成功就换它响应，如果它失败就自己返回失败。
所以对于动态服务器，确实应该禁用。